Before starting with ‘languages and tools you should know to become a data scientist-Part 1’, you may look our introductory article Introduction to data science.
Let’s start…
Learn about the providers of online masters in data science by clicking here
Languages for Data Science
One of the most confusing things for someone who is entering the field of Data Science is to know which programming language to choose from. The answer depends on what field you are working in, what are your uses, for whom you are working, etc. Some of the most common languages that are used by Data Scientists are:
Python, R, SQL, Scala, Java, C++, Julia, JS, PHP, GO, Ruby, and Visual Basic
But the most popular and most used languages among these are:
- Python
- R
- SQL
Python
Python is an Open Source high-level programming language, developed by Guido van Rossum, back in the late 1980s. It is one of the most popular programming languages.
What makes python so popular is its simplified syntax which makes it very easy to learn even for someone new to programming. Also, python has one of the largest sets of standard libraries that would provide additional functions.
For Data Science it has:
- Scientific computing libraries like Pandas, NumPy Matplotlib, etc.
- For Artificial Intelligence and Machine Learning, it has PyTorch, TensorFlow, Keras, Scikit-Learn, NLTK (For Natural Language Processing (NLTK-Natural Language Toolkit))
R
It is often found that learning more than one programming language can be an advantage if you are aiming for a high salary job. R is a Free Software created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. It is most often used by statisticians, mathematicians, and data miners for developing statistical software, graphing, and data analysis.
Data mining Is the process of finding patterns in a huge set of data.
Advantages of R programming language:
- Easy to translate from math to code
- Easy for programming beginners
R has a large set of libraries available which makes exploratory data analysis easier. It also integrates well with other programming languages like C++, JAVA, C, .Net, and Python. R is also equipped with object-oriented programming, which is better than other statistical computing languages.
SQL (Structured Query Language)
SQL was developed at IBM by Donald D. Chamberlin and Raymond F. Boyce during the 1970s. Even though it is not considered among the usual programming languages used for data science it is still widely used by data scientists. SQL is used to handle relational databases, for example, an excel datasheet or any datasets.
Some of the most common SQL databases available are:
MySQL, IBM DB2, ORACLE DATABASE, SQLite, Microsoft SQL Server, etc.
Other languages for Data science
Even though Python, R, and SQL are the most commonly used languages by data scientists, many other languages can be used to serve similar purposes in solving a particular data science problem. Some of these include:
Scala, Java, C++, Julia, JS, PHP, GO, Ruby, and Visual Basic.
Tools for Data Science
Before moving to tools for data science we shall look at some of the basic tasks in Data science.
- Data Management – Process of making and retrieving data.
- Data Integration and Transformation – Process of retrieving data from different remote management systems to one and changing formats of data according to the preferred format in that particular system.
- Data Visualization – Part of initial data exploration and is also used for presenting the final results.
- Model Building – Process of creating an ML or DL model from the data. (see the Introduction to Data Science article to know more about ML and DL).
- Model Deployment – Make the ML or DL model available in the form of applications.
- Model Monitoring & Assessment – Monitoring the deployed model for improving its functions.
- Code Asset Management – Uses different collaborative methods to facilitates teamwork such as versioning (creating newer versions of the same application).
- Data Asset Management – Supports replication, backup, and access rights to data.
- Development Environment – Tools that help developers to build prototype, test, develop, and deploy their work.
- Execution Environment – Tools for data processing, model training, and deployment.
- Fully Integrated Visual Tools – It covers all the tools mentioned above either partially or fully with a user interface where we can activate some functions with a button click rather than coding the whole thing.
Now we shall look into the tools…
Open Source Tools for Data Science
Data Management Tools
The most commonly used open-source data management tools are relational databases such as MySQL and PostgreSQL; non-SQL databases such as MongoDB, CouchDB, and Cassandra; file-based tools such as Hadoop file systems and ceph and also elasticsearch which is used for text data.
Data Integration and Transformation Tools
The task of data integration and transform is called ELT (Extract, Load and Transform). It is also called data refining and cleaning these days. The most widely used tools for this are:
Apache Airflow, Kubeflow, Apache kafka, Apache nifi, Apache Spark SQL, and Node-RED (It consumes very few resources that it can be run on devices like raspberry pi).
Data Visualization Tools
- Hue – create a visualization from SQL queries
- Kibana – Web application which creates visualization from data given by the Elasticsearch data management tool (mentioned above).
- Apache Superset – A web application for data exploration and visualization.
In addition to these, some of the most popular Python libraries for data visualization include – Plotly, Matplotlib, Bokeh, Seaborn, Altair, etc.
Model Deployment Tools
After we create an ML model, to make it useful, we have to make it into an API (Application Programming Interface). For that we can use these tools:
Apache PredictionIO, SELDOME, mleap, TensorFlow Service
Model Monitoring and Assessment Tools
Once we deploy the ML model, we have to keep track of its performance as new data comes in and update the model if any change is needed. Some of the common tools used for that are:
- ModelDB – Stores data about the model’s performance.
- Prometheus – Used for Ml model monitoring.
- IBM Research Trusted AI – Also helps in monitoring the biases shown by model towards a particular gender or anything while predicting some results.
Code Asset Management Tools (Version Control)
Git is the most popular version control software. GitHub and Gitlab help in using git to implement version control.
Note: Version control means updating your software or service with the latest changes.
Data Asset Management Tools
Also known as data governance is the process of versioning and annotating data. Apache Atlas and ODPi Egeria and kylo are some of the tools that would help us in doing this.
Annotating is the way of putting different tags on data based on its features. It can be used later to differentiate between different types of data or maybe different categories and finding the different patterns in data and putting them under different categories.
Development Environments Tools
- Jupyter Notebooks – Mostly used as a tool for interactive python programming. It supports more than 100 languages through kernels. Kernels encapsulate the different interactive interpreters for different programming languages. It unifies documentation, codes, outputs, shell commands, and visualizations into a single document.
An interpreter is a program that executes programs written in a programming language without converting them to machine code.
- Jupyter Lab – Jupyter Lab is the next generation of Jupyter Notebooks and the main difference is the ability to open many types of files including Jupyter notebook files.
- Apache Zeppelin – Inspired by Jupyter notebook it works almost similar to It. One major difference is the availability of integrated plotting capability. In the Jupyter notebook, we have to use different libraries to do the plotting.
- R Studio – It runs the R language and its libraries. It also integrates programming, visualization, and data exploration into a single tool.
- Spyder – A similar software like R studio which runs Python in place of R language.
Execution Environments Tools
Sometimes when we have a large amount of data, we won’t be able to use our system for executing different tasks. That’s where the use of cluster execution environments plays a big role. Each addition of a system to the cluster would boost the performance of the environment. Some of the most common environments are:
- Apache Spark – Supports Linear scalability (Double the system, double the performance).
- Apache Flink – Processes real-time data streams (For example real-time data from a temperature sensor).
- Riselab Ray
Fully Integrated Visual Tools
- KNIME – It has a visual user interface and has integrated visualization capabilities. Its features can be increased by programming in R or Python.
- Orange – Easier to use but less flexible than knime.
Commercial Tools for Data Science
Enterprise projects use commercial products to do the data science jobs mentioned above. Let’s see some of the tools that are commercially available to do these tasks.
Data Management Tools
Oracle database, Microsoft SQL Server, IBM DB2. They also provide commercial support which is important for any organization.
Data Integration and Transformation Tools
Some of the ETL (Extract Transform and Load) tools are:
Informatica, IBM InfoSphere DataStage, Microsoft, Oracle
Data Visualization Tools
Some of the commercial tools for Data Visualization are:
Tableau and Microsoft Power BI.
Model Building Tools
SPSS Modeler and SAS miner
Model deployment is tightly integrated with model building in commercial environments.
For Model monitoring and assessment, Code asset management usually uses open source tools mentioned above as it is the industry-standard tool for these tasks.
Data Asset Management Tools
Informatica and IBM InfoSphere Information Governance Catalog.
IBM Watson Studio is one of the most popular Development Environment that can provide fully integrated visual tools for Data Science tasks.
Cloud-Based Tools for Data Science
Fully Integrated Visual Tools and Platforms Tools
IBM Watson, Microsoft Azur, and H2O driverless AI are cloud-based fully integrated data science tools that can do all the tasks mentioned above.
Data Management Tools
Some of the SaaS (Software as a Service – the software is maintained and updated by the cloud provider itself.) cloud-based Data Management services are:
AWS Amazon DynamoDB, Cloudant, CouchDB and IBM DB2
Data Integration and Transformation Tools
Informatica and IBM Data refinery
Data Visualization Tools
Datameer, IBM Cognos Analytics
Model Building Tools
Google AI and IBM Watson Machine Learning
Model Deployment Tools
IBM Watson Machine Learning
Model Monitoring and Assessment Tools
Amazon SageMaker Model Monitor and IBM Watson Openscale
Please have a look at the Part 2 of this article.
Thank You
If you know any subject that can be related to manufacturing industry or industrial engineering, you can earn some income by becoming article contributor of this website. For knowing more about it, please visit Join us page.
You don’t need to have any experience in article writing, just knowledge on the subject is needed.
Also you can know more about our team of article contributors by visiting the about us page.
About the Author
Deepak Jose is a B-Tech CS student with a passion for Data Science. Loves learning about Data Science, coding, and science in general. Does data analysis and visualization as a hobby. Even though I’m in the Computer Science path I always find time to learn about space, automobiles, geography, energy, architecture, arts, etc. Loves solving problems and learning about new inventions.
Was in search for this information from a long time. Thank you for such informative post. Looking forward for more of such informative postings
https://www.kellytechno.com/Hyderabad/Course/Data-Science-Training