Copy Link Button
Share
Top Article Link
Top Pick

Python Libraries for Data Science

Data Scientists across the world use Python as their go-to programming language for Data Science. This is due to the presence of a vast set of libraries that offers solutions to every problem in the stream of Data Science. And also due to the fact that Python is easy to learn and use, due to its spoken language-like syntax. All these make python a powerful tool for Data Science.

There is a self assessment test at the end of this article. Take that test and assess your knowledge related to Python Libraries for Data Science after reading this article.

What makes python so powerful and popular is its Data Science libraries. These include Pandas, NumPy, and Scikit-Learn which offer functions that turn insights from your data into magical tools that you can use for your business purposes.

In this article, we will learn about the 11 most important Python libraries for Data Science, which would help you to become a successful Data Scientist. We will be also discussing its installation, important functions, and code.

Before we begin if you want to learn more about Data Science read the below articles.

  1. Introduction to Data Science
  2. Languages and Tools you should know to become a Data Scientist
  3. Tools for Data Science
  4. Statistics for Data Science – Descriptive Statistics
  5. Complete Data Science Roadmap – With resources
  6. Introduction To Natural Language Processing

Learn about the providers of online masters in data science by clicking here

Let’s dive deep into the 11 most essential python libraries you must learn to become a successful Data Scientist.

1. NumPy

Used by Data scientists across the world for scientific computing, Numpy is a powerful open-source library written in C programming language. NumPy provides data structures like multidimensional arrays, matrices, etc., for various math operations. As a result, NumPy is the base library for various powerful data science libraries. Therefore Learning NumPy should be one of your first steps in becoming a Data Scientist. Even though it seems a bit similar to python lists, it consumes lesser memory and is way faster than a list.

Visit the official NumPy website to know more.

Installation

If you have already installed python on your machine then go to the terminal(cmd in windows) and type the following code.

pip install numpy

Or if you have anaconda installed on your machine then open the anaconda prompt and type the following code.

conda install -c anaconda numpy

Example code

If you have installed anaconda in your machine, then you can use Jupiter notebook to test these codes. Or you can use any other code editor as per your wish.

1. Creating an array

The following code is used to create a NumPy array.

Code :

#importing the library
import numpy as np 
  
a = np.array([(8, 9, 10), (11, 12, 13)])
print(a)

output:

[[8 9 10]
 [11 12 13]]

2. min and max: 

These functions return the minimum and maximum value of the numbers in a NumPy array.

Code :

arr = np.array([1,2,3,4,5])
print(np.max(arr), np.min(arr))

Output :

5 1

3. std: 

This function returns the standard deviation of the numbers in a NumPy array.

Code :

arr = np.array([1,2,3,4,5])
print(np.std(arr))

Output :

1.4142135623730951

4. mean: 

This function returns the mean value of the numbers in a NumPy array.

Code:

arr = np.array([1,2,3,4,5])
print(np.mean(arr))

Output:

3.0

5. median: 

This function returns the median of the numbers in a NumPy array.

Code :

arr = np.array([1,2,3,4,5])
print(np.median(arr))

Output :

3.0

6. percentile: 

This function returns the percentile of the numbers in a NumPy array. Which is the number under which that percentage of numbers resides.

Code :

arr = np.array([1,2,3,4,5])
print(np.percentile(arr,25)) #prints 25th percentile
print(np.percentile(arr,50)) #prints 50th percentile
print(np.percentile(arr,75)) #prints 75th percentile

Output

2.0
3.0
4.0

7. linspace:

This function gives evenly spaced numbers within a specified interval.

Code:

print(np.linspace(2.0, 3.0, num=5, retstep=True))

"""
Here 2 is the lower limit and 3 is the upper limit.
Num=5 means we need to generate 5 numbers between 2 and 3.
retstep = True means the stepcount is returned.
"""

Output :

(array([2.  , 2.25, 2.5 , 2.75, 3.  ]), 0.25)
"""
Here 0.25 is the step count.
"""

8. shape: 

This function returns the shape of an array.

Code :

arr = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print(arr)
print(np.shape(arr))

Output :

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
(2, 5) #Two rows and 5 columns

9. reshape: 

This function used to reshape a NumPy array.

Code :

arr = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print(arr,"\n")
print(arr.reshape(5,2))

Output :

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]] 
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]

10. transpose: 

This function is used to reverse the axes of the NumPy array. 

Code :

arr = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print(arr,"\n")
print(np.transpose(arr))

Output :

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]] 
[[ 1  6]
 [ 2  7]
 [ 3  8]
 [ 4  9]
 [ 5 10]]

11. sort: 

This function is used for sorting the NumPy array.

Code :

arr = np.array([11,2,30,4,5])
print(arr, "\n")
print(np.sort(arr))

Output :

[11  2 30  4  5] 
[ 2  4  5 11 30]

2. SciPy

SciPy which stands for Scientific Python is a scientific computation library that is designed on top of NumPy. It provides functions that can be used to find mathematical formulas like the determinant of a matrix, eigenvalues, Eigenvectors, etc. Written in Python programming language it is slower than NumPy but provides much more functionalities. It is also used for solving problems like integration, optimization, linear algebra, and statistics.

Visit the official SciPy website to know more.

Installation

For the python environment use the following code.

pip install scipy

For the conda environment use the following code.

conda install -c anaconda scipy

Example Code

Here we would see some examples of using the library.

Importing the necessary libraries

We would start by importing the libraries we would be using in these example codes.

import numpy as np
from scipy import linalg

Now we would see the code to do different functions using this library.

1. Code to find the Determinant of a Matrix.

arr = np.array([[4,2,3],[4,5,6],[9,8,9]]) #creating a matrix using numpy
linalg.det(arr)

Output :

-15.0

2. Code to find the Eigen values and Eigen vectors of this matrix.

We would be using the array we have built above to find these.

eig_vals, eig_vects = linalg.eig(arr)
print(eig_vals)
print(eig_vects)

Output :

[16.88108059+0.j  1.65561839+0.j -0.53669898+0.j]
[[-0.26938637 -0.68838898 -0.19730392]
 [-0.5048473   0.72319052 -0.65139713]
 [-0.82009767  0.05582181  0.73263425]]

3. Matplotlib

Data Visualization is an important skill set to have in becoming a successful Data Scientist. And its examples include bar graphs, pie charts, scatter plots, etc. Data Scientists use these graphs to easily understand insights from data.

Matplotlib is an open-source library built on NumPy arrays. It is a comprehensive library that consists of several plots like line, bar, scatter, histogram, etc. To know more about this visit the official Matplotlib website.

Installation

For the python environment use the following code.

pip install matplotlib

For the conda environment use the following code.

conda install -c conda-forge matplotlib

Example Code

Here we would see some examples of using the library.

Importing the necessary libraries

We would start by importing the libraries we would be using in these example codes.

import matplotlib.pyplot as plt

Now we would see the code to do different functions using this library.

1. Code to draw a Line Plot.

x = [4,5,3,1,2] #values on x-axis
y = [10,9,6,8,7] #values on y-axis
  

plt.plot(x,y) #plotting funtion
plt.show()

Output :

2. Code to draw a Bar Plot.

x = [4,5,3,1,2] #values on x-axis
y = [10,9,6,8,7] #values on y-axis
  
plt.bar(x,y) #plotting funtion
plt.show()

Output :

3. Code to draw a Histogram.

y = [10,9,6,8,7] #values on y-axis
  
plt.hist(y) #plotting funtion
plt.show()

Output :

4. Code to draw a Pie Chart.

langs = ['C', 'C++', 'Java', 'Python', 'PHP']
students = [23,17,35,29,12]

plt.pie(students, labels=langs,autopct='%1.2f%%')
plt.show() 

Output :

Autopct is used to label the wedges with their numeric value. The label can be seen inside the wedge. The format string will be fmt%pct.

4. Seaborn

Built on top of Matplotplib seaborn is a high-level interface for drawing statistical graphs. Visit the official Seaborn website to know more.

Installation

For the python environment use the following code.

pip install seaborn

For the conda environment use the following code.

conda install -c anaconda seaborn

Example Code

Here we would see some examples of using the library.

Importing the necessary libraries

We would start by importing the libraries we would be using in these example codes.

import seaborn as sns

Now we would see the code to do different functions using this library.

Code to draw a Lmplot.

Used to represent a linear regression model that shows the relationship between two variables (x & y).

# Loading the sample dataset that comes with seaborn
df = sns.load_dataset("anscombe")
  
# Show the results of a linear regression
sns.lmplot(x="x", y="y", data=df)

Output :

5. Pandas

Pandas is built on top of two main Python libraries – NumPy and matplotlib. Using these libraries pandas offer fast methods to work with datasets. It is used for loading datasets, cleaning them, transforming them, etc. As a result, the pandas library is one of the most important libraries that you must learn.

Visit the official Pandas website to know more.

Installation

For the python environment use the following code.

pip install pandas

For the conda environment use the following code.

conda install -c anaconda pandas

Example code

Here we would see some examples of using the library.

Importing the necessary libraries

We would start by importing the libraries we would be using in these example codes.

import pandas as pd
import numpy as np

Now we would see the code to do different functions using this library.

1. Code for Creating a series.

Pandas series is a one-dimensional array that is capable of holding data of any type. It can be compared with a column in an excel sheet. We can create it from lists, dictionaries, NumPy arrays, etc. But in the real world, it will be loaded from databases, CSV files, Excel sheets, etc.

data = np.array(['a','b','c','d']) #using numpy array
data1 = ['q','r','s','t'] #using list
data2 = {'Names':['Depak','Melvin','Abi','Meet']} #using dictionary
   
print(pd.Series(data),'\n')
print(pd.Series(data1),'\n')
print(pd.Series(data2))

Output :

0    a
1    b
2    c
3    d
dtype: object 

0    q
1    r
2    s
3    t
dtype: object 

Names    [Depak, Melvin, Abi, Meet]
dtype: object

2. Code for Creating DataFrame.

A DataFrame is a table-like data structure that can store values in rows and columns. It can be compared to an excel sheet with rows and columns. Each of the columns can represent a variable whose values are stored in that column. Each of the rows represents the different samples for these different variables.

data = {'Name':['Depak','John','Jose','Tojo'],
        'Age':[22, 35, 28, 30]}
 
df = pd.DataFrame(data)
 
df #can only be done while using a notebook, use print otherwise
#print(df)

Output :

6. scikit-learn

Built on NumPy, SciPy, and Matplotlib scikit-learn offers simple and efficient functions for predictive data analysis. Some of the examples of these functions are Linear Regression, Logistic Regression, KNN, etc. This library is most commonly used for building machine learning models for business applications.

Visit the official scikit-learn website to know more.

Installation

For the python environment use the following code.

pip install -U scikit-learn

For the conda environment use the following code.

conda create -n sklearn-env -c conda-forge scikit-learn

Let’s now see an example code. In this code, we create a linear regression model that predicts salary based on age.

Example Code

Here we would see some examples of using the library.

Importing the libraries

We would start by importing the libraries we would be using in these example codes.

from sklearn.linear_model import LinearRegression as LR

Now we would see the code to do different functions using these libraries.

1. Code for Creating our Dataset.

dataset = pd.DataFrame({'Name':['Depak','John','Jose','Tojo'],
        'Age':[22, 35, 28, 30],'Salary':[15000,25000,18000,20000]})

dataset

2. Code for Creating a model & doing the prediction.

"""
Reshaping is done to convert 1D data to 2D data because the model accepts only 2D data.

"""
X = np.array(dataset['Age']).reshape(-1, 1)

y = np.array(dataset['Salary']).reshape(-1, 1)

lm = LR() #loading the linear regression model.

lm.fit(X,y) #fitting the model with the variables.

lm.predict(np.array([25]).reshape(-1,1)) #predicting the salary for person aged 25

Output :

salary of a person aged 25 is predicted as 16625.36023055

array([[16625.36023055]])

7. Tensorflow 2.0

Tensorflow is one of the most popular Deep Learning frameworks widely used for training and developing neural network models. It offers a high-level programming interface to do these tasks. And Tensorflow 2.0 is an update to this framework. One of the most important additions in this update is Keras, which is a high-level deep learning API developed by Google. As a result model building is much easier than before. And because of these reasons learning Tensorflow 2.0 can be a big addition to your Data Science armory.

Uses

Listed below are some of the uses of this library.

  • Image Recognition.
  • Voice Recognition.
  • Video Detection.
  • Text Analysis.

Installation

For the python environment use the following code.

pip install tensorflow

For the conda environment use the following code.

conda install -c conda-forge tensorflow

Visit the official Tensorflow website to know more.

8. NLTK

NLTK is a python package that is used for Natural Language Processing. It stands for Natural Language Toolkit. As the name suggests, the library contains tools that can be used to perform NLP. Learn more about NLP here.

Features

Some of the features provided by the library are as follows:

  • Tokenizing
  • Filtering Stop Words
  • Stemming
  • Tagging Parts of Speech
  • Lemmatizing
  • Chunking
  • Chinking
  • Named Entity Recognition

All these features mentioned above make NLTK a powerful tool to know.

Installation

For the python environment use the following code.

pip install nltk

For the conda environment use the following code.

conda install -c anaconda nltk

Visit the official NLTK website to know more.

9. Beautiful Soup

Getting information from web pages, often known as web scrapping is an important Data Science task. Beautifulsoup is a library that does this function. You might have wondered why it is called “beautiful soup”. This is because it converts the poorly-formed stuff on the web referred to as “tag soup” to a more understandable or usable form.

Installation

For the python environment use the following code.

pip install beautifulsoup4

For the conda environment use the following code.

conda install -c anaconda beautifulsoup4

Visit the official BeatifulSoup website to know more.

10. OpenCV

OpenCV is an open-source library used for image or video processing. After processing the images the library can be used to detect objects, human faces or handwriting, etc. As a result, it is a powerful tool for computer vision and artificial intelligence applications.

Applications

Some of the applications of OpenCV are listed below.

  • Face detection.
  • Face Recognition.
  • Object detection.
  • Automated inspection and surveillance.
  • Medical image analysis.
  • Driverless car navigation and control.
  • Robots

Installation

For the python environment use the following code.

pip install opencv-python

For the conda environment use the following code.

conda install -c conda-forge opencv

11. Face Recognition

Bult using dlib‘s state-of-the-art face recognition package, the Face Recognition library has an accuracy of 99.38 in benchmark tests. Which is more than enough for many computer vision applications. It also comes with a simple face_recogntion command-line tool that lets you do face recognition easily from the command line. We would see some of the applications of this library below.

Applications

  • Finding faces in pictures.
  • Manipulate the facial features in the pictures that we have found.
  • Give an identity to these faces.

Installation

For the python environment in mac or Linux use the following code.

pip3 install face_recognition

For installing this library on windows use the following tutorial.

Windows Installation Tutorial · Issue #175 · ageitgey/face_recognition (github.com)

Hope you got an idea about Python Libraries for Data Science.

Thanks for reading!

After learning all these libraries you would most definitely be ready to take on amazing Data Science problems and projects. But don’t stop your learning here. Because the world of Data Science is vast and there is way more for you to learn. And please make sure you practice what you have learned. Because only by practice you can actually master anything. And after that apply that knowledge to solving real-life problems.

I hope you have enjoyed the article and in order to keep up with new articles visit the website.

Now take this self assessment test related to this article.

0 0 votes
Article Rating
Subscribe
Notify of
guest

2 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Meet Arunkumar Maheshwari
Meet Arunkumar Maheshwari
2 years ago

Very useful article, especially for beginners in python