We know that humans are the most intelligent species on our planet. And our path to becoming intelligent has a lot to do with our ability to communicate with each other. There are more than 7000 languages spoken in different parts of the world today. These languages are collectively called human languages or natural language.
Coming to the world of Data Science, around 80% of the data that is available is unstructured and is in the form of text data written in Human Language. This data is collected from sources like Instagram, Facebook, Twitter, etc. So, to analyze this data and understand the insights it provides we need to learn some of the techniques like Text Mining and Natural Language Processing that we can use to analyze the text data.
Learn about the providers of online masters in data science by clicking here
Text Mining
Text mining is the process of structuring the input unstructured text, analyzing it, deriving patterns from it, and evaluate it to get meaningful output from it which can be used for solving problems.
Natural Language Processing
Natural Language processing is an interdisciplinary field of computer science and linguistic that deals with the study of communication of machines and humans using natural language. Its main objective is to enable machines to understand human language and develop the ability to communicate using human language about any subjects seamlessly like a human.
Applications of NLP
NLP is one of the most exciting areas of study in the world of Machine Learning. Some of the most interesting applications of NLP are given below:
- Sentiment Analysis
- Chatbots & Virtual Assistants
- Machine Translation
- Speech Recognition
Let’s learn some of the basic tasks done in NLP
Components of NLP
- Tokenization
- Stemming
- Lemmatization
- POS Tagging
- Named Entity Recognition
- Chunking
Tokenization
Tokenization is the first step in NLP. It is the processing of splitting a string into small units or tokens which can be used for further analysis.
Example:
The sentence, “India is my country” can be divided into 4 tokens as:
Stemming
Stemming is the process of converting these different tokens into their root word or their word stem. For example, the stem of the word eating, eaten, eats, etc. is eat.
Stemming of a word is done by cutting off the prefixes or suffixes of a word to form the root word. But this doesn’t work always, for example in the case of gone, going, and went, the root word is ‘go’ but stemming can’t give this output. so, we use another technique called lemmatization.
Lemmatization
Lemmatization does morphological analysis of the word and returns the base word known as a lemma. What this means is lemmatization makes use of the dictionary to identify the root word rather than just cutting off suffix and prefix.
It’s considerably slower than stemming but is more accurate.
Parts of Speech Tagging
In simple terms, it is the process of mapping each word in a sentence to a particular part of speech (Noun, Verb, Adverb, Adjective, etc.)
Example: I(Pronoun) want (Verb) an (Determiner) early (Adjective) upgrade (Noun)
It is done to understand the grammatical structure of a sentence.
But sometimes a word can have multiple POS tags based on how it is used. For example, we often say “Google that on the internet” for searching for something on the internet. Here we are using Google as the act of searching, i.e., as a verb. But actually, Google is a name, so it is a noun.
To avoid these mistakes, we would use another technique called named entity recognition.
Named entity recognition
Named entity recognition is used to identify the named entities in a sentence and classify them into pre-defined categories like persons, places, quantities, organizations, etc.
Example: My name is Deepak Jose (Person). I am from India (Place). I write articles for the Know Industrial Engineering website (Website).
Now after we have tokenized the sentence, found root words, attached POS tags, and identified the named entities, it’s time for us to put together all of these to make some sense out of it.
Chunking
Chunking is the process of grouping together tokens (words) matched with different POS tags to form a meaningful group of phrases, like Noun phrase, Verb phrase, etc. learn more about phrases here.
Example: In the sentence “We saw the yellow dog” We is a noun phrase, saw is a verb, the yellow dog is a noun phrase, where ‘the’ is a determiner, ‘yellow’ is an adjective and ‘dog’ is a noun.
Python library for NLP
NLTK – NLTK is a natural language toolkit library that is used for all the natural language processing tasks mentioned above.
If you want to learn how to do all of these tasks using python here is a tutorial for you to try out these hands-on.
Read our previous articles below:
- Introduction to Data Science
- Languages and Tools you should know to become a Data Scientist
- Tools for Data Science
- Statistics for Data Science – Descriptive Statistics
- Complete Data Science Roadmap – With resources
Also you would like to write articles like this and earn some money, Join Us.
About the Author
Deepak Jose is a B-Tech CS student with a passion for Data Science. Loves learning about Data Science, coding, and science in general. Does data analysis and visualization as a hobby. Even though I’m in the Computer Science path I always find time to learn about space, automobiles, geography, energy, architecture, arts, etc. Loves solving problems and learning about new inventions.