Module 5
Specialization in Business Analytics
Course Content
• Natural language processing with python and excel
• Deep learning with excel and tensor flow
• Business analytics with mysql and python
• Business intelligence with power BI and tableau
• Data engineering with pyspark and sqoop
Natural Language Processing
• Only 21% of the available data is present in structured form.
• Data is being generated as we speak, as we tweet, send messages
on Whatsapp and in various other activities are in textual form &
highly unstructured in nature.
• Natural Language Processing (NLP) helps you extract insights from
emails of customers, the tweets, text messages.
• To produce significant and actionable insights from text data, it is
important to get acquainted with the techniques and principles
of Natural Language Processing (NLP).
NLP Definition/ Meaning
• Natural language processing (NLP) is a field that focuses on making
natural human language usable by computer programs.
• NLTK (Natural Language Toolkit), is a Python package that one can use
for NLP.
• NLP is a branch of data science that consists of systematic processes for
analyzing, understanding, and deriving information from the text data
in a smart and efficient manner.
• By utilizing NLP and its components, one can organize the massive
chunks of text data, perform numerous automated tasks and solve a
wide range of problems such as – speech recognition, sentiment
analysis, topic segmentation etc.
Terms in NLP
• Tokenization – process of splitting up text by word or by sentence; the
first step of turning unstructured to structured data.
• Tokens – words or entities present in the text
• Text object – a sentence or a phrase or a word or an article
• Natural Language Toolkit (NLTK):
- is a popular open-source Python library for natural language
processing (NLP).
- It includes packages that help machines understand human
languages and respond appropriately. NLTK can be used for a variety of
tasks, including: data cleaning, visualization, tokenizing etc.
Tokenization
• Tokenizing by word: Eg. ‘Today’, ‘is’, ‘Monday’
• Tokenizing by sentence
• In Python, import the relevant parts of NLTK so one can tokenize by word
and by sentence:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
Text Preprocessing
• The entire process of cleaning and standardization of text, making it
noise-free and ready for analysis is known as text preprocessing.
• It is predominantly comprising of three steps:
• Noise Removal Eg. removing is, the, am, URLs , links
• Lemmatization Eg. Converting play, player, plays, played to play
• Object Standardization Eg. Acronyms, hashtags
Stemming vs Lemmatization
• Stemming: Stemming is the process of removing the last few characters of
a given word, to obtain a shorter form
• Its primary goal is to reduce words to their base or root form, known as
the stem.
• Eg. “history” and “historical” with “histori”. Similarly, for the words
“finally” and “final” to “fina”
• Use cases: sentiment analysis, spam classification, restaurant reviews
• Lemmatization: Lemma is an actual language word and it has meaning.
• Use Cases: Chatbots, human-answering
Stemming vs Lemmatization
Stemming Lemmatization
• Stemming is a process that stems • Lemmatization considers the
or removes last few characters context and converts the word to
from a word, often leading to its meaningful base form, which
incorrect meanings and spelling. is called Lemma.
• For instance, stemming the word • For instance, lemmatizing the
‘Caring‘ would return ‘Car‘. word ‘Caring‘ would return ‘Care‘.
• Stemming is used in case of large • Lemmatization is
dataset where performance is an computationally expensive since
issue. it involves look-up tables etc.
Steps in NLP
• Tokenization: The first step is to break down a text into individual words or
tokens.
• POS Tagging: Parts-of-speech tagging involves assigning a grammatical
category (like noun, verb, adjective, etc.) to each token.
• Lemmatization: Once each word has been tokenized and assigned a part-of-
speech tag, the lemmatization algorithm uses a lexicon or linguistic rules to
determine the lemma of each word. For example, the lemma of “running” is
“run,” and the lemma of “better” (in the context of an adjective) is “good.”
• Applying Rules: Lemmatization algorithms often rely on linguistic rules and
patterns. For irregular verbs or words with multiple possible lemmas, these
rules help in making the correct lemmatization decision.
• Output: The result of lemmatization is a set of words in their base or
dictionary form, making it easier to analyze and understand the underlying
meaning of a text.
Uses of NLP
• Classify documents. For instance, you can label documents as sensitive or spam.
• Summarize text by identifying the entities that are present in the document.
• Tag documents with keywords. For the keywords, NLP can use identified
entities.
• Do content-based search and retrieval. Tagging makes this functionality
possible.
• Summarize a document's important topics. NLP can combine identified entities
into topics.
• Categorize documents for navigation. For this purpose, NLP uses detected
topics.
• Enumerate related documents based on a selected topic. For this purpose, NLP
uses detected topics.