0% found this document useful (0 votes)
51 views11 pages

Dads402-Unstructured Data Analysis

The document is an assignment for a Master of Business Administration (MBA) course on Unstructured Data Analysis, covering topics such as differences between structured and unstructured data, text vs. big data, word clouds, Naive Bayes classifiers, sentiment analysis, topic modeling, Fast Fourier Transform (FFT), and audio data preprocessing. It includes detailed explanations, applications, and techniques related to each topic. The assignment is structured into multiple questions and answers, providing a comprehensive overview of key concepts in data analysis and machine learning.

Uploaded by

namrita15mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views11 pages

Dads402-Unstructured Data Analysis

The document is an assignment for a Master of Business Administration (MBA) course on Unstructured Data Analysis, covering topics such as differences between structured and unstructured data, text vs. big data, word clouds, Naive Bayes classifiers, sentiment analysis, topic modeling, Fast Fourier Transform (FFT), and audio data preprocessing. It includes detailed explanations, applications, and techniques related to each topic. The assignment is structured into multiple questions and answers, providing a comprehensive overview of key concepts in data analysis and machine learning.

Uploaded by

namrita15mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

ASSIGNMENT

SESSION FEBRUARY - MARCH 2024


PROGRAM MASTER OF BUSINESS ADMINISTRATION
(MBA)
SEMESTER IV
COURSE CODE & NAME DADS402-UNSTRUCTURED DATA ANALYSIS
NAME NAMRITA MISHRA
ROLL NUMBER 2214511886

Assignment Set – 1
Question-1 (a) List down a few differences between structured and unstructured data.
(b) What is the difference between Text and Big data?
Answer- 1 (a) Differences between Structured and Unstructured Data

Structured Data:

1. Format: Structured data is organized in a predefined format, typically in rows and


columns. Examples include databases, spreadsheets, and tables.
2. Storage: It is stored in relational databases (SQL databases) where data relationships
are well-defined.
3. Ease of Analysis: Structured data is easier to analyze using traditional data processing
techniques and tools due to its organized nature.
4. Schema: Structured data relies on a fixed schema that defines the data types,
relationships, and constraints.
5. Examples: Customer information in CRM systems, transaction records, inventory
data, and financial data.

Unstructured Data:

1. Format: Unstructured data lacks a predefined format or organizational structure. It


can include text, images, videos, and audio files.
2. Storage: It is stored in non-relational databases (NoSQL databases) or data lakes,
which are designed to handle diverse data types.
3. Ease of Analysis: Analyzing unstructured data is more complex and often requires
advanced techniques such as natural language processing (NLP), machine learning,
and image recognition.
4. Schema: Unstructured data does not have a fixed schema, making it more flexible but
also more challenging to manage.
5. Examples: Emails, social media posts, documents, multimedia content, and sensor
data from IoT devices.

(b) Difference between Text and Big Data

Text Data:
1. Definition: Text data refers to data that is in textual form, including written or printed
words. It is typically unstructured and can be found in documents, emails, social
media posts, and web pages.
2. Volume: The volume of text data can vary from small to large datasets, but it does not
necessarily encompass the vast scale associated with big data.
3. Analysis Techniques: Text data analysis involves techniques like text mining, natural
language processing (NLP), sentiment analysis, and keyword extraction.
4. Sources: Text data is generated from sources such as books, articles, emails, social
media, and chat logs.
5. Tools: Tools for text data analysis include NLP libraries (like NLTK and SpaCy), text
mining software, and sentiment analysis tools.

Big Data:

1. Definition: Big data refers to extremely large and complex datasets that cannot be
easily managed or processed with traditional data processing tools. It encompasses
structured, unstructured, and semi-structured data.
2. Volume: Big data involves vast volumes of data that are continuously generated at
high velocity. It is characterized by the three Vs: Volume, Velocity, and Variety.
3. Analysis Techniques: Analyzing big data requires advanced analytics, including
machine learning, artificial intelligence, distributed computing, and big data
frameworks like Hadoop and Spark.
4. Sources: Big data sources are diverse and include transactional systems, social media,
IoT devices, sensors, logs, and multimedia content.
5. Tools: Tools for big data processing and analysis include Hadoop, Spark, NoSQL
databases (like MongoDB and Cassandra), data warehouses, and big data analytics
platforms (like Apache Flink and Google BigQuery).

Question-2 (a) What is a word cloud? What are some libraries that you need to import to
create a word cloud in python?

(b) What is a naive Bayes classifier and how does it work in text classification?

Answer-2 Word Cloud: A word cloud is a visual representation of text data where the size
of each word indicates its frequency or importance within the text. It helps in identifying the
most prominent terms in a body of text and can be a powerful tool for data visualization and
text analysis. Words that appear more frequently in the source text are displayed in larger
fonts, while less frequent words are shown in smaller fonts.

Libraries for Creating Word Clouds in Python: To create a word cloud in Python, you
typically need to import the following libraries:

1. wordcloud: This is the primary library for generating word clouds. It provides
functions to create and customize word clouds from text data.
2. matplotlib: This library is used for plotting and visualizing the word cloud.
3. numpy (optional): Useful for handling arrays and numerical data, often used for
image masking when creating custom-shaped word clouds.
4. PIL (Python Imaging Library) or its fork Pillow: Used for image manipulation,
such as creating masks or adding color to the word cloud.
(b) Naive Bayes Classifier and Its Working in Text Classification

Naive Bayes Classifier: The Naive Bayes classifier is a probabilistic machine learning
algorithm based on Bayes' Theorem, used primarily for classification tasks. It is called
"naive" because it assumes that the features (in this case, words) are independent of each
other, which is rarely true in real-world data but simplifies the computation significantly.

How Naive Bayes Works in Text Classification: In text classification, the Naive Bayes
classifier is commonly used due to its simplicity and effectiveness. It works by calculating
the probability of each category (or class) given the words in a document. The category with
the highest probability is then assigned to the document.

Steps Involved:

1. Training Phase:
o Calculate Prior Probabilities: Determine the prior probability of each class
based on the training data. This is the probability of any document belonging
to a specific class.
o Calculate Likelihoods: For each word in the vocabulary, calculate the
likelihood of that word given each class. This involves counting the frequency
of each word in documents of a particular class and normalizing it by the total
number of words in that class.

2. Prediction Phase:
o Calculate Posterior Probabilities: For a new document, calculate the
posterior probability for each class by combining the prior probabilities and
the likelihoods of the words in the document using Bayes' Theorem.
o Class Assignment: Assign the class with the highest posterior probability to
the document.

Mathematical Representation: P(c∣d)=P(d∣c)⋅P(c)P(d)P(c|d) = \frac{P(d|c) \cdot P(c)}


{P(d)}P(c∣d)=P(d)P(d∣c)⋅P(c) Where:

 P(c∣d)P(c|d)P(c∣d) is the posterior probability of class ccc given document ddd.


 P(d∣c)P(d|c)P(d∣c) is the likelihood of document ddd given class ccc.
 P(c)P(c)P(c) is the prior probability of class ccc.
 P(d)P(d)P(d) is the probability of document ddd (a normalizing constant).

Conclusion

Word clouds are a simple yet effective way to visualize the frequency of words in a text,
using libraries like wordcloud and matplotlib in Python. The Naive Bayes classifier, on the
other hand, is a powerful algorithm for text classification that leverages the principles of
Bayes' Theorem to predict the likelihood of different classes based on the occurrence of
words in documents. Despite its simplicity and the assumption of feature independence, it
performs remarkably well in many text classification tasks.

Question- 3 (a) What is the Machine Learning approach in sentiment analysis?


(b) What are some applications of topic modeling?

Answer- 3 (a) Machine Learning Approach in Sentiment Analysis

Sentiment Analysis: Sentiment analysis, also known as opinion mining, is the process of
using natural language processing (NLP), text analysis, and computational linguistics to
identify and extract subjective information from text. The goal is to determine the sentiment
expressed in a text, whether it is positive, negative, or neutral.

Machine Learning Approach: The machine learning approach to sentiment analysis


involves training algorithms on labeled datasets where the sentiment is predefined. These
algorithms learn patterns and features associated with different sentiments and apply this
knowledge to new, unseen data.

Steps Involved:

1. Data Collection:
o Gather a large corpus of text data from sources such as social media, reviews,
blogs, and forums. This data should be labeled with sentiments (e.g., positive,
negative, neutral).

2. Data Preprocessing:
o Text Cleaning: Remove noise such as punctuation, special characters, and
stop words.
o Tokenization: Split text into individual words or tokens.
o Normalization: Convert text to lowercase, stem or lemmatize words to their
base forms.

3. Feature Extraction:
o Bag of Words (BoW): Represent text by the frequency of words appearing in
the document.
o TF-IDF (Term Frequency-Inverse Document Frequency): Weigh the
importance of words based on their frequency in a document relative to the
entire corpus.
o Word Embeddings: Use pre-trained embeddings like Word2Vec or GloVe to
capture semantic meaning.

4. Model Training:
o Algorithms: Train machine learning models such as Logistic Regression,
Naive Bayes, Support Vector Machines (SVM), or advanced deep learning
models like Recurrent Neural Networks (RNN) and Transformers.
o Training: Split the data into training and testing sets and train the model on
the training data while validating on the test data.

5. Model Evaluation:
o Evaluate the model using metrics like accuracy, precision, recall, F1-score,
and AUC-ROC to assess its performance.

6. Prediction:
o Use the trained model to predict the sentiment of new, unseen text data.
(b) Applications of Topic Modelling

Topic Modelling: Topic modelling is a type of statistical model used to discover the abstract
topics that occur in a collection of documents. It helps in identifying hidden patterns in the
text and clustering documents based on topics. Latent Dirichlet Allocation (LDA) is one of
the most commonly used algorithms for topic modelling.

Applications:

1. Content Recommendation:
o Application: Streaming services like Netflix and Spotify use topic modelling
to recommend content to users. By analyzing the topics of movies, TV shows,
or music, they can suggest similar content that matches the user's preferences.
o Example: A user who watches a lot of science fiction movies might get
recommendations for new sci-fi releases based on topic analysis of their
viewing history.

2. Document Classification:
o Application: Topic modelling helps in classifying documents into predefined
categories. This is useful in organizing large repositories of documents, such
as news articles, research papers, and legal documents.
o Example: Classifying news articles into categories like sports, politics,
technology, and entertainment based on the topics they discuss.

3. Trend Analysis:
o Application: Analyzing social media posts, blogs, and news articles over time
to identify emerging trends and public opinions.
o Example: Businesses can use topic modelling to detect changes in consumer
sentiment and preferences, allowing them to adapt their marketing strategies
accordingly.

4. Customer Feedback Analysis:


o Application: Understanding customer feedback from reviews, surveys, and
support tickets to identify common issues and areas for improvement.
o Example: An e-commerce company can use topic modelling to analyze
product reviews and determine the most frequently mentioned problems, such
as delivery delays or product defects.

5. Academic Research:
o Application: Helping researchers organize and analyze large volumes of
academic papers by identifying the main topics of research and clustering
related papers together.
o Example: A researcher studying climate change can use topic modelling to
find and group papers discussing similar subtopics, such as carbon emissions,
renewable energy, and climate policy.

Conclusion:

The machine learning approach to sentiment analysis involves data preprocessing, feature
extraction, model training, and evaluation, allowing for accurate prediction of sentiments in
text data. Topic modelling, on the other hand, has diverse applications across content
recommendation, document classification, trend analysis, customer feedback analysis, and
academic research, enabling organizations and individuals to extract meaningful insights
from large text datasets.

Assignment Set – 2
Question- 4 (a) What is Fast Fourier Transform (FFT)?

(b) What is audio data preprocessing in machine learning?

Answer-4 (a) What is Fast Fourier Transform (FFT)?

The Fast Fourier Transform (FFT) is an efficient algorithm used to compute the Discrete
Fourier Transform (DFT) and its inverse. Fourier Transform is a mathematical technique that
transforms a function of time (or space) into a function of frequency. In the context of digital
signal processing, the DFT converts a sequence of complex numbers into another sequence of
complex numbers, representing the signal in the frequency domain.

Key Concepts:

⋅e − i⋅2πkn/N, where NNN is the number of points, xnx_nxn is the time-domain signal, and
Discrete Fourier Transform (DFT): The DFT is defined by the formula: Xk = ∑n = 0N− 1 xn

XkX_kXk is the frequency-domain representation.

 Efficiency: Direct computation of the DFT requires O(N2) O(N2) O(N2) operations,
where NNN is the number of data points. FFT reduces this complexity to
O(Nlog⁡N)O(N \log N)O(NlogN), making it much faster and practical for large
datasets.

Applications:

 Signal Processing: FFT is widely used in audio, image, and speech processing to
analyze the frequency components of signals.
 Communication Systems: Used in modulation, demodulation, and signal
compression techniques.
 Biomedical Engineering: Analyzing the frequency content of EEG, ECG, and other
biomedical signals.
 Astronomy: Processing radio signals from space.

(b) What is Audio Data Preprocessing in Machine Learning?

Audio data preprocessing is a critical step in preparing raw audio signals for machine
learning tasks. It involves transforming the audio data into a format that can be effectively
used by machine learning algorithms. The primary goals of audio preprocessing are to reduce
noise, extract relevant features, and normalize the data.

Steps in Audio Data Preprocessing:

1. Loading and Resampling:


o Loading: Audio data is typically loaded using libraries like librosa or pydub.
o Resampling: Standardizing the sample rate ensures consistency across
different audio files. Common sample rates are 16 kHz or 44.1 kHz.

2. Noise Reduction:
o Filtering: Applying filters to remove background noise and unwanted
frequencies.
o Spectral Gating: Reducing noise based on the spectral properties of the audio
signal.

3. Segmentation:
o Silence Removal: Cutting out silent sections of the audio to focus on the
meaningful parts.
o Framing: Dividing the audio signal into short frames (e.g., 20-40
milliseconds) for analysis.

4. Feature Extraction:
o Time-Domain Features: Extracting features like zero-crossing rate
(frequency of sign changes) and energy (signal strength).
o Frequency-Domain Features: Using FFT to transform the signal and extract
features like spectral centroid (brightness), spectral bandwidth (spread), and
Mel-Frequency Cepstral Coefficients (MFCCs), which are particularly useful
for speech and audio recognition.
o Temporal Features: Extracting features that capture the changes over time,
such as delta and delta-delta MFCCs.

5. Normalization:
o Scaling: Normalizing the amplitude of the audio signal to a standard range
(e.g., between -1 and 1) to ensure consistency.
o Standardization: Adjusting the mean and variance of features to improve the
performance of machine learning models.

6. Data Augmentation:
o Techniques: Applying transformations like pitch shifting, time stretching, and
adding background noise to increase the diversity of the training data.

Applications:

 Speech Recognition: Converting spoken language into text.


 Music Classification: Categorizing music by genre, mood, or artist.
 Speaker Identification: Recognizing individuals based on their voice.
 Sound Event Detection: Identifying specific sounds within an audio clip, such as
sirens or dog barks.

Question- 5 (a) What are the benefits of using histogram equalization?

(b) What is the advantage of using a CNN for image classification?


Answer- 5 (a) Benefits of Using Histogram Equalization

Histogram Equalization: Histogram equalization is a technique in image processing used to


improve the contrast of an image. This method adjusts the intensity distribution of an image
to span a broader range, enhancing its visual quality. The process involves redistributing the
image's histogram so that the output image has a more uniform histogram, which generally
improves the visibility of features in the image.

Benefits:

1. Enhanced Contrast:
o Histogram equalization increases the global contrast of images, especially
when the usable data of the image is represented by close contrast values. By
stretching out the intensity range, it makes the dark regions darker and bright
regions brighter, thus improving the overall visibility.

2. Better Feature Representation:


o By improving the contrast, histogram equalization can make important
features more discernible. This is particularly useful in medical imaging (e.g.,
X-rays, MRI scans) where subtle details need to be highlighted for accurate
diagnosis.

3. Improved Detail Visibility:


o Details in shadowed or highlighted regions become more visible. This is
beneficial in applications like satellite imagery, where enhancing details can
lead to better interpretation of geographical data.

4. Uniform Histogram:
o The process aims to produce a uniform histogram, which means the pixel
intensity values are evenly distributed. This can lead to better performance in
various computer vision tasks since the dynamic range of the pixel values is
maximized.

5. Preprocessing for Further Analysis:


o Histogram equalization is often used as a preprocessing step in image
processing and computer vision tasks. By normalizing the intensity
distribution, it prepares the image for subsequent analysis, such as edge
detection, object recognition, and image segmentation, improving the accuracy
and robustness of these tasks.

(b) Advantages of Using a Convolutional Neural Network (CNN) for Image Classification

Convolutional Neural Networks (CNNs): CNNs are a class of deep neural networks
specifically designed for processing structured grid data, such as images. They have proven
highly effective for various image-related tasks, including image classification, object
detection, and segmentation.

Advantages:

1. Automatic Feature Extraction:


o CNNs automatically learn and extract features from raw image data. Unlike
traditional methods that require manual feature extraction, CNNs use
convolutional layers to learn spatial hierarchies of features, such as edges,
textures, and shapes, through backpropagation.

2. Spatial Hierarchies:
o The convolutional layers in CNNs detect low-level features (e.g., edges and
textures) in the initial layers and higher-level features (e.g., objects and
shapes) in deeper layers. This hierarchical feature extraction is highly effective
for recognizing complex patterns in images.

3. Parameter Sharing:
o Convolutional layers use the same weights (filters) across different regions of
the image, significantly reducing the number of parameters compared to fully
connected layers. This parameter sharing makes CNNs more efficient and less
prone to overfitting, especially when dealing with large images.

4. Translation Invariance:
o CNNs inherently possess translation invariance due to their convolutional and
pooling operations. This means they can recognize objects regardless of their
position in the image. Pooling layers further enhance this property by
downsampling the feature maps, making the network robust to spatial
variations.

5. Reduction in Computational Complexity:


o The local connectivity of convolutional layers reduces the computational
complexity by focusing on small regions of the input image at a time. This
makes CNNs more efficient and scalable to larger and deeper networks,
enabling the processing of high-resolution images.

Applications:

 Image Classification: Categorizing images into predefined classes, such as


identifying animals, vehicles, or landmarks.
 Object Detection: Identifying and localizing objects within an image, useful in
applications like autonomous driving and security surveillance.
 Image Segmentation: Partitioning an image into meaningful segments, used in
medical imaging and scene understanding.

Question- 6 (a) What are some common techniques used for video classification?

(b) What is the difference between feature extraction and feature selection?

Answer- 6 (a) Common Techniques Used for Video Classification

Video Classification: Video classification involves categorizing video clips into predefined
categories based on their content. This task is more complex than image classification due to
the temporal dimension of videos, requiring methods that can capture both spatial and
temporal information.
Common Techniques:

1. Convolutional Neural Networks (CNNs):


o 2D CNNs: Used to extract spatial features from individual frames of the
video. These networks treat each frame as a separate image, and features are
extracted independently from each frame.
o 3D CNNs: Extend 2D CNNs to the temporal dimension, allowing
simultaneous extraction of spatial and temporal features. 3D convolutions are
applied to a sequence of frames, capturing motion information effectively.
o Example: A 3D CNN could analyze a sequence of frames from a sports video
to classify the type of sport.

2. Recurrent Neural Networks (RNNs):


o LSTM (Long Short-Term Memory): A type of RNN designed to capture
long-term dependencies in sequential data. LSTMs can process frame-level
features extracted by CNNs, learning the temporal dynamics of the video.
o GRU (Gated Recurrent Unit): A simplified version of LSTM, also used to
handle temporal dependencies in video sequences.
o Example: LSTMs can be used to analyze a sequence of actions in a cooking
video to classify the recipe.

3. Two-Stream Networks:
o Spatial Stream: Processes spatial information from video frames using a 2D
CNN.
o Temporal Stream: Captures motion information using optical flow or
temporal differences between frames, often processed by another 2D CNN.
o Fusion: The outputs from both streams are fused to make the final
classification decision.
o Example: Two-stream networks can classify human activities in surveillance
videos by combining appearance and motion information.

4. Transformers:
o Self-Attention Mechanism: Transformers, originally designed for natural
language processing, have been adapted for video classification. They use self-
attention mechanisms to capture relationships between different parts of the
video sequence.
o Example: Vision transformers can process long video sequences by attending
to important frames and actions, classifying videos based on learned
representations.

5. Hybrid Models:
o Combination: These models combine CNNs for spatial feature extraction and
RNNs or transformers for temporal modeling. This approach leverages the
strengths of both types of networks.
o Example: A hybrid model might use a CNN to extract features from frames
and an LSTM to capture temporal dependencies, effectively classifying
complex video content like movie genres.
(b) Difference Between Feature Extraction and Feature Selection

Feature Extraction: Feature extraction involves transforming raw data into a set of
meaningful features that can be used for machine learning tasks. The goal is to create new
features that represent the data's important characteristics, often reducing dimensionality
while retaining critical information.

Key Points:

 Transformation: Converts raw data into informative and non-redundant features.


 Dimensionality Reduction: Reduces the number of features by combining or
transforming original features into a smaller set of new features.
 Methods: Includes techniques like Principal Component Analysis (PCA),
Independent Component Analysis (ICA), and autoencoders.
 Example: In image processing, feature extraction might involve extracting edges,
textures, or shapes from images to create a set of descriptive features.

Feature Selection: Feature selection involves selecting a subset of relevant features from the
original dataset. The goal is to improve model performance by removing irrelevant or
redundant features, thus simplifying the model and reducing overfitting.

Key Points:

 Subset Selection: Chooses a subset of original features without transforming them.


 Relevance: Focuses on selecting features that are most relevant to the target variable.
 Methods: Includes techniques like filter methods (e.g., correlation coefficient scores),
wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g.,
regularization techniques like LASSO).
 Example: In a dataset with numerous attributes, feature selection might involve
choosing only the most important attributes, such as age, income, and education level,
for predicting credit risk.

You might also like