0% found this document useful (0 votes)
19 views25 pages

Introduction To Data Science Notes of Unit 1

Data science is the interdisciplinary study of data that combines math, computer science, and domain expertise to derive insights for business decision-making. It involves key processes such as data collection, cleaning, analysis, visualization, and decision-making, and has applications across various industries including healthcare, finance, and retail. Essential skills for data scientists include programming, statistics, machine learning, and data visualization, while data preprocessing is crucial for ensuring data quality and improving analysis outcomes.

Uploaded by

vishal224447
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views25 pages

Introduction To Data Science Notes of Unit 1

Data science is the interdisciplinary study of data that combines math, computer science, and domain expertise to derive insights for business decision-making. It involves key processes such as data collection, cleaning, analysis, visualization, and decision-making, and has applications across various industries including healthcare, finance, and retail. Essential skills for data scientists include programming, statistics, machine learning, and data visualization, while data preprocessing is crucial for ensuring data quality and improving analysis outcomes.

Uploaded by

vishal224447
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Data Science

Unit 1

Data Science
Data science is the study of data that helps us derive useful insight for business decision
making. Data Science is all about using tools, techniques, and creativity to uncover insights
hidden within data. It combines math, computer science, and domain expertise to tackle real-
world challenges in a variety of fields.

Data Science processes the raw data and solve business problems and even make prediction
about the future trend or requirement. For example, from the huge raw data of a company, data
science can help answer following question:

• What do customer want?


• How can we improve our services?
• What will the upcoming trend in sales?
• How much stock they need for upcoming festival.

In short, data science empowers the industries to make smarter, faster, and more informed
decisions. In order to find patterns and achieve such insights, expertise in relevant domain is
required. With expertise in Healthcare, a data scientists can predict patient risks and suggest
personalized treatments.

Data science involves these key steps:

• Data Collection: Gathering raw data from various sources, such as databases,
sensors, or user interactions.
• Data Cleaning: Ensuring the data is accurate, complete, and ready for analysis.
• Data Analysis: Applying statistical and computational methods to identify patterns,
trends, or relationships.
• Data Visualization: Creating charts, graphs, and dashboards to present findings
clearly.
• Decision-Making: Using insights to inform strategies, create solutions, or predict
outcomes.

Importance of Data Science


Here are some key reasons why it is so important:

• Helps Business in Decision-Making: By analyzing data, businesses can understand


trends and make informed choices that reduce risks and maximize profits.
• Improves Efficiency: Organizations can use data science to identify areas where they
can save time and resources.
• Personalizes Experiences: Data science helps create customized recommendations
and offers that improve customer satisfaction.

• Predicts the Future: Businesses can use data to forecast trends, demand, and other
important factors.

• Drives Innovation: New ideas and products often come from insights discovered
through data science.

• Benefits Society: Data science improves public services like healthcare, education,
and transportation by helping allocate resources more effectively.

Real Life Example of Data Science

There are lot of examples you can observe around yourself, where data science is being used.
For Example - Social Media, Medical, Preparing strategy for Cricket or FIFA by analyzing
past matches. Here are some more real life examples:

• Social Media Recommendation:


Have you ever wondered why you always get Instagram Reels aligned towards your
interest? These platforms uses data-science to Analyze your past interest/data (Like,
Comments, watch etc) and create personalized recommendation to serve content that
matches your interests.
• Early Diagnosis of Disease:
Data Science can predicts the risk of conditions like diabetes or heart disease, by
analyzing a patient’s medical records and lifestyle habits. This allows doctors to act
early and improve lives. In Future, it can help doctors detect diseases before
symptoms even start to appear. For example, predicting a Tumor or Cancer at a very
early stage. Data Science uses medical history and Image-data for such prediction.
• E-commerce recommendation and Demand Forecast:
E-commerce platforms like Amazon or Flipkart use data science to enhance the
shopping experience. By analyzing your browsing history, purchase behavior, and
search patterns, they recommend products based on your preferences. It can also help
in predicting demand for products by studying past sales trends, seasonal patterns etc.

Applications of Data Science

Data science has a wide range of applications across various industries, by transforming how
they operate and deliver results. Here are some examples:

• Data science is used to analyze patient data, predict diseases, develop personalized
treatments, and optimize hospital operations.

• It helps detect fraudulent transactions, manage risks, and provide personalized


financial advice.

• Businesses use data science to understand customer behavior, recommend products,


optimize inventory, and improve supply chains.
• Data science powers innovations like search engines, virtual assistants, and
recommendation systems.

• It enables route optimization, traffic management, and predictive maintenance for


vehicles.

• Data science helps in designing personalized learning experiences, tracking student


performance, and improving administrative efficiency.

• Streaming platforms and content creators use data science to recommend shows,
analyze viewer preferences, and optimize content delivery.

• Companies leverage data science to segment audiences, predict campaign outcomes,


and personalize advertisements.

Industry where data science is used

Data science is transforming every industry by unlocking the power of data. Here are some
key sectors where data science plays a vital role:
• Healthcare: Data science improves patient outcomes by using predictive analytics to
detect diseases early, creating personalized treatment plans and optimizing hospital
operations for efficiency.
• Finance: Data science helps detect fraudulent activities, assess and manage financial
risks, and provide tailored financial solutions to customers.

• Retail: Data science enhances customer experiences by delivering targeted marketing


campaigns, optimizing inventory management, and forecasting sales trends
accurately.

• Technology: Data science powers cutting-edge AI applications such as voice


assistants, intelligent search engines, and smart home devices.

• Transportation: Data science optimizes travel routes, manages vehicle fleets


effectively, and enhances traffic management systems for smoother journeys.

• Manufacturing: Data science predicts potential equipment failures, streamlines


supply chain processes, and improves production efficiency through data-driven
decisions.

• Energy: Data science forecasts energy demand, optimizes energy consumption, and
facilitates the integration of renewable energy resources.

• Agriculture: Data science drives precision farming practices by monitoring crop


health, managing resources efficiently, and boosting agricultural yields.

Important Data Science Skills


Data Scientists need a mix of technical and soft skills to excel in this domain. To start with
data science, it's important to learn the basics like Mathematics and Basic programming
skills. Here are some essential skills for a successful career in data science:

• Programming: Proficiency in programming languages like Python, R, or SQL is


crucial for analyzing and processing data effectively.

• Statistics and Mathematics: A strong foundation in statistics and linear algebra helps
in understanding data patterns and building predictive models.
• Machine Learning: Knowledge of machine learning algorithms and frameworks is
key to creating intelligent data-driven solutions.
• Data Visualization: The ability to present data insights through tools like Tableau,
Power BI, or Matplotlib ensures findings are clear and actionable.

• Data Wrangling: Skills in cleaning, transforming, and preparing raw data for
analysis are vital for maintaining data quality.

• Big Data Tools: Familiarity with tools like Hadoop, Spark, or cloud platforms helps
in handling large datasets efficiently.

• Critical Thinking: Analytical skills to interpret data and solve problems creatively
are essential for uncovering actionable insights.

• Communication: The ability to explain complex data findings in simple terms to


stakeholders is a valuable asset.

Data Preprocessing in Data Science


Data preprocessing is the process of preparing raw data for analysis by cleaning and
transforming it into a usable format. In data mining it refers to preparing raw data for mining
by performing tasks like cleaning, transforming, and organizing it into a format suitable for
mining algorithms.

• Goal is to improve the quality of the data.

• Helps in handling missing values, removing duplicates, and normalizing data.

• Ensures the accuracy and consistency of the dataset.

Steps in Data Preprocessing


Some key steps in data preprocessing are Data Cleaning, Data Integration, Data
Transformation, and Data Reduction.
1. Data Cleaning: It is the process of identifying and correcting errors or inconsistencies in
the dataset. It involves handling missing values, removing duplicates, and correcting incorrect
or outlier data to ensure the dataset is accurate and reliable. Clean data is essential for
effective analysis, as it improves the quality of results and enhances the performance of data
models.

• Missing Values: This occur when data is absent from a dataset. You can either ignore
the rows with missing data or fill the gaps manually, with the attribute mean, or by
using the most probable value. This ensures the dataset remains accurate and
complete for analysis.

• Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry. It can be handled in
several ways:

o Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.

o Regression: Data can be smoothed by fitting it to a regression function, either


linear or multiple, to predict values.

o Clustering: This method groups similar data points together, with outliers
either being undetected or falling outside the clusters. These techniques help
remove noise and improve data quality.

• Removing Duplicates: It involves identifying and eliminating repeated data entries to


ensure accuracy and consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.
2. Data Integration: It involves merging data from various sources into a single, unified
dataset. It can be challenging due to differences in data formats, structures, and meanings.
Techniques like record linkage and data fusion help in combining data efficiently, ensuring
consistency and accuracy.
• Record Linkage is the process of identifying and matching records from different
datasets that refer to the same entity, even if they are represented differently. It helps
in combining data from various sources by finding corresponding records based on
common identifiers or attributes.

• Data Fusion involves combining data from multiple sources to create a more
comprehensive and accurate dataset. It integrates information that may be inconsistent
or incomplete from different sources, ensuring a unified and richer dataset for
analysis.

3. Data Transformation: It involves converting data into a format suitable for analysis.
Common techniques include normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit variance; and discretization,
which converts continuous data into discrete categories. These techniques help prepare the
data for more accurate analysis.
• Data Normalization: The process of scaling data to a common range to ensure
consistency across variables.
• Discretization: Converting continuous data into discrete categories for easier
analysis.

• Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.

• Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to


provide a higher-level view for better understanding and analysis.

4. Data Reduction: It reduces the dataset's size while maintaining key information. This can
be done through feature selection, which chooses the most relevant features, and feature
extraction, which transforms the data into a lower-dimensional space while preserving
important details. It uses various reduction techniques such as,

• Dimensionality Reduction (e.g., Principal Component Analysis): A technique that


reduces the number of variables in a dataset while retaining its essential information.

• Numerosity Reduction: Reducing the number of data points by methods like


sampling to simplify the dataset without losing critical patterns.

• Data Compression: Reducing the size of data by encoding it in a more compact


form, making it easier to store and process.

Uses of Data Preprocessing

Data preprocessing is utilized across various fields to ensure that raw data is transformed into
a usable format for analysis and decision-making. Here are some key areas where data
preprocessing is applied:

1. Data Warehousing: In data warehousing, preprocessing is essential for cleaning,


integrating, and structuring data before it is stored in a centralized repository. This ensures the
data is consistent and reliable for future queries and reporting.

2. Data Mining: Data preprocessing in data mining involves cleaning and transforming raw
data to make it suitable for analysis. This step is crucial for identifying patterns and extracting
insights from large datasets.

3. Machine Learning: In machine learning, preprocessing prepares raw data for model
training. This includes handling missing values, normalizing features, encoding categorical
variables, and splitting datasets into training and testing sets to improve model performance
and accuracy.

4. Data Science: Data preprocessing is a fundamental step in data science projects, ensuring
that the data used for analysis or building predictive models is clean, structured, and relevant.
It enhances the overall quality of insights derived from the data.
5. Web Mining: In web mining, preprocessing helps analyze web usage logs to extract
meaningful user behavior patterns. This can inform marketing strategies and improve user
experience through personalized recommendations.

6. Business Intelligence (BI): Preprocessing supports BI by organizing and cleaning data to


create dashboards and reports that provide actionable insights for decision-makers.

7. Deep Learning Purpose: Similar to machine learning, deep learning applications require
preprocessing to normalize or enhance features of the input data, optimizing model training
processes.

Advantages of Data Preprocessing


• Improved Data Quality: Ensures data is clean, consistent, and reliable for analysis.

• Better Model Performance: Reduces noise and irrelevant data, leading to more
accurate predictions and insights.
• Efficient Data Analysis: Streamlines data for faster and easier processing.
• Enhanced Decision-Making: Provides clear and well-organized data for better
business decisions.

Disadvantages of Data Preprocessing

• Time-Consuming: Requires significant time and effort to clean, transform, and


organize data.

• Resource-Intensive: Demands computational power and skilled personnel for


complex preprocessing tasks.

• Potential Data Loss: Incorrect handling may result in losing valuable information.
• Complexity: Handling large datasets or diverse formats can be challenging.

Data Cleaning

Data cleaning is a steps in machine learning (ML) which involves identifying and removing
any missing, duplicate or irrelevant data.

• Raw data (log file, transactions, audio /video recordings, etc) is often noisy,
incomplete and inconsistent which can negatively impact the accuracy of model.

• The goal of data cleaning is to ensure that the data is accurate, consistent and free of
errors.

• Clean datasets also important in EDA (Exploratory Data Analysis) which enhances
the interpretability of data so that the right actions can be taken based on insights.
Benefits of Data Cleaning
How to Perform Data Cleaning

The process begins by identifying issues like missing values, duplicates and outliers.
Performing data cleaning involves a systematic process to identify and remove errors in a
dataset. The following steps are essential to perform data cleaning:

• Remove Unwanted Observations: Eliminate duplicates, irrelevant entries or


redundant data that add noise.

• Fix Structural Errors: Standardize data formats and variable types for consistency.

• Manage Outliers: Detect and handle extreme values that can skew results, either by
removal or transformation.

• Handle Missing Data: Address gaps using imputation, deletion or advanced


techniques to maintain accuracy and integrity.

Implementation for Data Cleaning

Let's understand each step for Database Cleaning using titanic dataset.

Step 1: Import Libraries and Load Dataset

We will import all the necessary libraries i.e pandas and numpy.
import pandas as pd
import numpy as np
df = pd.read_csv('Titanic-Dataset.csv')
df.info()
df.head()

Step 2: Check for Duplicate Rows

df.duplicated(): Returns a boolean Series indicating duplicate rows.


df.duplicated()

Step 3: Identify Column Data Types

• List comprehension with .dtype attribute to separate categorical and numerical


columns.

• object dtype: Generally used for text or categorical data.

cat_col = [col for col in df.columns if df[col].dtype == 'object']


num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Categorical columns:', cat_col)
print('Numerical columns:', num_col)
Step 4: Count Unique Values in the Categorical Columns

df[numeric_columns].nunique(): Returns count of unique values per column.

df[cat_col].nunique()

Step 5: Calculate Missing Values as Percentage


• df.isnull(): Detects missing values, returning boolean DataFrame.

• Sum missing across columns, normalize by total rows and multiply by 100.

round((df.isnull().sum() / df.shape[0]) * 100, 2)

Step 6: Drop Irrelevant or Data-Heavy Missing Columns

• df.drop(columns=[]): Drops specified columns from the DataFrame.

• df.dropna(subset=[]): Removes rows where specified columns have missing values.


• fillna(): Fills missing values with specified value (e.g., mean).
df1 = df.drop(columns=['Name', 'Ticket', 'Cabin'])

df1.dropna(subset=['Embarked'], inplace=True)

df1['Age'].fillna(df1['Age'].mean(), inplace=True)

Step 7: Detect Outliers with Box Plot

• matplotlib.pyplot.boxplot(): Displays distribution of data, highlighting median,


quartiles and outliers.

• plt.show(): Renders the plot.

import matplotlib.pyplot as plt

plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')

plt.xlabel('Age')

plt.title('Box Plot')

plt.show()

Step 8: Calculate Outlier Boundaries and Remove Them

• Calculate mean and standard deviation


(std) using df['Age'].mean() and df['Age'].std().

• Define bounds as mean ± 2 * std for outlier detection.


• Filter DataFrame rows within bounds using Boolean indexing.
mean = df1['Age'].mean()

std = df1['Age'].std()

lower_bound = mean - 2 * std

upper_bound = mean + 2 * std


df2 = df1[(df1['Age'] >= lower_bound) & (df1['Age'] <= upper_bound)]

Step 9: Impute Missing Data Again if Any

fillna() applied again on filtered data to handle any remaining missing values.

df3 = df2.fillna(df2['Age'].mean())

df3.isnull().sum()

Step 10: Recalculate Outlier Bounds and Remove Outliers from the Updated Data

• mean = df3['Age'].mean(): Calculates the average (mean) value of the Age column
in the DataFrame df3.

• std = df3['Age'].std(): Computes the standard deviation (spread or variability) of


the Age column in df3.

• lower_bound = mean - 2 * std: Defines the lower limit for acceptable Age values,
set as two standard deviations below the mean.

• upper_bound = mean + 2 * std: Defines the upper limit for acceptable Age values,
set as two standard deviations above the mean.

• df4 = df3[(df3['Age'] >= lower_bound) & (df3['Age'] <= upper_bound)]: Creates


a new DataFrame df4 by selecting only rows where the Age value falls between the
lower and upper bounds, effectively removing outlier ages outside this range.

mean = df3['Age'].mean()

std = df3['Age'].std()

lower_bound = mean - 2 * std

upper_bound = mean + 2 * std

print('Lower Bound :', lower_bound)

print('Upper Bound :', upper_bound)

df4 = df3[(df3['Age'] >= lower_bound) & (df3['Age'] <= upper_bound)]


Step 11: Data validation and verification

Data validation and verification involve ensuring that the data is accurate and consistent by
comparing it with external sources or expert knowledge. For the machine learning prediction
we separate independent and target features. Here we will consider only 'Sex' 'Age' 'SibSp',
'Parch' 'Fare' 'Embarked' only as the independent features and Survived as target variables
because PassengerId will not affect the survival rate.

X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']

Step 12: Data formatting

Data formatting involves converting the data into a standard format or structure that can be
easily processed by the algorithms or models used for analysis. Here we will discuss
commonly used data formatting techniques i.e. Scaling and Normalization.
Scaling involves transforming the values of features to a specific range. It maintains the
shape of the original distribution while changing the scale. It is useful when features have
different scales and certain algorithms are sensitive to the magnitude of the features.
Common scaling methods include:

1. Min-Max Scaling: Min-Max scaling rescales the values to a specified range, typically
between 0 and 1. It preserves the original distribution and ensures that the minimum value
maps to 0 and the maximum value maps to 1.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

num_col_ = [col for col in X.columns if X[col].dtype != 'object']

x1 = X

x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()

2. Standardization (Z-score scaling): Standardization transforms the values to have a mean


of 0 and a standard deviation of 1. It centers the data around the mean and scales it based on
the standard deviation. Standardization makes the data more suitable for algorithms that
assume a Gaussian distribution or require features to have zero mean and unit variance.

Z = (X - μ) / σ

Where,
• X = Data

• μ = Mean value of X
• σ = Standard deviation of X
Data Cleaning Tools

Some data cleansing tools:

• OpenRefine: A free, open-source tool for cleaning, transforming and enriching messy
data with an easy-to-use interface and powerful features like clustering and faceting.

• Trifacta Wrangler: An AI-powered, user-friendly platform that helps automate data


cleaning and transformation workflows for faster, more accurate preparation.

• TIBCO Clarity: A data profiling and cleansing tool that ensures high-quality,
standardized and consistent datasets across diverse sources.

• Cloudingo: A cloud-based solution focused on deduplication and data cleansing,


especially useful for maintaining accurate CRM data.

• IBM InfoSphere QualityStage: An enterprise-grade tool designed for large-scale,


complex data quality management including profiling, matching and cleansing.

Advantages

• Improved model performance: Removal of errors, inconsistencies and irrelevant


data helps the model to better learn from the data.

• Increased accuracy: Helps ensure that the data is accurate, consistent and free of
errors.

• Better representation of the data: Data cleaning allows the data to be transformed
into a format that better represents the underlying relationships and patterns in the
data.

• Improved data quality: Improve the quality of the data, making it more reliable and
accurate.
• Improved data security: Helps to identify and remove sensitive or confidential
information that could compromise data security.
Disadvantages

• Time-consuming: It is very time consuming task specially for large and complex
datasets.

• Error-prone: It can result in loss of important information.

• Cost and resource-intensive: It is resource-intensive process that requires significant


time, effort and expertise. It can also require the use of specialized software tools.

• Overfitting: Data cleaning can contribute to overfitting by removing too much data.
Data Collection
Data Collection is the process of collecting information from relevant sources to find a
solution to the given statistical inquiry. Collection of Data is the first and foremost step in a
statistical investigation. It's an essential step because it helps us make informed decisions,
spot trends, and measure progress.

Different methods of collecting data include


• Interviews
• Questionnaires
• Observations
• Experiments
• Published Sources and Unpublished Sources

Terms Related to Data Collection

• Data: Data is a tool that helps an investigator in understanding the problem by


providing him with the information required. Data can be classified into two types;
viz., Primary Data and Secondary Data.

• Investigator: An investigator is a person who conducts the statistical enquiry.

• Enumerators: In order to collect information for statistical enquiry, an investigator


needs the help of some people. These people are known as enumerators.

• Respondents: A respondent is a person from whom the statistical information


required for the enquiry is collected.

• Survey: It is a method of collecting information from individuals. The basic purpose


of a survey is to collect data to describe different characteristics such as usefulness,
quality, price, kindness, etc. It involves asking questions about a product or service
from a large number of people.

1. Surveys and Questionnaires


• What it is: Collecting responses directly from individuals using structured questions.
• How it works: Can be online (Google Forms, SurveyMonkey), paper-based, or
telephonic.
• Advantages:
o Quick and cost-effective for large groups.
o Can cover wide geographic areas.
• Disadvantages:
o Biased or inaccurate responses possible.
o Low response rate in online surveys.

Example: An e-commerce company asking customers about satisfaction levels after


purchase.
2. Interviews

• What it is: Face-to-face, telephonic, or video-based conversations to collect in-depth


information.

• Types:
o Structured (fixed questions)
o Semi-structured (some flexibility)
o Unstructured (free-flow discussions)
• Advantages:
o Rich, detailed insights.
o Clarification possible immediately.

• Disadvantages:
o Time-consuming and costly.
o Requires skilled interviewers.

Example: HR interviewing employees to understand job satisfaction.

3. Observation

• What it is: Watching and recording behaviors, events, or processes without direct
interaction.
• Types:
o Direct (watching events as they occur)
o Participant (researcher becomes part of the group)
o Mechanical (using cameras, sensors, CCTV)

• Advantages:
o Real-time and natural behavior captured.
o Reduces reliance on self-reported data.

• Disadvantages:
o Observer bias possible.
o Some behaviors/events may not be observable.

Example: Observing customers’ shopping patterns in a supermarket.

4. Experiments
• What it is: Gathering data by conducting controlled experiments.
• Process: Involves changing one or more variables and observing the impact.

• Advantages:
o Provides causal relationships.
o High reliability if properly designed.
• Disadvantages:

o Expensive and time-consuming.

o Not always feasible in real-world scenarios.

Example: A/B testing in websites — showing two different versions of a webpage to


users and tracking which performs better.

5. Focus Groups

• What it is: Small group discussions led by a moderator to gather opinions and
perceptions.

• Advantages:
o Generates diverse ideas.
o Useful for exploring attitudes and motivations.

• Disadvantages:
o Groupthink may occur (people influenced by others).
o Difficult to analyze subjective data.

Example: A company conducting a focus group to test reactions to a new product design.

6. Case Studies

• What it is: An in-depth study of a single case (individual, group, event, or


organization).
• Advantages:
o Provides deep, contextual understanding.
o Useful for rare or complex issues.

• Disadvantages:
o Limited generalizability.
o Time-intensive.

Example: A hospital analyzing patient recovery patterns after a new treatment.

7. Secondary Data (Existing Sources)

• What it is: Using already available data instead of collecting new data.

• Sources: Research papers, government reports, company databases, public datasets,


online repositories (like Kaggle, UCI ML).

• Advantages:
o Cost-effective and time-saving.
o Useful for historical analysis.
• Disadvantages:
o May not exactly fit current research needs.
o Data reliability/accuracy issues possible.

Example: Using World Bank data for economic research.

Qualitative Data:
The data collected on grounds of categorical variables are qualitative data. Qualitative data
are more descriptive and conceptual in nature. It measures the data on the basis of the type of
data, collection, or category.

The data collection is based on what type of quality is given. Qualitative data is categorized
into different groups based on characteristics. The data obtained from these kinds of analysis
or research is used in theorization, perceptions, and developing hypothetical theories. These
data are collected from texts, documents, transcripts, audio and video recordings, etc.

Examples of Qualitative Data


Examples of qualitative data include:

• Textual responses from open-ended survey questions


• Observational notes or fieldwork observations
• Interview transcripts
• Photographs or videos
• Personal narratives or case studies

Quantitative Data
The data collected on the grounds of the numerical variables are quantitative data.
Quantitative data are more objective and conclusive in nature. It measures the values and is
expressed in numbers. The data collection is based on "how much" is the quantity. The data in
quantitative analysis is expressed in numbers so it can be counted or measured. The data is
extracted from experiments, surveys, market reports, matrices, etc.

Examples of Quantitative Data

Some examples of quantitative data are:

• Age, Height, Weight, etc.


• Temperature
• Income
• Number of siblings
• GPA
• Test scores
• Stock prices
Difference between Qualitative and Quantitative Data

The key differences between Qualitative and Quantitative Data are:

Qualitative Data Quantitative Data

Qualitative data uses methods like


Quantitative data uses methods as
interviews, participant observation, focus
questionnaires, surveys, and structural
on a grouping to gain collective
observations to gain collective information.
information.

Data format used in it is textual. Datasheets Data format used in it is numerical.


are contained of audio or video recordings Datasheets are obtained in the form of
and notes. numerical values.

Qualitative data talks about the experience Quantitative data talks about the quantity
or quality and explains the questions like and explains the questions like 'how much',
'why' and 'how'. 'how many .

The data is analyzed by grouping it into


The data is analyzed by statistical methods.
different categories.

Qualitative data are subjective and can be


Quantitative data are fixed and universal.
further open for interpretation.

Structured Data:
Structured data refers to data that is organized in a predefined format, making it easily
readable and understandable by both humans and machines. This is achieved through a well-
defined schema or data model, where data is stored in an orderly way such as rows and
columns.

For Example: A customer database might contain structured records with fields like Name,
Address, Phone Number, and Email.
Some examples of structured data

Characteristics of Structured Data

• Data conforms to a data model and has easily identifiable structure

• Stored in tabular form (rows and columns), e.g., relational databases.

• Data is well organised so, Definition, Format and Meaning of data is explicitly known

• Data resides in fixed fields within a record or file


• Data elements are addressable, so efficient to analyse and process

Common Sources of Structured Data

• Relational Databases (e.g., MySQL, PostgreSQL)

• Spreadsheets (e.g., Excel, Google Sheets)

• OLTP Systems (Online Transaction Processing)

• Online forms and surveys

• IoT sensors (e.g., GPS, RFID tags)


• Web and server logs
• Medical monitoring devices
Advantages of Structured Data

1. Easy to understand and use: Structured data has a well-defined schema or data
model, making it easy to understand and use. This allows for easy data retrieval,
analysis, and reporting.

2. Consistency: The well-defined structure of structured data ensures consistency and


accuracy in the data, making it easier to compare and analyze data across different
sources.
3. Efficient storage and retrieval: Structured data is typically stored in relational
databases, which are designed to efficiently store and retrieve large amounts of data.
This makes it easy to access and process data quickly.

4. Enhanced data security: Structured data can be more easily secured than
unstructured or semi-structured data, as access to the data can be controlled through
database security protocols.

5. Clear data lineage: Structured data typically has a clear lineage or history, making it
easy to track changes and ensure data quality.

Unstructured Data

• Unstructured data refers to information that does not have a predefined format or
structure. It is messy, unorganized and hard to sort. Unlike structured data, which is
organized into rows and columns (like an Excel sheet), unstructured data comes in
many different forms such as text documents, images, audio files, videos and social
media posts. Because this type of data does not follow a clear pattern, it’s harder to
store, process and search.

• Unstructured vs Structured Data

Characteristics of Unstructured Data


• Lack of Format: Unstructured data does not fit neatly into tables or databases. It can
be textual or non-textual, making it difficult to categorize and organize.
• Variety: This type of data can include a wide range of formats, such as:

• Text documents (e.g., emails, reports, articles)

• Multimedia files (e.g., images, audio, video)

• Social media content (e.g., posts, comments, tweets)

• Web pages and blogs

• Volume: Unstructured data represents a significant portion of the data generated


today. It is often larger in volume compared to structured data.
• Diverse Sources: It can originate from various sources, including user-generated
content, sensor data, customer interactions and more.
Importance of unstructured Data

• Even though unstructured data is harder to deal with, it is extremely valuable. Let us
see that in the below :

• It helps businesses understand their customers better. For example, businesses can
learn what customers think about their products by reading reviews or social media
posts.

• It contains real world insights, like what people are talking about online or what
videos are trending.

• It’s growing rapidly. More and more data being created today is unstructured like
photos, tweets and videos.

Examples of Unstructured Data

• Unstructured data can come in many different forms. Here are some examples:

• Social Media: Posts, tweets, comments and pictures on Facebook, Instagram, or


Twitter

• Emails: Your inbox full of messages, attachments and conversations

• Photos & Videos: Pictures on your phone or videos on YouTube

• Audio Files: Podcasts, voice messages, music files

• Documents: Reports, articles, PDFs, or Word files

• Websites & Blogs: Articles, reviews and posts online


Extracting Information from Unstructured Data
• Unstructured data do not have any structure. So it can not easily interpreted by
conventional algorithms. It is also difficult to tag and index unstructured data. So
extracting information from them is a tough job. However, there are ways to organize
and extract useful information from it:

• Tagging: We can label or tag data with keywords. For example, a photo of a dog
might be tagged with the words “dog,” “pet,” or “animal” so it can be found easily
later.

• Classifying Data: This is like organizing things into groups. For example, grouping
customer reviews into positive or negative feedback. This makes it easier to search
and analyze.

• Data Mining: This technique helps find patterns in unstructured data. For example,
analyzing customer reviews to see common complaints or finding patterns in social
media posts to predict trends.
Storing Unstructured Data
• Unstructured data can be converted to easily manageable formats.

• Using a content addressable storage system (CAS) to store unstructured data.

• It stores data based on their metadata and a unique name is assigned to every object
stored in it. The object is retrieved based on content, not its location.

• Unstructured data can be stored in XML format.

• Unstructured data can be stored in RDBMS which supports BLOBs.

Unstructured Data vs Structured Data


• Structured data is neatly organized into rows and columns, much like a spreadsheet
or a database. For instance, a table listing people's names, ages and addresses is
structured data ,it follows a clear format and is easy to search or analyze.

• Unstructured data, on the other hand, doesn’t follow a set structure. It includes
things like photos, videos, audio clips or tweets. There's no consistent format, which
makes it harder to organize or process.

• Feature • Structured Data • Unstructured Data

• Organized in rows and


• No fixed format or
columns (e.g., tables,
predefined structure.
• Format spreadsheets).

• Names, ages and • Photos, videos, emails,


• Examples addresses in a database. social media posts.

• Stored in files, cloud


• Stored in relational
storage, or NoSQL
databases (e.g., SQL).
• Storage databases.

• Requires advanced
• Easy to search, sort and
• Ease of processing (e.g., NLP,
analyze with tools.
Analysis image recognition).

• Text and numbers in a • Mixed data types: text,


• Data Type predictable format. audio, video, etc.
• Feature • Structured Data • Unstructured Data

• A neatly arranged • A scattered pile of books,


• Real-World bookshelf with photos, papers and sticky
Analogy categorized books. notes.

Applications

• Unstructured data is already being used across industries:

• Healthcare: Doctors use unstructured patient records, lab notes and imaging reports
to diagnose and personalize treatment.

• Retail: Analyzing customer reviews and social media comments to improve product
quality and customer experience.
• Finance: Processing news feeds, analyst reports and customer emails to manage risk
and improve investment decisions.

• Legal: Automating document review and e-discovery in law firms through text
mining.

• Media & Entertainment: Recommending content based on viewing habits,


comments and user preferences.

Challenges with Unstructured Data

• There are a few challenges with unstructured data that make it difficult to manage:

• Hard to Store: Since unstructured data comes in so many different formats (like
images or audio), it takes up a lot of space to store. You need big storage systems to
hold it all.

• Difficult to Search: Without labels or organization, it’s hard to find specific


information in unstructured data. For example, if you have thousands of tweets,
finding one tweet might be tricky.

• Hard to Analyze: Unlike structured data, which is easy to analyze using simple tools,
unstructured data requires special software and complex techniques to make sense
of it.

Semi-Structured:
Semi-structured data is data that does not reside in a traditional relational database (like SQL)
but still has some organizational properties, such as tags or markers, that make it easier to
analyze than completely unstructured data.
It doesn't follow a strict schema like structured data, but it still contains elements like labels
or keys that make the data identifiable and searchable.

Characteristics of Semi-Structured Data

1. Flexible Schema: The structure can vary from one entry to another. For example, one
JSON object may have five fields while another has only three.

2. Human-Readable Format: Many types like XML or JSON are easy for humans and
machines to understand.

3. Scalable: Easily handled by modern NoSQL databases, making it great for Big Data
environments.

4. Metadata-Rich: Tags and attributes provide context that helps with sorting and
analysis.

Importance of Semi-Structured Data

As data becomes more complex and varied, semi-structured formats offer a balance between
flexibility and manageability. They allow organizations to store and process different types of
information in one place, making it easier to handle diverse data formats. Additionally, semi-
structured data enables quick adaptation to new data sources without the need to redesign
existing databases. This flexibility supports more efficient data analysis and integration,
especially when combining structured and unstructured data, making it a valuable asset in
modern data-driven environments.

Examples of Semi-Structured Data:

• JSON (JavaScript Object Notation)


• XML (eXtensible Markup Language)
• CSV files with inconsistent rows
• Emails (with structured headers and unstructured body text)
• Sensor data from IoT devices
• HTML web pages
Extracting Information from Semi-Structured Data

Semi-structured data have different structure because of heterogeneity of the sources.


Sometimes they do not contain any structure at all. This makes it difficult to tag and index. So
while extract information from them is tough job. Here are possible solutions -

• Graph based models (e.g OEM) can be used to index semi-structured data

• Data modelling technique in OEM allows the data to be stored in graph based model.
The data in graph based model is easier to search and index.

• XML allows data to be arranged in hierarchical order which enables the data to be
indexed and searched
• Use of various data mining tools
Semi-Structured Data Management

Unlike structured data, semi-structured data is best managed using NoSQL databases or
document stores. Popular technologies include:

• MongoDB: A document-based NoSQL database that works well with JSON-like


formats.

• Cassandra: Handles wide-column data with semi-structured schema design.

• Elasticsearch: Can index and search through semi-structured log files and
documents.

• Cloud Storage (e.g. AWS S3, Azure Blob): Used to store large volumes of semi-
structured data like logs, emails, and telemetry data.

Applications

Semi-structured data is used across various industries:

• E-commerce: Product catalogs stored in JSON format, allowing flexibility in item


attributes.

• Healthcare: Patient forms and reports stored in XML with variable fields.

• IoT and Smart Devices: Sensor data captured in key-value formats.


• Web Development: HTML and JSON used to render dynamic content on websites.

• Social Media Platforms: User activity and messages logged in semi-structured logs.

Challenges

Despite its flexibility, semi-structured data comes with a few challenges:

• Complex Querying: Not as straightforward as SQL queries on structured data.

• Data Cleaning: Irregular structure may lead to inconsistency and harder integration.

• Tool Compatibility: Not all analytics tools support semi-structured formats out of the
box.

Differences between Structured, Semi-structured and Unstructured data:


Unstructured
Properties Structured data Semi-structured data data

It is based on It is based on It is based on


Technology Relational database XML/RDF(Resource character and
table Description Framework). binary data
Unstructured
Properties Structured data Semi-structured data data

Matured transaction
No transaction
Transaction and various Transaction is adapted
management and
management concurrency from DBMS not matured
no concurrency
techniques

Version Versioning over Versioning over tuples or Versioned as a


management tuples,row,tables graph is possible whole

It is more flexible than It is more flexible


It is schema
structured data but less and there is
Flexibility dependent and less
flexible than unstructured absence of
flexible
data schema

It is very difficult to It's scaling is simpler than It is more


Scalability
scale DB schema structured data scalable.

New technology, not very


Robustness Very robust --
spread

Structured query Only textual


Query Queries over anonymous
allow complex queries are
performance nodes are possible
joining possible

You might also like