Introduction To Data Science Notes of Unit 1
Introduction To Data Science Notes of Unit 1
Unit 1
Data Science
Data science is the study of data that helps us derive useful insight for business decision
making. Data Science is all about using tools, techniques, and creativity to uncover insights
hidden within data. It combines math, computer science, and domain expertise to tackle real-
world challenges in a variety of fields.
Data Science processes the raw data and solve business problems and even make prediction
about the future trend or requirement. For example, from the huge raw data of a company, data
science can help answer following question:
In short, data science empowers the industries to make smarter, faster, and more informed
decisions. In order to find patterns and achieve such insights, expertise in relevant domain is
required. With expertise in Healthcare, a data scientists can predict patient risks and suggest
personalized treatments.
• Data Collection: Gathering raw data from various sources, such as databases,
sensors, or user interactions.
• Data Cleaning: Ensuring the data is accurate, complete, and ready for analysis.
• Data Analysis: Applying statistical and computational methods to identify patterns,
trends, or relationships.
• Data Visualization: Creating charts, graphs, and dashboards to present findings
clearly.
• Decision-Making: Using insights to inform strategies, create solutions, or predict
outcomes.
• Predicts the Future: Businesses can use data to forecast trends, demand, and other
important factors.
• Drives Innovation: New ideas and products often come from insights discovered
through data science.
• Benefits Society: Data science improves public services like healthcare, education,
and transportation by helping allocate resources more effectively.
There are lot of examples you can observe around yourself, where data science is being used.
For Example - Social Media, Medical, Preparing strategy for Cricket or FIFA by analyzing
past matches. Here are some more real life examples:
Data science has a wide range of applications across various industries, by transforming how
they operate and deliver results. Here are some examples:
• Data science is used to analyze patient data, predict diseases, develop personalized
treatments, and optimize hospital operations.
• Streaming platforms and content creators use data science to recommend shows,
analyze viewer preferences, and optimize content delivery.
Data science is transforming every industry by unlocking the power of data. Here are some
key sectors where data science plays a vital role:
• Healthcare: Data science improves patient outcomes by using predictive analytics to
detect diseases early, creating personalized treatment plans and optimizing hospital
operations for efficiency.
• Finance: Data science helps detect fraudulent activities, assess and manage financial
risks, and provide tailored financial solutions to customers.
• Energy: Data science forecasts energy demand, optimizes energy consumption, and
facilitates the integration of renewable energy resources.
• Statistics and Mathematics: A strong foundation in statistics and linear algebra helps
in understanding data patterns and building predictive models.
• Machine Learning: Knowledge of machine learning algorithms and frameworks is
key to creating intelligent data-driven solutions.
• Data Visualization: The ability to present data insights through tools like Tableau,
Power BI, or Matplotlib ensures findings are clear and actionable.
• Data Wrangling: Skills in cleaning, transforming, and preparing raw data for
analysis are vital for maintaining data quality.
• Big Data Tools: Familiarity with tools like Hadoop, Spark, or cloud platforms helps
in handling large datasets efficiently.
• Critical Thinking: Analytical skills to interpret data and solve problems creatively
are essential for uncovering actionable insights.
• Missing Values: This occur when data is absent from a dataset. You can either ignore
the rows with missing data or fill the gaps manually, with the attribute mean, or by
using the most probable value. This ensures the dataset remains accurate and
complete for analysis.
• Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry. It can be handled in
several ways:
o Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.
o Clustering: This method groups similar data points together, with outliers
either being undetected or falling outside the clusters. These techniques help
remove noise and improve data quality.
• Data Fusion involves combining data from multiple sources to create a more
comprehensive and accurate dataset. It integrates information that may be inconsistent
or incomplete from different sources, ensuring a unified and richer dataset for
analysis.
3. Data Transformation: It involves converting data into a format suitable for analysis.
Common techniques include normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit variance; and discretization,
which converts continuous data into discrete categories. These techniques help prepare the
data for more accurate analysis.
• Data Normalization: The process of scaling data to a common range to ensure
consistency across variables.
• Discretization: Converting continuous data into discrete categories for easier
analysis.
• Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
4. Data Reduction: It reduces the dataset's size while maintaining key information. This can
be done through feature selection, which chooses the most relevant features, and feature
extraction, which transforms the data into a lower-dimensional space while preserving
important details. It uses various reduction techniques such as,
Data preprocessing is utilized across various fields to ensure that raw data is transformed into
a usable format for analysis and decision-making. Here are some key areas where data
preprocessing is applied:
2. Data Mining: Data preprocessing in data mining involves cleaning and transforming raw
data to make it suitable for analysis. This step is crucial for identifying patterns and extracting
insights from large datasets.
3. Machine Learning: In machine learning, preprocessing prepares raw data for model
training. This includes handling missing values, normalizing features, encoding categorical
variables, and splitting datasets into training and testing sets to improve model performance
and accuracy.
4. Data Science: Data preprocessing is a fundamental step in data science projects, ensuring
that the data used for analysis or building predictive models is clean, structured, and relevant.
It enhances the overall quality of insights derived from the data.
5. Web Mining: In web mining, preprocessing helps analyze web usage logs to extract
meaningful user behavior patterns. This can inform marketing strategies and improve user
experience through personalized recommendations.
7. Deep Learning Purpose: Similar to machine learning, deep learning applications require
preprocessing to normalize or enhance features of the input data, optimizing model training
processes.
• Better Model Performance: Reduces noise and irrelevant data, leading to more
accurate predictions and insights.
• Efficient Data Analysis: Streamlines data for faster and easier processing.
• Enhanced Decision-Making: Provides clear and well-organized data for better
business decisions.
• Potential Data Loss: Incorrect handling may result in losing valuable information.
• Complexity: Handling large datasets or diverse formats can be challenging.
Data Cleaning
Data cleaning is a steps in machine learning (ML) which involves identifying and removing
any missing, duplicate or irrelevant data.
• Raw data (log file, transactions, audio /video recordings, etc) is often noisy,
incomplete and inconsistent which can negatively impact the accuracy of model.
• The goal of data cleaning is to ensure that the data is accurate, consistent and free of
errors.
• Clean datasets also important in EDA (Exploratory Data Analysis) which enhances
the interpretability of data so that the right actions can be taken based on insights.
Benefits of Data Cleaning
How to Perform Data Cleaning
The process begins by identifying issues like missing values, duplicates and outliers.
Performing data cleaning involves a systematic process to identify and remove errors in a
dataset. The following steps are essential to perform data cleaning:
• Fix Structural Errors: Standardize data formats and variable types for consistency.
• Manage Outliers: Detect and handle extreme values that can skew results, either by
removal or transformation.
Let's understand each step for Database Cleaning using titanic dataset.
We will import all the necessary libraries i.e pandas and numpy.
import pandas as pd
import numpy as np
df = pd.read_csv('Titanic-Dataset.csv')
df.info()
df.head()
df[cat_col].nunique()
• Sum missing across columns, normalize by total rows and multiply by 100.
df1.dropna(subset=['Embarked'], inplace=True)
df1['Age'].fillna(df1['Age'].mean(), inplace=True)
plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
std = df1['Age'].std()
fillna() applied again on filtered data to handle any remaining missing values.
df3 = df2.fillna(df2['Age'].mean())
df3.isnull().sum()
Step 10: Recalculate Outlier Bounds and Remove Outliers from the Updated Data
• mean = df3['Age'].mean(): Calculates the average (mean) value of the Age column
in the DataFrame df3.
• lower_bound = mean - 2 * std: Defines the lower limit for acceptable Age values,
set as two standard deviations below the mean.
• upper_bound = mean + 2 * std: Defines the upper limit for acceptable Age values,
set as two standard deviations above the mean.
mean = df3['Age'].mean()
std = df3['Age'].std()
Data validation and verification involve ensuring that the data is accurate and consistent by
comparing it with external sources or expert knowledge. For the machine learning prediction
we separate independent and target features. Here we will consider only 'Sex' 'Age' 'SibSp',
'Parch' 'Fare' 'Embarked' only as the independent features and Survived as target variables
because PassengerId will not affect the survival rate.
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']
Data formatting involves converting the data into a standard format or structure that can be
easily processed by the algorithms or models used for analysis. Here we will discuss
commonly used data formatting techniques i.e. Scaling and Normalization.
Scaling involves transforming the values of features to a specific range. It maintains the
shape of the original distribution while changing the scale. It is useful when features have
different scales and certain algorithms are sensitive to the magnitude of the features.
Common scaling methods include:
1. Min-Max Scaling: Min-Max scaling rescales the values to a specified range, typically
between 0 and 1. It preserves the original distribution and ensures that the minimum value
maps to 0 and the maximum value maps to 1.
x1 = X
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()
Z = (X - μ) / σ
Where,
• X = Data
• μ = Mean value of X
• σ = Standard deviation of X
Data Cleaning Tools
• OpenRefine: A free, open-source tool for cleaning, transforming and enriching messy
data with an easy-to-use interface and powerful features like clustering and faceting.
• TIBCO Clarity: A data profiling and cleansing tool that ensures high-quality,
standardized and consistent datasets across diverse sources.
Advantages
• Increased accuracy: Helps ensure that the data is accurate, consistent and free of
errors.
• Better representation of the data: Data cleaning allows the data to be transformed
into a format that better represents the underlying relationships and patterns in the
data.
• Improved data quality: Improve the quality of the data, making it more reliable and
accurate.
• Improved data security: Helps to identify and remove sensitive or confidential
information that could compromise data security.
Disadvantages
• Time-consuming: It is very time consuming task specially for large and complex
datasets.
• Overfitting: Data cleaning can contribute to overfitting by removing too much data.
Data Collection
Data Collection is the process of collecting information from relevant sources to find a
solution to the given statistical inquiry. Collection of Data is the first and foremost step in a
statistical investigation. It's an essential step because it helps us make informed decisions,
spot trends, and measure progress.
• Types:
o Structured (fixed questions)
o Semi-structured (some flexibility)
o Unstructured (free-flow discussions)
• Advantages:
o Rich, detailed insights.
o Clarification possible immediately.
• Disadvantages:
o Time-consuming and costly.
o Requires skilled interviewers.
3. Observation
• What it is: Watching and recording behaviors, events, or processes without direct
interaction.
• Types:
o Direct (watching events as they occur)
o Participant (researcher becomes part of the group)
o Mechanical (using cameras, sensors, CCTV)
• Advantages:
o Real-time and natural behavior captured.
o Reduces reliance on self-reported data.
• Disadvantages:
o Observer bias possible.
o Some behaviors/events may not be observable.
4. Experiments
• What it is: Gathering data by conducting controlled experiments.
• Process: Involves changing one or more variables and observing the impact.
• Advantages:
o Provides causal relationships.
o High reliability if properly designed.
• Disadvantages:
5. Focus Groups
• What it is: Small group discussions led by a moderator to gather opinions and
perceptions.
• Advantages:
o Generates diverse ideas.
o Useful for exploring attitudes and motivations.
• Disadvantages:
o Groupthink may occur (people influenced by others).
o Difficult to analyze subjective data.
Example: A company conducting a focus group to test reactions to a new product design.
6. Case Studies
• Disadvantages:
o Limited generalizability.
o Time-intensive.
• What it is: Using already available data instead of collecting new data.
• Advantages:
o Cost-effective and time-saving.
o Useful for historical analysis.
• Disadvantages:
o May not exactly fit current research needs.
o Data reliability/accuracy issues possible.
Qualitative Data:
The data collected on grounds of categorical variables are qualitative data. Qualitative data
are more descriptive and conceptual in nature. It measures the data on the basis of the type of
data, collection, or category.
The data collection is based on what type of quality is given. Qualitative data is categorized
into different groups based on characteristics. The data obtained from these kinds of analysis
or research is used in theorization, perceptions, and developing hypothetical theories. These
data are collected from texts, documents, transcripts, audio and video recordings, etc.
Quantitative Data
The data collected on the grounds of the numerical variables are quantitative data.
Quantitative data are more objective and conclusive in nature. It measures the values and is
expressed in numbers. The data collection is based on "how much" is the quantity. The data in
quantitative analysis is expressed in numbers so it can be counted or measured. The data is
extracted from experiments, surveys, market reports, matrices, etc.
Qualitative data talks about the experience Quantitative data talks about the quantity
or quality and explains the questions like and explains the questions like 'how much',
'why' and 'how'. 'how many .
Structured Data:
Structured data refers to data that is organized in a predefined format, making it easily
readable and understandable by both humans and machines. This is achieved through a well-
defined schema or data model, where data is stored in an orderly way such as rows and
columns.
For Example: A customer database might contain structured records with fields like Name,
Address, Phone Number, and Email.
Some examples of structured data
• Data is well organised so, Definition, Format and Meaning of data is explicitly known
1. Easy to understand and use: Structured data has a well-defined schema or data
model, making it easy to understand and use. This allows for easy data retrieval,
analysis, and reporting.
4. Enhanced data security: Structured data can be more easily secured than
unstructured or semi-structured data, as access to the data can be controlled through
database security protocols.
5. Clear data lineage: Structured data typically has a clear lineage or history, making it
easy to track changes and ensure data quality.
Unstructured Data
• Unstructured data refers to information that does not have a predefined format or
structure. It is messy, unorganized and hard to sort. Unlike structured data, which is
organized into rows and columns (like an Excel sheet), unstructured data comes in
many different forms such as text documents, images, audio files, videos and social
media posts. Because this type of data does not follow a clear pattern, it’s harder to
store, process and search.
• Even though unstructured data is harder to deal with, it is extremely valuable. Let us
see that in the below :
• It helps businesses understand their customers better. For example, businesses can
learn what customers think about their products by reading reviews or social media
posts.
• It contains real world insights, like what people are talking about online or what
videos are trending.
• It’s growing rapidly. More and more data being created today is unstructured like
photos, tweets and videos.
• Unstructured data can come in many different forms. Here are some examples:
• Tagging: We can label or tag data with keywords. For example, a photo of a dog
might be tagged with the words “dog,” “pet,” or “animal” so it can be found easily
later.
• Classifying Data: This is like organizing things into groups. For example, grouping
customer reviews into positive or negative feedback. This makes it easier to search
and analyze.
• Data Mining: This technique helps find patterns in unstructured data. For example,
analyzing customer reviews to see common complaints or finding patterns in social
media posts to predict trends.
Storing Unstructured Data
• Unstructured data can be converted to easily manageable formats.
• It stores data based on their metadata and a unique name is assigned to every object
stored in it. The object is retrieved based on content, not its location.
• Unstructured data, on the other hand, doesn’t follow a set structure. It includes
things like photos, videos, audio clips or tweets. There's no consistent format, which
makes it harder to organize or process.
• Requires advanced
• Easy to search, sort and
• Ease of processing (e.g., NLP,
analyze with tools.
Analysis image recognition).
Applications
• Healthcare: Doctors use unstructured patient records, lab notes and imaging reports
to diagnose and personalize treatment.
• Retail: Analyzing customer reviews and social media comments to improve product
quality and customer experience.
• Finance: Processing news feeds, analyst reports and customer emails to manage risk
and improve investment decisions.
• Legal: Automating document review and e-discovery in law firms through text
mining.
• There are a few challenges with unstructured data that make it difficult to manage:
• Hard to Store: Since unstructured data comes in so many different formats (like
images or audio), it takes up a lot of space to store. You need big storage systems to
hold it all.
• Hard to Analyze: Unlike structured data, which is easy to analyze using simple tools,
unstructured data requires special software and complex techniques to make sense
of it.
Semi-Structured:
Semi-structured data is data that does not reside in a traditional relational database (like SQL)
but still has some organizational properties, such as tags or markers, that make it easier to
analyze than completely unstructured data.
It doesn't follow a strict schema like structured data, but it still contains elements like labels
or keys that make the data identifiable and searchable.
1. Flexible Schema: The structure can vary from one entry to another. For example, one
JSON object may have five fields while another has only three.
2. Human-Readable Format: Many types like XML or JSON are easy for humans and
machines to understand.
3. Scalable: Easily handled by modern NoSQL databases, making it great for Big Data
environments.
4. Metadata-Rich: Tags and attributes provide context that helps with sorting and
analysis.
As data becomes more complex and varied, semi-structured formats offer a balance between
flexibility and manageability. They allow organizations to store and process different types of
information in one place, making it easier to handle diverse data formats. Additionally, semi-
structured data enables quick adaptation to new data sources without the need to redesign
existing databases. This flexibility supports more efficient data analysis and integration,
especially when combining structured and unstructured data, making it a valuable asset in
modern data-driven environments.
• Graph based models (e.g OEM) can be used to index semi-structured data
• Data modelling technique in OEM allows the data to be stored in graph based model.
The data in graph based model is easier to search and index.
• XML allows data to be arranged in hierarchical order which enables the data to be
indexed and searched
• Use of various data mining tools
Semi-Structured Data Management
Unlike structured data, semi-structured data is best managed using NoSQL databases or
document stores. Popular technologies include:
• Elasticsearch: Can index and search through semi-structured log files and
documents.
• Cloud Storage (e.g. AWS S3, Azure Blob): Used to store large volumes of semi-
structured data like logs, emails, and telemetry data.
Applications
• Healthcare: Patient forms and reports stored in XML with variable fields.
• Social Media Platforms: User activity and messages logged in semi-structured logs.
Challenges
• Data Cleaning: Irregular structure may lead to inconsistency and harder integration.
• Tool Compatibility: Not all analytics tools support semi-structured formats out of the
box.
Matured transaction
No transaction
Transaction and various Transaction is adapted
management and
management concurrency from DBMS not matured
no concurrency
techniques