0% found this document useful (0 votes)

20 views28 pages

Module 2 Data Science

Data science involves analyzing large amounts of structured and unstructured data to extract useful insights for businesses, utilizing methods from various fields like math and AI. It encompasses data collection techniques, data cleaning, preprocessing, and visualization to ensure high-quality data analysis. Effective data visualization aids in understanding trends and patterns, but care must be taken to avoid misinterpretations.

Uploaded by

pasamonte.justine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views28 pages

Module 2 Data Science

Uploaded by

pasamonte.justine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Introduction to Data

Science
Data Science
• Data science is the study of data to find useful information for
businesses. It involves using ideas and methods from different areas like
math, statistics, artificial intelligence, and computer engineering to
examine large amounts of data. This helps data scientists understand
things like what happened, why it happened, what might happen next, and
what actions can be taken based on the results.
Importance of Data Science
• Data science is important because it uses tools, methods, and technology
to make sense of data. Today, organizations collect a huge amount of data
from many sources, like devices that automatically gather and store
information. Online systems, payment platforms, and other areas like e-
commerce, healthcare, and finance capture large amounts of data. This
includes text, audio, video, and images, all in very large quantities. Data
science helps turn this information into useful insights.
Structured and
Unstructured Data
Structured Data
Structured data, also called quantitative data, is information that is
neatly organized, making it easy for computers to read, search, and
analyze. You often find it in tables with rows and columns, like in a
spreadsheet. Each column has a specific heading, and the data in
each row matches that category, such as names, addresses, or dates.
This structure makes it easy for search engines and algorithms to
understand the data. Since everything is clearly labeled, both people
and computer programs can quickly search and analyze large
amounts of this data.
Unstructured Data
Unstructured data is information that doesn’t have a set format or
organization. Each piece of unstructured data is called an "object"
because it doesn’t have a specific key or label to easily identify it. To
make it searchable, each object needs to be tagged or labeled with
something that helps identify it.

Examples of unstructured data include videos, emails, images, and

web content. Unstructured data makes up 80 to 90 percent of all data
worldwide, but it’s harder to work with and less immediately useful
compared to structured data because it’s more difficult to analyze
and get insights from.
Data Collection Methods
Data Collection Methods
Data collection methods are ways to gather information for
research. They can be as simple as asking people questions through
surveys or as complex as running experiments.

Some common methods include surveys, interviews, watching people

(observations), group discussions (focus groups), experiments, and
analyzing data that already exists (secondary data analysis). After
collecting the data, researchers analyze it to see if it supports or
challenges their ideas and to draw conclusions about the topic they
are studying.
Types of Data Collection
Methods
Primary & Secondary
Data Collection Methods
Primary Data Collection
Primary Data is collected from first-hand
experience and is not used in the past. The
data gathered by primary data collection
methods are highly accurate and specific to
the research’s motive.
Primary Data Collection
1. Surveys - Surveys gather information from a specific group of people to understand
their preferences, opinions, and feedback about products and services. Survey tools
usually provide different types of questions to collect this information, such as
multiple-choice, rating scales, or open-ended questions. This helps businesses learn
what their customers think and make better decisions.

2. Polls - Polls consist of just one or a few multiple-choice questions. They are helpful
when you want to quickly gauge the opinions or feelings of an audience. Since polls
are short, people are more likely to respond, making it a fast and easy way to
gather feedback.

3. Interviews - In face-to-face interviews, the interviewer asks the interviewee a set

of questions in person and records their answers. If meeting in person isn't possible, the
interviewer can conduct the interview over the phone instead.
Primary Data Collection
4. Delphi Tachnique - In the Delphi method, market experts are given the predictions
and assumptions made by other experts. After reviewing this information, they may
adjust their own estimates. The final demand forecast is created by combining the
opinions of all the experts to reach a shared agreement.

5. Focus Groups - Focus groups are a type of qualitative data collection used in
education. In a focus group, a small group of about 8-10 people discusses a specific
research topic. Each person shares their thoughts and ideas about the issue being
studied, helping researchers understand different perspectives.

6. Questionnaire: - A questionnaire is a printed or digital list of questions, either

open-ended (where people can give detailed answers) or closed-ended (with set
answer choices). Respondents answer based on their knowledge and experience
with the topic. A questionnaire is often used in surveys, but its purpose doesn't
always have to be for a survey.
Secondary Data Collection
Secondary data is information that has
already been collected and used in the
past. Researchers can gather this data
from both internal sources (within an
organization) and external sources (outside
the organization) to use for their studies.
Secondary Data Collection
Internal sources of secondary data:
•Organization’s health and safety records
•Mission and vision statements
•Financial Statements
•Magazines
•Sales Report
•CRM Software
•Executive summaries
Data Cleaning and
Preprocessing Basis
In data science and machine learning, the quality of the input data is extremely
important. The performance of machine learning models relies heavily on how
good the data is. This is why data cleaning—finding and fixing (or removing)
incorrect or incomplete data—is a key step in the process.

Data cleaning isn't just about deleting data or filling in missing values. It's a
detailed process that uses different techniques to make raw data ready for
analysis. These techniques include handling missing data, removing duplicates,
converting data types, and more. Each method is used based on the type of
data and the needs of the analysis.
Common Data Cleaning Techniques

Handling Missing Values: Missing data can occur for various reasons, such as errors in data collection or
transfer. There are several ways to handle missing data, depending on the nature and extent of the missing
values.
• Imputation: Here, you replace missing values with substituted values. The substituted value could be a
central tendency measure like mean, median, or mode for numerical data or the most frequent category for
categorical data. More sophisticated imputation methods include regression imputation and multiple
imputation.
• Deletion: You remove the instances with missing values from the dataset. While this method is
straightforward, it can lead to loss of information, especially if the missing data is not random.
Common Data Cleaning Techniques

Removing Duplicates: Duplicate entries can occur for various reasons, such as data entry errors or data merging. These
duplicates can skew the data and lead to biased results. Techniques for removing duplicates involve identifying these
redundant entries based on key attributes and eliminating them from the dataset.

Data Type Conversion: Sometimes, the data may be in an inappropriate format for a particular analysis or model. For
instance, a numerical attribute may be recorded as a string. In such cases, data type conversion, also known as datacasting,
is used to change the data type of a particular attribute or set of attributes. This process involves converting the data into a
suitable format that machine learning algorithms can easily process.

Outlier Detection: Outliers are data points that significantly deviate from other observations. They can be caused by
variability in the data or errors. Outlier detection techniques are used to identify these anomalies. These techniques include
statistical methods, such as the Z-score or IQR method, and machine learning methods, such as clustering or anomaly
detection algorithms.
Data Processing

Data preprocessing is critical in data science, particularly for machine learning applications. It
involves preparing and cleaning the dataset to make it more suitable for machine learning
algorithms. This process can reduce complexity, prevent overfitting, and improve the model's
overall performance.

The data preprocessing phase begins with understanding your dataset's nuances and the data's
main issues through Exploratory Data Analysis. Real-world data often presents inconsistencies,
typos, missing data, and different scales. You must address these issues to make the data more
useful and understandable. This process of cleaning and solving most of the issues in the data is
what we call the data preprocessing step.
Common Data Processing Techniques

Data Scaling
Data scaling is a technique used to standardize the range of independent variables or features of data. It aims to standardize
the data's range of features to prevent any feature from dominating the others, especially when dealing with large datasets.
This is a crucial step in data preprocessing, particularly for algorithms sensitive to the range of the data, such as deep
learning models.

Encoding Categorical Variables

Machine learning models require inputs to be numerical. If your data contains categorical data, you must encode them to
numerical values before fitting and evaluating a model. This process, known as encoding categorical variables, is a common
data preprocessing technique. One common method is One-Hot Encoding, which creates new binary columns for each
category/label in the original columns.

Data Splitting
Data Splitting is a technique to divide the dataset into two or three sets, typically training, validation, and test sets. You use
the training set to train the model and the validation set to tune the model's parameters. The test set provides an unbiased
evaluation of the final model. This technique is essential when dealing with large data, as it ensures the model is not
overfitted to a particular subset of data.
Common Data Processing Techniques

Handling Missing Values

Missing data in the dataset can lead to misleading results. Therefore, it's essential to handle
missing values appropriately. Techniques for handling missing values include deletion, removing the
rows with missing values, and imputation, replacing the missing values with statistical measures
like mean, median, or model. This step is crucial in ensuring the quality of data used for training
machine learning models.
Feature Selection
Feature selection is a process in machine learning where you automatically select those features in your data that contribute
most to the prediction variable or output in which you are interested. Having irrelevant features in your data can decrease the
accuracy of many models, especially linear algorithms like linear and logistic regression. This process is particularly
important for data scientists working with high-dimensional data, as it reduces overfitting, improves accuracy, and reduces
training time.

Three benefits of performing feature selection before modeling your data are:
•Reduces Overfitting: Less redundant data means less opportunity to make noise-based decisions.
•Improves Accuracy: Less misleading data means modeling accuracy improves.
•Reduces Training Time: Fewer data points reduce algorithm complexity, and it trains faster.
Step-by-Step Guide to Data Cleaning

1.Identifying and Removing Duplicate or Irrelevant Data: Duplicate data can arise from various sources, such as the same individual
participating in a survey multiple times or redundant fields in the data collection process. Irrelevant data refers to information you can safely
remove because it is not likely to contribute to the model's predictive capacity. This step is particularly important when dealing with large datasets.

2.Fixing Syntax Errors: Syntax errors can occur due to inconsistencies in data entry, such as date formats, spelling mistakes, or grammatical
errors. You must identify and correct these errors to ensure the data's consistency. This step is crucial in maintaining the quality of data.

3.Filtering out Unwanted Outliers: Outliers, or data points that significantly deviate from the rest of the data, can distort the model's learning
process. These outliers must be identified and handled appropriately by removal or statistical treatment. This process is a part of data reduction.

4.Handling Missing Data: Missing data is a common issue in data collection. Depending on the extent and nature of the missing data, you can
employ different strategies, including dropping the data points or imputing missing values. This step is especially important when dealing with
large data.

5.Validating Data Accuracy: Validate the accuracy of the data through cross-checks and other verification methods. Ensuring data accuracy is
crucial for maintaining the reliability of the machine-learning model. This step is particularly important for data scientists as it directly impacts the
model's performance.
Data Visualization
Data Visualization

Data visualization is the graphical representation of information and data. By using v

isual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data. Additionally,
it provides an excellent way for employees or business owners to present data to non-
technical audiences without confusion.

In the world of Big Data, data visualization tools and technologies are essential to
analyze massive amounts of information and make data-driven decisions.
Data Visualization Advantages

•Easily sharing information- Data visualization simplifies complex data into easy-to-
understand visuals, making it quicker to share insights with a wider audience

•Interactively explore opportunities- Interactive visualizations allow users to

manipulate data in real time, helping them discover new trends and potential
opportunities.

•Visualize patterns and relationships- Visuals like charts and graphs make it easier to
spot trends, correlations, and outliers that might not be apparent in raw data.
Data Visualization Disadvantages

•Biased or inaccurate information: Poorly designed visuals or selective data

presentation can mislead viewers, resulting in biased or incorrect conclusions.

•Correlation doesn’t always mean causation: Visuals may highlight correlations

between variables, but these relationships don’t necessarily imply cause and effect,
leading to potential misinterpretations.

•Core messages can get lost in translation: Complex visuals or excessive details can
distract viewers, causing the main point or key insights to be overlooked.
As the “age of Big Data” kicks into high gear, visualization is an
increasingly key tool to make sense of the trillions of rows of data
generated every day. Data visualization helps to tell stories by curating
data into a form easier to understand, highlighting the trends and
outliers. A good visualization tells a story, removing the noise from data
and highlighting useful information.

UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Data Analytics: Collection & Pre-processing
No ratings yet
Data Analytics: Collection & Pre-processing
16 pages
ITE Elective Lecture Materials Data Colletion and Descriptive Statistics
No ratings yet
ITE Elective Lecture Materials Data Colletion and Descriptive Statistics
8 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
Data Analysis Essentials Guide
No ratings yet
Data Analysis Essentials Guide
9 pages
Unit 2 FDS
No ratings yet
Unit 2 FDS
13 pages
Da Notes
No ratings yet
Da Notes
61 pages
Introduction To Data Science Module 2
No ratings yet
Introduction To Data Science Module 2
35 pages
Data Science Basics for Beginners
100% (2)
Data Science Basics for Beginners
68 pages
Data Is A Collection of Facts
No ratings yet
Data Is A Collection of Facts
4 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Unit 2
No ratings yet
Unit 2
105 pages
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
No ratings yet
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
18 pages
Presentation by Abhyuday Sharma
No ratings yet
Presentation by Abhyuday Sharma
27 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Unit 1
No ratings yet
Unit 1
9 pages
BigDataAnalytics - Unit1
No ratings yet
BigDataAnalytics - Unit1
21 pages
Data Mining 3
No ratings yet
Data Mining 3
31 pages
Unit 2 BI & Data Science
No ratings yet
Unit 2 BI & Data Science
35 pages
Unit 3
No ratings yet
Unit 3
18 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
89 pages
Understanding Data Types and Collection
No ratings yet
Understanding Data Types and Collection
3 pages
U2 Data Collection, Cleaning & Handling
No ratings yet
U2 Data Collection, Cleaning & Handling
5 pages
Unit 1
No ratings yet
Unit 1
11 pages
All Unit Notes
No ratings yet
All Unit Notes
116 pages
Data Mining
No ratings yet
Data Mining
22 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Webinar StorytellingwithDataSession3-4
No ratings yet
Webinar StorytellingwithDataSession3-4
30 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Overview of Data Preprocessing
No ratings yet
Overview of Data Preprocessing
4 pages
Important Question of Introduction of Data Science
No ratings yet
Important Question of Introduction of Data Science
10 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
DEV UNIT 1&2 Notes
No ratings yet
DEV UNIT 1&2 Notes
118 pages
Data Processing
No ratings yet
Data Processing
14 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Unit 1 Da
No ratings yet
Unit 1 Da
69 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
DS Unit-2
No ratings yet
DS Unit-2
9 pages
Data Collection
No ratings yet
Data Collection
5 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
3 pages
Fds Question Bank With Answer
No ratings yet
Fds Question Bank With Answer
35 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Unit 1 - PPT
No ratings yet
Unit 1 - PPT
67 pages
Data Mining in Data Warehousing Explained
No ratings yet
Data Mining in Data Warehousing Explained
20 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Data Mining
No ratings yet
Data Mining
5 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Chapter 2.data Warehouse
No ratings yet
Chapter 2.data Warehouse
42 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Module 3 Quiz
No ratings yet
Module 3 Quiz
4 pages
QMD Long QUIZ
No ratings yet
QMD Long QUIZ
12 pages
4 Scrum
No ratings yet
4 Scrum
46 pages
Module 2
No ratings yet
Module 2
31 pages
6LearningActivity Pasamonte JustinNichol P
No ratings yet
6LearningActivity Pasamonte JustinNichol P
2 pages
Assign2 PasamonteJustinNichol
No ratings yet
Assign2 PasamonteJustinNichol
2 pages
STAT3206 - Exercise 5-1
No ratings yet
STAT3206 - Exercise 5-1
1 page
1 - Structured Cabling System Part1
No ratings yet
1 - Structured Cabling System Part1
19 pages
Assign3 PasamonteJustinNichol
No ratings yet
Assign3 PasamonteJustinNichol
3 pages
PasamonteJustinNicholP Activity1
No ratings yet
PasamonteJustinNicholP Activity1
2 pages
Module 4
No ratings yet
Module 4
16 pages
Assign1 PasamonteJustinNichol
No ratings yet
Assign1 PasamonteJustinNichol
4 pages
1 - Structured Cabling - System - Part2
100% (1)
1 - Structured Cabling - System - Part2
68 pages
Lab Activity2
No ratings yet
Lab Activity2
1 page
Module 3
No ratings yet
Module 3
24 pages
Finals Ans Key
No ratings yet
Finals Ans Key
3 pages
Lab Activity 6
No ratings yet
Lab Activity 6
1 page
Introduction to C Programming Basics
No ratings yet
Introduction to C Programming Basics
18 pages
C Programming Basics for Beginners
No ratings yet
C Programming Basics for Beginners
1 page
Lab Activity 4
No ratings yet
Lab Activity 4
1 page
Lec 2.2 - CENON
No ratings yet
Lec 2.2 - CENON
30 pages
Lec 3 - CENON
No ratings yet
Lec 3 - CENON
34 pages
Lab Activity 5
No ratings yet
Lab Activity 5
1 page
Lec 2 - CENON
No ratings yet
Lec 2 - CENON
144 pages
Chapter 4 Storage Tech
No ratings yet
Chapter 4 Storage Tech
20 pages
Lec 1 - CENON
No ratings yet
Lec 1 - CENON
69 pages
Module 2
No ratings yet
Module 2
19 pages
Game
No ratings yet
Game
28 pages
Discrete Quiz 1
No ratings yet
Discrete Quiz 1
9 pages
NurseryTag: Enhancing Child Safety
No ratings yet
NurseryTag: Enhancing Child Safety
6 pages
1.4. Image Interpolation - IMPORTANTE
No ratings yet
1.4. Image Interpolation - IMPORTANTE
5 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
6.2. Convolution Integral Example
No ratings yet
6.2. Convolution Integral Example
12 pages
Topic 6 Optimal Dispatch of Generation
No ratings yet
Topic 6 Optimal Dispatch of Generation
136 pages
Dynamic-Programming Ques With Sol
No ratings yet
Dynamic-Programming Ques With Sol
7 pages
Data Structures & Algorithms Guide
No ratings yet
Data Structures & Algorithms Guide
12 pages
FINAL450
No ratings yet
FINAL450
45 pages
R2 Unet PDF
No ratings yet
R2 Unet PDF
12 pages
Cloud Securityusing Hybrid Cryptography
No ratings yet
Cloud Securityusing Hybrid Cryptography
7 pages
Deep Learning: Movie Review Classifier
No ratings yet
Deep Learning: Movie Review Classifier
18 pages
Unit Iv
No ratings yet
Unit Iv
23 pages
Quantum Computing Terms Cheat Sheet
No ratings yet
Quantum Computing Terms Cheat Sheet
5 pages
SORTING (Bubble Sort) Aim of The Experiment: Write A C Program That Implement Bubble Sort Method To Sort A Given
No ratings yet
SORTING (Bubble Sort) Aim of The Experiment: Write A C Program That Implement Bubble Sort Method To Sort A Given
12 pages
Algorithm&flowchart
No ratings yet
Algorithm&flowchart
8 pages
The Design and Analysis of Parallel Algorithms
No ratings yet
The Design and Analysis of Parallel Algorithms
412 pages
MAT2002 Applications of Differential and Difference Equations ETH 1 AC37
No ratings yet
MAT2002 Applications of Differential and Difference Equations ETH 1 AC37
3 pages
A Quantile Regression Analysis To Investigate The Effect of Temperature and Humidity On The Spread of COVID-19
No ratings yet
A Quantile Regression Analysis To Investigate The Effect of Temperature and Humidity On The Spread of COVID-19
15 pages
Deep Learning Course Overview DASC7606
No ratings yet
Deep Learning Course Overview DASC7606
10 pages
Bisection Method for Finding Roots
No ratings yet
Bisection Method for Finding Roots
10 pages
Localized Uniform Conditioning (LUC) Method and Application Case Studies
No ratings yet
Localized Uniform Conditioning (LUC) Method and Application Case Studies
7 pages
Week3 3
No ratings yet
Week3 3
24 pages
Eigenvalues and Eigenvectors Analysis
No ratings yet
Eigenvalues and Eigenvectors Analysis
4 pages
Mee1024 Operations-Research TH 1.1 47 Mee1024 PDF
No ratings yet
Mee1024 Operations-Research TH 1.1 47 Mee1024 PDF
2 pages
COS 302 Precept 3: Spring 2020
No ratings yet
COS 302 Precept 3: Spring 2020
41 pages
Project 2 Assignment
No ratings yet
Project 2 Assignment
4 pages
CBR SK Edited
No ratings yet
CBR SK Edited
12 pages
A Review On Sentiment Analysis Using Machine Learning
No ratings yet
A Review On Sentiment Analysis Using Machine Learning
5 pages
Multi-Degree of Freedom System (MDOF) : Lesson 2: MODES OF VIBRATION (Modal Shapes)
No ratings yet
Multi-Degree of Freedom System (MDOF) : Lesson 2: MODES OF VIBRATION (Modal Shapes)
5 pages
A Matrix Formulation For The Inverse Vandermonde Matrix
No ratings yet
A Matrix Formulation For The Inverse Vandermonde Matrix
1 page
DSP vs Microprocessor Explained
No ratings yet
DSP vs Microprocessor Explained
3 pages

Module 2 Data Science

Uploaded by

Module 2 Data Science

Uploaded by

Introduction to Data

Examples of unstructured data include videos, emails, images, and

Some common methods include surveys, interviews, watching people

3. Interviews - In face-to-face interviews, the interviewer asks the interviewee a set

6. Questionnaire: - A questionnaire is a printed or digital list of questions, either

Encoding Categorical Variables

Handling Missing Values

Data visualization is the graphical representation of information and data. By using v

•Interactively explore opportunities- Interactive visualizations allow users to

•Biased or inaccurate information: Poorly designed visuals or selective data

•Correlation doesn’t always mean causation: Visuals may highlight correlations

You might also like