0% found this document useful (0 votes)
20 views28 pages

Module 2 Data Science

Data science involves analyzing large amounts of structured and unstructured data to extract useful insights for businesses, utilizing methods from various fields like math and AI. It encompasses data collection techniques, data cleaning, preprocessing, and visualization to ensure high-quality data analysis. Effective data visualization aids in understanding trends and patterns, but care must be taken to avoid misinterpretations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views28 pages

Module 2 Data Science

Data science involves analyzing large amounts of structured and unstructured data to extract useful insights for businesses, utilizing methods from various fields like math and AI. It encompasses data collection techniques, data cleaning, preprocessing, and visualization to ensure high-quality data analysis. Effective data visualization aids in understanding trends and patterns, but care must be taken to avoid misinterpretations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to Data

Science
Data Science
• Data science is the study of data to find useful information for
businesses. It involves using ideas and methods from different areas like
math, statistics, artificial intelligence, and computer engineering to
examine large amounts of data. This helps data scientists understand
things like what happened, why it happened, what might happen next, and
what actions can be taken based on the results.
Importance of Data Science
• Data science is important because it uses tools, methods, and technology
to make sense of data. Today, organizations collect a huge amount of data
from many sources, like devices that automatically gather and store
information. Online systems, payment platforms, and other areas like e-
commerce, healthcare, and finance capture large amounts of data. This
includes text, audio, video, and images, all in very large quantities. Data
science helps turn this information into useful insights.
Structured and
Unstructured Data
Structured Data
Structured data, also called quantitative data, is information that is
neatly organized, making it easy for computers to read, search, and
analyze. You often find it in tables with rows and columns, like in a
spreadsheet. Each column has a specific heading, and the data in
each row matches that category, such as names, addresses, or dates.
This structure makes it easy for search engines and algorithms to
understand the data. Since everything is clearly labeled, both people
and computer programs can quickly search and analyze large
amounts of this data.
Unstructured Data
Unstructured data is information that doesn’t have a set format or
organization. Each piece of unstructured data is called an "object"
because it doesn’t have a specific key or label to easily identify it. To
make it searchable, each object needs to be tagged or labeled with
something that helps identify it.

Examples of unstructured data include videos, emails, images, and


web content. Unstructured data makes up 80 to 90 percent of all data
worldwide, but it’s harder to work with and less immediately useful
compared to structured data because it’s more difficult to analyze
and get insights from.
Data Collection Methods
Data Collection Methods
Data collection methods are ways to gather information for
research. They can be as simple as asking people questions through
surveys or as complex as running experiments.

Some common methods include surveys, interviews, watching people


(observations), group discussions (focus groups), experiments, and
analyzing data that already exists (secondary data analysis). After
collecting the data, researchers analyze it to see if it supports or
challenges their ideas and to draw conclusions about the topic they
are studying.
Types of Data Collection
Methods
Primary & Secondary
Data Collection Methods
Primary Data Collection
Primary Data is collected from first-hand
experience and is not used in the past. The
data gathered by primary data collection
methods are highly accurate and specific to
the research’s motive.
Primary Data Collection
1. Surveys - Surveys gather information from a specific group of people to understand
their preferences, opinions, and feedback about products and services. Survey tools
usually provide different types of questions to collect this information, such as
multiple-choice, rating scales, or open-ended questions. This helps businesses learn
what their customers think and make better decisions.

2. Polls - Polls consist of just one or a few multiple-choice questions. They are helpful
when you want to quickly gauge the opinions or feelings of an audience. Since polls
are short, people are more likely to respond, making it a fast and easy way to
gather feedback.

3. Interviews - In face-to-face interviews, the interviewer asks the interviewee a set


of questions in person and records their answers. If meeting in person isn't possible, the
interviewer can conduct the interview over the phone instead.
Primary Data Collection
4. Delphi Tachnique - In the Delphi method, market experts are given the predictions
and assumptions made by other experts. After reviewing this information, they may
adjust their own estimates. The final demand forecast is created by combining the
opinions of all the experts to reach a shared agreement.

5. Focus Groups - Focus groups are a type of qualitative data collection used in
education. In a focus group, a small group of about 8-10 people discusses a specific
research topic. Each person shares their thoughts and ideas about the issue being
studied, helping researchers understand different perspectives.

6. Questionnaire: - A questionnaire is a printed or digital list of questions, either


open-ended (where people can give detailed answers) or closed-ended (with set
answer choices). Respondents answer based on their knowledge and experience
with the topic. A questionnaire is often used in surveys, but its purpose doesn't
always have to be for a survey.
Secondary Data Collection
Secondary data is information that has
already been collected and used in the
past. Researchers can gather this data
from both internal sources (within an
organization) and external sources (outside
the organization) to use for their studies.
Secondary Data Collection
Internal sources of secondary data:
•Organization’s health and safety records
•Mission and vision statements
•Financial Statements
•Magazines
•Sales Report
•CRM Software
•Executive summaries
Data Cleaning and
Preprocessing Basis
In data science and machine learning, the quality of the input data is extremely
important. The performance of machine learning models relies heavily on how
good the data is. This is why data cleaning—finding and fixing (or removing)
incorrect or incomplete data—is a key step in the process.

Data cleaning isn't just about deleting data or filling in missing values. It's a
detailed process that uses different techniques to make raw data ready for
analysis. These techniques include handling missing data, removing duplicates,
converting data types, and more. Each method is used based on the type of
data and the needs of the analysis.
Common Data Cleaning Techniques

Handling Missing Values: Missing data can occur for various reasons, such as errors in data collection or
transfer. There are several ways to handle missing data, depending on the nature and extent of the missing
values.
• Imputation: Here, you replace missing values with substituted values. The substituted value could be a
central tendency measure like mean, median, or mode for numerical data or the most frequent category for
categorical data. More sophisticated imputation methods include regression imputation and multiple
imputation.
• Deletion: You remove the instances with missing values from the dataset. While this method is
straightforward, it can lead to loss of information, especially if the missing data is not random.
Common Data Cleaning Techniques

Removing Duplicates: Duplicate entries can occur for various reasons, such as data entry errors or data merging. These
duplicates can skew the data and lead to biased results. Techniques for removing duplicates involve identifying these
redundant entries based on key attributes and eliminating them from the dataset.

Data Type Conversion: Sometimes, the data may be in an inappropriate format for a particular analysis or model. For
instance, a numerical attribute may be recorded as a string. In such cases, data type conversion, also known as datacasting,
is used to change the data type of a particular attribute or set of attributes. This process involves converting the data into a
suitable format that machine learning algorithms can easily process.

Outlier Detection: Outliers are data points that significantly deviate from other observations. They can be caused by
variability in the data or errors. Outlier detection techniques are used to identify these anomalies. These techniques include
statistical methods, such as the Z-score or IQR method, and machine learning methods, such as clustering or anomaly
detection algorithms.
Data Processing

Data preprocessing is critical in data science, particularly for machine learning applications. It
involves preparing and cleaning the dataset to make it more suitable for machine learning
algorithms. This process can reduce complexity, prevent overfitting, and improve the model's
overall performance.

The data preprocessing phase begins with understanding your dataset's nuances and the data's
main issues through Exploratory Data Analysis. Real-world data often presents inconsistencies,
typos, missing data, and different scales. You must address these issues to make the data more
useful and understandable. This process of cleaning and solving most of the issues in the data is
what we call the data preprocessing step.
Common Data Processing Techniques

Data Scaling
Data scaling is a technique used to standardize the range of independent variables or features of data. It aims to standardize
the data's range of features to prevent any feature from dominating the others, especially when dealing with large datasets.
This is a crucial step in data preprocessing, particularly for algorithms sensitive to the range of the data, such as deep
learning models.

Encoding Categorical Variables


Machine learning models require inputs to be numerical. If your data contains categorical data, you must encode them to
numerical values before fitting and evaluating a model. This process, known as encoding categorical variables, is a common
data preprocessing technique. One common method is One-Hot Encoding, which creates new binary columns for each
category/label in the original columns.

Data Splitting
Data Splitting is a technique to divide the dataset into two or three sets, typically training, validation, and test sets. You use
the training set to train the model and the validation set to tune the model's parameters. The test set provides an unbiased
evaluation of the final model. This technique is essential when dealing with large data, as it ensures the model is not
overfitted to a particular subset of data.
Common Data Processing Techniques

Handling Missing Values


Missing data in the dataset can lead to misleading results. Therefore, it's essential to handle
missing values appropriately. Techniques for handling missing values include deletion, removing the
rows with missing values, and imputation, replacing the missing values with statistical measures
like mean, median, or model. This step is crucial in ensuring the quality of data used for training
machine learning models.
Feature Selection
Feature selection is a process in machine learning where you automatically select those features in your data that contribute
most to the prediction variable or output in which you are interested. Having irrelevant features in your data can decrease the
accuracy of many models, especially linear algorithms like linear and logistic regression. This process is particularly
important for data scientists working with high-dimensional data, as it reduces overfitting, improves accuracy, and reduces
training time.

Three benefits of performing feature selection before modeling your data are:
•Reduces Overfitting: Less redundant data means less opportunity to make noise-based decisions.
•Improves Accuracy: Less misleading data means modeling accuracy improves.
•Reduces Training Time: Fewer data points reduce algorithm complexity, and it trains faster.
Step-by-Step Guide to Data Cleaning

1.Identifying and Removing Duplicate or Irrelevant Data: Duplicate data can arise from various sources, such as the same individual
participating in a survey multiple times or redundant fields in the data collection process. Irrelevant data refers to information you can safely
remove because it is not likely to contribute to the model's predictive capacity. This step is particularly important when dealing with large datasets.

2.Fixing Syntax Errors: Syntax errors can occur due to inconsistencies in data entry, such as date formats, spelling mistakes, or grammatical
errors. You must identify and correct these errors to ensure the data's consistency. This step is crucial in maintaining the quality of data.

3.Filtering out Unwanted Outliers: Outliers, or data points that significantly deviate from the rest of the data, can distort the model's learning
process. These outliers must be identified and handled appropriately by removal or statistical treatment. This process is a part of data reduction.

4.Handling Missing Data: Missing data is a common issue in data collection. Depending on the extent and nature of the missing data, you can
employ different strategies, including dropping the data points or imputing missing values. This step is especially important when dealing with
large data.

5.Validating Data Accuracy: Validate the accuracy of the data through cross-checks and other verification methods. Ensuring data accuracy is
crucial for maintaining the reliability of the machine-learning model. This step is particularly important for data scientists as it directly impacts the
model's performance.
Data Visualization
Data Visualization

Data visualization is the graphical representation of information and data. By using v


isual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data. Additionally,
it provides an excellent way for employees or business owners to present data to non-
technical audiences without confusion.

In the world of Big Data, data visualization tools and technologies are essential to
analyze massive amounts of information and make data-driven decisions.
Data Visualization Advantages

•Easily sharing information- Data visualization simplifies complex data into easy-to-
understand visuals, making it quicker to share insights with a wider audience

•Interactively explore opportunities- Interactive visualizations allow users to


manipulate data in real time, helping them discover new trends and potential
opportunities.

•Visualize patterns and relationships- Visuals like charts and graphs make it easier to
spot trends, correlations, and outliers that might not be apparent in raw data.
Data Visualization Disadvantages

•Biased or inaccurate information: Poorly designed visuals or selective data


presentation can mislead viewers, resulting in biased or incorrect conclusions.

•Correlation doesn’t always mean causation: Visuals may highlight correlations


between variables, but these relationships don’t necessarily imply cause and effect,
leading to potential misinterpretations.

•Core messages can get lost in translation: Complex visuals or excessive details can
distract viewers, causing the main point or key insights to be overlooked.
As the “age of Big Data” kicks into high gear, visualization is an
increasingly key tool to make sense of the trillions of rows of data
generated every day. Data visualization helps to tell stories by curating
data into a form easier to understand, highlighting the trends and
outliers. A good visualization tells a story, removing the noise from data
and highlighting useful information.

You might also like