0% found this document useful (0 votes)

55 views28 pages

Unit 1

Uploaded by

mayura.shelke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views28 pages

Unit 1

Uploaded by

mayura.shelke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Unit 1:Introduction to Data Science (07 Hours)

Basics and need of Data Science, Applications of Data Science, Relationship between Data
Science and Information Science, Business intelligence versus Data Science,

Data: Data Types, Data Collection.

Need of Data wrangling, Methods: Data Cleaning, Data Integration, Data Reduction, Data
Transformation, and Data Discretization.
Data science Jobs
As per various surveys, data scientist job is becoming the most demanding Job
of the 21st century due to increasing demands for data science.

Some people also called it "the hottest job title of the 21st century". Data
scientists are the experts who can use various statistical tools and machine
learning algorithms to understand and analyze the data.

The average salary range for data scientist will be approximately $95,000 to $
165,000 per annum, and as per different researches, about 11.5 millions of job
will be created by the year 2026.
Types of Data Science Job

If you learn data science, then you get the opportunity to find the various exciting job roles in
this domain. The main job roles are given below:

1.Data Scientist

2.Data Analyst

3.Machine learning expert

4.Data engineer

5.Data Architect

6.Data Administrator

7.Business Analyst
data science as a field of study and practice that involves the collection, storage,
and processing of data in order to derive important insights into a problem or a
phenomenon. Such data may be generated by humans (surveys, logs, etc.) or
machines (weather data, road vision, etc.), and could be in different formats
(text, audio, video, augmented or virtual reality, etc.).
Need of Data Science
Applications of Data Science
•Fraud and risk detection: Over the years, financial organizations have learned
to analyze the probabilities of risks and defaults through customer profiling, past
expenditures, and other variables available through data.

•Healthcare: Data science makes it possible to manage and analyze very large
diverse datasets in healthcare systems, drug development, medical image
analysis, and more. Recently Data Science approaches were brought in to combat
the COVID-19 pandemic. Data Scientists helped in digital contact tracing,
diagnosis, risk assessment, resource allocation, estimating epidemiological
parameters, drug development, social media analytics, etc.
•Internet search: All search engines, including Google, use data science
algorithms to deliver the best result for searched queries within seconds.

•Targeted advertising: Digital ads have a higher call-through rate (CTR) than
traditional ads because targeted advertising is based on a user’s past behavior
with the help of data science algorithms.

•Recommendation systems: Internet giants as well as other businesses have

fervidly made use of recommendation engines to promote their products based
on users’ previous search results and their interests.
•Advanced image, speech, or character recognition: Facial recognition
algorithms on Facebook, speech recognition products, such as Siri, Cortana,
Alexa, etc., and Google Lens are all perfect examples of data science applications
in image, speech, and character recognition.

•Gaming: Today, games use machine learning algorithms to improve or upgrade

themselves as players move up to higher levels. In motion gaming, the opponent
(computer) is able to analyze a player’s previous moves and accordingly shape up
its game. This is all possible because of data science.

•Augmented reality (AR): Augmented reality promises an exciting future

through Data Science. A VR headset, for example, contains algorithms, data, and
computing knowledge to offer the best viewing experience.
How Does Data Science relate to other fields?
Data is everywhere.
Humans and machines are constantly creating new data.
Data scientists are interested in investigating the characteristics of data – looking
for patterns that reveal how people and society can benefit from data.
1. Data Science and Statistics
2. Data Science and Computer Science
3. Data Science and Engineering
4. Data Science and Business Analytics
Relationship between Data Science and Information Science
1.4.1 Information vs. Data
• Data is something raw, meaningless, an object that, when analyzed or converted
to a useful form, becomes information.
• Information is also defined as “data that are endowed with meaning and
purpose.
For example, the number “480,000” is a data point. But when we add an
explanation that it represents the number of deaths per year in the USA from
cigarette smoking, it becomes information.
1.4.2 Users in Information Science
Different users may not agree on a piece of information’s relevancy depending on
various factors that affect judgment, such as “usefulness.
Usefulness is a criterion that determines how useful is the interaction between the
user and the information object (data) in accomplishing the task or goal of the user.
Business intelligence (BI)
• Business intelligence (BI) is a set of strategies and technologies enterprises use to analyze
business information and transform it into actionable insights that inform strategic and
tactical business decisions.

• BI tools access and analyze data sets and present analytical findings in reports, summaries,
dashboards, graphs, charts, and maps to provide users with detailed intelligence about the
state of the business.
Factors
Business intelligence
Business Intelligence
versus Data
Data Science
Science
Concept It is a collection of processes, tools, and It consists of mathematical and statistical
technologies that help a business with data models used for processing the data,
analysis. discovering hidden patterns, and predicting
future actions based on those patterns.
Data It deals mainly with structured data. It accepts both structured and unstructured
data.
Flexibility Data sources should be planned before the Data Sources can be added anytime based on
visualization. the requirements.
Approach It has both statistical and visual approaches Graph analysis, NLP, machine learning, neural
toward data analysis. networks, and other methods can be used to
process the data.
Expertise It is made for business users to visualize It requires sound knowledge of data analysis
raw business information without any and programming.
technical knowledge.
Complexity For a single user, compared to data science, Data science is much more complex when
business intelligence is much simpler to use compared to business intelligence.
and visualize data.
Data
1. Data Types
- Structured data
- Unstructured data
2. Data Collection
- Open Data
- Social Media Data
-Multimodal Data
-Data Storage and Presentation
• Structured data is the most important data
Structured Datatype

• Highly organized information that can be seamlessly included in a database

and readily searched via simple search operations.
Unstructured data
• Unstructured data is data without labels.
• Examples of unstructured data include text, mobile activity, social media
posts, Internet of Things (IoT) sensor data, etc.
• Challenges with Unstructured Data
• The lack of structure makes compilation and organizing unstructured data a
time- and energy-consuming task.
• structured data is akin to machine language, in that it makes information
much easier to be parsed by computers.
Data Collections
1. Open Data
-data should be freely available in a public domain
-can be used by anyone as they wish, without restrictions from copyright, patents, or other
mechanisms of control.
list of principles associated with open data
a. Public.
b. Accessible.
c. Described.
d. Reusable.
e. Complete.
f. Timely.
2. Social Media Data
data to analyze for research or marketing purposes.
This is facilitated by the Application Programming Interface (API) that social media companies
provide to researchers and developers.
3. Multimodal Data

-Internet of Things (IoT).

4. Data Storage and Presentation

-Depending on its nature, data is stored in various formats

-most commonly used formats that store data as simple text – comma-separated values (CSV)
and tab-separated values (TSV).

Other formats are

XML (eXtensible Markup Language)

RSS (Really Simple Syndication

JSON (JavaScript Object Notation)

Need of Data wrangling/ Data Preprocessing
Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data
into the desired format for analysts to use for prompt decision-making. Also known as data
cleaning or data munging.

Data wrangling enables businesses to tackle more complex data in less time, produce more
accurate results, and make better decisions.

Here are some of the factors that indicate that data is not clean or ready to process:

1. Incomplete.

2. Noisy

3. Inconsistent.
Forms of data pre-processing
Data Cleaning
1. Data Munging-
-The data is not in a format that is easy to work with.
-It may be stored or presented in a way that is hard to process.
-We need to convert it to something more suitable for a computer to understand.
-This can be done manually, automatically, or, in many cases, semi-
automatically.
For example:
Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the mix.”
This can be turned into a table- “analysis friendly.”
2. Handling Missing Data
• Sometimes data may be in the right format, but some of the values are
missing.
• what to do when we encounter missing data? There is no single good
answer.
--------------------We need to find a suitable strategy based on the situation.
3. Smooth Noisy Data
• There are times when the data is not missing, but it is corrupted for
some reason.
• This is, in some ways, a bigger problem than missing data.
• Data corruption may be a result of faulty data collection instruments,
data entry problems, or technology limitations.
Data Integration
The following steps describe how to integrate multiple databases or files.

1. Combine data from multiple sources into a coherent storage place (e.g., a single file or a
database).

2. Engage in schema integration, or the combining of metadata from different sources.

3. Detect and resolve data value conflicts. For example: Reasons for this conflict could be
different representations or different scales; for example, metric vs. British units.

4. Address redundant data in data integration. Redundant data is commonly generated in

the process of integrating multiple databases. For example:

a. The same attribute may have different names in different databases.

Data Transformation
Data must be transformed so it is consistent and readable (by a system).
The following five processes may be used for data transformation.
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and aggregation.
Some of
the techniques that are used for accomplishing normalization are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction.
a. New attributes constructed from the given ones.
Data Reduction

• Data reduction is a key process in which a reduced representation of a dataset

that produces the same or similar analytical results is obtained.

• two of the most common techniques used for data reduction.

• Data Cube Aggregation-This technique is used to aggregate data in a

simpler form. Data Cube Aggregation is a multidimensional aggregation that
uses aggregation at various levels of a data cube to represent the original data
set, thus achieving data reduction.

• Dimensionality Reduction-
Data Discretization

Data that are collected from processes that are continuous, such as
temperature, ambient light, and a company’s stock price. But
sometimes we need to convert these continuous values into more
manageable parts.
There are three types of attributes involved in discretization:
a. Nominal: Values from an unordered set
b. Ordinal: Values from an ordered set
c. Continuous: Real numbers

Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Ch7-Overview of Data Science-Part 1
No ratings yet
Ch7-Overview of Data Science-Part 1
37 pages
Unit 01 Ids
No ratings yet
Unit 01 Ids
39 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
Datascience Presentation
No ratings yet
Datascience Presentation
94 pages
Data Science Course Overview and Skills
100% (2)
Data Science Course Overview and Skills
18 pages
Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
Riak CS in Data Science Statistics
No ratings yet
Riak CS in Data Science Statistics
27 pages
Fundamentals of Data Science Course
75% (4)
Fundamentals of Data Science Course
62 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Question Bank Syllbuswise
No ratings yet
Question Bank Syllbuswise
16 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
Module-1: Introduction To Data Science
No ratings yet
Module-1: Introduction To Data Science
98 pages
Unit 1
No ratings yet
Unit 1
60 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
02 Data Science
No ratings yet
02 Data Science
23 pages
Unit 1-3
No ratings yet
Unit 1-3
39 pages
Data v2
No ratings yet
Data v2
25 pages
AI UNIT 1 Data Science
No ratings yet
AI UNIT 1 Data Science
16 pages
21BCAD5C01 IDA Module 1 Notes
No ratings yet
21BCAD5C01 IDA Module 1 Notes
24 pages
Data Science Overview & Applications
No ratings yet
Data Science Overview & Applications
10 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
27 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
Intro to Data Science Fields
No ratings yet
Intro to Data Science Fields
8 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
UNIT - I Intro To DS
No ratings yet
UNIT - I Intro To DS
18 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Data Science
No ratings yet
Data Science
40 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
M 1 FDS Notes
No ratings yet
M 1 FDS Notes
19 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
DS - Module 1
No ratings yet
DS - Module 1
57 pages
DS231 Week 2
No ratings yet
DS231 Week 2
33 pages
Fundamentals of Data Science Course Overview
No ratings yet
Fundamentals of Data Science Course Overview
65 pages
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
No ratings yet
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
15 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
Fundamentals of Data Science
100% (1)
Fundamentals of Data Science
53 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
33 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Introduction To Data Science What Is Data Science?
No ratings yet
Introduction To Data Science What Is Data Science?
11 pages
DSBDA Unit 1
No ratings yet
DSBDA Unit 1
16 pages
Introduction to Data Science Basics
100% (1)
Introduction to Data Science Basics
27 pages
Unit 1
No ratings yet
Unit 1
34 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
17 pages
Data Science and Big Data Analytics Unit 1 Notes
No ratings yet
Data Science and Big Data Analytics Unit 1 Notes
13 pages
Principles of Sport Management
No ratings yet
Principles of Sport Management
23 pages
Requirements Questionnaire Checklist
No ratings yet
Requirements Questionnaire Checklist
3 pages
Software Engineer Seeking Growth
No ratings yet
Software Engineer Seeking Growth
3 pages
Practices, Usage and Perceived Effectiveness of AI Tools Among IT Students of CSU-A
0% (1)
Practices, Usage and Perceived Effectiveness of AI Tools Among IT Students of CSU-A
48 pages
Knowledge Management: Presented by Sehar Abbas Saima Hanif
No ratings yet
Knowledge Management: Presented by Sehar Abbas Saima Hanif
39 pages
LO Grade 10 Revision Book T1 2024
No ratings yet
LO Grade 10 Revision Book T1 2024
17 pages
Third Grade Curriculum Guide
No ratings yet
Third Grade Curriculum Guide
32 pages
Torsten HAGERSTRAND. (1965) A Monte Carlo Approach Do Diffusion
No ratings yet
Torsten HAGERSTRAND. (1965) A Monte Carlo Approach Do Diffusion
26 pages
Enc 4942 - Reflective Essay
No ratings yet
Enc 4942 - Reflective Essay
2 pages
Pretest HOTS
100% (1)
Pretest HOTS
5 pages
BSTM Quiz
No ratings yet
BSTM Quiz
6 pages
Coca Cola Case Analysis by Dhing Patulot
No ratings yet
Coca Cola Case Analysis by Dhing Patulot
6 pages
Customer Handling Newly Revised Training Material
No ratings yet
Customer Handling Newly Revised Training Material
75 pages
Foundations of Communications Theory Sereno Mortensen 1970
No ratings yet
Foundations of Communications Theory Sereno Mortensen 1970
383 pages
STE Microproject
No ratings yet
STE Microproject
35 pages
Claire First Set
No ratings yet
Claire First Set
18 pages
CANoe ProductInformation EN
No ratings yet
CANoe ProductInformation EN
63 pages
Organizational Behavior Managing People and Organizations 13th Edition Ricky W Griffin Jean M Phillips Stanley M Gully Instructor Test Bank
No ratings yet
Organizational Behavior Managing People and Organizations 13th Edition Ricky W Griffin Jean M Phillips Stanley M Gully Instructor Test Bank
339 pages
AOM Chapter 7 Activity
No ratings yet
AOM Chapter 7 Activity
2 pages
9-12 Audit Logging Guide
No ratings yet
9-12 Audit Logging Guide
45 pages
Computer Seminar
No ratings yet
Computer Seminar
8 pages
Topic 1 - Information Assurance Principles
No ratings yet
Topic 1 - Information Assurance Principles
21 pages
Information Governance Infographic
No ratings yet
Information Governance Infographic
2 pages
Attractions and Theme Parks Ncii
100% (1)
Attractions and Theme Parks Ncii
115 pages
Software Engineering Basics
No ratings yet
Software Engineering Basics
43 pages
Advantages of Computer
100% (2)
Advantages of Computer
2 pages
Mechanical Drafting and Design
No ratings yet
Mechanical Drafting and Design
21 pages
Introduction To Nuclear Engineering 4th Edition by John R Lamarsh Ebook and TestBank Bundle Get PDF Now
No ratings yet
Introduction To Nuclear Engineering 4th Edition by John R Lamarsh Ebook and TestBank Bundle Get PDF Now
325 pages
Topic Writing and Speaking Mid Term Test
No ratings yet
Topic Writing and Speaking Mid Term Test
4 pages
ABC of Learning and Teaching in Medicine 2nd Edition Peter Cantillon
No ratings yet
ABC of Learning and Teaching in Medicine 2nd Edition Peter Cantillon
305 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit 1:Introduction to Data Science (07 Hours)

Data: Data Types, Data Collection.

3.Machine learning expert

•Recommendation systems: Internet giants as well as other businesses have

•Gaming: Today, games use machine learning algorithms to improve or upgrade

•Augmented reality (AR): Augmented reality promises an exciting future

• Highly organized information that can be seamlessly included in a database

-Internet of Things (IoT).

4. Data Storage and Presentation

-Depending on its nature, data is stored in various formats

Other formats are

XML (eXtensible Markup Language)

RSS (Really Simple Syndication

JSON (JavaScript Object Notation)

2. Engage in schema integration, or the combining of metadata from different sources.

4. Address redundant data in data integration. Redundant data is commonly generated in

the process of integrating multiple databases. For example:

a. The same attribute may have different names in different databases.

• Data reduction is a key process in which a reduced representation of a dataset

• two of the most common techniques used for data reduction.

• Data Cube Aggregation-This technique is used to aggregate data in a

You might also like