0% found this document useful (0 votes)

29 views26 pages

Unit 1

Uploaded by

Mihir Bhayani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views26 pages

Unit 1

Uploaded by

Mihir Bhayani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Unit 1

Prepared By :: Prof. Megha Mehta

Defining data science
Recognizing the different types of data
Gaining insight into the data science process
Fields of data science
 Data science is the study of data to extract meaningful insights for business.

 It is a multidisciplinary approach that combines principles and practices from the fields
of mathematics, statistics, artificial intelligence, and computer engineering to analyse
large amounts of data.

 This analysis helps data scientists to ask and answer questions like what happened, why
it happened, what will happen, and what can be done with the results.

 Data science is an evolutionary extension of statistics capable of dealing with the

massive amounts of data produced today. It adds methods from computer science to the
repertoire of statistics
 Data science is important because it combines tools, methods, and technology to
generate meaning from data.

 Modern organizations are inundated with data; there is a proliferation of devices that
can automatically collect and store information.

 Online systems and payment portals capture more data in the fields of e-commerce,
medicine, finance, and every other aspect of human life.

 We have text, audio, video, and image data available in vast quantities.
 In data science and big data you’ll come across many different types of data, and each
of them tends to require different tools and techniques.
 The main categories of data are these:
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming
 Structured data is data that depends on a data model and resides in a fixed field within a
record.
 As such, it’s often easy to store structured data in tables within databases or Excel files.
 SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.
 You may also come across structured data that might give you a hard time storing it in a
traditional relational database.
 Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying.
 One example of unstructured data is your regular email.
 Although email contains structured elements such as the sender, title, and body text, it’s
a challenge to find the number of people who have written an email complaint about a
specific employee because so many ways exist to refer to a person, for example.
 The thousands of different languages and dialects out there further complicate this.
 Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.
 The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained
in one domain don’t generalize well to other domains.
 Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of
text.
 This shouldn’t be a surprise though: humans struggle with natural language as well. It’s
ambiguous by nature.
 The concept of meaning itself is questionable here. Have two people listen to the same
conversation. Will they get the same meaning? The meaning of the same words can vary
when coming from someone upset or joyous.
 Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention.
 Machine-generated data is becoming a major data resource and will continue to do so.
 Wikibon has forecast that the market value of the industrial Internet (a term coined by
Frost & Sullivan to refer to the integration of complex physical machinery with
networked sensors and software) will be approximately $540 billion in 2020.
 IDC (International Data Corporation) has estimated there will be 26 times more
connected things than people in 2020. This network is commonly referred to as the
internet of things.
 The analysis of machine data relies on highly scalable tools, due to its high volume and
speed. Examples of machine data are web server logs, call detail records, network event
logs, and telemetry
 “Graph data” can be a confusing term because any data can be shown in a graph.
“Graph” in this case points to mathematical graph theory.
 In graph theory, a graph is a mathematical structure to model pair-wise relationships
between objects.
 Graph or network data is, in short, data that focuses on the relationship or adjacency of
objects. The graph structures use nodes, edges, and properties to represent and store
graphical data.
 Graph-based data is a natural way to represent social networks, and its structure allows
you to calculate specific metrics such as the influence of a person and the shortest path
between two people.
 Examples of graph-based data can be found on many social media websites.
 For instance, on LinkedIn you can see who you know at which company. Your follower list
on Twitter is another example of graph-based data.
 The power and sophistication comes from multiple, overlapping graphs of the same
nodes.
 For example, imagine the connecting edges here to show “friends” on Facebook.
 Imagine another graph with the same people which connects business colleagues via
LinkedIn. Imagine a third graph based on movie interests on Netflix.
 Overlapping the three different-looking graphs makes more interesting questions
possible.
 Graph databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL.
 Graph data poses its challenges, but for a computer interpreting additive and image
data, it can be even more difficult.
 Audio, image, and video are data types that pose specific challenges to a data scientist.
 Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers. MLBAM (Major League Baseball Advanced Media)
announced in 2014 that they’ll increase video capture to approximately 7 TB per game
for the purpose of live, in-game analytics.
 High-speed cameras at stadiums will capture ball and athlete movements to calculate in
real time, for example, the path taken by a defender relative to two baselines.
 Recently a company called DeepMind succeeded at creating an algorithm that’s
capable of learning how to play video games.
 This algorithm takes the video screen as input and learns to interpret everything via a
complex process of deep learning.
 It’s a remarkable feat that prompted Google to buy the company for their own Artificial
Intelligence (AI) development plans.
 The learning algorithm takes in data as it’s produced by the computer game; it’s
streaming data.
 While streaming data can take almost any of the previous forms, it has an extra property.
 The data flows into the system when an event happens instead of being loaded into a
data store in a batch.
 Although this isn’t really a different type of data, we treat it here as such because you
need to adapt your process to deal with this type of information.
 Examples are the “What’s trending” on Twitter, live sporting or music events, and the
stock market.
 The data science process typically consists of six steps, as you can see in the mind map.
 The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project.
 Data science is mostly applied in the context of an organization.
 When the business asks you to perform a data science project, you’ll first prepare a
project charter.
 This charter contains information such as what you’re going to research, how the
company benefits from that, what data and resources you need, a timetable, and
deliverables.
 The second phase is data retrieval.
 You want to have data available for analysis, so this step includes finding suitable data
and getting access to the data from the data owner.
 The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
 You’ve stated in the project charter which data you need and where you can find it.
 In this step you ensure that you can use the data in your program, which means checking
the existence of, quality, and access to the data.
 Data can also be delivered by third-party companies and takes many forms ranging
from Excel spreadsheets to different types of databases.
 Now that you have the raw data, it’s time to prepare it.
 This includes transforming the data from a raw form into data that’s directly usable in
your models.
 To achieve this, you’ll detect and correct different kinds of errors in the data, combine
data from different data sources, and transform it.
 If you have successfully completed this step, you can progress to data visualization and
modeling.
 This phase consists of three sub phases : data cleansing removes false values from a data
source and inconsistencies across data sources, data integration enriches data sources
by combining information from multiple data sources, and data transformation ensures
that the data is in a suitable format for use in your models.
 Data collection is an error-prone process; in this phase you enhance the quality of the
data and prepare it for use in subsequent steps.
 The fourth step is data exploration.
 The goal of this step is to gain a deep understanding of the data. You’ll look for patterns,
correlations, and deviations based on visual and descriptive techniques. The insights
you gain from this phase will enable you to start modeling.
 Data exploration is concerned with building a deeper understanding of your data.
 You try to understand how variables interact with each other, the distribution of the data,
and whether there are outliers.
 To achieve this you mainly use descriptive statistics, visual techniques, and simple
modelling.
 This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
 In this phase you use models, domain knowledge, and insights about the data you found
in the previous steps to answer the research question.
 You select a technique from the fields of statistics, machine learning, operations
research, and so on.
 Building a model is an iterative process that involves selecting the variables for the
model, executing the model, and model diagnostics.
 It is now that you attempt to gain the insights or make the predictions stated in your
project charter. Now is the time to bring out the heavy guns, but remember research has
taught us that often (but not always) a combination of simple models tends to
outperform one complicated model.
 If you’ve done this phase right, you’re almost done.
 The last step of the data science model is presenting your results and automating the
analysis, if needed.
 One goal of a project is to change a process and/or make better decisions. You may still
need to convince the business that your findings will indeed change the business
process as expected. This is where you can shine in your influencer role.
 The importance of this step is more apparent in projects on a strategic and tactical level.
 Certain projects require you to perform the business process over and over again, so
automating the project will save time.
 Finally, you present the results to your business.
 These results can take many forms, ranging from presentations to research reports.
 Sometimes you’ll need to automate the execution of the process because the business
will want to use the insights you gained in another project or enable an operational
process to use the outcome from your model.
 The field of data science encompasses multiple sub disciplines such as data analytics,
data mining, artificial intelligence, machine learning, and others.
 Data Analytics
 While data analysts are focused on extracting meaningful insights from various data sources,
data scientists go beyond that to “forecast the future based on past patterns,” according to
SimpliLearn. “A data scientist creates questions, while a data analyst finds answers to the
existing set of questions.”
 Artificial Intelligence
 Commonly called AI, artificial intelligence, according to Techopedia, “aims to imbue software
with the ability to analyze its environment using either predetermined rules and search
algorithms or pattern recognizing machine learning models, and then make decisions based
on those analyses. In this way, AI attempts to mimic biological intelligence to allow the software
application or system to act with varying degrees of autonomy, thereby reducing manual
human intervention for a wide range of functions.”
 Machine Learning
 Machine learning algorithms use statistics to find patterns in massive amounts of data,
according to MIT Technology Review. A sub discipline of AI, “machine learning is the process
that powers many of the services we use today — recommendation systems like those on
Netflix, YouTube, and Spotify; search engines like Google and Baidu; social-media feeds like
Facebook and Twitter; voice assistants like Siri and Alexa. The list goes on.”
 Data Visualization
 Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
 Data mining
 Data mining is the process of extracting and discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database systems.
 Statistics and Probability
 Statistics and probability represent a considerable area of mathematics that also greatly
impacts data science. Statistics and probability are one of the most widely used fields of data
science. This specialty area is all about establishing and working with finite figures as well as
the effects of the ever-present factor of “chance” in all things. Data scientists with training in this
area are a great asset to general and specialized areas of the data science industry today
including:
 Epidemiologist
 Statistician
 Business Intelligence Analyst
 Social Science Data Analyst
 General Data Scientist

Unit I
No ratings yet
Unit I
262 pages
FDS Notes
No ratings yet
FDS Notes
137 pages
Five Unit Notes
No ratings yet
Five Unit Notes
138 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Data Science Overview and Process Guide
No ratings yet
Data Science Overview and Process Guide
139 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Foundations of Data Science Course
No ratings yet
Foundations of Data Science Course
25 pages
Stucor Cs3352 Ad
No ratings yet
Stucor Cs3352 Ad
138 pages
Data Science
No ratings yet
Data Science
244 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
Data Science Foundations Guide
100% (2)
Data Science Foundations Guide
143 pages
FDS - Aids Complete Notes
No ratings yet
FDS - Aids Complete Notes
138 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Cs3352 FDS Question Bank
No ratings yet
Cs3352 FDS Question Bank
145 pages
11.course Materials (Unit Wise
No ratings yet
11.course Materials (Unit Wise
138 pages
FDS Notes PDF
No ratings yet
FDS Notes PDF
140 pages
Data v2
No ratings yet
Data v2
25 pages
Data Science Overview for Honours Students
No ratings yet
Data Science Overview for Honours Students
28 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Fdsunit 1
No ratings yet
Fdsunit 1
27 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
35 pages
Facets of Data
50% (2)
Facets of Data
22 pages
Cs3352 Fds Notes Mk1
No ratings yet
Cs3352 Fds Notes Mk1
30 pages
FDS Notes
No ratings yet
FDS Notes
143 pages
Unit1 Fds
No ratings yet
Unit1 Fds
20 pages
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
No ratings yet
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
138 pages
Unit 1 PPT 1
100% (1)
Unit 1 PPT 1
27 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Unit 1
No ratings yet
Unit 1
9 pages
Cs3352foundation of Data Science - 1
No ratings yet
Cs3352foundation of Data Science - 1
141 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Data Science Basics for Students
No ratings yet
Data Science Basics for Students
32 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
FDS Notes
No ratings yet
FDS Notes
148 pages
Data Science: Insights & Challenges
No ratings yet
Data Science: Insights & Challenges
33 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Fundamentals of Data Science Course
75% (4)
Fundamentals of Data Science Course
62 pages
CS3352 Foundations of Data Science Nov Dec 2023
No ratings yet
CS3352 Foundations of Data Science Nov Dec 2023
26 pages
Fundamentals of Data Science & Big Data"
No ratings yet
Fundamentals of Data Science & Big Data"
14 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
43 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
UNIT I Democracy
No ratings yet
UNIT I Democracy
75 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
IoT Manufacturing
No ratings yet
IoT Manufacturing
22 pages
Assignment Activity Unit 1: CS 1111-01 - AY2025-T2
No ratings yet
Assignment Activity Unit 1: CS 1111-01 - AY2025-T2
2 pages
Ericsson Bangalore R&D AI Intern JD
No ratings yet
Ericsson Bangalore R&D AI Intern JD
2 pages
Assignment - DBB2101 - BBA 3 - Set 1 and 2 - Nov 23
No ratings yet
Assignment - DBB2101 - BBA 3 - Set 1 and 2 - Nov 23
3 pages
Oracle Unified Methods Basics
No ratings yet
Oracle Unified Methods Basics
2 pages
Cloud Call Center Solutions
No ratings yet
Cloud Call Center Solutions
22 pages
Cloud Database: Types and Benefits
No ratings yet
Cloud Database: Types and Benefits
10 pages
Data Mining Techniques & Use Cases
No ratings yet
Data Mining Techniques & Use Cases
42 pages
Bitcoin Mining Format
94% (16)
Bitcoin Mining Format
3 pages
Hands On Practice Lab
No ratings yet
Hands On Practice Lab
7 pages
Solutions: Solutions Manual For Contemporary Logistics 12Th Edition Murphy
No ratings yet
Solutions: Solutions Manual For Contemporary Logistics 12Th Edition Murphy
9 pages
Cloud Computing and IoT Exam Insights
No ratings yet
Cloud Computing and IoT Exam Insights
57 pages
The Essential Components
No ratings yet
The Essential Components
6 pages
License Cert1287221
No ratings yet
License Cert1287221
1 page
Supply Chain & Logistics Trends 2023
No ratings yet
Supply Chain & Logistics Trends 2023
4 pages
Cyber Security, Cyber Threats, Implications and Future Perspectives: A Review
No ratings yet
Cyber Security, Cyber Threats, Implications and Future Perspectives: A Review
8 pages
Microsoft Office 2010 Product Key Guide
No ratings yet
Microsoft Office 2010 Product Key Guide
2 pages
MP T1 Report - 2
No ratings yet
MP T1 Report - 2
25 pages
Prayag Dave: IT MSc, Software Engineer
No ratings yet
Prayag Dave: IT MSc, Software Engineer
1 page
Responsibility Matrix
No ratings yet
Responsibility Matrix
2 pages
MSC Apex 2020 + Documentation - Baidu SkyDrive - Rapidgator - Nitroflare
100% (1)
MSC Apex 2020 + Documentation - Baidu SkyDrive - Rapidgator - Nitroflare
4 pages
Lamina1 Whitepaper 1.1
No ratings yet
Lamina1 Whitepaper 1.1
30 pages
Preethi R Internship Report
No ratings yet
Preethi R Internship Report
36 pages
CIS Module 9 - Cloud Infrastructure and Management
No ratings yet
CIS Module 9 - Cloud Infrastructure and Management
43 pages
PDF 5
No ratings yet
PDF 5
100 pages
Bcos 186
100% (3)
Bcos 186
7 pages
Azure Virtual Desktop Systems Engineer Role
No ratings yet
Azure Virtual Desktop Systems Engineer Role
2 pages
NIST CSF To NERC CIP Mapping Presentation
No ratings yet
NIST CSF To NERC CIP Mapping Presentation
21 pages
Retail Franchise Contract Draft - Final
No ratings yet
Retail Franchise Contract Draft - Final
49 pages
Top Construction Project Manager Skills & Responsi
No ratings yet
Top Construction Project Manager Skills & Responsi
17 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit 1

Prepared By :: Prof. Megha Mehta

 Data science is an evolutionary extension of statistics capable of dealing with the

You might also like