0% found this document useful (0 votes)

80 views10 pages

Data Science Unit-1 B.sc. III Sem. MDC

The document provides an overview of data science, its benefits, and the concept of big data, including its characteristics and applications. It details the types of data, such as structured, unstructured, and machine-generated data, as well as the data science process, which includes stages like data capture, maintenance, processing, analysis, and communication. Additionally, it outlines steps for effective data science processes, emphasizing the importance of problem definition, data retrieval, cleansing, exploratory analysis, model building, and presentation of findings.

Uploaded by

molon26155

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views10 pages

Data Science Unit-1 B.sc. III Sem. MDC

Uploaded by

molon26155

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Guru Ghasidas Vishwavidyalaya, Bilaspur (C.G.

)
Department of Computer Science and Information Technology
Lecture Notes
B.Sc. III Sem. Data Science (MDC3)

Unit – I
Introduction of Data Science

Data science:
Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions. Data science uses complex machine learning algorithms to build
predictive models. The data used for analysis can come from many different sources and
presented in various formats. Data science is about extraction, preparation, analysis,
visualization, and maintenance of information. It is a cross disciplinary field which uses
scientific methods and processes to draw insights from data.

Benefits and Uses of Data Science

a) Anomaly detection: Fraud, disease and crime
b) Classification: Background checks; an email server classifying emails as "important"
c) Forecasting: Sales, revenue and customer retention
d) Pattern detection: Weather patterns, financial market patterns
e) Recognition : Facial, voice and text
f) Recommendation: Based on learned preferences, recommendation engines can refer
user to movies, restaurants and books
g) Regression: Predicting food delivery times, predicting home prices based on
amenities
h) Optimization: Scheduling ride-share pickups and package deliveries

Big Data:
Big data can be defined as very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speed i.e. velocities and varying
degrees of ambiguity, which cannot be processed using traditional technologies,
processing methods, algorithms or any commercial off-the-shelf solutions. 'Big data' is a
term used to describe collection of data that is huge in size and yet growing exponentially
with time. In short, such a data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.

Page | 1
Lokesh Suryawanshi
Characteristics of Big Data:
Characteristics of big data are volume, velocity and variety. They are often referred to as
the three V's.
1. Volume Volumes of data are larger than that conventional relational database
infrastructure can cope with. It consisting of terabytes or petabytes of data.
2. Velocity: The term 'velocity' refers to the speed of generation of data. How fast the data
is generated and processed to meet the demands, determines real potential in the data.
It is being created in or near real-time.
3. Variety: It refers to heterogeneous sources and the nature of data, both structured and
unstructured.

Benefits and Use of Big Data

• Benefits of Big Data :
1. Improved customer service
2. Businesses can utilize outside intelligence while taking decisions
3. Reducing maintenance costs
4. Re-develop our products : Big Data can also help us understand how others perceive
our products so that we can adapt them or our marketing, if need be.
5. Early identification of risk to the product/services, if any
6. Better operational efficiency

• Some of the examples of big data are:

1. Social media : Social media is one of the biggest contributors to the flood of data we
have today. Facebook generates around 500+ terabytes of data everyday in the form of
content generated by the users like status messages, photos and video uploads,
messages, comments etc.
2. Stock exchange : Data generated by stock exchanges is also in terabytes per day. Most
of this data is the trade data of users and companies.
3. Aviation industry: A single jet engine can generate around 10 terabytes of data during
a 30 minute flight.
4. Survey data: Online or offline surveys conducted on various topics which typically has
hundreds and thousands of responses and needs to be processed for analysis and
visualization by creating a cluster of population and their associated responses.
5. Compliance data : Many organizations like healthcare, hospitals, life sciences,
finance etc has to file compliance reports.

Page | 2
Lokesh Suryawanshi
Page | 3
Lokesh Suryawanshi
Unit – I
Facets of Data

Facets of Data

Very large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:

a) Structured

b) Natural language

c) Graph-based (Network)

d) Streaming

e) Unstructured

f) Machine-generated

g) Audio, video and images

Structured Data

• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.

• The term structured data refers to data that is identifiable because it is organized in a
structure. The most common form of structured data or records is a database where
specific information is stored based on a methodology of columns and rows.

• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.

• An Excel table is an example of structured data.

Unstructured Data

• Unstructured data is data that does not follow a specified format. Row and columns
are not used for unstructured data. Therefore it is difficult to retrieve required
information. Unstructured data has no identifiable structure.

• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured data.

• Even today in most of the organizations more than 80 % of the data are in unstructured
form. This carries lots of information. But extracting information from these various
sources is a very big challenge.

Page | 4
Lokesh Suryawanshi
• Characteristics of unstructured data:

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in

nature.

Natural Language

• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words and

sentences, then apply meaning and understanding to that information. This helps
machines to understand language as humans do.

• Natural language processing is the driving force behind machine intelligence in many
modern real-world applications. The natural language processing community has had
success in entity recognition, topic recognition, summarization, text completion and
sentiment analysis.

•For natural language processing to help machines understand human language, it

must go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text analysis.

Machine - Generated Data

• Machine-generated data is an information that is created without human interaction

as a result of a computer process or application activity. This means that data entered
manually by an end-user is not recognized to be machine-generated.

• Machine data contains a definitive record of all activity and behavior of our customers,
users, transactions, applications, servers, networks, factory machinery and so on.

• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.

• Examples of machine data are web server logs, call detail records, network event logs
and telemetry.

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate

machine data. Machine data is generated continuously by every processor-based
system, as well as many consumer-oriented systems.

Page | 5
Lokesh Suryawanshi
• It can be either structured or unstructured. In recent years, the increase of machine
data has surged. The expansion of mobile devices, virtual servers and desktops, as well
as cloud- based services and RFID technologies, is making IT infrastructures more
complex.

Graph-based or Network Data

•Graphs are data structures to describe relationships and interactions between entities
in complex systems. In general, a graph contains a collection of entities called nodes
and another collection of interactions between a pair of nodes called edges.

• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph (network)
of nodes.

• A graph database stores nodes and relationships instead of tables or documents.

Data is stored just like we might sketch ideas on a whiteboard. Our data is stored
without restricting it to a predefined model, allowing a very flexible way of thinking about
and using it.

• Graph databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL.

• Graph databases are capable of sophisticated fraud prevention. With graph

databases, we can use relationships to process financial and purchase transactions in
near-real time. With fast graph queries, we are able to detect that, for example, a
potential purchaser is using the same email address and credit card as included in a
known fraud case.

• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people sharing
the same IP address but residing in different physical addresses.

• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories such
as customer interests, friends and purchase history. We can use a highly available
graph database to make product recommendations to a user based on which products
are purchased by others who follow the same sport and have similar purchase history.

• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes and
links (for example influencers and the followers).

Page | 6
Lokesh Suryawanshi
• Influencers on social network have been identified as users that have impact on the
activities or opinion of other users by way of followership or influence on decision made
by other users on the network as shown in Fig. 1.2.1.

• Graph theory has proved to be very effective on large-scale datasets such as social
network data. This is because it is capable of by-passing the building of an actual visual
representation of the data to run directly on data matrices.

Audio, Image and Video

• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures, turn
out to be challenging for computers.

•The terms audio and video commonly refers to the time-based media storage format
for sound/music and moving pictures information. Audio and video digital recording,
also referred as audio and video codecs, can be uncompressed, lossless compressed
or lossy compressed depending on the desired quality and use cases.

• It is important to remark that multimedia data is one of the most important sources of
information and knowledge; the integration, transformation and indexing of multimedia
data bring significant challenges in data management and analysis. Many challenges
have to be addressed including big data, multidisciplinary nature of Data Science and
heterogeneity.

Page | 7
Lokesh Suryawanshi
• Data Science is playing an important role to address these challenges in multimedia
data. Multimedia data usually contains various forms of media, such as text, image,
video, geographic coordinates and even pulse waveforms, which come from multiple
sources. Data Science can be a key instrument covering big data, machine learning and
data mining solutions to store, handle and analyze such heterogeneous data.

Streaming Data

Streaming data is data that is generated continuously by thousands of data sources,

which typically send in the data records simultaneously and in small sizes (order of
Kilobytes).

• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or geospatial
services and telemetry from connected devices or instrumentation in data centers.

Page | 8
Lokesh Suryawanshi
Unit – I
Data Science Process

The Data Science Lifecycle:

Data science’s lifecycle consists of five distinct stages, each with its own tasks:
Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage
involves gathering raw structured and unstructured data.
Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data
Architecture. This stage covers taking the raw data and putting it in a form that can be
used.
Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization.
Data scientists take the prepared data and examine its patterns, ranges, and biases to
determine how useful it will be in predictive analysis.
Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining,
Qualitative Analysis. Here is the real meat of the lifecycle. This stage involves performing
the various analyses on the data.
Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision
Making. In this final step, analysts prepare the analyses in easily readable forms such as
charts, graphs, and reports.

Page | 9
Lokesh Suryawanshi
Steps for Data Science Processes:

Step 1: Define the Problem and Create a Project Charter

Clearly defining the research goals is the first step in the Data Science Process.
A project charter outlines the objectives, resources, deliverables, and timeline,
ensuring that all stakeholders are aligned.

Step 2: Retrieve Data

Data can be stored in databases, data warehouses, or data lakes within an organization.
Accessing this data often involves navigating company policies and requesting
permissions.

Step 3: Data Cleansing, Integration, and Transformation

Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data
integration combines datasets from different sources, while data
transformation prepares the data for modeling by reshaping variables or creating new
features.

Step 4: Exploratory Data Analysis (EDA)

During EDA, various graphical techniques like scatter plots, histograms, and box plots
are used to visualize data and identify trends. This phase helps in selecting the right
modeling techniques.

Step 5: Build Models

In this step, machine learning or deep learning models are built to make predictions or
classifications based on the data. The choice of algorithm depends on the complexity of
the problem and the type of data.

Step 6: Present Findings and Deploy Models

Once the analysis is complete, results are presented to stakeholders. Models are
deployed into production systems to automate decision-making or support ongoing
analysis.

Page | 10
Lokesh Suryawanshi

Fundamentals of Data Science & Big Data"
No ratings yet
Fundamentals of Data Science & Big Data"
14 pages
Data Science
No ratings yet
Data Science
54 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
65 pages
Unit 1
No ratings yet
Unit 1
44 pages
Module 1 Intro To Big Data - Hadoop
No ratings yet
Module 1 Intro To Big Data - Hadoop
55 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Unit-I - Big Data
No ratings yet
Unit-I - Big Data
29 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
30 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Introduction to Big Data Concepts
100% (2)
Introduction to Big Data Concepts
33 pages
BIG DATA Research PDF
No ratings yet
BIG DATA Research PDF
9 pages
Bda Unit - 1
No ratings yet
Bda Unit - 1
21 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Big Data Basics for IT Professionals
No ratings yet
Big Data Basics for IT Professionals
108 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Big Data Seminar Report Overview
100% (2)
Big Data Seminar Report Overview
27 pages
Unit 1
No ratings yet
Unit 1
26 pages
Unit 4
No ratings yet
Unit 4
29 pages
Big Data
No ratings yet
Big Data
34 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit 5
No ratings yet
Unit 5
63 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
Facets of Data
No ratings yet
Facets of Data
7 pages
Presentation 1
No ratings yet
Presentation 1
27 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
BigData Unit-1
No ratings yet
BigData Unit-1
72 pages
Big Data Applications in Business
No ratings yet
Big Data Applications in Business
11 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
Course Material
100% (1)
Course Material
57 pages
Mod 3
No ratings yet
Mod 3
96 pages
Evolution of Data Management History
No ratings yet
Evolution of Data Management History
48 pages
BDA Unit 1
No ratings yet
BDA Unit 1
60 pages
BIG DATA INTRODUCTION Hadoop
No ratings yet
BIG DATA INTRODUCTION Hadoop
24 pages
Big Data Technologies Overview
No ratings yet
Big Data Technologies Overview
32 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
DBIS Lecture 4 - Slides (AI and Big Data)
No ratings yet
DBIS Lecture 4 - Slides (AI and Big Data)
84 pages
Unit 1
No ratings yet
Unit 1
20 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Introduction To Big Data (Module 1)
No ratings yet
Introduction To Big Data (Module 1)
25 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
Bda M1
No ratings yet
Bda M1
111 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
C++ Data Structures: Classes vs. Structs
No ratings yet
C++ Data Structures: Classes vs. Structs
52 pages
Jio's Telecom Revolution in India
No ratings yet
Jio's Telecom Revolution in India
2 pages
I-Unit C#
No ratings yet
I-Unit C#
11 pages
Webpack Module Federation Plugin With Examples
No ratings yet
Webpack Module Federation Plugin With Examples
21 pages
e-Panchayat: Enhancing Local Governance
No ratings yet
e-Panchayat: Enhancing Local Governance
67 pages
GeForce GTX 1080 Ti GAMING X TRIO
No ratings yet
GeForce GTX 1080 Ti GAMING X TRIO
1 page
DocuMate 3220 Service Manual
No ratings yet
DocuMate 3220 Service Manual
62 pages
Master of Science (Information Technology) YearSession 2024-2025-DeCEMBER-DeC2024 REGULAR Term 1 REGULAR STUDENT
No ratings yet
Master of Science (Information Technology) YearSession 2024-2025-DeCEMBER-DeC2024 REGULAR Term 1 REGULAR STUDENT
421 pages
Computer 01: Daily Class Notes (English)
No ratings yet
Computer 01: Daily Class Notes (English)
7 pages
Morek EV Tool User Manual
No ratings yet
Morek EV Tool User Manual
20 pages
Chapter 1: The Study of Accounting Information Systems
No ratings yet
Chapter 1: The Study of Accounting Information Systems
23 pages
ServiceNow Interview FAQs and Insights
No ratings yet
ServiceNow Interview FAQs and Insights
7 pages
List of HTML Tags
No ratings yet
List of HTML Tags
13 pages
LPI Passguide 701-100 v2019-03-07 by David 30q
No ratings yet
LPI Passguide 701-100 v2019-03-07 by David 30q
15 pages
Client 4 - ARM Tech Symposia Trustonic 01112023 - John Dent (Trustonic)
No ratings yet
Client 4 - ARM Tech Symposia Trustonic 01112023 - John Dent (Trustonic)
31 pages
Java Basics for Computer Science Students
No ratings yet
Java Basics for Computer Science Students
14 pages
Big Data: Borcelle Company
No ratings yet
Big Data: Borcelle Company
15 pages
SimHub USB Display Guide
No ratings yet
SimHub USB Display Guide
14 pages
Best Practice - Data Consistency Check For Logistics
100% (1)
Best Practice - Data Consistency Check For Logistics
68 pages
2.2.service Design
No ratings yet
2.2.service Design
67 pages
Vending Machines Final PDF
No ratings yet
Vending Machines Final PDF
20 pages
Customize Your AutoCAD Plant 3D Isometric Configuration
No ratings yet
Customize Your AutoCAD Plant 3D Isometric Configuration
61 pages
About MMORPG Carding PDF
No ratings yet
About MMORPG Carding PDF
1 page
FortiGate 1000F Series Data Sheet
No ratings yet
FortiGate 1000F Series Data Sheet
1 page
Microsoft SQL Server IO Internals
No ratings yet
Microsoft SQL Server IO Internals
24 pages
Asynchronous Counter Lab Experiment
100% (1)
Asynchronous Counter Lab Experiment
2 pages
Summer 2024 Industrial Training Overview
No ratings yet
Summer 2024 Industrial Training Overview
38 pages
DP 700
No ratings yet
DP 700
139 pages
Lenovo Ideapad 510 User Guide: Ideapad 510-15ISK Ideapad 510-15IKB
No ratings yet
Lenovo Ideapad 510 User Guide: Ideapad 510-15ISK Ideapad 510-15IKB
32 pages
Final Year Project Literature Review - Old
No ratings yet
Final Year Project Literature Review - Old
14 pages

Data Science Unit-1 B.sc. III Sem. MDC

Uploaded by

Data Science Unit-1 B.sc. III Sem. MDC

Uploaded by

Guru Ghasidas Vishwavidyalaya, Bilaspur (C.G.

Benefits and Uses of Data Science

Benefits and Use of Big Data

• Some of the examples of big data are:

g) Audio, video and images

• An Excel table is an example of structured data.

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in

• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words and

•For natural language processing to help machines understand human language, it

Machine - Generated Data

• Machine-generated data is an information that is created without human interaction

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate

Graph-based or Network Data

• A graph database stores nodes and relationships instead of tables or documents.

• Graph databases are capable of sophisticated fraud prevention. With graph

Audio, Image and Video

Streaming data is data that is generated continuously by thousands of data sources,

The Data Science Lifecycle:

Step 1: Define the Problem and Create a Project Charter

Step 2: Retrieve Data

Step 3: Data Cleansing, Integration, and Transformation

Step 4: Exploratory Data Analysis (EDA)

Step 5: Build Models

Step 6: Present Findings and Deploy Models

You might also like