0% found this document useful (0 votes)
80 views10 pages

Data Science Unit-1 B.sc. III Sem. MDC

The document provides an overview of data science, its benefits, and the concept of big data, including its characteristics and applications. It details the types of data, such as structured, unstructured, and machine-generated data, as well as the data science process, which includes stages like data capture, maintenance, processing, analysis, and communication. Additionally, it outlines steps for effective data science processes, emphasizing the importance of problem definition, data retrieval, cleansing, exploratory analysis, model building, and presentation of findings.

Uploaded by

molon26155
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views10 pages

Data Science Unit-1 B.sc. III Sem. MDC

The document provides an overview of data science, its benefits, and the concept of big data, including its characteristics and applications. It details the types of data, such as structured, unstructured, and machine-generated data, as well as the data science process, which includes stages like data capture, maintenance, processing, analysis, and communication. Additionally, it outlines steps for effective data science processes, emphasizing the importance of problem definition, data retrieval, cleansing, exploratory analysis, model building, and presentation of findings.

Uploaded by

molon26155
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Guru Ghasidas Vishwavidyalaya, Bilaspur (C.G.

)
Department of Computer Science and Information Technology
Lecture Notes
B.Sc. III Sem. Data Science (MDC3)

Unit – I
Introduction of Data Science

Data science:
Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions. Data science uses complex machine learning algorithms to build
predictive models. The data used for analysis can come from many different sources and
presented in various formats. Data science is about extraction, preparation, analysis,
visualization, and maintenance of information. It is a cross disciplinary field which uses
scientific methods and processes to draw insights from data.

Benefits and Uses of Data Science


a) Anomaly detection: Fraud, disease and crime
b) Classification: Background checks; an email server classifying emails as "important"
c) Forecasting: Sales, revenue and customer retention
d) Pattern detection: Weather patterns, financial market patterns
e) Recognition : Facial, voice and text
f) Recommendation: Based on learned preferences, recommendation engines can refer
user to movies, restaurants and books
g) Regression: Predicting food delivery times, predicting home prices based on
amenities
h) Optimization: Scheduling ride-share pickups and package deliveries

Big Data:
Big data can be defined as very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speed i.e. velocities and varying
degrees of ambiguity, which cannot be processed using traditional technologies,
processing methods, algorithms or any commercial off-the-shelf solutions. 'Big data' is a
term used to describe collection of data that is huge in size and yet growing exponentially
with time. In short, such a data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.

Page | 1
Lokesh Suryawanshi
Characteristics of Big Data:
Characteristics of big data are volume, velocity and variety. They are often referred to as
the three V's.
1. Volume Volumes of data are larger than that conventional relational database
infrastructure can cope with. It consisting of terabytes or petabytes of data.
2. Velocity: The term 'velocity' refers to the speed of generation of data. How fast the data
is generated and processed to meet the demands, determines real potential in the data.
It is being created in or near real-time.
3. Variety: It refers to heterogeneous sources and the nature of data, both structured and
unstructured.

Benefits and Use of Big Data


• Benefits of Big Data :
1. Improved customer service
2. Businesses can utilize outside intelligence while taking decisions
3. Reducing maintenance costs
4. Re-develop our products : Big Data can also help us understand how others perceive
our products so that we can adapt them or our marketing, if need be.
5. Early identification of risk to the product/services, if any
6. Better operational efficiency

• Some of the examples of big data are:


1. Social media : Social media is one of the biggest contributors to the flood of data we
have today. Facebook generates around 500+ terabytes of data everyday in the form of
content generated by the users like status messages, photos and video uploads,
messages, comments etc.
2. Stock exchange : Data generated by stock exchanges is also in terabytes per day. Most
of this data is the trade data of users and companies.
3. Aviation industry: A single jet engine can generate around 10 terabytes of data during
a 30 minute flight.
4. Survey data: Online or offline surveys conducted on various topics which typically has
hundreds and thousands of responses and needs to be processed for analysis and
visualization by creating a cluster of population and their associated responses.
5. Compliance data : Many organizations like healthcare, hospitals, life sciences,
finance etc has to file compliance reports.

Page | 2
Lokesh Suryawanshi
Page | 3
Lokesh Suryawanshi
Unit – I
Facets of Data

Facets of Data

Very large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:

a) Structured

b) Natural language

c) Graph-based (Network)

d) Streaming

e) Unstructured

f) Machine-generated

g) Audio, video and images

Structured Data

• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.

• The term structured data refers to data that is identifiable because it is organized in a
structure. The most common form of structured data or records is a database where
specific information is stored based on a methodology of columns and rows.

• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.

• An Excel table is an example of structured data.

Unstructured Data

• Unstructured data is data that does not follow a specified format. Row and columns
are not used for unstructured data. Therefore it is difficult to retrieve required
information. Unstructured data has no identifiable structure.

• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured data.

• Even today in most of the organizations more than 80 % of the data are in unstructured
form. This carries lots of information. But extracting information from these various
sources is a very big challenge.

Page | 4
Lokesh Suryawanshi
• Characteristics of unstructured data:

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in


nature.

Natural Language

• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words and


sentences, then apply meaning and understanding to that information. This helps
machines to understand language as humans do.

• Natural language processing is the driving force behind machine intelligence in many
modern real-world applications. The natural language processing community has had
success in entity recognition, topic recognition, summarization, text completion and
sentiment analysis.

•For natural language processing to help machines understand human language, it


must go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text analysis.

Machine - Generated Data

• Machine-generated data is an information that is created without human interaction


as a result of a computer process or application activity. This means that data entered
manually by an end-user is not recognized to be machine-generated.

• Machine data contains a definitive record of all activity and behavior of our customers,
users, transactions, applications, servers, networks, factory machinery and so on.

• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.

• Examples of machine data are web server logs, call detail records, network event logs
and telemetry.

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate


machine data. Machine data is generated continuously by every processor-based
system, as well as many consumer-oriented systems.

Page | 5
Lokesh Suryawanshi
• It can be either structured or unstructured. In recent years, the increase of machine
data has surged. The expansion of mobile devices, virtual servers and desktops, as well
as cloud- based services and RFID technologies, is making IT infrastructures more
complex.

Graph-based or Network Data

•Graphs are data structures to describe relationships and interactions between entities
in complex systems. In general, a graph contains a collection of entities called nodes
and another collection of interactions between a pair of nodes called edges.

• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph (network)
of nodes.

• A graph database stores nodes and relationships instead of tables or documents.


Data is stored just like we might sketch ideas on a whiteboard. Our data is stored
without restricting it to a predefined model, allowing a very flexible way of thinking about
and using it.

• Graph databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL.

• Graph databases are capable of sophisticated fraud prevention. With graph


databases, we can use relationships to process financial and purchase transactions in
near-real time. With fast graph queries, we are able to detect that, for example, a
potential purchaser is using the same email address and credit card as included in a
known fraud case.

• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people sharing
the same IP address but residing in different physical addresses.

• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories such
as customer interests, friends and purchase history. We can use a highly available
graph database to make product recommendations to a user based on which products
are purchased by others who follow the same sport and have similar purchase history.

• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes and
links (for example influencers and the followers).

Page | 6
Lokesh Suryawanshi
• Influencers on social network have been identified as users that have impact on the
activities or opinion of other users by way of followership or influence on decision made
by other users on the network as shown in Fig. 1.2.1.

• Graph theory has proved to be very effective on large-scale datasets such as social
network data. This is because it is capable of by-passing the building of an actual visual
representation of the data to run directly on data matrices.

Audio, Image and Video

• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures, turn
out to be challenging for computers.

•The terms audio and video commonly refers to the time-based media storage format
for sound/music and moving pictures information. Audio and video digital recording,
also referred as audio and video codecs, can be uncompressed, lossless compressed
or lossy compressed depending on the desired quality and use cases.

• It is important to remark that multimedia data is one of the most important sources of
information and knowledge; the integration, transformation and indexing of multimedia
data bring significant challenges in data management and analysis. Many challenges
have to be addressed including big data, multidisciplinary nature of Data Science and
heterogeneity.

Page | 7
Lokesh Suryawanshi
• Data Science is playing an important role to address these challenges in multimedia
data. Multimedia data usually contains various forms of media, such as text, image,
video, geographic coordinates and even pulse waveforms, which come from multiple
sources. Data Science can be a key instrument covering big data, machine learning and
data mining solutions to store, handle and analyze such heterogeneous data.

Streaming Data

Streaming data is data that is generated continuously by thousands of data sources,


which typically send in the data records simultaneously and in small sizes (order of
Kilobytes).

• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or geospatial
services and telemetry from connected devices or instrumentation in data centers.

Page | 8
Lokesh Suryawanshi
Unit – I
Data Science Process

The Data Science Lifecycle:


Data science’s lifecycle consists of five distinct stages, each with its own tasks:
Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage
involves gathering raw structured and unstructured data.
Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data
Architecture. This stage covers taking the raw data and putting it in a form that can be
used.
Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization.
Data scientists take the prepared data and examine its patterns, ranges, and biases to
determine how useful it will be in predictive analysis.
Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining,
Qualitative Analysis. Here is the real meat of the lifecycle. This stage involves performing
the various analyses on the data.
Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision
Making. In this final step, analysts prepare the analyses in easily readable forms such as
charts, graphs, and reports.

Page | 9
Lokesh Suryawanshi
Steps for Data Science Processes:

Step 1: Define the Problem and Create a Project Charter

Clearly defining the research goals is the first step in the Data Science Process.
A project charter outlines the objectives, resources, deliverables, and timeline,
ensuring that all stakeholders are aligned.

Step 2: Retrieve Data

Data can be stored in databases, data warehouses, or data lakes within an organization.
Accessing this data often involves navigating company policies and requesting
permissions.

Step 3: Data Cleansing, Integration, and Transformation

Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data
integration combines datasets from different sources, while data
transformation prepares the data for modeling by reshaping variables or creating new
features.

Step 4: Exploratory Data Analysis (EDA)

During EDA, various graphical techniques like scatter plots, histograms, and box plots
are used to visualize data and identify trends. This phase helps in selecting the right
modeling techniques.

Step 5: Build Models

In this step, machine learning or deep learning models are built to make predictions or
classifications based on the data. The choice of algorithm depends on the complexity of
the problem and the type of data.

Step 6: Present Findings and Deploy Models

Once the analysis is complete, results are presented to stakeholders. Models are
deployed into production systems to automate decision-making or support ongoing
analysis.

Page | 10
Lokesh Suryawanshi

You might also like