0% found this document useful (0 votes)
55 views28 pages

Unit 1

Uploaded by

mayura.shelke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views28 pages

Unit 1

Uploaded by

mayura.shelke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit 1:Introduction to Data Science (07 Hours)

Basics and need of Data Science, Applications of Data Science, Relationship between Data
Science and Information Science, Business intelligence versus Data Science,

Data: Data Types, Data Collection.

Need of Data wrangling, Methods: Data Cleaning, Data Integration, Data Reduction, Data
Transformation, and Data Discretization.
Data science Jobs
As per various surveys, data scientist job is becoming the most demanding Job
of the 21st century due to increasing demands for data science.

Some people also called it "the hottest job title of the 21st century". Data
scientists are the experts who can use various statistical tools and machine
learning algorithms to understand and analyze the data.

The average salary range for data scientist will be approximately $95,000 to $
165,000 per annum, and as per different researches, about 11.5 millions of job
will be created by the year 2026.
Types of Data Science Job

If you learn data science, then you get the opportunity to find the various exciting job roles in
this domain. The main job roles are given below:

1.Data Scientist

2.Data Analyst

3.Machine learning expert

4.Data engineer

5.Data Architect

6.Data Administrator

7.Business Analyst
data science as a field of study and practice that involves the collection, storage,
and processing of data in order to derive important insights into a problem or a
phenomenon. Such data may be generated by humans (surveys, logs, etc.) or
machines (weather data, road vision, etc.), and could be in different formats
(text, audio, video, augmented or virtual reality, etc.).
Need of Data Science
Applications of Data Science
•Fraud and risk detection: Over the years, financial organizations have learned
to analyze the probabilities of risks and defaults through customer profiling, past
expenditures, and other variables available through data.

•Healthcare: Data science makes it possible to manage and analyze very large
diverse datasets in healthcare systems, drug development, medical image
analysis, and more. Recently Data Science approaches were brought in to combat
the COVID-19 pandemic. Data Scientists helped in digital contact tracing,
diagnosis, risk assessment, resource allocation, estimating epidemiological
parameters, drug development, social media analytics, etc.
•Internet search: All search engines, including Google, use data science
algorithms to deliver the best result for searched queries within seconds.

•Targeted advertising: Digital ads have a higher call-through rate (CTR) than
traditional ads because targeted advertising is based on a user’s past behavior
with the help of data science algorithms.

•Recommendation systems: Internet giants as well as other businesses have


fervidly made use of recommendation engines to promote their products based
on users’ previous search results and their interests.
•Advanced image, speech, or character recognition: Facial recognition
algorithms on Facebook, speech recognition products, such as Siri, Cortana,
Alexa, etc., and Google Lens are all perfect examples of data science applications
in image, speech, and character recognition.

•Gaming: Today, games use machine learning algorithms to improve or upgrade


themselves as players move up to higher levels. In motion gaming, the opponent
(computer) is able to analyze a player’s previous moves and accordingly shape up
its game. This is all possible because of data science.

•Augmented reality (AR): Augmented reality promises an exciting future


through Data Science. A VR headset, for example, contains algorithms, data, and
computing knowledge to offer the best viewing experience.
How Does Data Science relate to other fields?
Data is everywhere.
Humans and machines are constantly creating new data.
Data scientists are interested in investigating the characteristics of data – looking
for patterns that reveal how people and society can benefit from data.
1. Data Science and Statistics
2. Data Science and Computer Science
3. Data Science and Engineering
4. Data Science and Business Analytics
Relationship between Data Science and Information Science
1.4.1 Information vs. Data
• Data is something raw, meaningless, an object that, when analyzed or converted
to a useful form, becomes information.
• Information is also defined as “data that are endowed with meaning and
purpose.
For example, the number “480,000” is a data point. But when we add an
explanation that it represents the number of deaths per year in the USA from
cigarette smoking, it becomes information.
1.4.2 Users in Information Science
Different users may not agree on a piece of information’s relevancy depending on
various factors that affect judgment, such as “usefulness.
Usefulness is a criterion that determines how useful is the interaction between the
user and the information object (data) in accomplishing the task or goal of the user.
Business intelligence (BI)
• Business intelligence (BI) is a set of strategies and technologies enterprises use to analyze
business information and transform it into actionable insights that inform strategic and
tactical business decisions.

• BI tools access and analyze data sets and present analytical findings in reports, summaries,
dashboards, graphs, charts, and maps to provide users with detailed intelligence about the
state of the business.
Factors
Business intelligence
Business Intelligence
versus Data
Data Science
Science
Concept It is a collection of processes, tools, and It consists of mathematical and statistical
technologies that help a business with data models used for processing the data,
analysis. discovering hidden patterns, and predicting
future actions based on those patterns.
Data It deals mainly with structured data. It accepts both structured and unstructured
data.
Flexibility Data sources should be planned before the Data Sources can be added anytime based on
visualization. the requirements.
Approach It has both statistical and visual approaches Graph analysis, NLP, machine learning, neural
toward data analysis. networks, and other methods can be used to
process the data.
Expertise It is made for business users to visualize It requires sound knowledge of data analysis
raw business information without any and programming.
technical knowledge.
Complexity For a single user, compared to data science, Data science is much more complex when
business intelligence is much simpler to use compared to business intelligence.
and visualize data.
Data
1. Data Types
- Structured data
- Unstructured data
2. Data Collection
- Open Data
- Social Media Data
-Multimodal Data
-Data Storage and Presentation
• Structured data is the most important data
Structured Datatype

• Highly organized information that can be seamlessly included in a database


and readily searched via simple search operations.
Unstructured data
• Unstructured data is data without labels.
• Examples of unstructured data include text, mobile activity, social media
posts, Internet of Things (IoT) sensor data, etc.
• Challenges with Unstructured Data
• The lack of structure makes compilation and organizing unstructured data a
time- and energy-consuming task.
• structured data is akin to machine language, in that it makes information
much easier to be parsed by computers.
Data Collections
1. Open Data
-data should be freely available in a public domain
-can be used by anyone as they wish, without restrictions from copyright, patents, or other
mechanisms of control.
list of principles associated with open data
a. Public.
b. Accessible.
c. Described.
d. Reusable.
e. Complete.
f. Timely.
2. Social Media Data
data to analyze for research or marketing purposes.
This is facilitated by the Application Programming Interface (API) that social media companies
provide to researchers and developers.
3. Multimodal Data

-Internet of Things (IoT).

4. Data Storage and Presentation

-Depending on its nature, data is stored in various formats

-most commonly used formats that store data as simple text – comma-separated values (CSV)
and tab-separated values (TSV).

Other formats are

XML (eXtensible Markup Language)

RSS (Really Simple Syndication

JSON (JavaScript Object Notation)


Need of Data wrangling/ Data Preprocessing
Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data
into the desired format for analysts to use for prompt decision-making. Also known as data
cleaning or data munging.

Data wrangling enables businesses to tackle more complex data in less time, produce more
accurate results, and make better decisions.

Here are some of the factors that indicate that data is not clean or ready to process:

1. Incomplete.

2. Noisy

3. Inconsistent.
Forms of data pre-processing
Data Cleaning
1. Data Munging-
-The data is not in a format that is easy to work with.
-It may be stored or presented in a way that is hard to process.
-We need to convert it to something more suitable for a computer to understand.
-This can be done manually, automatically, or, in many cases, semi-
automatically.
For example:
Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the mix.”
This can be turned into a table- “analysis friendly.”
2. Handling Missing Data
• Sometimes data may be in the right format, but some of the values are
missing.
• what to do when we encounter missing data? There is no single good
answer.
--------------------We need to find a suitable strategy based on the situation.
3. Smooth Noisy Data
• There are times when the data is not missing, but it is corrupted for
some reason.
• This is, in some ways, a bigger problem than missing data.
• Data corruption may be a result of faulty data collection instruments,
data entry problems, or technology limitations.
Data Integration
The following steps describe how to integrate multiple databases or files.

1. Combine data from multiple sources into a coherent storage place (e.g., a single file or a
database).

2. Engage in schema integration, or the combining of metadata from different sources.

3. Detect and resolve data value conflicts. For example: Reasons for this conflict could be
different representations or different scales; for example, metric vs. British units.

4. Address redundant data in data integration. Redundant data is commonly generated in

the process of integrating multiple databases. For example:

a. The same attribute may have different names in different databases.


Data Transformation
Data must be transformed so it is consistent and readable (by a system).
The following five processes may be used for data transformation.
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and aggregation.
Some of
the techniques that are used for accomplishing normalization are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction.
a. New attributes constructed from the given ones.
Data Reduction

• Data reduction is a key process in which a reduced representation of a dataset


that produces the same or similar analytical results is obtained.

• two of the most common techniques used for data reduction.

• Data Cube Aggregation-This technique is used to aggregate data in a


simpler form. Data Cube Aggregation is a multidimensional aggregation that
uses aggregation at various levels of a data cube to represent the original data
set, thus achieving data reduction.

• Dimensionality Reduction-
Data Discretization

Data that are collected from processes that are continuous, such as
temperature, ambient light, and a company’s stock price. But
sometimes we need to convert these continuous values into more
manageable parts.
There are three types of attributes involved in discretization:
a. Nominal: Values from an unordered set
b. Ordinal: Values from an ordered set
c. Continuous: Real numbers

You might also like