Primary data, Secondary data, Process of Data Science, The 3 V’s (Volume, Velocity, Variety) ,
APPLICATIONS OF DATA SCIENCE, DATA SCIENCE LIFE CYCLE, DATA SCIENTIST’s TOOLBOX, python
prohraming, R Programming
Primary data is data that is never collected before and can be gathered in a variety of ways such as,
participatory or non-participatory observation, conducting interviews, collecting data through
questionnaires or schedules, and so on.
Secondary data, on the other hand, is data that is already gathered and can be accessed and used by other
users easily. Secondary data can be from existing case studies, government reports, newspapers, journals,
books and also from many popular dedicated websites that provide several datasets
Process of Data Science: • Data science builds algorithms and systems for discovering knowledge, detecting
the patterns, and generating useful information from massive data. • To do so, it encompasses an entire data
analysis process that starts with the extraction of data and cleaning, and extends to data analysis, description,
and summarization
The 3 V’s (Volume, Velocity, Variety) • Why is data science so important now? We have a lot of data, we
continue to generate a staggering amount of data at an unprecedented and ever-increasing speed, analyzing
data wisely necessitates the involvement of competent and well-trained practitioners, and analyzing such
data can provide actionable insights.
APPLICATIONS OF DATA SCIENCE • Traditionally, the data was mostly structured and small in size, which
could be analyzed by using the simple BI (Business Intelligence) tools. • Unlike data in the traditional systems
which was mostly structured, today most of the data is unstructured or semi-structured. • This data is
generated from different sources like financial logs, text files, multimedia forms, sensors, and instruments.
DATA SCIENCE LIFE CYCLE • The life cycle of data science outlines the steps/phases, from start to finish, that
projects usually follow when they are executed. • The lifecycle of the data analytics provides a framework
for the best performances of each phase from the creation of the project until its completion. Setting Goal •
The entire cycle revolves around the business or research goal. What will we solve if we do not have a precise
problem? It is essential to understand the business objective clearly because that will be the final goal of the
analysis Data Understanding • Data understanding involves the collection of all the available data. • We need
to understand what data is present and what data could be used for given problem. Data Preparation • The
data preparation step includes selecting the relevant data, integrating the data by merging the data sets,
cleaning them, treating the missing values by either removing them or imputing them, Exploratory Data
Analysis • This step involves getting some idea about the solution and factors affecting it before building the
actual model. Data Modeling • Data modeling is the heart of data analysis. A model takes the prepared data
as input and provides the desired output.
DATA SCIENTIST’s TOOLBOX • A data scientist is a professional who responsible for extracting, manipulating,
pre processing and generating predictions out of data. In order to do so, he/she requires various statistical
tools and programming languages.
Python Programming: • Choosing the right programming language for Data Science is of utmost importance.
Python offers various libraries designed explicitly for Data Science operations. • Python programming
language is an open-source tool and falls under object-oriented scripting language. It was found in the 1980s
by Guido van Rossum
R Programming: • R programming is a popular language used in the Data Science provides a scalable software
environment for statistical analysis. r programming is versatile and can run on any platform such as UNIX,
Windows, and Mac operating systems.
SAS (Statistical Analysis System,tableau public, Microsoft excel, type of data,structure of data,
Unstructured Data, semi structured data, DATA SOURCES open data source, Social Media Data Source
SAS (Statistical Analysis System): • SAS is used by large organizations to analyze data uses SAS programming
language which for performing statistical modeling. • SAS offers numerous statistical libraries and tools that
we as a Data Scientist can use for modeling and organizing their data.
Tableau Public: • Tableau is data visualization software which has its free version named as Tableau Public.
It is data visualization software/tool that is packed with powerful graphics to make interactive visualizations.
Microsoft Excel: • Microsoft Excel is an analytical tool for Data Science used by data scientists for data
visualization. • Excel represents the data in a simple way using rows and columns and comes with various
formulae, filters for data
TYPES OF DATA • Data is a set of raw facts such as descriptions, observations and numbers that needs to be
processed to make it meaningful. Processed data in a meaningful way is known as information. • One purpose
of Data Science is to structure data, making it interpretable and easy and simply to work with
Structured Data • Structured data as name suggest type of data is well organized. Structured data is data
that depends on a data model and resides in a fixed field within a record. • Structured data is comprised of
clearly defined data types whose pattern makes them easily searchable. It is often easy to store structured
data in tables within databases or Excel files.
Unstructured Data • Unstructured data is a data that is not organized in a pre-defined manner or does not
have a pre-defined data model. Unstructured data has internal structure but is not structured via pre-defined
data models or schema.
Semi-structured Data • Semi-structured data is a data type that contains semantic tags, but does not
conform to the structure associated with typical relational databases Markup Language XML: This is a semi-
structured document language. XML is a set of document encoding rules that defines a human- and machine-
readable format. Its value is that its tag-driven structure is highly flexible, and coders can adapt it to
universalize data structure, storage, and transport on the Web. . Open Standard JSON (JavaScript Object
Notation) JSON: It is another semi structured data interchange format. Java is implicit in the name but other
C-like programming languages recognize it. Its structure consists of name/value pairs (or object, hash table,
etc.) and an ordered value list (or array, sequence, list).
DATA SOURCES • A data source in data science is the initial location where data that is being used come from.
• Data collection is the process of acquiring, collecting, extracting, and storing the huge amount of data which
may be in the structured or unstructured form like text, video, audio, XML files, records, or other image files
used in later stages of data analysis.
Open Data Source • The idea behind open data is that some data should be freely available in a public domain
that can be used by anyone as they wish, without restrictions from copyright, patents, or other mechanisms
of control. • Local and federal governments, Non-Government Organizations (NGOs) and academic
communities all lead open data initiatives. For example, Open Government Data Platform India is a platform
for supporting Open Data initiative of Government of India. Open Government Data Platform India is also
packaged as a product and made available in open source for implementation by countries globally.
Social Media Data Source • Social media channels has abundant source of data. Social media are interactive
Web 2.0 Internet-based applications. Social media are reflection of public. • Social media are interactive
technologies that allows creation or sharing/exchange of information, ideas, career interests and other forms
of expression via virtual communities and network
Multi-model Data, Standard Datasets, DATA FORMATS Integers, Classification of Data Samples,
KEY Structured Data Semi Structured Data Unstructured Data
• Level of Structured data as name Semi structured data the data is Unstructured data is
organizing suggest this type of data organized up to some extent non organized, hence
is well organized and only and rest is non organized level of organizing is
hence level of hence the level of organizing is lowest in case of
organizing is highest in less than that of Structured Unstructured Data.
this type of data. Data and higher than that of
Unstructured Data
• Means of Structured data is get Unstructured data is
data organized by the means Semi structured data is partially based on simple
organization of Relational Database. organized by the means of character and binary
XML/RDF. data.
• Transaction In structured data
management management and In semi structured data While in unstructured
concurrency of data is transaction is not by default but data no transaction
present and hence is get adapted from DBMS but management and no
mostly preferred in data concurrency is not present concurrency are
multitasking process. present.
• Technology It is based on Relational
database table. It is based on XML/RDF. It is based on
character and binary
data
Multi-model Data • Today explosion of unstructured data evolving as a big challenge for industry and
researchers. • IoT (Internet of Things) has allowed us to always remain connected with the help of different
electronics gadgets. This communication network generates huge data having different formats and data
types. • When dealing with such contexts, we may need to collect and explore multimodal (different forms)
and multimedia (different media) data such as images, music and other sounds, gestures, body posture, and
Standard Datasets • A dataset or data set is simply a collection of data. • In the case of tabular data (in the
form of table), a data set corresponds to one or more database tables, where every column of a table
represents a particular variable and each row corresponds to a given record of the data set in question.
DATA FORMATS Integers: • An integer is a datum of integral data type, a data type that represents some
range of mathematical integers. • Integral data types may be of different sizes and may or may not be allowed
to contain negative values. 2. Floats: • A floating point (known as a float) number has decimal points even if
that decimal point value is 0 : • Text data type is known as Strings in Python, or Objects in Pandas. Strings
can contain numbers and/or characters. • For example, a string might be a word, a sentence, or several
sentences. A string can also contain or consist of numbers
Classification of Data Samples: • This is a statistical method that is used by the same name in the data science
and mining fields. • Classification is used to categorize available data into accurate, observable analyses. Such
an organization is key for companies who plan to use these insights to make business plans.
Probability Distribution and Estimation: • These statistical methods are helps to learning the basics of
machine learning and algorithms like logistic regressions. • Cross-validation and LOOCV (Leave One Out Cross
Validation) techniques are also inherently statistical tools that have been brought into the Machine Learning
and Data Analytics world for inference-based research, A/B and hypothesis testing
ROLE OF STATISTICS IN DATA , DESCRIPTIVE STATISTICS, Measures of Frequency, Measures of Central
Tendency, Measures of Dispersion, range, Coefficient of Range, Estimation of Parameter Values,
ROLE OF STATISTICS IN DATA SCIENCE • Statistics has evolved along with technology and the growth of data.
Context of Statistics and its applications are tremendously changed by time. Strategies for taking business
decisions using statistical results are now more expanded. It can be used both on large complex data sets and
as a more accurate and informative alternative to data modeling on smaller data sets. • Framing questions
statistically allows researchers to leverage data resources to extract knowledge and obtain better answers.
DESCRIPTIVE STATISTICS • The study of numerical and graphical ways to describe and display the data is
called descriptive statistics. • Descriptive statistics use data to carry out descriptions of the population in the
form of numerical calculations, visualization graphs, or tables.
Measures of Frequency • The measures of frequency are widely used in statistical analysis (analyze and
interpret data to gain meaningful insights) to analyze how often a particular data value or a feature occurs. •
The frequency distribution can be tabulated as a frequency chart or it can be graphically represented by
drawing a bar chart or a histogram
Measures of Central Tendency • One of the simplest and yet important measures of statistical analysis is to
find one such value that describes the characteristic of the entire huge set of data. • This single value is
referred to as a central tendency that provides a number to represent the whole set of scores of a feature. •
A measure of central tendency is a summary statistic that represents the center point or typical value
Measures of Dispersion • The measures of central tendency may not be adequate to describe data unless
we know the manner in which the individual items scatter around it. • In other words, a further description
of a series on the scatter or variability known as dispersion is necessary, if we are to gauge how representative
the average
Range: • The value of the range is the simplest measure of dispersion and is found by calculating the
difference between the largest data value (L) and the smallest data value (S) in a given data distribution. Thus,
Range (R) = L – S.
Coefficient of Range: It is a relative measure of the range. It is used in the comparative study of the
dispersion, Co-efficient of Range = L – S /L + S
Standard Deviation: • The standard deviation is the measure of how far the data deviates from the mean
value. • Standard deviation is the most common measure of dispersion and is found by finding the square
root of the sum of squared deviation from the mean divided by the number of observations in a given dataset.
variance: • The variance is a measure of variability. It is the average squared deviation from the mean.
Variance measures how far are data points spread out from the mean
Hypothesis Testing • The hypothesis testing is the one of the most promising inferential statistical techniques
used in data analysis to check whether a stated hypothesis is accepted or rejected. • The process to
determine whether the stated hypothesis is accepted or rejected from sample data is called hypothesis
testing. • Hypothesis testing is mainly used to determine whether there is sufficient evidence in a data sample
to conclude that a particular condition holds for an entire population
Estimation of Parameter Values • Parameter estimation plays a vital role in statistics. In statistics, finding
estimation or inference refers to the task of drawing conclusions about a population, based on the
information provided about the sample. • This means that the task of estimation of parameter values involves
making inferences from a given sample about an unknown population parameter. • This can be done in two
ways namely, using point estimate and using the interval estimate. Both of these ways of estimation of
parameter values
MEASURING DATA SIMILARITY AND DISSIMILARITY, Data Matrix versus Dissimilarity Matrix, Proximity
Measures for Nominal Attributes, Binary Attributes , CONCEPT OF OUTLIERS, Outlier Detection Methods
MEASURING DATA SIMILARITY AND DISSIMILARITY • In data science, the similarity measure is a way of
measuring how data samples are related or closed to each other. The dissimilarity measure is to tell how
much the data objects are distinct. • In data mining applications, such as clustering, outlier analysis, and
nearest-neighor classification, we need ways to assess how alike or unalike objects are in comparison
Data Matrix versus Dissimilarity Matrix • Consider the objects described by multiple attributes. Suppose
that we have n objects (e.g., persons, items, or courses) described by p attributes (also called measurements
or features, such as age, height, weight, or gender). • The objects are x1=(x11, x12, : : : , x1p), x2 =(x21, x22,
: : : , x2p), and so on, where xij is the value for object xi of the jth attribute. • For brevity, we hereafter refer
to object xi as object i. The objects may be tuples in a relational database, samples or feature vectors
Proximity Measures for Nominal Attributes • A nominal attribute can take on two or more states. For
example, map color is a nominal attribute that may have, say, five states namely, red, yellow, green, pink and
blue. • Let the number of states of a nominal attribute be M. The states can be denoted by letters, symbols,
or a set of integers, such as 1, 2, …. , M.
Proximity Measures for Binary Attributes • A binary attribute has only one of two states: 0 and 1, where 0
means that the attribute is absent and 1 means that it is present. • To compute the dissimilarity between
two binary attributes approach involves computing a dissimilarity matrix from the given binary data.
Dissimilarity of Numeric Data • The distance measures that are commonly used for computing the
dissimilarity of objects described by numeric attributes. These measures include the Euclidean, Manhattan
and Minkowski distances.
Proximity Measures for Ordinal Attributes • Ordinal attributes may also be obtained from the discretization
of numeric attributes by splitting the value range into a finite number of categories. • These categories are
organized into ranks. That is, the range of a numeric attribute can be mapped to an ordinal attribute f having
Mf states. • For example, the range of the interval-scaled attribute temperature (in Celsius) can be organized
into the following states: -30 to -10, -10 to 10, 10 to 30, representing the categories cold temperature,
moderate temperature, and warm temperature, respectively.
CONCEPT OF OUTLIERS • Outliers are a very important aspect of data analysis. This has many applications in
determining fraud and potential new trends in the market. • In purely statistical sense, an outlier is an
observation point that is distant from other observations. • The probably first definition was given by Grubbs
in 1969 as “an outlying observation, or outlier is one that appears to deviate markedly from other members
of the sample in which it occurs”.
Types of Outliers • Outliers can be classified into following three categories: 1. Global Outlier (or Point
Outliers): • If an individual data point can be considered anomalous with respect to the rest of the data, then
the datum is termed as a point outlier. • For example, Intrusion detection in computer networks Contextual
Outliers: • If an individual data instance is anomalous in a specific context or condition (but not otherwise),
then it is termed as a contextual outlier. Collective Outliers: • If a collection of data points is anomalous with
respect to the entire data set, it is termed as a collective outlier
Outlier Detection Methods • The outlier detection methods can be divided into supervised methods, semi
supervised methods and unsupervised methods. 1. Supervised Methods: • Supervised methods model data
normality and abnormality. Domain experts examine and label a sample of the underlying data. • Outlier
detection can then be modeled as a classification problem .The task is to learn a classifier that can recognize
outliers. The sample is used for training and testing. • In some applications, the experts may label just the
normal objects, and any other objects not matching the model of normal objects are reported as outliers.
Data Attributes, Data object, Types of Data Attributes, Discrete versus Continuous, DATA QUALITY: WHY
PREPROCESS THE DATA, DATA QUALITY, DATA MUNGING / WRANGLING OPERATIONS , Data Cleaning,
Missing Values, noise data
Data Attributes • An attribute is a property or characteristic of an object. A data attribute is a single value
descriptor for a data object. For example, eye color of a person, name of a student, etc.
Data Objects • A collection of attributes describe an object. Data objects can also be referred to as samples,
examples, instances, case, entity, data points or objects. • If the data objects are stored in a database, they
are data tuples. That is, the rows of a database correspond to the data objects, and the columns correspond
to the attributes
Types of Data Attributes
Nominal Attribute: • Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things. • Each value in nominal attribute represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical.
Binary Attributes: • A binary attribute is a nominal attribute with only two categories or states namely, 0 or
1 where 0 typically means that the attribute is absent and 1 means that it is present.
Ordinal Attributes: • An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Discrete versus Continuous Attributes • There are many ways to organize attribute types. Many machine
learning algorithms specially, classification algorithms advocate the attributes categorization as being either
discrete or continuous. • A discrete attribute has a finite or countably infinite set of values, which may or
may not be represented as integers.
DATA QUALITY: WHY PREPROCESS THE DATA? • Data have quality if they satisfy the requirements of the
intended use. Data quality can be defined as, “the ability of a given data set to serve an intended purpose”.
• Data preprocessing is responsible for maintain the quality of data. The phrase "garbage in, garbage out" is
particularly applicable to such projects. • Data-collection methods are often loosely controlled, resulting in
out-of-range values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes) and
missing values, etc.
DATA MUNGING / WRANGLING OPERATIONS • Data wrangling is the task of converting data into a feasible
format that is suitable for the consumption of the data. • The goal of data wrangling is to assure quality and
useful data. Data analysts typically spend the majority of their time in the process of data wrangling
compared to the actual analysis of the data
Data Cleaning • Real-world data tend to be incomplete, noisy, and inconsistent. This dirty data can cause an
error while doing data analysis. Data cleaning is done to handle irrelevant or missing data. • Data cleaning
also known as data cleansing or scrubbing. Data is cleaned by filling in the missing values, smoothing any
noisy data, identifying and removing outliers, and resolving any inconsistencies
Missing Values • The raw data that is collected for analyzing usually consists of several types of errors that
need to be prepared and processed for data analysis. • Some values in the data may not be filled up for
various reasons and hence are considered missing.
Noisy Data • The noisy data contains errors or outliers. For example, for stored employee details, all values
of the age attribute are within the range 22-45 years whereas one record reflects the age attribute value as
80. • There are times when the data is not missing, but it is corrupted for some reason. This is, in some ways,
a bigger problem than missing data.
Data Transformation , Data Reduction, Data Discretization, Advantages of Visualization:, INTRODUCTION
TO EXPLORATORY DATA ANALYSIS, Data Visualization, Visual Encoding, BASIC DATA VISUALIZATION TOOLS
Histogram, box ploat, ADVANCED DATA VISUALIZATION TOOL,
Data Transformation • Data transformation is the process of converting raw data into a format or structure
that would be more suitable for data analysis. • Data transformation is a data preprocessing technique that
transforms or consolidates the data into alternate forms appropriate for mining.
Data Reduction • When the data is collected from different data sources for analysis, it results in a huge
amount of data. It is difficult for a data analyst to deal with this large volume of data. • It is even difficult to
run the complex queries on the huge amount of data as it takes a long time and sometimes it even becomes
impossible to track the desired data.
Data Discretization • Data discretization is characterized as a method of translating attribute values of
continuous data into a finite set of intervals with minimal information loss. • Data discretization facilitates
the transfer of data by substituting interval marks for the values of numeric data.
Advantages of Visualization: 1. Visualization makes it easier for humans to detect trends, patterns,
correlations, and outliers in a group of data. Data visualization makes humans understand the big picture of
big data using a small, impactful visualizations.
INTRODUCTION TO EXPLORATORY DATA ANALYSIS • Exploratory Data Analysis (EDA) is a process of
examining or understanding the data and extracting insights of the data. • EDA is an important step in any
Data Science project. EDA is the process of investigating the dataset to discover patterns, and anomalies
(outliers) and form hypotheses based on the understanding of the dataset.
Data Visualization • Data visualization is the presentation of data in graphical format. Data visualization is a
generic term used which describes any attempt to help understanding of data by providing visual
representation. • Visualization of data makes it much easier to analyze and understand the textual and
numeric data. • Apart from saving time, increased used of data for decision making further adds to the
importance and need of data visualization
Visual Encoding • Encoding in data visualization means translating the data into a visual element on a chart
or map through position, shape, size, symbols and color. • The visual encoding is the way in which data is
mapped into visual structures, upon which we build the images on a screen. • Visual encoding is the
approach/technique used to map data into visual structures, thus building an image on the screen
BASIC DATA VISUALIZATION TOOLS Histogram: • A histogram is a graphical display of data using bars of
different heights. A histogram shows an accurate representation of the distribution of numeric data. • A
histogram is a way to represent the distribution of numerical data elements (mainly statistical) in an
approximate manner. A histogram uses a "bin" or a "bucket" for a set or range of values to be distributed.
Box Plot: • Box plot is a commonly used chart for business, professional aspects and extensively in data
science-related visualizations. • A box plot is used to show the distribution of two or more data elements in
a summarized manner.
ADVANCED DATA VISUALIZATION TOOL : WORD CLOUDS • There are more advanced and complex
visualization tools that are used in data analytics namely, word clouds, waffle charts and seaborn plots. • A
word cloud (or tag cloud) is a word visualization that displays the most used words in a text from small to
large, according to how often each appears. • Word clouds (also known as text clouds or tag clouds) work in
a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or
database), the bigger and bolder it appears in the word cloud