0% found this document useful (0 votes)
19 views40 pages

FDSUnit2 Data Science

This document covers mathematical preliminaries in probability, descriptive statistics, and correlation analysis. It explains key concepts such as probability distributions, measures of central tendency, and variability measures. Additionally, it discusses data munging techniques, including data collection, cleaning, and crowdsourcing.

Uploaded by

veena more
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views40 pages

FDSUnit2 Data Science

This document covers mathematical preliminaries in probability, descriptive statistics, and correlation analysis. It explains key concepts such as probability distributions, measures of central tendency, and variability measures. Additionally, it discusses data munging techniques, including data collection, cleaning, and crowdsourcing.

Uploaded by

veena more
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

UNIT- 2 MATHEMATICAL PRELIMINARIES

PROBABILITY
 Probability Vs Statistics
 Compound Events and Independence
 Conditional Probability
 Probability Distributions
DESCRIPTIVE STATISTICS
 Centrality Measure
 Variability Measures
 Interpreting Variance
 Characterizing Distributions
CORRELATION ANALYSIS
 Correlation Coefficients: Pearson and Spearman Rank
 The Power and Significance of Correlation
 Correlation does not imply Causation
 Detecting Periodicities by Autocorrelation

DATA MUNGING:
LANGUAGES FOR DATA SCIENCE
 The importance of Notebook Environments
 Standard Data Formats
COLLECTING DATA
 Hunting
 Scraping
 Logging
CLEANING DATA
 Errors Vs Artifacts
 Data Compatibility
 Dealing with missing values
 Outlier Detection
CROWDSOURCING
 The Penny Demo
 When is the crowd wise ?
 Mechanisms for Aggregation
 Crowdsourcing services

O1
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

You must walk before you can run. Similarly, there is a certain level of mathematical
maturity which is necessary before you should be trusted to do anything meaningful
with numerical data.

PROBABILITY
Probability is a numerical representation of the chance of occurrence of a particular
event. Here the event is the word used to describe any particular set of the outcome.

 experiment
An experiment is a procedure which yields one of a set of possible outcomes.

Example,
A coin tossed 10 times, Head is recorder 7 times, Tail is recorded 3 times.

 sample space
A sample space S is the set of possible outcomes of an experiment.
example In roll two dice together, there are 36 possible outcomes, namely
S = {(1, 1),(1, 2),(1, 3),(1, 4),(1, 5),(1, 6),(2, 1),(2, 2),(2, 3),(2, 4),(2, 5),(2, 6), (3, 1),
(3, 2),(3, 3),(3, 4),(3, 5),(3, 6),(4, 1),(4, 2),(4, 3),(4, 4),(4, 5),(4, 6), (5, 1),(5, 2),
(5, 3), (5, 4),(5, 5),(5, 6),(6, 1),(6, 2),(6, 3),(6, 4),(6, 5),(6, 6)}.

 event
An event E is a specified subset of the outcomes of an experiment.
example The event that the sum of the dice equals 7 or 11
subset E = {(1, 6),(2, 5),(3, 4),(4, 3),(5, 2),(6, 1),(5, 6),(6, 5)}.

 Probability of an outcome
probability of an event is the number of outcomes in the event divided by the
number of possible, equally likely outcomes.
Example: If we assume two distinct fair dice, the probability p(s) = (1/6) × (1/6) =
1/36 for all outcomes s ∈ S.

O2
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

 Probability of an event
The probability of an event E is the sum of the probabilities of the outcomes of the
experiment. An alternate formulation is in terms of the complement of the event
E¯, the case when E does not occur. Then P(E) = 1 − P(E¯)
Example: If you pull a random card from a deck of playing cards, what is
the probability it is not a heart?

 Random variable
A random variable V is a numerical function on the outcomes of a probability
space.
Example: Suppose 2 dice are rolled and the random variable, X, is used to
represent the sum of the numbers. Then, the smallest value of X will be
equal to 2 (1 + 1), while the highest value would be 12 (6 + 6). Thus, X could
take on any value between 2 to 12 (inclusive). Now if probabilities are
attached to each outcome then the probability distribution of X can be
determined.

 Expected value
Random variables are the functions that assign a probability to some outcomes
in the sample space.
Example: Let R be the random variable, and in this case, it is defined as,
R = Number of heads

Sample Space = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
The value given by R for each outcome is shown below in the table.

O3
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

 Probability Vs Statistics
Probability and statistics are related areas of mathematics which concern
themselves with analyzing the relative frequency of events. Still, there are
fundamental differences in the way they see the world:

• Probability is primarily a theoretical branch of mathematics, which studies the


consequences of mathematical definitions. Probability deals with predicting the
likelihood of future events. Probability is all about chance.
For example, when we flip a coin in the air, what is the possibility of getting a head?
The answer to this question is based on the number of possible outcomes. Here the
possibility is either head or tail will be the outcome. So, the probability of a head to
come as a result is 1/2.

O4
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

• Statistics is primarily an applied branch of mathematics, which tries to make sense of


observations in the real world. , while statistics involves the analysis of the frequency
of past events. Whereas statistics is more about how we handle various data using
different techniques. It helps to represent complicated data in a very easy and
understandable way.
Example:You have a coin of unknown. To investigate whether it is fair you toss it 100
times and count the number of heads. Let’s say you count 60 heads. Your job as a
statistician is to draw a conclusion (inference) from this data. In this situation, different
Statisticians may draw different conclusions because they may use different conclusion
forms or may use different methods for predicting the probability(e.g. of landing
heads).

 Compound Events and Independence


A compound event is an event that has more than one possible outcomes.
As opposed to a simple event, if there is more than one sample point on a sample space,
such an event is called Compound Event. It involves combining two or more events
together and finding the probability of such a combination of events.

For example, let us take another example. When we throw a die, the possibility of an even
number appearing is a compound event, as there is more than one possibility, there are
three possibilities i.e. E = {2,4,6}.

Set difference
If there are two sets A and B, then the difference of two sets A and B is equal to the set
which consists of elements present in A but not in B. It is represented by A-B.
Example: If A = {1,2,3,4,5,6,7} and B = {6,7} are two sets.
Then, the difference of set A and set B is given by;
A – B = {1,2,3,4,5}

Union

O5
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

If two sets A and B are given, then the union of A and B is equal to the set that
contains all the elements present in set A and set B.
Example: If set A = {1,2,3,4} and B {6,7}
Then, Union of sets, A ∪ B = {1,2,3,4,6,7}

Intersection
If two sets A and B are given, then the intersection of A and B is the subset of universal
set , which consist of elements common to both A and B. It is denoted by the symbol
‘∩’. This operation is represented by:
Example: Let A = {1,2,3} and B = {3,4,5}
Then, A∩B = {3}; because 3 is common to both the sets.

Independent events are those events whose occurrence is not dependent on any
other event.

For example, if we flip a coin in the air and get the outcome as Head, then again if
we flip the coin but this time we get the outcome as Tail. In both cases, the
occurrence of both events is independent of each other.

The events A and B are independent if and only if


P(A ∩ B) = P(A) × P(B)
Probability theorists love independent events, because it simplifies their calculations.
But data scientists generally don’t. When building models to predict the likelihood of
some future event B, given knowledge of some previous event A, we want as strong a
dependence of B on A as possible.

O6
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

 Conditional Probability
Conditional probability is one type of probability in which the possibility of an
event depends upon the existence of a previous event.
The conditional probability of A given B, P(A|B) is defined:
P(A|B) = P (A ∩ B) / P(B)
Where,
 P (A ∩ B) represents the probability of both events A and B occurring
simultaneously.
 P(B) represents the probability of event B occurring.

Example: A die is thrown two times and the sum of the scores appearing on the
die is observed to be a multiple of 4. Then the conditional probability that the
score 4 has appeared at least once is: Let A be the event that the sum obtained is a
multiple of 4.B be the event that the score of 4 has appeared at least once.

A = {(1, 3), (2, 2), (3, 1), (2, 6), (3, 5), (4, 4), (5, 3), (6, 2), (6, 6)}
B = {(1, 4), (2, 4), (3, 4), (4, 4), (5, 4), (6, 4), (4, 1), (4, 2), (4, 3), (4, 5), (4, 6)}
(A ∩ B) = (4, 4)

n(A ∩ B) = 1
Required probability = P(B|A)
= P(A ∩ B)/P(B)
= 1/11

The Bayes theorem is a mathematical formula for calculating conditional


probability in probability and statistics. In other words, it's used to figure out how
likely an event is based on its proximity to another. Bayes law or Bayes rule are
other names for the theorem.
P(A ∣ B) = P(A ∣ B)P(B) / P(A)
P(A ∣ B) is the conditional probability of event A occurring, given that B is true.
P(B ∣ A) is the conditional probability of event B occurring, given that A is true.
P(A) and P(B) are the probabilities of A and B occurring independently of one
another.

 Probability Distributions
Probability Distribution is basically the set of all possible outcomes of any random
experiment or event. Understanding random variables’ behavior, features, and
distributions depends critically on PDF(Probability Density Function ) and
O7
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

CDF(Cumulative Density Function) operations.

Probability Density Function (PDF)

The probability Density Function describes the probability distribution of continuous


random variables. It provides a smooth curve representing the probability
distribution over possible values.

Example: Imagine a continuous probability distribution, such as the height of adult


males. The probability for various height ranges will be displayed in the PDF. It
might suggest, for instance, that people with heights between 5’9″ and 5’10” are
more numerous than those with heights outside of that range.

Probability Density Function (CDF)


Cumulative Distribution Function is a probability distribution that deals with discrete
random variables. CDF is a step function that jumps at specific values.
Example: The CDF would demonstrate the likelihood of discovering a male with a
height less than or equal to a certain value, such as 5 ‘9″, using the male heights
example from before. The CDF allows us to respond to questions like “What
percentage of adult males is shorter than 5 ‘9” by presenting cumulative probability .

O8
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

DESCRIPTIVE STATISTICS
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.

There are two main types of descriptive statistics:


• Central tendency measures, which capture the center around which the data is
distributed.
• Variation or variability measures, which describe the data spread, i.e. how far the
measurements lie from the center.

 Centrality Measure
the measures of central tendency are used to describe data by determining a single
representative central value. The important measures of central tendency are given
below:

Mean: The mean or Arithmetic mean can be defined as the sum of all observations
divided by the total number of observations. The formulas for the mean are given as
follows:

Geometric Mean
the geometric mean is defined as the nth root of the product of n numbers. e multiply
the numbers altogether and take the nth root of the multiplied numbers, where n is the
total number of data values.
For example: for a given set of two numbers such as 3 and 1, the geometric mean is
O9
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

equal to √(3×1) = √3 = 1.732.

Median: The median can be defined as the center-most observation that is obtained
by arranging the data in ascending order. The formulas for the median are given as
follows:
Example: In this case the median is the 11th number:

53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
Median = 61

Mode: The mode is the most frequently occurring observation in the data set. The
formulas for the mode are given as follows:

Example:
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
62 appears three times, more often than the other values, so Mode = 62

 Variability Measures
The most common measure of variability is the standard deviation σ, which
measures sum of squares differences between the individual elements and the mean:

Variability in statistics refers to the difference being exhibited by data points within
a data set, as related to each other or as related to the mean. This can be expressed
through the range, variance or standard deviation of a data set. The field of finance
uses these concepts as they are specifically applied to price data and the returns that
changes in price imply.

 Interpreting Variance
Repeated observations of the same phenomenon do not always produce the same
results, due to random noise or error.

O10
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Sampling errors result when our observations capture unrepresentative


circumstances, like measuring rush hour traffic on weekends as well as during the
work week.

Measurement errors reflect the limits of precision inherent in any sensing device.
The notion of signal to noise ratio captures the degree to which a series of
observations reflects a quantity of interest as opposed to data variance.

As data scientists, care about changes in the signal instead of the noise, and such
variance . Variance as an inherent property of the universe.
Example: Each morning you weigh yourself on a scale you are guaranteed to get a
different number, with changes reflecting when you last ate (sampling error), the
flatness of the floor, or the age of the scale (both measurement error) as much as
changes in your body mass (actual variation).

So what is your real weight? Every measured quantity is subject to some level of
variance, Data scientists seek to explain the world through data.

 Characterizing Distributions
Distributions do not necessarily have much probability mass exactly at the mean.
Consider what your wealth would look like after you borrow $100 million, and then
bet it all on an even money coin flip. Heads you are now $100 million in clear, tails
you are $100 million in hock. Your expected wealth is zero, but this mean does not
tell you much about the shape of your wealth distribution.

However, taken together the mean and standard deviation do a decent job of
characterizing any distribution.

CORRELATION ANALYSIS
Correlation
Suppose we are given two variables x and y, represented by a sample of n points of
the form (xi , yi), for 1 ≤ i ≤ n. We say that x and y are correlated when the value of x
has some predictive power on the value of y.

The correlation coefficient r(X, Y ) is a statistic that measures the degree to which Y
is a function of X, and vice versa. The value of the correlation coefficient ranges from
O11
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

−1 to 1, where 1 means fully correlated and 0 implies no relation, or independent


variables.

Negative correlations imply that the variables are anti-correlated, meaning that when
X goes up, Y goes down. Perfectly anti-correlated variables have a correlation of −1.
Note that negative correlations are just as good for predictive purposes as positive
ones.

That you are less likely to be unemployed the more education you have is an example
of a negative correlation, so the level of education can indeed help predict job status.
Correlations around 0 are useless for forecasting. Observed correlations drives many
of the predictive models we build in data science.

Representative strengths of correlations include:


• Are taller people more likely to remain lean? The observed correlation between
height and BMI is r = −0.711, so height is indeed negatively correlated with body
mass index (BMI).

• Does financial status affect health? The observed correlation between household
income and the prevalence of coronary artery disease is r = −0.717, so there is a
strong negative correlation. So yes, the wealthier you are, the lower your risk of
having a heart attack.

 Correlation Coefficients: Pearson and Spearman Rank


There are two primary statistics used to measure correlation. Both operate on the
same −1 to 1 scale, although they measure somewhat different things. These different
statistics are appropriate in different situations.

The Pearson Correlation Coefficient


The more prominent of the two statistics is Pearson correlation, defined as

Suppose X and Y are strongly correlated. Then we would expect that when xi is greater
than the mean X , then yi should be bigger than its mean Y . When xi is lower than its
mean, yi should follow. Now look at the numerator. The sign of each term is positive
when both values are above (1 ×1) or below (−1× −1) their respective means. The sign
of each term is negative ((−1×1) or (1×−1)) if they move in opposite directions,

O12
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

suggesting negative correlation. If X and Y were uncorrelated, then positive and


negative terms should occur with equal frequency, offsetting each other and driving
the value to zero.

The numerator’s operation determining the sign of the correlation is so useful that we
give it a name, covariance, computed:

The denominator of the Pearson formula reflects the amount of variance in the two
variables, as measured by their standard deviations. The covariance between X and Y
potentially increases with the variance of these variables, and this denominator is the
magic amount to divide it by to bring correlation to a −1 to 1 scale.

The Spearman Rank Correlation Coefficient


Pearson correlation measures how well the best linear predictors can work, but says
nothing about weirder functions like absolute value.

The Spearman rank correlation coefficient essentially counts the number of pairs of
input points which are out of order. Suppose that our data set contains points (x 1, y1)
and (x2, y2) where x1 < x2 and y1 < y2. This is a vote that the values are positively
correlated, whereas the vote would be for a negative correlation if y2 < y1.

Summing up over all pairs of points and normalizing properly gives us Spearman rank
correlation. Let rank(xi) be the rank position of xi in sorted order among all xi , so the
rank of the smallest value is 1 and the largest value n. Then

where di = rank(xi) − rank(yi)

 The Power and Significance of Correlation


The correlation coefficient, r, tells us about the strength and direction of the linear
relationship between x and y. However, the reliability of the linear model also
depends on how many observed data points are in the sample. We need to look at
both the value of the correlation coefficient r and the sample size n, together.

O13
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

We perform a hypothesis test of the “significance of the correlation coefficient”


to decide whether the linear relationship in the sample data is strong enough to use
to model the relationship in the population.
The sample data are used to compute r, the correlation coefficient for the sample. If
we had data for the entire population, we could find the population correlation
coefficient. But because we have only have sample data, we cannot calculate the
population correlation coefficient. The sample correlation coefficient, r, is our
estimate of the unknown population correlation coefficient.
 The symbol for the population correlation coefficient is ρ, the Greek letter “rho.”
 ρ = population correlation coefficient (unknown)
 r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation
coefficient ρ is “close to zero” or “significantly different from zero”. We decide this
based on the sample correlation coefficient r and the sample size n.

If the test concludes that the correlation coefficient is significantly different


from zero, we say that the correlation coefficient is “significant.”
Conclusion: There is sufficient evidence to conclude that there is a significant linear
relationship between x and y because the correlation coefficient is significantly
different from zero. What the conclusion means: There is a significant linear
relationship between x and y. We can use the regression line to model the linear
relationship between x and y in the population.

If the test concludes that the correlation coefficient is not significantly


different from zero (it is close to zero), we say that correlation coefficient is
“not significant.”

Conclusion: “There is insufficient evidence to conclude that there is a significant


linear relationship between x and y because the correlation coefficient is not
significantly different from zero.” What the conclusion means: There is not a
significant linear relationship between x and y. Therefore, we CANNOT use the
regression line to model a linear relationship between x and y in the population.

 Correlation does not imply Causation


But many observed correlations are completely spurious, with neither variable
O14
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

having any real impact on the other.

Still, correlation implies causation is a common error in thinking, even among


those who understand logical reasoning. Generally speaking, few statistical tools
are available to tease out whether A really causes B.

For example, the fact that we can put people on a diet that makes them lose weight
without getting shorter is convincing evidence that weight does not cause height.
But it is often harder to do these experiments the other way, e.g. there is no
reasonable way to make people shorter other than by hacking off limbs.

 Detecting Periodicities by Autocorrelation


autocorrelation can help verify the presence of cycles and determine their durations.
Seasonal trends reflect cycles of a fixed duration, rising and falling in a regular
pattern. . If the values are in sync for a particular period length p, then this
correlation with itself will be unusually high relative to other possible lag values.
Comparing a sequence to itself is called an autocorrelation, and the series of
correlations for all 1 ≤ k ≤ n − 1 is called the autocorrelation function.

Figure resents a time series of daily sales, and the associated autocorrelation function
for this data. The peak at a shift of seven days (and every multiple of seven days)
establishes that there is a weekly periodicity in sales: more stuff gets sold on
weekends.

Autocorrelation is an important concept in predicting future events, because it means


we can use previous observations as features in a model. The heuristic that
tomorrow’s weather will be similar to today’s is based on autocorrelation, with a lag
of p = 1 days. Certainly we would expect such a model to be more accurate than
predictions made on weather data from six months ago (lag p = 180 days).

O15
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Generally speaking, the autocorrelation function for many quantities tends to be


highest for very short lags. This is why long-term predictions are less accurate than
short-term forecasts: the autocorrelations are generally much weaker. But periodic
cycles do sometimes stretch much longer. Indeed, a weather forecast based on a lag
of p = 365 days will be much better than one of p = 180, because of seasonal effects.
Computing the full autocorrelation function requires calculating n−1 different
correlations on points of the time series, which can get expensive for large n.

Fortunately, there is an efficient algorithm based on the fast Fourier transform (FFT),
which makes it possible to construct the autocorrelation function even for very long
sequences.

DATA MUNGING:
Data munging is the process of cleaning and transforming raw data into a structured
format that is suitable for analysis. This step is essential because real-world data is
often messy, incomplete, or inconsistent. Data munging aims to address these issues,
ensuring that the data is accurate, consistent, and ready for further exploration.

Example 1: Handling Missing Data

One common challenge in real-world datasets is missing values. Let's consider a


dataset containing customer information, where some entries have missing values for
the "Email" field.

Data munging involves deciding how to handle these missing values, whether by
imputing them with averages, removing the corresponding rows, or using other
strategies.

import pandas as pd

O16
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

# Load the dataset

df = pd.read_csv('customer_data.csv')

# Handling missing values by imputing with the mean

df['Email'].fillna(df['Email'].mean(), inplace=True)

Example 2: Dealing with Duplicates

Duplicate records can skew analysis results. Data munging includes identifying and
handling duplicate entries. In the following example, we use pandas to identify and
remove duplicates from a dataset.

# Identifying and removing duplicate records

df.drop_duplicates(inplace=True)

Example 3: Text Data Cleanup

When dealing with text data, data munging may involve cleaning and preprocessing text
for analysis. Let's say we have a dataset with a "Description" column containing text
data. We can use regular expressions to remove special characters and convert text to
lowercase.

import re

# Cleaning text data

df['Description'] = df['Description'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '',


x.lower()))

Conclusion:

Data munging is a critical skill for anyone working with data. Whether you're a data
scientist, analyst, or business professional, mastering the art of data munging can
significantly impact the quality and reliability of your analyses. The examples provided
demonstrate some common scenarios encountered during data munging, but the field
is vast and ever-evolving. Continuous learning and practice will empower you to

O17
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

efficiently prepare data for insightful analysis, driving informed decision-making in


your organization.

LANGUAGES FOR DATA SCIENCE


The primary data science programming languages to be aware of are:

 Python: This is today’s bread-and-butter programming language for data science.


Python contains a variety of language features to make basic data munging easier, like
regular expressions. It is an interpreted language, making the development process
quicker and enjoyable. Python is supported by an enormous variety of libraries, doing
everything from scraping to visualization to linear algebra and machine learning.
Perhaps the biggest strike against Python is efficiency: interpreted languages cannot
compete with compiled ones for speed. But Python compilers exist in a fashion, and
support linking in efficient C/assembly language libraries for computationally-
intensive tasks. Bottom line, Python should probably be your primary tool in working.

 Perl: This used to be the go to language for data munging on the web, before
Python ate it for lunch. In the TIOBE programming language popularity index
(http://www.tiobe.com/tiobe-index), Python first exceeded Perl in popularity in 2008
and hasn’t looked back. There are several reasons for this, including stronger support
for object-oriented programming and better available libraries, but the bottom line is
that there are few good reasons to start projects in Perl at this point.

 R: This is the programming language of statisticians, with the deepest libraries


available for data analysis and visualization. The data science world is split between R
and Python camps, with R perhaps more suitable for exploration and Python better for
production use. The style of interaction with R is somewhat of an acquired taste, so I
encourage you to play with it a bit to see whether it feels natural to you. Linkages exist
between R and Python, so you can conveniently call R library functions in Python code.
This provides access to advanced statistical methods, which may not be supported by
the native Python libraries.

 Matlab: The Mat here stands for matrix, as Matlab is a language designed for the
fast and efficient manipulation of matrices. As we will see, many machine learning
algorithms reduce to operations on matrices, making Matlab a natural choice for
engineers programming at a high-level of abstraction. Matlab is a proprietary system.
However, much of its functionality is available in GNU Octave, an open-source
O18
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

alternative.

 Java and C/C++: These mainstream programming languages for the development
of large systems are important in big data applications. Parallel processing systems
like Hadoop and Spark are based on Java and C++, respectively. If you are living in the
world of distributed computing, then you are living in a world of Java and C++ instead
of the other languages listed here.

 Mathematica/Wolfram Alpha: Mathematica is a proprietary system providing


computational support for all aspects of numerical and symbolic mathematics, built
upon the less proprietary Wolfram programming language. It is the foundation of the
Wolfram Alpha computational knowledge engine, which processes natural language-
like queries through a mix of algorithms and pre-digested data sources. Check it out at
http://www. wolframalpha.com. I will confess a warm spot for Mathematica.1 It is
what I tend to reach for when I am doing a small data analysis or simulation, but cost
has traditionally put it out of the range of many users. The release of the Wolfram
language perhaps now opens it up to a wider community.

 Excel: Spreadsheet programs like Excel are powerful tools for exploratory data
analysis, such as playing with a given data set to see what it contains. They deserve
our respect for such applications. Full featured spreadsheet programs contain a
surprising amount of hidden functionality for power users. A student of mine who rose
to become a Microsoft executive told me that 25% of all new feature requests for Excel
proposed functionality already present there. The special functions and data
manipulation features you want probably are in Excel if you look hard enough, in the
same way that a Python library for what you need probably will be found if you search
for it.

 The importance of Notebook Environments


Notebooks provide an interactive computing environment that allows data scientists
to run code in a step-by-step manner. This feature is particularly valuable when
exploring and analyzing data, as it enables users to execute portions of the code and
immediately see the results. This iterative process enhances efficiency and facilitates a
more dynamic and responsive workflow.

 Computations need to be reproducible: We must be able to run the same


O19
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

programs again from scratch, and get exactly the same result. This means that data
pipelines must be complete: taking raw input and producing the final output.
 Computations must be tweakable: Often reconsideration or evaluation will
prompt a change to one or more parameters or algorithms. This requires rerunning
the notebook to produce the new computation. A notebook is never finished until
after the entire project is done.
 Data pipelines need to be documented: That notebooks permit you to integrate
text and visualizations with your code provides a powerful way to communicate
what you are doing and why, in ways that traditional programming environments
cannot match.

 Standard Data Formats


Data comes from all sorts of places, and in all kinds of formats. Which representation
is best depends upon who the ultimate consumer is. Charts and graphs are marvelous
ways to convey the meaning of numerical data to people.

The best computational data formats have several useful properties:


 They are easy for computers to parse: Data written in a useful format is destined
to be used again, elsewhere. Sophisticated data formats are often supported by APIs
that govern technical details ensuring proper format.

 They are easy for people to read: Which of the data files in this directory is the
right one for me to use? What do we know about the data fields in this file? What is the
gross range of values for each particular field? These use cases speak to the enormous
value of being able to open a data file in a text editor to look at it. Typically, this means
presenting the data in a human-readable text-encoded format, with records demarcated
by separate lines, and fields separated by delimiting symbols.

 They are widely used by other tools and systems: The urge to invent
proprietary data standard beats firmly in the corporate heart, and most software
developers would rather share a toothbrush than a file format. But these are impulses
to be avoided. The power of data comes from mixing and matching it with other data
resources, which is best facilitated by using popular standard formats.

A data format's structure defines how data is organized in a database or file system,
giving it meaning. The most important data formats/representations to be aware of are
discussed below:
• CSV (comma separated value) files: Comma-separated values (CSV) is a common

O20
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

format for storing and exchanging tabular data. It's similar to Excel and is often used
to process data with pandas. CSV is good for storing and processing text, numbers,
and dates. However, if your data contains strings or sentences with commas, you
should wrap the strings in quotation marks or use a different delimiter.

• XML (eXtensible Markup Language): is a markup language to structure


documents. The Document is organized by elements and their attributes and the
scope of an element is marked by tags for the start and the end. Elements can be
nested and may include values. This allows for a representation of complex data
models, including multidimensional data and hierarchies.

 SQL (structured query language) databases: Relational databases prove


excellent for manipulating multiple distinct but related tables, using SQL to provide a
clunky but powerful query language. Any reasonable database system imports and
exports records as either csv or XML files, as well as an internal content dump. The
internal representation in databases is opaque, so it really isn’t accurate to describe
them as a data format. Still, I emphasize them here because SQL databases generally
prove a better and more powerful solution than manipulating multiple data files in an
ad hoc manner.

• JSON (JavaScript Object Notation): This is a format for transmitting data objects
between programs. It is a natural way to communicate the state of variables/data
structures from one system to another. This representation is basically a list of
attribute-value pairs corresponding to variable/field names, and the associated
values:

Because library functions that support reading and writing JSON objects are readily
available in all modern programming languages, it has become a very convenient way
to store data structures for later use. JSON objects are human readable, but are quite
cluttered-looking, representing arrays of records compared to CSV files. Use them for
complex structured objects, but not simple tables of data.

• Protocol buffers: These are a language/platform-neutral way of serializing


structured data for communications and storage across applications. They are
essentially lighter weight versions of XML (where you define the format of your
O21
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

structured data), designed to communicate small amounts of data across programs


like JSON. This data format is used for much of the intermachine communication at
Google. Apache Thrift is a related standard, used at Facebook.

COLLECTING DATA
The most critical issue in any data science or modeling project is finding the right data
set.
Identifying viable data sources is an art, one that revolves around three basic questions:
• Who might actually have the data I need?
• Why might they decide to make it available to me?
• How can I get my hands on it?

In this section, we will explore the answers to these questions. We look at common
sources of data, and what you are likely to be able to find and why.

 Hunting
Who has the data, and how can you get it? Some of the likely suspects are reviewed
below.

Companies and Proprietary Data Sources


Large companies like Facebook, Google, Amazon, American Express, and Blue Cross
have amazing amounts of exciting data about users and transactions, data which could
be used to improve how the world works. The problem is that getting outside access is
usually impossible.
Companies are reluctant to share data for two good reasons:
• Business issues, and the fear of helping their competition.
• Privacy issues, and the fear of offending their customers.

Many responsible companies like The New York Times, Twitter, Facebook, and
Google do release certain data, typically by rate-limited application program
interfaces (APIs).
They generally have two motives:
• Providing customers and third parties with data that can increase sales. For
example, releasing data about query frequency and ad pricing can encourage more
people to place ads on a given platform.
• It is generally better for the company to provide well-behaved APIs than having
cowboys repeatedly hammer and scrape their site.
O22
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

You won’t find exactly the content or volume that you dream of, but probably
something that will suffice to get started. Be aware of limits and terms of use.

Other organizations do provide bulk downloads of interesting data for offline


analysis, as with the Google Ngrams, IMDb, and the taxi fare data sets . Large data sets
often come with valuable metadata, such as book titles, image captions, and edit
history, which can be re-purposed with proper imagination.

Finally, most organizations have internal data sets of relevance to their business. As
an employee, you should be able to get privileged access while you work there. Be
aware that companies have internal data access policies, so you will still be subject to
certain restrictions. Violating the terms of these policies is an excellent way to
become an ex-employee

Government Data Sources


Collecting data is one of the important things that governments do. Indeed, the
requirement that the United States conduct a census of its population is mandated by
our constitution, and has been running on schedule every ten years since 1790.

City, state, and federal governments have become increasingly committed to open
data, to facilitate novel applications and improve how government can fulfill its
mission. The website http://Data.gov is an initiative by the federal government to
centrally collect its data sources, and at last count points to over 100,000 data sets!

Government data differs from industrial data in that, in principle, it belongs to the
People. The Freedom of Information Act (FOI) enables any citizen to make a formal
request for any government document or data set. Such a request triggers a process
to determine what can be released without compromising the national interest or
violating privacy.

State governments operate under fifty different sets of laws, so data that is tightly
held in one jurisdiction may be freely available in others. Major cities like New York
have larger data processing operations than many states, again with restrictions that
vary by location.

Academic Data Sets


There is a vast world of academic scholarship, covering all that humanity has deemed
O23
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

worth knowing. An increasing fraction of academic research involves the creation of


large data sets. Many journals now require making source data available to other
researchers prior to publication. Expect to be able to find vast amounts of economic,
medical, demographic, historical, and scientific data if you look hard enough.

The key to finding these data sets is to track down the relevant papers. There is an
academic literature on just about any topic of interest. Google Scholar is the most
accessible source of research publications. Search by topic, and perhaps “Open
Science” or “data.” Research publications will typically provide pointers to where its
associated data can be found. If not, contacting the author directly with a request
should quickly yield the desired result.

The biggest catch with using published data sets is that someone else has worked
hard to analyze them before you got to them, so these previously mined sources may
have been sucked dry of interesting new results. But bringing fresh questions to old
data generally opens new possibilities.

Often interesting data science projects involve collaborations between researchers


from different disciplines, such as the social and natural sciences. These people speak
different languages than you do, and may seem intimidating at first. But they often
welcome collaboration, and once you get past the jargon it is usually possible to
understand their issues on a reasonable level without specialized study. Be assured
that people from other disciplines are generally not any smarter than you are.

Sweat Equity
Sometimes you will have to work for your data, instead of just taking it from others.
Much historical data still exists only in books or other paper documents, thus
requiring manual entry and curation. A graph or table might contain information that
we need, but it can be hard to get numbers from a graphic locked in a PDF (portable
document format) file.

I have observed that computationally-oriented people vastly over-estimate the


amount of effort it takes to do manual data entry. At one record per minute, you can
easily enter 1,000 records in only two work days. Instead, computational people tend
to devote massive efforts trying to avoid such grunt work, like hunting in vain for
optical character recognition (OCR) systems that don’t make a mess of the file, or
spending more time cleaning up a noisy scan than it would take to just type it in again
fresh.
O24
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Crowdsourcing platforms like Amazon Turk and CrowdFlower enable you to pay for
armies of people to help you extract data, or even collect it in the first place. Tasks
requiring human annotation like labeling images or answering surveys are
particularly good use of remote workers.

Many amazing open data resources have been built up by teams of contributors, like
Wikipedia, Freebase, and IMDb. But there is an important concept to remember:
people generally work better when you pay them.

 Scraping
Data scraping, also known as web scraping, is a technique that involves using a
computer program to extract data from a website, database, or other source. The data
can be text, images, or videos, and it can be copied into a spreadsheet or local file for
later use.

Suppose you want some information about Mahatma Gandhi from Wikipedia or any
other website, you can extract this data by copying and pasting the information into
your file. But if you want this information for hundreds of different personalities,
manually getting this data is impossible, and you need an automated and efficient
method to scrape all of this information quickly. And here, Web Scraping comes into the
picture.

Web Scraping can be defined as an automated process to extract content or data from
the Internet. It provides various intelligent and automated methods to quickly extract
large volumes of data from websites. Most of this data will be in unstructured or HTML
format, which can be further parsed and converted into a structured format for further
analysis. In theory, you can scrape any data on the Internet. The most common data
types scraped include text, images, videos, pricing, reviews, product information, etc.

There are many ways you can perform Web Scraping to collect data from websites. You
can use online Web Scraping services or APIsor create your own custom-built code to
scrape the information. Many popular websites such as Google, Twitter, Facebook, etc.
provide APIs that allow the collection of the required data directly in a structured
format.

O25
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

What Is Web Scraping Used For?

Web Scraping has countless applications across industries. A few of the most common
use cases of Web Scraping include -

Price Monitoring

Organizations scrape the pricing and other related information for their and
competitors' products to analyze and fix optimal pricing for the products to maximize
revenue.

Market Research

Organizations use Web Scraping to extract product data, reviews, and other relevant
information to perform sentiment analysis, consumer trends, and competitor analysis.

News Monitoring

Organizations dependent on daily news for their day-to-day functioning can use Web
Scraping to generate reports based on the daily news.

Sentiment Analysis

Companies can collect product-related data from Social Media such as Facebook,
Twitter, etc., and other online forums to analyze the general sentiment for their
products among consumers.

Contact Scraping

Organizations scrape websites to collect contact information such as email IDs, and
mobile numbers to send bulk promotional and marketing emails and SMS.

O26
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Other than the above use cases, Web Scraping can also be used for numerous other
scenarios such as Weather Forecasting, Sports Analytics, Real Estate Listings, etc.

 Logging
Logging is a vital part of programming, providing a record of events and important
information that can be used to monitor and optimise system performance.

One of the primary uses of logging in programming is for debugging and


troubleshooting. When a program is running, logging can provide a record of events and
any errors or issues that may have occurred. This can be especially useful when
debugging complex programs, as logging can provide valuable information about the
root cause of an error and help programmers identify and fix the problem. By analysing
logs, programmers can more easily identify and fix problems, saving time and resources
in the development process.

Logging can play a vital part in machine learning by monitoring and optimizing the
performance of a system. Machine learning (ML) log files are an essential component of
the ML pipeline. They serve as a record of the training and evaluation processes,
providing valuable insights and debugging information for ML practitioners.

CLEANING DATA
Data cleansing or data scrubbing is the act of detecting and correcting (or removing)
corrupt or inaccurate records from a record set, table, or database. Used mainly in
databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant
etc. parts of the data and then replacing, modifying or deleting this dirty data.
 data.
 e.g., instrument faulty, human or computer error, transmission error
o incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
o noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
o inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
o Intentional (e.g., disguised missing data)

O27
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

 Jan. 1 as everyone’s birthday?

 Errors Vs Artifacts

Errors: information that is lost during acquisition and can never be recovered e.g.
power outage, crashed servers
Example: In a similar scenario consider yourself working in a water manufacturing
company and during the time of production by mistake, you dropped a sack of sugar
into the water container. This is an error! Now you cant recover the original water or
the sugar once it's all mixed up!

Artifacts: systematic problems that arise from the data cleaning process. these
problems can be corrected but we must first discover them.
Example:Now consider that the data you possess is a tin of oil! If there is a rock
present in the tin of oil, you could easily remove the rock, this therefore is a classic
example for Artifacts!

 Data Compatibility
We say that a comparison of two items is “apples to apples” when it is fair comparison,
that the items involved are similar enough that they can be meaningfully stood up
against each other. In contrast, “apples to oranges” comparisons are ultimately
meaningless.

For example: It makes no sense to compare weights of 123.5 against 78.9, when one is
in pounds and the other is in kilograms.

These types of data comparability issues arise whenever data sets are merged

Unit Conversions
A unit of measurement denotes how a value is measured. For instance, something
might be 4 pounds, 4 seconds, 4 inches, etc. All of these measurements contain the
same value, but the units make the measurements fundamentally different.

Quantifying observations in physical systems requires standard units of measurement.


Unfortunately there exist many functionally equivalent but incompatible systems of
measurement. My 12-year old daughter and I both weigh about 70, but one of us is in
O28
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

pounds and the other in kilograms.

Disastrous things like rocket explosions happen when measurements are entered into
computer systems using the wrong units of measurement. In particular, NASA lost the
$125 million Mars Climate Orbiter space mission on September 23, 1999 due to a
metric-to-English conversion issue.

Such problems are best addressed by selecting a single system of measurements and
sticking to it. In particular, individual measurements are naturally expressed as single
decimal quantities (like 3.28 meters) instead of incomparable pairs of quantities (5
feet, 8 inches). This same issue arises in measuring angles (radians vs.
degrees/seconds) and weight (kilograms vs. pounds/oz).

When merging records from diverse sources, it is an excellent practice to create a new
“origin” or “source” field to identify where each record came from. This provides at
least the hope that unit conversion mistakes can be corrected later, by systematically
operating on the records from the problematic source.

Numerical Representation Conversions


Numerical features are the easiest to incorporate into mathematical models. Indeed,
certain machine learning algorithms such as linear regression and support vector
machines work only with numerically-coded data. But even turning numbers into
numbers can be a subtle problem. Numerical fields might be represented in different
ways: as integers (123), as decimals (123.5), or even as fractions (123 1/2). Numbers
can even be represented as text, requiring the conversion from “ten million” to
10000000 for numerical processing.

The distinction between integers and floating point (real) numbers is important to
maintain. Integers are counting numbers: quantities which are really discrete should
be represented as integers. Physically measured quantities are never precisely
quantified, because we live in a continuous world. Thus all measurements should be
reported as real numbers. Integer approximations of real numbers are sometimes
used in a misbegotten attempt to save space. Don’t do this: the quantification effects
of rounding or truncation introduces artifacts.

In one particularly clumsy data set we encountered, baby weights were represented as
two integer fields (pounds and the remaining ounces). Much better would have been to
combine them into a single decimal quantity.

O29
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Name Unification
Integrating records from two distinct data sets requires them to share a common key
field. Names are frequently used as key fields, but they are often reported
inconsistently. Is Jos´e the same fellow as Jose? Such diacritic marks are banned from
the official birth records of several U.S. states, in an aggressive attempt to force them
to be consistent.

As another case in point, databases show my publications as authored by the Cartesian


product of my first (Steve, Steven, or S.), middle (Sol, S., or blank), and last (Skiena)
names, allowing for nine different variations. And things get worse if we include
misspellings. I can find myself on Google with a first name of Stephen and last names
of Skienna and Skeina.

Unifying records by key is a problem, so ID numbers were invented, so use them as


keys if you possibly can. The best general technique is unification: doing simple text
transformations to reduce each name to a single canonical version.

Time/Date Unification
Data/time stamps are used to infer the relative order of events, and group events by
relative simultaneity. Integrating event data from multiple sources requires careful
cleaning to ensure meaningful results.

First let us consider issues in measuring time. The clocks from two computers never
exactly agree. There are also time zone issues when dealing with data from different
regions, as well as diversities in local rules governing changes in daylight saving time.

The right answer here is to align all time measurements to Coordinated Universal
Time (UTC), a modern standard subsuming the traditional Greenwich Mean Time
(GMT).

The Gregorian calendar is common throughout the technology world, although many
other calendar systems are in use in different countries.

Financial Unification

O30
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Money makes the world go round, which is why so many data science projects
revolve around financial time series.

One issue here is currency conversion, representing international prices using a


standardized financial unit. Currency exchange rates can vary by a few percent
within a given day, so certain applications require time-sensitive conversions.
Conversion rates are not truly standardized. Different markets will each have
different rates and spreads, the gap between buying and selling prices that cover the
cost of conversion.

The time value of money implies that a dollar today is (generally) more valuable than
a dollar a year from now, with interest rates providing the right way to discount
future dollars. Inflation rates are estimated by tracking price changes over baskets of
items, and provide a way to standardize the purchasing power of a dollar over time.

 Dealing with missing values


Not all data sets are complete. An important aspect of data cleaning is identifying
fields for which data isn’t there, and then properly compensating for them:
• What is the year of death of a living person?
• What should you do with a survey question left blank, or filled with an obviously
outlandish value?
• What is the relative frequency of events too rare to see in a limited-size sample?

Numerical data sets expect a value for every element in a matrix. Setting missing
values to zero is tempting, but generally wrong, because there is always some
ambiguity as to whether these values should be interpreted as data or not.

Is someone’s salary zero because he is unemployed, or did he just not answer the
question?

The danger with using nonsense values as not-data symbols is that they can get
misinterpreted as data when it comes time to build models. A linear regression
model trained to predict salaries from age, education, and gender will have trouble
with people who refused to answer the question.

Using a value like −1 as a no-data symbol has exactly the same deficiencies as zero.
Indeed, be like the mathematician who is afraid of negative numbers: stop at nothing
to avoid them.
O31
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Take-Home Lesson: Separately maintain both the raw data and its cleaned version.
The raw data is the ground truth, and must be preserved intact for future analysis.
The cleaned data may be improved using imputation to fill in missing values. But
keep raw data distinct from cleaned, so we can investigate different approaches to
guessing.

So how should we deal with missing values? The simplest approach is to drop all
records containing missing values. This works just fine when it leaves enough
training data, provided the missing values are absent for non-systematic reasons. If
the people refusing to state their salary were generally those above the mean,
dropping these records will lead to biased results.

But typically we want to make use of records with missing fields. It can be better to
estimate or impute missing values, instead of leaving them blank. We need general
methods for filling in missing values. Candidates include:

 Heuristic-based imputation: Given sufficient knowledge of the underlying


domain, we should be able to make a reasonable guess for the value of certain fields.

 Mean value imputation: Using the mean value of a variable as a proxy for missing
values is generally sensible.

 Random value imputation: Another approach is to select a random value from the
column to replace the missing value. This would seem to set us up for potentially
lousy guesses, but that is actually the point. If we run the model ten times with ten
different imputed values and get widely varying results, then we probably shouldn’t
have much confidence in the model.

 Imputation by nearest neighbor: What if we identify the complete record which


matches most closely on all fields present, and use this nearest neighbor to infer the
values of what is missing? Such predictions should be more accurate than the mean,
when there are systematic reasons to explain variance among records. This
approach requires a distance function to identify the most similar records. Nearest
neighbor methods are an important technique in data science.

 Imputation by interpolation: More generally, we can use a method like linear


regression to predict the values of the target column, given the other fields in the
O32
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

record. Such models can be trained over full records and then applied to those with
missing values. Using linear regression to predict missing values works best when
there is only one field missing per record. Regression models can easily turn an
incomplete record into an outlier, by filling the missing fields in with unusually high
or low values.

 Outlier Detection
An outlier is a data point that significantly deviates from the rest of the data. It can
be either much higher or much lower than the other data points, and its presence
can have a significant impact on the results of machine learning algorithms. They
can be caused by measurement or execution errors. The analysis of outlier data is
referred to as outlier analysis or outlier mining.

It is important for a data scientist to find outliers and remove them from the
dataset as part of the feature engineering before training machine learning
algorithms for predictive modeling. Outliers present in a classification or
regression dataset can lead to lower predictive modeling performance.

Example: Imagine you have a group of friends, and you’re all about the same
age, but one person is much older or younger than the rest. That person would
be considered an outlier because they stand out from the usual pattern. In data,
outliers are points that deviate significantly from the majority, and detecting
them helps identify unusual patterns or errors in the information. This method is
like finding the odd one out in a group, helping us spot data points that might
need special attention or investigation.

Reasons for outliers in data


 Errors during data entry or a faulty measuring device (a faulty sensor may result
in extreme readings).
 Natural occurrence (salaries of junior level employees vs C-level employees)

O33
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

CROWDSOURCING

No single person has all the answers. Not even me. Much of what passes for wisdom is
how we aggregate expertise, assembling opinions from the knowledge and experience
of others. Crowdsourcing is the collection of information, opinions, or work from a
group of people.

Crowdsourcing serves as an important source of data in building models, especially for


tasks associated with human perception. Humans remain the state of-the-art system in
natural language processing and computer vision, achieving the highest level of
performance. The best way to gather training data often requires asking people to score
a particular text or image. Social media and other new technologies have made it easier
to collect and aggregate opinions on a massive scale.

 The Penny Demo

Figure contains photos of a jar of pennies I accumulated in my office over many years.
How many pennies do I have in this jar? Make your own guess now.

To get the right answer, I had my biologist-collaborator Justin Garden weigh the
pennies on a precision laboratory scale. Dividing by the weight of a single penny gives
the count. Justin can be seen diligently performing his task in Figure (right). So I ask
again: how many pennies do you think I have in this jar? I performed this experiment
on students in my data science class. How will your answer compare to theirs?

I first asked eleven of my students to write their opinions on cards and quietly pass
them up to me at the front of the room. Thus these guesses were completely
independent of each other. The results, sorted for convenience, were:

537, 556, 600, 636, 1200, 1250, 2350, 3000, 5000, 11,000, 15,000

O34
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

I then wrote then wrote these numbers on the board, and computed some statistics.
The median of these guesses was 1250, with a mean of 3739. In fact, there were
exactly 1879 pennies in the jar. The median score among my students was closer to
the right amount than any single guess. it is clear that group-think had settled in to
make it happen.

 When is the crowd wise ?


According to James in his book The Wisdom of Crowds, crowds are wise when four
conditions are satisfied:

 When the opinions are independent: Our experiment highlighted how easy it is
for a group to lapse into group-think. People naturally get influenced by others. If you
want someone’s true opinion, you must ask them in isolation.
 When crowds are people with diverse knowledge and methods: Crowds only
add information when there is disagreement. A committee composed of perfectly-
correlated experts contributes nothing more than you could learn from any one of
them.
 When the problem is in a domain that does not need specialized knowledge: I
trust the consensus of the crowd in certain important decisions, like which type of car
to buy or who should serve as the president of my country (gulp).
 Opinions can be fairly aggregated: The least useful part of any mass survey form
is the open response field “Tell us what you think!”. The problem here is that there is no
way to combine these opinions to form a consensus, because different people have
different issues and concerns.

 Mechanisms for Aggregation


Collecting wisdom from a set of responses requires using the right aggregation
mechanism. For estimating numerical quantities, standard techniques like plotting the
frequency distribution and computing summary statistics are appropriate. Both the
mean and median implicitly assume that the errors are symmetrically distributed.

A quick look at the shape of the distribution can generally confirm or reject that
hypothesis. The median is, generally speaking, a more appropriate choice than the
mean in such aggregation problems. It reduces the influence of outliers, which is a
particular problem in the case of mass experiments where a certain fraction of your
participants are likely to be bozos.

Removing outliers is a very good strategy, but we may have other grounds to judge
O35
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

the reliability of our subjects, such as their performance on other tests where we do
know the answer. Taking a weighted average, where we give more weight to the scores
deemed more reliable, provides a way to take such confidence measures into account.

 Crowdsourcing services
Crowdsourcing services like Amazon Turk and CrowdFlower provide the
opportunity for you to hire large numbers of people to do small amounts of piecework.
They help you to wrangle people, in order to create data for you to wrangle. These
crowdsourcing services maintain a large stable of freelance workers, serving as the
middleman between them and potential employers. These workers, generally called
Turkers, are provided with lists of available jobs and what they will pay, as shown in
Figure.

Employers generally have some ability to control the location and credentials of who
they hire, and the power to reject a worker’s efforts without pay, if they deem it
inadequate. But statistics on employers’ acceptance rates are published, and good
workers are unlikely to labor for bad actors.

The tasks assigned to Turkers generally involve simple cognitive efforts that cannot
currently be performed well by computers. Good applications of Turkers include:

O36
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

Measuring aspects of human perception: Crowdsourcing systems provide efficient


ways to gather representative opinions on simple tasks. One nice application was
establishing linkages between colors in red-green-blue space, and the names by which
people typically identify them in a language. This is important to know when writing
descriptions of products and images.

Obtaining training data for machine learning classifiers: Our primary interest in
crowdsourcing will be to produce human annotations that serve as training data. Many
machine learning problems seek to do a particular task “as well as people do.” Doing so
requires a large number of training instances to establish what people did, when given
the chance. For example, suppose we sought to build a sentiment analysis system
capable of reading a written review and deciding whether its opinion of a product is
favorable or unfavorable. We will need a large number of reviews labeled by annotators
to serve as testing/training data. Further, we need the same reviews labeled repeatedly
by different annotators, so as to identify any inter-annotator disagreements concerning
the exact meaning of a text.

Obtaining evaluation data for computer systems: A/B testing is a standard method
for optimizing user interfaces: show half of the judges version A of a given system and
the other half version B. Then test which group did better according to some metric.
Turkers can provide feedback on how interesting a given app is, or how well a new
classifier is performing. One of my grad students (Yanqing Chen) used CrowdFlower to
evaluate a system he built to identify the most relevant Wikipedia category for a
particular entity. Which category better describes Barack Obama: Presidents of the
United States or African-American Authors? For $200, he got people to answer a total of
10,000 such multiple-choice questions, enough for him to properly evaluate his system.

Putting humans into the machine: There still exist many cognitive tasks that people
do much better than machines. A cleverly-designed interface can supply user queries to
people sitting inside the computer, waiting to serve those in need. Suppose you wanted
to build an app to help the visually impaired, enabling the user to snap a picture and ask
someone for help. Maybe they are in their kitchen, and need someone to read the label
on a can to them. This app could call a Turker as a subroutine, to do such a task as it is
needed. Of course, these image-annotation pairs should be retained for future analysis.
They could serve as training data for a machine learning program to take the people out
of the loop, as much as possible.

Independent creative efforts: Crowdsourcing can be used to commission large

O37
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

numbers of creative works on demand. You can order blog posts or articles on demand,
or written product reviews both good and bad. Anything that you might imagine can be
created, if you just specify what you want. Here are two silly examples that I somehow
find inspiring: – The Sheep Market (http://www.thesheepmarket.com) commissioned
10,000 drawings of sheep for pennies each. As a conceptual art piece, it tries to sell
them to the highest bidder. What creative endeavors can you think of that people will
do for you at $0.25 a pop? – Emoji Dick (http://www.emojidick.com) was a
crowdsourced effort to translate the great American novel Moby Dick completely into
emoji images. Its creators partitioned the book into roughly 10,000 parts, and farmed
out each part to be translated by three separate Turkers. Other Turkers were hired to
select the best one of these to be incorporated into the final book. Over 800 Turkers
were involved, with the total cost of $3,676 raised by the crowd-fundingsite Kickstarter.

• Economic/psychological experiments: Crowdsourcing has proven a boon to social


scientists conducting experiments in behavioral economics and psychology. Instead of
bribing local undergraduates to participate in their studies, these investigators can now
expand their subject pool to the entire world. They get the power to harness larger
populations, perform independent replications in different countries, and thus test
whether there are cultural biases of their hypotheses.

QUESTION BANK

UNIT-II

2 MARKS QUESTION

1. Define Probability.
2. Define experiment with example.
3. Define sample space with example.
4. Define event with example.
5. What is Probability of an outcome explain with example.
6. What is Probability of an event explain with example.
7. What is Random variable explain with example.
8. What is Expected value explain with example.
9. What is Variability Measures ?
10. What is Sampling error.
11. What is Measurement error.
O38
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

12. What is Characterizing Distributions.


13. Define Correlation and Correlation Coefficient.
14. Define Data Munging with example.
15. What is Hunting ?
16. What is Scrapping?
17. What is Logging?
18. What is Outlier ?
19. What is crowdsourcing ?

5 MARKS QUESTION
1.State Probability Vs Statistics with example.
2.State Compound Events and Independence with example.
3.What is Conditional Probability explain with example.
4.Explain Interpreting Variance with example.
5.Write a short note on Correlation does not imply Causation and Detecting
6.State Periodicities by Autocorrelation.
7.Explain in brief languages for Data Science.
8.Write the importance of Notebook Environments.
9.How to clean the data explain with examples.
10. State errors Vs artifacts with example.
11.Explain Outlier detection with example in detail.
12.Explain Penny demo in Crowdsourcing.
13.When is the crowd wise?
14.What is Mechanisms for Aggregation

10 MARKS QUESTION
1. Write a short note on union, intersection, set difference and Independent
events with example.
2. What is Probability distributions? Explain types also with example.
3. What is Descriptive Statistics explain with example. OR Define Centrality
measure – Mean, Geometric Mean, Median, Mode.
4. What is Correlation Analysis ? Explain its types with examples.
5. Explain in detail Standard Data Formats.
6. How to collect the data from different resources explain in detail with
example.
7. What is Data Compatibility ? Explain with its different types.
8. How to deal with missing values explain in detail.

O39
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES

9. Explain crowdsourcing services in detail.

O40
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme

You might also like