FDSUnit2 Data Science
FDSUnit2 Data Science
PROBABILITY
Probability Vs Statistics
Compound Events and Independence
Conditional Probability
Probability Distributions
DESCRIPTIVE STATISTICS
Centrality Measure
Variability Measures
Interpreting Variance
Characterizing Distributions
CORRELATION ANALYSIS
Correlation Coefficients: Pearson and Spearman Rank
The Power and Significance of Correlation
Correlation does not imply Causation
Detecting Periodicities by Autocorrelation
DATA MUNGING:
LANGUAGES FOR DATA SCIENCE
The importance of Notebook Environments
Standard Data Formats
COLLECTING DATA
Hunting
Scraping
Logging
CLEANING DATA
Errors Vs Artifacts
Data Compatibility
Dealing with missing values
Outlier Detection
CROWDSOURCING
The Penny Demo
When is the crowd wise ?
Mechanisms for Aggregation
Crowdsourcing services
O1
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
You must walk before you can run. Similarly, there is a certain level of mathematical
maturity which is necessary before you should be trusted to do anything meaningful
with numerical data.
PROBABILITY
Probability is a numerical representation of the chance of occurrence of a particular
event. Here the event is the word used to describe any particular set of the outcome.
experiment
An experiment is a procedure which yields one of a set of possible outcomes.
Example,
A coin tossed 10 times, Head is recorder 7 times, Tail is recorded 3 times.
sample space
A sample space S is the set of possible outcomes of an experiment.
example In roll two dice together, there are 36 possible outcomes, namely
S = {(1, 1),(1, 2),(1, 3),(1, 4),(1, 5),(1, 6),(2, 1),(2, 2),(2, 3),(2, 4),(2, 5),(2, 6), (3, 1),
(3, 2),(3, 3),(3, 4),(3, 5),(3, 6),(4, 1),(4, 2),(4, 3),(4, 4),(4, 5),(4, 6), (5, 1),(5, 2),
(5, 3), (5, 4),(5, 5),(5, 6),(6, 1),(6, 2),(6, 3),(6, 4),(6, 5),(6, 6)}.
event
An event E is a specified subset of the outcomes of an experiment.
example The event that the sum of the dice equals 7 or 11
subset E = {(1, 6),(2, 5),(3, 4),(4, 3),(5, 2),(6, 1),(5, 6),(6, 5)}.
Probability of an outcome
probability of an event is the number of outcomes in the event divided by the
number of possible, equally likely outcomes.
Example: If we assume two distinct fair dice, the probability p(s) = (1/6) × (1/6) =
1/36 for all outcomes s ∈ S.
O2
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Probability of an event
The probability of an event E is the sum of the probabilities of the outcomes of the
experiment. An alternate formulation is in terms of the complement of the event
E¯, the case when E does not occur. Then P(E) = 1 − P(E¯)
Example: If you pull a random card from a deck of playing cards, what is
the probability it is not a heart?
Random variable
A random variable V is a numerical function on the outcomes of a probability
space.
Example: Suppose 2 dice are rolled and the random variable, X, is used to
represent the sum of the numbers. Then, the smallest value of X will be
equal to 2 (1 + 1), while the highest value would be 12 (6 + 6). Thus, X could
take on any value between 2 to 12 (inclusive). Now if probabilities are
attached to each outcome then the probability distribution of X can be
determined.
Expected value
Random variables are the functions that assign a probability to some outcomes
in the sample space.
Example: Let R be the random variable, and in this case, it is defined as,
R = Number of heads
Sample Space = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
The value given by R for each outcome is shown below in the table.
O3
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Probability Vs Statistics
Probability and statistics are related areas of mathematics which concern
themselves with analyzing the relative frequency of events. Still, there are
fundamental differences in the way they see the world:
O4
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
For example, let us take another example. When we throw a die, the possibility of an even
number appearing is a compound event, as there is more than one possibility, there are
three possibilities i.e. E = {2,4,6}.
Set difference
If there are two sets A and B, then the difference of two sets A and B is equal to the set
which consists of elements present in A but not in B. It is represented by A-B.
Example: If A = {1,2,3,4,5,6,7} and B = {6,7} are two sets.
Then, the difference of set A and set B is given by;
A – B = {1,2,3,4,5}
Union
O5
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
If two sets A and B are given, then the union of A and B is equal to the set that
contains all the elements present in set A and set B.
Example: If set A = {1,2,3,4} and B {6,7}
Then, Union of sets, A ∪ B = {1,2,3,4,6,7}
Intersection
If two sets A and B are given, then the intersection of A and B is the subset of universal
set , which consist of elements common to both A and B. It is denoted by the symbol
‘∩’. This operation is represented by:
Example: Let A = {1,2,3} and B = {3,4,5}
Then, A∩B = {3}; because 3 is common to both the sets.
Independent events are those events whose occurrence is not dependent on any
other event.
For example, if we flip a coin in the air and get the outcome as Head, then again if
we flip the coin but this time we get the outcome as Tail. In both cases, the
occurrence of both events is independent of each other.
O6
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Conditional Probability
Conditional probability is one type of probability in which the possibility of an
event depends upon the existence of a previous event.
The conditional probability of A given B, P(A|B) is defined:
P(A|B) = P (A ∩ B) / P(B)
Where,
P (A ∩ B) represents the probability of both events A and B occurring
simultaneously.
P(B) represents the probability of event B occurring.
Example: A die is thrown two times and the sum of the scores appearing on the
die is observed to be a multiple of 4. Then the conditional probability that the
score 4 has appeared at least once is: Let A be the event that the sum obtained is a
multiple of 4.B be the event that the score of 4 has appeared at least once.
A = {(1, 3), (2, 2), (3, 1), (2, 6), (3, 5), (4, 4), (5, 3), (6, 2), (6, 6)}
B = {(1, 4), (2, 4), (3, 4), (4, 4), (5, 4), (6, 4), (4, 1), (4, 2), (4, 3), (4, 5), (4, 6)}
(A ∩ B) = (4, 4)
n(A ∩ B) = 1
Required probability = P(B|A)
= P(A ∩ B)/P(B)
= 1/11
Probability Distributions
Probability Distribution is basically the set of all possible outcomes of any random
experiment or event. Understanding random variables’ behavior, features, and
distributions depends critically on PDF(Probability Density Function ) and
O7
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
O8
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
DESCRIPTIVE STATISTICS
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.
Centrality Measure
the measures of central tendency are used to describe data by determining a single
representative central value. The important measures of central tendency are given
below:
Mean: The mean or Arithmetic mean can be defined as the sum of all observations
divided by the total number of observations. The formulas for the mean are given as
follows:
Geometric Mean
the geometric mean is defined as the nth root of the product of n numbers. e multiply
the numbers altogether and take the nth root of the multiplied numbers, where n is the
total number of data values.
For example: for a given set of two numbers such as 3 and 1, the geometric mean is
O9
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Median: The median can be defined as the center-most observation that is obtained
by arranging the data in ascending order. The formulas for the median are given as
follows:
Example: In this case the median is the 11th number:
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
Median = 61
Mode: The mode is the most frequently occurring observation in the data set. The
formulas for the mode are given as follows:
Example:
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
62 appears three times, more often than the other values, so Mode = 62
Variability Measures
The most common measure of variability is the standard deviation σ, which
measures sum of squares differences between the individual elements and the mean:
Variability in statistics refers to the difference being exhibited by data points within
a data set, as related to each other or as related to the mean. This can be expressed
through the range, variance or standard deviation of a data set. The field of finance
uses these concepts as they are specifically applied to price data and the returns that
changes in price imply.
Interpreting Variance
Repeated observations of the same phenomenon do not always produce the same
results, due to random noise or error.
O10
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Measurement errors reflect the limits of precision inherent in any sensing device.
The notion of signal to noise ratio captures the degree to which a series of
observations reflects a quantity of interest as opposed to data variance.
As data scientists, care about changes in the signal instead of the noise, and such
variance . Variance as an inherent property of the universe.
Example: Each morning you weigh yourself on a scale you are guaranteed to get a
different number, with changes reflecting when you last ate (sampling error), the
flatness of the floor, or the age of the scale (both measurement error) as much as
changes in your body mass (actual variation).
So what is your real weight? Every measured quantity is subject to some level of
variance, Data scientists seek to explain the world through data.
Characterizing Distributions
Distributions do not necessarily have much probability mass exactly at the mean.
Consider what your wealth would look like after you borrow $100 million, and then
bet it all on an even money coin flip. Heads you are now $100 million in clear, tails
you are $100 million in hock. Your expected wealth is zero, but this mean does not
tell you much about the shape of your wealth distribution.
However, taken together the mean and standard deviation do a decent job of
characterizing any distribution.
CORRELATION ANALYSIS
Correlation
Suppose we are given two variables x and y, represented by a sample of n points of
the form (xi , yi), for 1 ≤ i ≤ n. We say that x and y are correlated when the value of x
has some predictive power on the value of y.
The correlation coefficient r(X, Y ) is a statistic that measures the degree to which Y
is a function of X, and vice versa. The value of the correlation coefficient ranges from
O11
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Negative correlations imply that the variables are anti-correlated, meaning that when
X goes up, Y goes down. Perfectly anti-correlated variables have a correlation of −1.
Note that negative correlations are just as good for predictive purposes as positive
ones.
That you are less likely to be unemployed the more education you have is an example
of a negative correlation, so the level of education can indeed help predict job status.
Correlations around 0 are useless for forecasting. Observed correlations drives many
of the predictive models we build in data science.
• Does financial status affect health? The observed correlation between household
income and the prevalence of coronary artery disease is r = −0.717, so there is a
strong negative correlation. So yes, the wealthier you are, the lower your risk of
having a heart attack.
Suppose X and Y are strongly correlated. Then we would expect that when xi is greater
than the mean X , then yi should be bigger than its mean Y . When xi is lower than its
mean, yi should follow. Now look at the numerator. The sign of each term is positive
when both values are above (1 ×1) or below (−1× −1) their respective means. The sign
of each term is negative ((−1×1) or (1×−1)) if they move in opposite directions,
O12
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
The numerator’s operation determining the sign of the correlation is so useful that we
give it a name, covariance, computed:
The denominator of the Pearson formula reflects the amount of variance in the two
variables, as measured by their standard deviations. The covariance between X and Y
potentially increases with the variance of these variables, and this denominator is the
magic amount to divide it by to bring correlation to a −1 to 1 scale.
The Spearman rank correlation coefficient essentially counts the number of pairs of
input points which are out of order. Suppose that our data set contains points (x 1, y1)
and (x2, y2) where x1 < x2 and y1 < y2. This is a vote that the values are positively
correlated, whereas the vote would be for a negative correlation if y2 < y1.
Summing up over all pairs of points and normalizing properly gives us Spearman rank
correlation. Let rank(xi) be the rank position of xi in sorted order among all xi , so the
rank of the smallest value is 1 and the largest value n. Then
O13
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
The hypothesis test lets us decide whether the value of the population correlation
coefficient ρ is “close to zero” or “significantly different from zero”. We decide this
based on the sample correlation coefficient r and the sample size n.
For example, the fact that we can put people on a diet that makes them lose weight
without getting shorter is convincing evidence that weight does not cause height.
But it is often harder to do these experiments the other way, e.g. there is no
reasonable way to make people shorter other than by hacking off limbs.
Figure resents a time series of daily sales, and the associated autocorrelation function
for this data. The peak at a shift of seven days (and every multiple of seven days)
establishes that there is a weekly periodicity in sales: more stuff gets sold on
weekends.
O15
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Fortunately, there is an efficient algorithm based on the fast Fourier transform (FFT),
which makes it possible to construct the autocorrelation function even for very long
sequences.
DATA MUNGING:
Data munging is the process of cleaning and transforming raw data into a structured
format that is suitable for analysis. This step is essential because real-world data is
often messy, incomplete, or inconsistent. Data munging aims to address these issues,
ensuring that the data is accurate, consistent, and ready for further exploration.
Data munging involves deciding how to handle these missing values, whether by
imputing them with averages, removing the corresponding rows, or using other
strategies.
import pandas as pd
O16
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
df = pd.read_csv('customer_data.csv')
df['Email'].fillna(df['Email'].mean(), inplace=True)
Duplicate records can skew analysis results. Data munging includes identifying and
handling duplicate entries. In the following example, we use pandas to identify and
remove duplicates from a dataset.
df.drop_duplicates(inplace=True)
When dealing with text data, data munging may involve cleaning and preprocessing text
for analysis. Let's say we have a dataset with a "Description" column containing text
data. We can use regular expressions to remove special characters and convert text to
lowercase.
import re
Conclusion:
Data munging is a critical skill for anyone working with data. Whether you're a data
scientist, analyst, or business professional, mastering the art of data munging can
significantly impact the quality and reliability of your analyses. The examples provided
demonstrate some common scenarios encountered during data munging, but the field
is vast and ever-evolving. Continuous learning and practice will empower you to
O17
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Perl: This used to be the go to language for data munging on the web, before
Python ate it for lunch. In the TIOBE programming language popularity index
(http://www.tiobe.com/tiobe-index), Python first exceeded Perl in popularity in 2008
and hasn’t looked back. There are several reasons for this, including stronger support
for object-oriented programming and better available libraries, but the bottom line is
that there are few good reasons to start projects in Perl at this point.
Matlab: The Mat here stands for matrix, as Matlab is a language designed for the
fast and efficient manipulation of matrices. As we will see, many machine learning
algorithms reduce to operations on matrices, making Matlab a natural choice for
engineers programming at a high-level of abstraction. Matlab is a proprietary system.
However, much of its functionality is available in GNU Octave, an open-source
O18
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
alternative.
Java and C/C++: These mainstream programming languages for the development
of large systems are important in big data applications. Parallel processing systems
like Hadoop and Spark are based on Java and C++, respectively. If you are living in the
world of distributed computing, then you are living in a world of Java and C++ instead
of the other languages listed here.
Excel: Spreadsheet programs like Excel are powerful tools for exploratory data
analysis, such as playing with a given data set to see what it contains. They deserve
our respect for such applications. Full featured spreadsheet programs contain a
surprising amount of hidden functionality for power users. A student of mine who rose
to become a Microsoft executive told me that 25% of all new feature requests for Excel
proposed functionality already present there. The special functions and data
manipulation features you want probably are in Excel if you look hard enough, in the
same way that a Python library for what you need probably will be found if you search
for it.
programs again from scratch, and get exactly the same result. This means that data
pipelines must be complete: taking raw input and producing the final output.
Computations must be tweakable: Often reconsideration or evaluation will
prompt a change to one or more parameters or algorithms. This requires rerunning
the notebook to produce the new computation. A notebook is never finished until
after the entire project is done.
Data pipelines need to be documented: That notebooks permit you to integrate
text and visualizations with your code provides a powerful way to communicate
what you are doing and why, in ways that traditional programming environments
cannot match.
They are easy for people to read: Which of the data files in this directory is the
right one for me to use? What do we know about the data fields in this file? What is the
gross range of values for each particular field? These use cases speak to the enormous
value of being able to open a data file in a text editor to look at it. Typically, this means
presenting the data in a human-readable text-encoded format, with records demarcated
by separate lines, and fields separated by delimiting symbols.
They are widely used by other tools and systems: The urge to invent
proprietary data standard beats firmly in the corporate heart, and most software
developers would rather share a toothbrush than a file format. But these are impulses
to be avoided. The power of data comes from mixing and matching it with other data
resources, which is best facilitated by using popular standard formats.
A data format's structure defines how data is organized in a database or file system,
giving it meaning. The most important data formats/representations to be aware of are
discussed below:
• CSV (comma separated value) files: Comma-separated values (CSV) is a common
O20
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
format for storing and exchanging tabular data. It's similar to Excel and is often used
to process data with pandas. CSV is good for storing and processing text, numbers,
and dates. However, if your data contains strings or sentences with commas, you
should wrap the strings in quotation marks or use a different delimiter.
• JSON (JavaScript Object Notation): This is a format for transmitting data objects
between programs. It is a natural way to communicate the state of variables/data
structures from one system to another. This representation is basically a list of
attribute-value pairs corresponding to variable/field names, and the associated
values:
Because library functions that support reading and writing JSON objects are readily
available in all modern programming languages, it has become a very convenient way
to store data structures for later use. JSON objects are human readable, but are quite
cluttered-looking, representing arrays of records compared to CSV files. Use them for
complex structured objects, but not simple tables of data.
COLLECTING DATA
The most critical issue in any data science or modeling project is finding the right data
set.
Identifying viable data sources is an art, one that revolves around three basic questions:
• Who might actually have the data I need?
• Why might they decide to make it available to me?
• How can I get my hands on it?
In this section, we will explore the answers to these questions. We look at common
sources of data, and what you are likely to be able to find and why.
Hunting
Who has the data, and how can you get it? Some of the likely suspects are reviewed
below.
Many responsible companies like The New York Times, Twitter, Facebook, and
Google do release certain data, typically by rate-limited application program
interfaces (APIs).
They generally have two motives:
• Providing customers and third parties with data that can increase sales. For
example, releasing data about query frequency and ad pricing can encourage more
people to place ads on a given platform.
• It is generally better for the company to provide well-behaved APIs than having
cowboys repeatedly hammer and scrape their site.
O22
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
You won’t find exactly the content or volume that you dream of, but probably
something that will suffice to get started. Be aware of limits and terms of use.
Finally, most organizations have internal data sets of relevance to their business. As
an employee, you should be able to get privileged access while you work there. Be
aware that companies have internal data access policies, so you will still be subject to
certain restrictions. Violating the terms of these policies is an excellent way to
become an ex-employee
City, state, and federal governments have become increasingly committed to open
data, to facilitate novel applications and improve how government can fulfill its
mission. The website http://Data.gov is an initiative by the federal government to
centrally collect its data sources, and at last count points to over 100,000 data sets!
Government data differs from industrial data in that, in principle, it belongs to the
People. The Freedom of Information Act (FOI) enables any citizen to make a formal
request for any government document or data set. Such a request triggers a process
to determine what can be released without compromising the national interest or
violating privacy.
State governments operate under fifty different sets of laws, so data that is tightly
held in one jurisdiction may be freely available in others. Major cities like New York
have larger data processing operations than many states, again with restrictions that
vary by location.
The key to finding these data sets is to track down the relevant papers. There is an
academic literature on just about any topic of interest. Google Scholar is the most
accessible source of research publications. Search by topic, and perhaps “Open
Science” or “data.” Research publications will typically provide pointers to where its
associated data can be found. If not, contacting the author directly with a request
should quickly yield the desired result.
The biggest catch with using published data sets is that someone else has worked
hard to analyze them before you got to them, so these previously mined sources may
have been sucked dry of interesting new results. But bringing fresh questions to old
data generally opens new possibilities.
Sweat Equity
Sometimes you will have to work for your data, instead of just taking it from others.
Much historical data still exists only in books or other paper documents, thus
requiring manual entry and curation. A graph or table might contain information that
we need, but it can be hard to get numbers from a graphic locked in a PDF (portable
document format) file.
Crowdsourcing platforms like Amazon Turk and CrowdFlower enable you to pay for
armies of people to help you extract data, or even collect it in the first place. Tasks
requiring human annotation like labeling images or answering surveys are
particularly good use of remote workers.
Many amazing open data resources have been built up by teams of contributors, like
Wikipedia, Freebase, and IMDb. But there is an important concept to remember:
people generally work better when you pay them.
Scraping
Data scraping, also known as web scraping, is a technique that involves using a
computer program to extract data from a website, database, or other source. The data
can be text, images, or videos, and it can be copied into a spreadsheet or local file for
later use.
Suppose you want some information about Mahatma Gandhi from Wikipedia or any
other website, you can extract this data by copying and pasting the information into
your file. But if you want this information for hundreds of different personalities,
manually getting this data is impossible, and you need an automated and efficient
method to scrape all of this information quickly. And here, Web Scraping comes into the
picture.
Web Scraping can be defined as an automated process to extract content or data from
the Internet. It provides various intelligent and automated methods to quickly extract
large volumes of data from websites. Most of this data will be in unstructured or HTML
format, which can be further parsed and converted into a structured format for further
analysis. In theory, you can scrape any data on the Internet. The most common data
types scraped include text, images, videos, pricing, reviews, product information, etc.
There are many ways you can perform Web Scraping to collect data from websites. You
can use online Web Scraping services or APIsor create your own custom-built code to
scrape the information. Many popular websites such as Google, Twitter, Facebook, etc.
provide APIs that allow the collection of the required data directly in a structured
format.
O25
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Web Scraping has countless applications across industries. A few of the most common
use cases of Web Scraping include -
Price Monitoring
Organizations scrape the pricing and other related information for their and
competitors' products to analyze and fix optimal pricing for the products to maximize
revenue.
Market Research
Organizations use Web Scraping to extract product data, reviews, and other relevant
information to perform sentiment analysis, consumer trends, and competitor analysis.
News Monitoring
Organizations dependent on daily news for their day-to-day functioning can use Web
Scraping to generate reports based on the daily news.
Sentiment Analysis
Companies can collect product-related data from Social Media such as Facebook,
Twitter, etc., and other online forums to analyze the general sentiment for their
products among consumers.
Contact Scraping
Organizations scrape websites to collect contact information such as email IDs, and
mobile numbers to send bulk promotional and marketing emails and SMS.
O26
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Other than the above use cases, Web Scraping can also be used for numerous other
scenarios such as Weather Forecasting, Sports Analytics, Real Estate Listings, etc.
Logging
Logging is a vital part of programming, providing a record of events and important
information that can be used to monitor and optimise system performance.
Logging can play a vital part in machine learning by monitoring and optimizing the
performance of a system. Machine learning (ML) log files are an essential component of
the ML pipeline. They serve as a record of the training and evaluation processes,
providing valuable insights and debugging information for ML practitioners.
CLEANING DATA
Data cleansing or data scrubbing is the act of detecting and correcting (or removing)
corrupt or inaccurate records from a record set, table, or database. Used mainly in
databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant
etc. parts of the data and then replacing, modifying or deleting this dirty data.
data.
e.g., instrument faulty, human or computer error, transmission error
o incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
o noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
o inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
o Intentional (e.g., disguised missing data)
O27
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Errors Vs Artifacts
Errors: information that is lost during acquisition and can never be recovered e.g.
power outage, crashed servers
Example: In a similar scenario consider yourself working in a water manufacturing
company and during the time of production by mistake, you dropped a sack of sugar
into the water container. This is an error! Now you cant recover the original water or
the sugar once it's all mixed up!
Artifacts: systematic problems that arise from the data cleaning process. these
problems can be corrected but we must first discover them.
Example:Now consider that the data you possess is a tin of oil! If there is a rock
present in the tin of oil, you could easily remove the rock, this therefore is a classic
example for Artifacts!
Data Compatibility
We say that a comparison of two items is “apples to apples” when it is fair comparison,
that the items involved are similar enough that they can be meaningfully stood up
against each other. In contrast, “apples to oranges” comparisons are ultimately
meaningless.
For example: It makes no sense to compare weights of 123.5 against 78.9, when one is
in pounds and the other is in kilograms.
These types of data comparability issues arise whenever data sets are merged
Unit Conversions
A unit of measurement denotes how a value is measured. For instance, something
might be 4 pounds, 4 seconds, 4 inches, etc. All of these measurements contain the
same value, but the units make the measurements fundamentally different.
Disastrous things like rocket explosions happen when measurements are entered into
computer systems using the wrong units of measurement. In particular, NASA lost the
$125 million Mars Climate Orbiter space mission on September 23, 1999 due to a
metric-to-English conversion issue.
Such problems are best addressed by selecting a single system of measurements and
sticking to it. In particular, individual measurements are naturally expressed as single
decimal quantities (like 3.28 meters) instead of incomparable pairs of quantities (5
feet, 8 inches). This same issue arises in measuring angles (radians vs.
degrees/seconds) and weight (kilograms vs. pounds/oz).
When merging records from diverse sources, it is an excellent practice to create a new
“origin” or “source” field to identify where each record came from. This provides at
least the hope that unit conversion mistakes can be corrected later, by systematically
operating on the records from the problematic source.
The distinction between integers and floating point (real) numbers is important to
maintain. Integers are counting numbers: quantities which are really discrete should
be represented as integers. Physically measured quantities are never precisely
quantified, because we live in a continuous world. Thus all measurements should be
reported as real numbers. Integer approximations of real numbers are sometimes
used in a misbegotten attempt to save space. Don’t do this: the quantification effects
of rounding or truncation introduces artifacts.
In one particularly clumsy data set we encountered, baby weights were represented as
two integer fields (pounds and the remaining ounces). Much better would have been to
combine them into a single decimal quantity.
O29
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Name Unification
Integrating records from two distinct data sets requires them to share a common key
field. Names are frequently used as key fields, but they are often reported
inconsistently. Is Jos´e the same fellow as Jose? Such diacritic marks are banned from
the official birth records of several U.S. states, in an aggressive attempt to force them
to be consistent.
Time/Date Unification
Data/time stamps are used to infer the relative order of events, and group events by
relative simultaneity. Integrating event data from multiple sources requires careful
cleaning to ensure meaningful results.
First let us consider issues in measuring time. The clocks from two computers never
exactly agree. There are also time zone issues when dealing with data from different
regions, as well as diversities in local rules governing changes in daylight saving time.
The right answer here is to align all time measurements to Coordinated Universal
Time (UTC), a modern standard subsuming the traditional Greenwich Mean Time
(GMT).
The Gregorian calendar is common throughout the technology world, although many
other calendar systems are in use in different countries.
Financial Unification
O30
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Money makes the world go round, which is why so many data science projects
revolve around financial time series.
The time value of money implies that a dollar today is (generally) more valuable than
a dollar a year from now, with interest rates providing the right way to discount
future dollars. Inflation rates are estimated by tracking price changes over baskets of
items, and provide a way to standardize the purchasing power of a dollar over time.
Numerical data sets expect a value for every element in a matrix. Setting missing
values to zero is tempting, but generally wrong, because there is always some
ambiguity as to whether these values should be interpreted as data or not.
Is someone’s salary zero because he is unemployed, or did he just not answer the
question?
The danger with using nonsense values as not-data symbols is that they can get
misinterpreted as data when it comes time to build models. A linear regression
model trained to predict salaries from age, education, and gender will have trouble
with people who refused to answer the question.
Using a value like −1 as a no-data symbol has exactly the same deficiencies as zero.
Indeed, be like the mathematician who is afraid of negative numbers: stop at nothing
to avoid them.
O31
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Take-Home Lesson: Separately maintain both the raw data and its cleaned version.
The raw data is the ground truth, and must be preserved intact for future analysis.
The cleaned data may be improved using imputation to fill in missing values. But
keep raw data distinct from cleaned, so we can investigate different approaches to
guessing.
So how should we deal with missing values? The simplest approach is to drop all
records containing missing values. This works just fine when it leaves enough
training data, provided the missing values are absent for non-systematic reasons. If
the people refusing to state their salary were generally those above the mean,
dropping these records will lead to biased results.
But typically we want to make use of records with missing fields. It can be better to
estimate or impute missing values, instead of leaving them blank. We need general
methods for filling in missing values. Candidates include:
Mean value imputation: Using the mean value of a variable as a proxy for missing
values is generally sensible.
Random value imputation: Another approach is to select a random value from the
column to replace the missing value. This would seem to set us up for potentially
lousy guesses, but that is actually the point. If we run the model ten times with ten
different imputed values and get widely varying results, then we probably shouldn’t
have much confidence in the model.
record. Such models can be trained over full records and then applied to those with
missing values. Using linear regression to predict missing values works best when
there is only one field missing per record. Regression models can easily turn an
incomplete record into an outlier, by filling the missing fields in with unusually high
or low values.
Outlier Detection
An outlier is a data point that significantly deviates from the rest of the data. It can
be either much higher or much lower than the other data points, and its presence
can have a significant impact on the results of machine learning algorithms. They
can be caused by measurement or execution errors. The analysis of outlier data is
referred to as outlier analysis or outlier mining.
It is important for a data scientist to find outliers and remove them from the
dataset as part of the feature engineering before training machine learning
algorithms for predictive modeling. Outliers present in a classification or
regression dataset can lead to lower predictive modeling performance.
Example: Imagine you have a group of friends, and you’re all about the same
age, but one person is much older or younger than the rest. That person would
be considered an outlier because they stand out from the usual pattern. In data,
outliers are points that deviate significantly from the majority, and detecting
them helps identify unusual patterns or errors in the information. This method is
like finding the odd one out in a group, helping us spot data points that might
need special attention or investigation.
O33
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
CROWDSOURCING
No single person has all the answers. Not even me. Much of what passes for wisdom is
how we aggregate expertise, assembling opinions from the knowledge and experience
of others. Crowdsourcing is the collection of information, opinions, or work from a
group of people.
Figure contains photos of a jar of pennies I accumulated in my office over many years.
How many pennies do I have in this jar? Make your own guess now.
To get the right answer, I had my biologist-collaborator Justin Garden weigh the
pennies on a precision laboratory scale. Dividing by the weight of a single penny gives
the count. Justin can be seen diligently performing his task in Figure (right). So I ask
again: how many pennies do you think I have in this jar? I performed this experiment
on students in my data science class. How will your answer compare to theirs?
I first asked eleven of my students to write their opinions on cards and quietly pass
them up to me at the front of the room. Thus these guesses were completely
independent of each other. The results, sorted for convenience, were:
537, 556, 600, 636, 1200, 1250, 2350, 3000, 5000, 11,000, 15,000
O34
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
I then wrote then wrote these numbers on the board, and computed some statistics.
The median of these guesses was 1250, with a mean of 3739. In fact, there were
exactly 1879 pennies in the jar. The median score among my students was closer to
the right amount than any single guess. it is clear that group-think had settled in to
make it happen.
When the opinions are independent: Our experiment highlighted how easy it is
for a group to lapse into group-think. People naturally get influenced by others. If you
want someone’s true opinion, you must ask them in isolation.
When crowds are people with diverse knowledge and methods: Crowds only
add information when there is disagreement. A committee composed of perfectly-
correlated experts contributes nothing more than you could learn from any one of
them.
When the problem is in a domain that does not need specialized knowledge: I
trust the consensus of the crowd in certain important decisions, like which type of car
to buy or who should serve as the president of my country (gulp).
Opinions can be fairly aggregated: The least useful part of any mass survey form
is the open response field “Tell us what you think!”. The problem here is that there is no
way to combine these opinions to form a consensus, because different people have
different issues and concerns.
A quick look at the shape of the distribution can generally confirm or reject that
hypothesis. The median is, generally speaking, a more appropriate choice than the
mean in such aggregation problems. It reduces the influence of outliers, which is a
particular problem in the case of mass experiments where a certain fraction of your
participants are likely to be bozos.
Removing outliers is a very good strategy, but we may have other grounds to judge
O35
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
the reliability of our subjects, such as their performance on other tests where we do
know the answer. Taking a weighted average, where we give more weight to the scores
deemed more reliable, provides a way to take such confidence measures into account.
Crowdsourcing services
Crowdsourcing services like Amazon Turk and CrowdFlower provide the
opportunity for you to hire large numbers of people to do small amounts of piecework.
They help you to wrangle people, in order to create data for you to wrangle. These
crowdsourcing services maintain a large stable of freelance workers, serving as the
middleman between them and potential employers. These workers, generally called
Turkers, are provided with lists of available jobs and what they will pay, as shown in
Figure.
Employers generally have some ability to control the location and credentials of who
they hire, and the power to reject a worker’s efforts without pay, if they deem it
inadequate. But statistics on employers’ acceptance rates are published, and good
workers are unlikely to labor for bad actors.
The tasks assigned to Turkers generally involve simple cognitive efforts that cannot
currently be performed well by computers. Good applications of Turkers include:
O36
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
Obtaining training data for machine learning classifiers: Our primary interest in
crowdsourcing will be to produce human annotations that serve as training data. Many
machine learning problems seek to do a particular task “as well as people do.” Doing so
requires a large number of training instances to establish what people did, when given
the chance. For example, suppose we sought to build a sentiment analysis system
capable of reading a written review and deciding whether its opinion of a product is
favorable or unfavorable. We will need a large number of reviews labeled by annotators
to serve as testing/training data. Further, we need the same reviews labeled repeatedly
by different annotators, so as to identify any inter-annotator disagreements concerning
the exact meaning of a text.
Obtaining evaluation data for computer systems: A/B testing is a standard method
for optimizing user interfaces: show half of the judges version A of a given system and
the other half version B. Then test which group did better according to some metric.
Turkers can provide feedback on how interesting a given app is, or how well a new
classifier is performing. One of my grad students (Yanqing Chen) used CrowdFlower to
evaluate a system he built to identify the most relevant Wikipedia category for a
particular entity. Which category better describes Barack Obama: Presidents of the
United States or African-American Authors? For $200, he got people to answer a total of
10,000 such multiple-choice questions, enough for him to properly evaluate his system.
Putting humans into the machine: There still exist many cognitive tasks that people
do much better than machines. A cleverly-designed interface can supply user queries to
people sitting inside the computer, waiting to serve those in need. Suppose you wanted
to build an app to help the visually impaired, enabling the user to snap a picture and ask
someone for help. Maybe they are in their kitchen, and need someone to read the label
on a can to them. This app could call a Turker as a subroutine, to do such a task as it is
needed. Of course, these image-annotation pairs should be retained for future analysis.
They could serve as training data for a machine learning program to take the people out
of the loop, as much as possible.
O37
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
numbers of creative works on demand. You can order blog posts or articles on demand,
or written product reviews both good and bad. Anything that you might imagine can be
created, if you just specify what you want. Here are two silly examples that I somehow
find inspiring: – The Sheep Market (http://www.thesheepmarket.com) commissioned
10,000 drawings of sheep for pennies each. As a conceptual art piece, it tries to sell
them to the highest bidder. What creative endeavors can you think of that people will
do for you at $0.25 a pop? – Emoji Dick (http://www.emojidick.com) was a
crowdsourced effort to translate the great American novel Moby Dick completely into
emoji images. Its creators partitioned the book into roughly 10,000 parts, and farmed
out each part to be translated by three separate Turkers. Other Turkers were hired to
select the best one of these to be incorporated into the final book. Over 800 Turkers
were involved, with the total cost of $3,676 raised by the crowd-fundingsite Kickstarter.
QUESTION BANK
UNIT-II
2 MARKS QUESTION
1. Define Probability.
2. Define experiment with example.
3. Define sample space with example.
4. Define event with example.
5. What is Probability of an outcome explain with example.
6. What is Probability of an event explain with example.
7. What is Random variable explain with example.
8. What is Expected value explain with example.
9. What is Variability Measures ?
10. What is Sampling error.
11. What is Measurement error.
O38
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
5 MARKS QUESTION
1.State Probability Vs Statistics with example.
2.State Compound Events and Independence with example.
3.What is Conditional Probability explain with example.
4.Explain Interpreting Variance with example.
5.Write a short note on Correlation does not imply Causation and Detecting
6.State Periodicities by Autocorrelation.
7.Explain in brief languages for Data Science.
8.Write the importance of Notebook Environments.
9.How to clean the data explain with examples.
10. State errors Vs artifacts with example.
11.Explain Outlier detection with example in detail.
12.Explain Penny demo in Crowdsourcing.
13.When is the crowd wise?
14.What is Mechanisms for Aggregation
10 MARKS QUESTION
1. Write a short note on union, intersection, set difference and Independent
events with example.
2. What is Probability distributions? Explain types also with example.
3. What is Descriptive Statistics explain with example. OR Define Centrality
measure – Mean, Geometric Mean, Median, Mode.
4. What is Correlation Analysis ? Explain its types with examples.
5. Explain in detail Standard Data Formats.
6. How to collect the data from different resources explain in detail with
example.
7. What is Data Compatibility ? Explain with its different types.
8. How to deal with missing values explain in detail.
O39
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 2 MATHEMATICAL PRELIMINARIES
O40
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme