FDS 4 Unit
FDS 4 Unit
Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.
Facets of data
In data science and big data you’ll come across many different types of data, and each of them tends to require
different tools and techniques. The main categories of data are these:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Let’s explore all these interesting data types.
Structured data
• Structured data is data that depends on a data model and resides in a fixed field within a record. As such,
it’s often easy to store structured data in tables within databases or Excel files
• SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or varying.
One example of unstructured data is your regular email
Natural language
• Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
• The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
• Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.
Machine-generated data
• Machine-generated data is information that’s automatically created by a computer, process, application,
or other machine without human intervention.
• Machine-generated data is becoming a major data resource and will continue to do so.
• The analysis of machine data relies on highly scalable tools, due to its high volume and speed. Examples
of machine data are web server logs, call detail records, network event logs, and telemetry.
• “Graph data” can be a confusing term because any data can be shown in a graph.
• Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store graphical data.
• Graph-based data is a natural way to represent social networks, and its structure allows you to calculate
specific metrics such as the influence of a person and the shortest path between two people.
Streaming data
• The data flows into the system when an event happens instead of being loaded into a data store in a
batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The
insights you gain from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It is now
that you attempt to gain the insights or make the predictions stated in your project charter. Now is the
time to bring out the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis, if needed.
One goal of a project is to change a process and/or make better decisions. You may still need to convince
the business that your findings will indeed change the business process as expected. This is where you
can shine in your influencer role. The importance of this step is more apparent in projects on a strategic
and tactical level. Certain projects require you to perform the business process over and over again, so
automating the project will save time.
Retrieving data
• The next step in data science is to retrieve the required data. Sometimes you need to go into the field
and design a data collection process yourself, but most of the time you won’t be involved in this step.
• Many companies will have already collected and stored the data for you, and what they don’t have can
often be bought from third parties.
• More and more organizations are making even high-quality data freely available for public and
commercial use.
• Data can be stored in many forms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.
• Most companies have a program for maintaining key data, so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
• Data warehouses and data marts are home to preprocessed data, data lakes contain data in its natural or
raw format.
• Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around many places. the data may be dispersed as people change positions and
leave the company.
• Getting access to data is another difficult task. Organizations understand the value and sensitivity of data
and often have policies in place so everyone has access to what they need and nothing more.
• These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.
External Data
• If data isn’t available inside your organization, look outside your organizations. Companies provide data
so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter, LinkedIn,
and Facebook.
• More and more governments and organizations share their data for free with the world.
• A list of open data providers that should get you started.
Cleansing data
Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your
data becomes a true and consistent representation of the processes it originates from.
• The first type is the interpretation error, such as when you take the value in your data for granted, like
saying that a person’s age is greater than 300 years.
• The second type of error points to inconsistencies between data sources or against your company’s
standardized values.
An example of this class of errors is putting “Female” in one table and “F” in another when they represent
the same thing: that the person is female.
Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of individual
observations on the regression line.
Most errors of this type are easy to fix with simple assignment statements and if-thenelse rules:
if x == “Godo”:
x = “Good” if x
== “Bade”:
x = “Bad”
Redundant Whitespace
• Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
• The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the observations that
couldn’t be matched.
• If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing
spaces.
Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way to
find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the upper side
when a normal distribution is expected.
1
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
collection or that an error happened in the ETL process. Common techniques data scientists use are listed in
table
Integrating data
Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.
Joining Tables
• Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table. The focus is on enriching a single observation.
• Let’s say that the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
• Joining the tables allows you to combine the information so that you can use it for your model, as shown
in figure.
To join tables, you use variables that represent the same object in both tables, such as a date, a country name,
or a Social Security number. These common fields are known as keys. When these keys also uniquely define
the records in the table they are called primary keys.
The number of resulting rows in the output table depends on the exact join type that you use
Appending Tables
• Appending or stacking tables is effectively adding observations from one table to another table.
• One table contains the observations from the month January and the second table contains observations
from the month February. The result of appending these tables is a larger one with the observations from
January as well as February.
1
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Figure. Appending data from tables is a common operation but requires an equal structure in the tables begin
appended,
Transforming data
Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form for
data modeling.
Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimation problem
dramatically. Transforming the input variables greatly simplifies the estimation problem. Other times you might
want to combine two variables into a new variable.
1
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
1
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Figure shows how reducing the number of variables makes it easier to understand the key values. It also shows
how two variables account for 50.6% of the variation within the data set (component1 = 27.8% + component2
= 22.8%). These variables, called “component1” and “component2,” are both combinations of the original
variables. They’re the principal components of the underlying data structure
• Dummy variables can only take two values: true(1) or false(0). They’re used to indicate the absence of
a categorical effect that may explain the observation.
• In this case you’ll make separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise.
• An example is turning one column named Weekdays into the columns Monday through Sunday. You
use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0 elsewhere.
• Turning variables into dummies is a technique that’s used in modeling and is popular with, but not
exclusive to, economists.
1
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Figure. Turning variables into dummies is a data transformation that breaks a variable that has multiple
classes into multiple variables, each having only two possible values: 0 or 1
During exploratory data analysis you take a deep dive into the data (see figure below). Information becomes
much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an
understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.
1
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
1
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
• The visualization techniques you use in this phase range from simple line graphs or histograms, as shown
in below figure , to more complex diagrams such as Sankey and network graphs.
• Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight into
the data Other times the graphs can be animated or made interactive to make it easier and, let’s admit
it, way more fun
The techniques we described in this phase are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can also be a part of exploratory
analysis. Even building simple models can be a part of this step.
Building a model is an iterative process. The way you build your model depends on whether you go with classic
statistics or the somewhat more recent machine learning school, and the type of technique you want to use.
Either way, most models consist of the following main steps:
• Selection of a modeling technique and variables to enter in the model
• Execution of the model
• Diagnosis and model comparison
1
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Model execution
• Once you’ve chosen a model you’ll need to implement it in code.
• Most programming languages, such as Python, already have libraries such as StatsModels or Scikitlearn.
These packages use several of the most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. As you can see in the following code, it’s fairly easy to use linear regression with StatsModels
or Scikit-learn
• Doing this yourself would require much more effort even for the simple techniques. The following
listing shows the execution of a linear prediction model.
Mean square error is a simple measure: check for every prediction how far it was from the truth, square this
error, and add up the error of every prediction.
1
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Above figure compares the performance of two models to predict the order size from the price. The first model
is size = 3 * price and the second model is size = 10.
• To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
• Once the model is trained, we predict the values for the other 20% of the variables based on those for
which we already know the true value, and calculate the model error with an error measure.
• Then we choose the model with the lowest error. In this example we chose model 1 because it has the
lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you have to verify that these
assumptions are indeed met. This is called model diagnostics.
• Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.
• This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s sufficient
that you implement only the model scoring; other times you might build an application that automatically
updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage of the data science
process is where your soft skills will be most useful, and yes, they’re extremely important.
Data mining
1
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Data mining is the process of discovering actionable information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are too complex or because there is too
much data.
These patterns and trends can be collected and defined as a data mining model. Mining models can be applied
to specific scenarios, such as:
• Forecasting: Estimating sales, predicting server loads or server downtime
• Risk and probability: Choosing the best customers for targeted mailings, determining the probable
break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
• Recommendations: Determining which products are likely to be sold together, generating
recommendations
• Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
• Grouping: Separating customers or events into cluster of related items, analyzing and predicting
affinities
Building a mining model is part of a larger process that includes everything from asking questions about the
data and creating a model to answer those questions, to deploying the model into a working environment. This
process can be defined by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models
The following diagram describes the relationships between each step in the process, and the technologies in
Microsoft SQL Server that you can use to complete each step.
2
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
The first step in the data mining process is to clearly define the problem, and consider ways that data can be
utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by
which the model will be evaluated, and defining specific objectives for the data mining project. These tasks
translate into questions such as the following:
• What are you looking for? What types of relationships are you trying to find?
• Does the problem you are trying to solve reflect the policies or processes of the business?
• Do you want to make predictions from the data mining model, or just look for interesting patterns and
associations?
• Which outcome or attribute do you want to try to predict?
• What kind of data do you have and what kind of information is in each column? If there are multiple
tables, how are the tables related? Do you need to perform any cleansing, aggregation, or processing to
make the data usable?
• How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the
business?
Preparing Data
• The second step in the data mining process is to consolidate and clean the data that was identified in the
Defining the Problem step.
• Data can be scattered across a company and stored in different formats, or may contain inconsistencies
such as incorrect or missing entries.
• Data cleaning is not just about removing bad data or interpolating missing values, but about finding
hidden correlations in the data, identifying sources of data that are the most accurate, and determining
which columns are the most appropriate for use in analysis
Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard
deviations, and looking at the distribution of the data. For example, you might determine by reviewing the
maximum, minimum, and mean values that the data is not representative of your customers or business
processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis
for your expectations. Standard deviations and other distribution values can provide useful information about
the stability and accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually contain any data until you process it.
When you process the mining structure, SQL Server Analysis Services generates aggregates and other statistical
information that can be used for analysis. This information can be used by any mining model that is based on
the structure.
2
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Before you deploy a model into a production environment, you will want to test how well the model performs.
Also, when you build a model, you typically create multiple models with different configurations and test all
models to see which yields the best results for your problem and your data.
Data warehousing
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed
by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data
consolidations.
2
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Although a data warehouse and a traditional database share some similarities, they need not be the same idea.
The main difference is that in a database, data is collected for multiple transactional purposes. However, in a
data warehouse, data is collected on an extensive scale to perform analytics. Databases provide real-time data,
while warehouses store data to be accessed for big analytical queries.
Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s pointof-
sale systems, its mailing lists, website, and comment cards. It might also incorporate confidential information
about employees, salary information, etc. Businesses use such components of data warehouse to analyze
customers.
Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns in vast
volumes of data and devising innovative strategies for increased sales and profits.
Data Mart
2
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business unit.
Every department of a business has a central repository or data mart to store data. The data from the data mart
is stored in the ODS periodically. The ODS then sends the data to the EDW, where it is stored and used.
Summary
In this chapter you learned the data science process consists of six steps:
• Setting the research goal—Defining the what, the why, and the how of your project in a project charter.
• Retrieving data—Finding and getting access to data needed in your project. This data is either found
within the company or retrieved from a third party.
• Data preparation—Checking and remediating data errors, enriching the data with data from other data
sources, and transforming it into a suitable format for your models.
• Data exploration—Diving deeper into your data using descriptive statistics and visual techniques.
• Data modeling—Using machine learning and statistical techniques to achieve your project goal.
• Presentation and automation—Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other tools.
2
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Unit – II
DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data with Averages
- Describing Variability - Normal Distributions and Standard (z) Scores
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
• The weights can be described not only as quantitative data but also as observations for a quantitative
variable, since the various weights take on different numerical values.
• By the same token, the replies can be described as observations for a qualitative variable, since the
replies to the Facebook profile question take on different values of either Yes or No.
• Given this perspective, any single observation can be described as a constant, since it takes on only one
value.
A continuous variable consists of numbers whose values, at least in theory, have no restrictions.
Continuous variables can assume any numeric value and can be meaningfully split into smaller parts.
Consequently, they have valid fractional and decimal values. In fact, continuous variables have an infinite
number of potential values between any two points. Generally, you measure them using a scale.
Examples of continuous variables include weight, height, length, time, and temperature.
Durations, such as the reaction times of grade school children to a fire alarm; and standardized test scores, such
as those on the Scholastic Aptitude Test (SAT).
The impartial creation of distinct groups, which differ only in terms of the independent variable, has a most
desirable consequence. Once the data have been collected, any difference between the groups can be interpreted
as being caused by the independent variable.
Dependent Variable
When a variable is believed to have been influenced by the independent variable, it is called a dependent
variable. In an experimental setting, the dependent variable is measured, counted, or recorded by the
investigator.
• The dependent variable (DV) is what you want to use the model to explain or predict. The values of this
variable depend on other variables.
• It’s also known as the response variable, outcome variable, and left-hand variable. Graphs place
dependent variables on the vertical, or Y, axis.
• a dependent variable is exactly what it sounds like. It is something that depends on other factors.
For example the blood sugar test depends on what food you ate, at which time you ate etc.
Unlike the independent variable, the dependent variable isn’t manipulated by the investigator. Instead, it
represents an outcome: the data produced by the experiment.
Confounding Variable
An uncontrolled variable that compromises the interpretation of a study is known as a confounding variable.
Sometimes a confounding variable occurs because it’s impossible to assign subjects randomly to different
conditions.
Grouped Data
According to their frequency of occurrence. When
observations are sorted into classes of more than one
value result is referred to as a frequency for grouped data.
(Shown in table 2.2)
• The general structure of this frequency distribution is the
data’s are grouped into class intervals with 10 possible values
each.
• The frequency ( f ) column shows the frequency of observations in each class and, at the bottom, the
total number of observations in all classes.
GUIDELINES
OUTLIERS
An outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of
the neighboring co-existing values in a data graph or dataset you're working with.
Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.
Percentages or Proportions
Some people prefer to deal with percentages rather than proportions because percentages usually lack
decimal points. A proportion always varies between 0 and 1, whereas a percentage always varies between
0 percent and 100 percent.
To convert the relative frequencies, multiply each proportion by 100; that is, move the decimal point two
places to the right.
Cumulative Percentages
As has been suggested, if relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages
To obtain this cumulative percentage, the cumulative frequency of the class should be divided by the total
frequency of the entire distribution.
Percentile Ranks
When used to describe the relative position of any score within its parent distribution, cumulative
percentages are referred to as percentile ranks.
The percentile rank of a score indicates the percentage of scores in the entire distribution with similar or
smaller values than that score. Thus a weight has a percentile rank of 80 if equal or lighter weights constitute
80 percent of the entire distribution.
GRAPHS
Data can be described clearly and concisely with the aid of a well-constructed frequency distribution. And
data can often be described even more vividly by converting frequency distributions into graphs.
Figure: Histogram
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency polygons may be
constructed directly from frequency distributions.
1
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Arrange a column of numbers, the stems, beginning with 13 (representing the 130s) and ending with 24
(representing the 240s). Draw a vertical line to separate the stems, which represent multiples of 10, from the
space to be occupied by the leaves, which represent multiples of 1.
For example
Enter each raw score into the stem and leaf display. As suggested by the shaded coding in Table 2.9, the first
raw score of 160 reappears as a leaf of 0 on a stem of 16. The next raw score of 193 reappears as a leaf of 3 on
a stem of 19, and the third raw score of 226 reappears as a leaf of 6 on a stem of 22, and so on, until each raw
score reappears as a leaf on its appropriate stem.
TYPICAL SHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important characteristic
of a frequency distribution is its shape. Below figure shows some of the more typical shapes for smoothed
frequency polygons (which ignore the inevitable irregularities of real data).
1
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
MISLEADING GRAPHS
Graphs can be constructed in an unscrupulous manner to support a particular point of view.
Popular sayings says, including “Numbers don’t lie, but statisticians do” and “There are three kinds of lies—
lies, damned lies, and statistics.”
1
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
1
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
1
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
The mode reflects the value of the most frequently occurring score.
In other words
A mode is defined as the value that has a higher frequency in a given set of values. It is the value that appears
the most number of times.
Example:
In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has appeared in the set twice.
Types of Modes
Bimodal, Trimodal & Multimodal (More than one mode)
• When there are two modes in a data set, then the set is called bimodal
For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because both 2 and 5 is repeated three times
in the given set.
• When there are three modes in a data set, then the set is called trimodal
For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
• When there are four or more modes in a data set, then the set is called multimodal
Example: The following table represents the number of wickets taken by a bowler in 10 matches. Find the mode
of the given set of data.
It can be seen that 2 wickets were taken by the bowler frequently in different matches. Hence, the mode of the
given data is 2.
MEDIAN
The median reflects the middle value when observations are ordered from least to most.
The median splits a set of ordered observations into two equal parts, the upper and lower halves.
• If the total number of observation is even, then the median formula is:
Example 1:
4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29
1
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Solution: n=
15
When we put those numbers in the order we have:
4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
Example 2:
Find the median of the following:
9,7,2,11,18,12,6,4
Solution n=8
When we put those numbers in the order we have:
2, 4, 6, 7, 9,11, 12, 18
MEAN
The mean is found by adding all scores and then dividing by the number of scores.
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total
number of numbers.
Types of means
• Sample mean
• Population mean
Sample Mean
1
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
The sample mean is a central tendency measure. The arithmetic average is computed using samples or random
values taken from the population. It is evaluated as the sum of all the sample variables divided by the total
number of variables.
Population Mean
The population mean can be calculated by the sum of all values in the given data/population divided by a total
number of values in the given data/population.
Describing Variability
RANGE
The range is the difference between the largest and smallest scores.
The range in statistics for a given data set is the difference between the highest and lowest values. For example,
if the given data set is {2,5,8,10,3}, then the range will be 10 – 2 = 8.
Example 1: Find the range of given observations: 32, 41, 28, 54, 35, 26, 23, 33, 38, 40.
VARIANCE
Variance is a measure of how data points differ from the mean. A variance is a measure of how far a set of data
(numbers) are spread out from their mean (average) value. Formula
σ = Σ(x-μ)2 or
Variance = (Standard deviation)2= σ2 = > σ 2= Σ(x-μ)2 /n
1
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
the values of all scores must be added and then divided by the total number of scores.
Example
X = 5, 8, 6, 10, 12, 9, 11, 10, 12, 7
Solution
Mean = sum (x)/ n n=
10
sum (x) = 5+8+6+10+12+9+11+10+12+ 7
= 90
Mean=> μ = 90 / 10 = 9 Deviation
from mean
x- μ = -4, -1, -3, 1, 3, 0, 2,1,3,-2
(x-μ)2 = 16,1,9,1,9,0,4,1,9,4
Σ(x-μ)2 = 16+1+9+1+9+0+4+1+9+4
=54
σ 2= Σ(x-μ)2 /n
=54/10
= 5.4
STANDARD DEVIATION
The standard deviation, the square root of the mean of all squared deviations from the mean, that is,
Standard deviation = √variance
Standard Deviation: A rough measure of the average (or standard) amount by which scores deviate
“The sum of squares equals the sum of all squared deviation scores.” You can reconstruct this formula by
remembering the following three steps:
1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score, X − μ.
1
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
1
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
2
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
2
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Formula
Degree of freedom df = n-1
Example
Consider a data set consists of five positive integers. The sum of the five integers must be the multiple of 6.
The values are randomly selected as 3, 8, 5, and 4.
The sum of this for values is 20. So we have to choose the fifth integer to make the sum divisible by 6. Therefore
the fifth element is 10.
The number of degrees of Degrees of Freedom (df ) The number of values free to vary, given one or more
mathematical restrictions. Freedom—in the numerator, as in the formulas for s2 and s. In fact, we can use
degrees of freedom to rewrite the formulas for the sample variance and standard deviation:
The interquartile range (IQR), is simply the range for the middle 50 percent of the scores. More specifically,
the IQR equals the distance between the third quartile (or 75th percentile) and the first quartile (or 25th
percentile), that is, after the highest quarter (or top 25 percent) and the lowest quarter (or bottom 25 percent)
have been trimmed from the original set of scores. Since most distributions are spread more widely in their
extremities than their middle, the IQR tends to be less than half the size of the range.
Simply, The IQR describes the middle 50% of values when ordered from lowest to highest. To find the
interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These
values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
2
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
2
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
z SCORES
A z score is a unit-free, standardized score that, regardless of the original units of measurement, indicates how
many standard deviations a score is above or below the mean of its distribution.
A z score can be defined as a measure of the number of standard deviations by which a score is below or above
the mean of a distribution. In other words, it is used to determine the distance of a score from the mean. If the
z score is positive it indicates that the score is above the mean. If it is negative then the score will be below the
mean. However, if the z score is 0 it denotes that the data point is the same as the mean.
To obtain a z score, express any original score, whether measured in inches, milliseconds, dollars, IQ points,
etc., as a deviation from its mean (by subtracting its mean) and then split this deviation into standard deviation
units (by dividing by its standard deviation),
2
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Where X is the original score and μ and σ are the mean and the standard deviation, respectively, for the normal
distribution of the original scores. Since identical units of measurement appear in both the numerator and
denominator of the ratio for z, the original units of measurement cancel each other and the z score emerges as
a unit-free or standardized number, often referred to as a standard score.
Although there is an infinite number of different normal curves, each with its own mean and standard deviation,
there is only one standard normal curve, with a mean of 0 and a standard deviation of 1.
Standard deviation = 1
2
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
2
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
FINDING PROPORTIONS
Finding Proportions for One Score
• Sketch a normal curve and shade in the target area, Plan your solution according to the normal
table. Convert X to z.
2
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
FINDING SCORES
2
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
So far, we have concentrated on normal curve problems for which Table A must be consulted to find
the unknown proportion (of area) associated with some known score or pair of known scores
Now we will concentrate on the opposite type of normal curve problem for which Table A must be
consulted to find the unknown score or scores associated with some known proportion.
For this type of problem requires that we reverse our use of Table A by entering proportions in
columns B, C, B′, or C′ and finding z scores listed in columns A or A′.
It’s often helpful to visualize the target score as splitting the total area into two sectors—one to the left of
(below) the target score and one to the right of (above) the target score
When converting z scores to original scores, you will probably find it more efficient to use the following
equation
2
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Points to Remember
1. range = largest value – smallest value in a list
2. class interval = range / desired no of classes
3. relative frequency = frequency (f)/ε(f)
4. Cumulative frequency - add to the frequency of each class the sum of the frequencies of all
classes ranked below it.
5. Cumulative percentage = (f/cumulative f)*100
6. Histograms
7. Construction of frequency polygon
8. Stem and leaf display
9. Mode - The value of the most frequent score.
10. For odd no of terms Median = {(n+1)/2}th term / observation. For even no of terms Median =
1/2[(n/2)th term + {(n/2)+1}th term ]
11. Mean = sum of all scores / number of scores
3
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
3
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
15. z – score
3
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Two scores
3
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Unit – III
DESCRIBING RELATIONSHIPS
Correlation – Scatter plots – correlation coefficient for quantitative data – computational formula for correlation
coefficient – Regression – regression line – least squares regression line – Standard error of estimate –
interpretation of r2 – multiple regression equations – regression towards the mean
Correlation
Correlation refers to a process for establishing the relationships between two variables. You learned a way to get
a general idea about whether or not two variables are related, is to plot them on a “scatter plot”. While there are
many measures of association for variables which are measured at the ordinal or higher level of measurement,
correlation is the most commonly used approach.
Types of Correlation
• Positive Correlation – when the values of the two variables move in the same direction so that an
increase/decrease in the value of one variable is followed by an increase/decrease in the value of the other
variable.
• Negative Correlation – when the values of the two variables move in the opposite direction so that an
increase/decrease in the value of one variable is followed by decrease/increase in the value of the other
variable.
• No Correlation – when there is no linear dependence or no relation between the two variables.
SCATTERPLOTS
A scatter plot is a graph containing a cluster of dots that represents all pairs of scores. In other words
Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data
points on a two-dimensional plane or on a Cartesian system.
The first step is to note the tilt or slope, if any, of a dot cluster.
A dot cluster that has a slope from the lower left to the upper right, as in panel A of below figure reflects a positive
relationship.
A dot cluster that has a slope from the upper left to the lower right, as in panel B of below figure reflects a negative
relationship.
A dot cluster that lacks any apparent slope, as in panel C of below figure reflects little or no relationship.
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a perfect relationship between
two variables.
Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates a straight line and, therefore, reflects a linear
relationship. But this is not always the case. Sometimes a dot cluster approximates a bent or curved line, as in
below figure, and therefore reflects a curvilinear relationship.
Properties of r
• The correlation coefficient is scaled so that it is always between -1 and +1.
• When r is close to 0 this means that there is little relationship between the variables and the farther away
from 0 r is, in either the positive or negative direction, the greater the relationship between the two
variables.
• The sign of r indicates the type of linear relationship, whether positive or negative.
• The numerical value of r, without regard to sign, indicates the strength of the linear relationship.
• A number with a plus sign (or no sign) indicates a positive relationship, and a number with a minus sign
indicates a negative relationship
Where the two sum of squares terms in the denominator are defined as
The sum of the products term in the numerator, SPxy, is defined in below formula
REGRESSION
A regression is a statistical technique that relates a dependent variable to one or more independent
(explanatory) variables. A regression model is able to show whether changes observed in the dependent variable
are associated with changes in one or more of the explanatory variables.
Regression captures the correlation between variables observed in a data set, and quantifies whether those
correlations are statistically significant or not.
A Regression Line
a regression line is a line that best describes the behaviour of a set of data. In other words, it’s a line that best fits
the trend of a given data.
Types of regression
The two basic types of regression are Simple linear
regression
Simple linear regression uses one independent variable to
explain or predict the outcome of the dependent variable Y
Multiple linear regression
Multiple linear regressions use two or more independent
variables to predict the outcome
Predictive Errors
Prediction error refers to the difference between the predicted values made by some model and the actual
values.
Formula
b= N Σ(xy) − Σx Σy
N Σ(x2) − (Σx)2
b = Σy − m Σx
N
Example
"x" "y"
2 16
3 11
5 14
7 15
9 15
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
Step 3: Calculate Slope b b =
N Σ(xy) − Σx Σy
N Σ(x2) − (Σx)2
= 5 x 263 − 26 x 41
5 x 168 − 262
= 1315 − 1066
840 − 676
= 249
164
b =
1.5183.
Step 5: y’ = bx+a
y’ = 1.518x + 0.305
Example
Calculate the standard error of estimate for the given X and Y values. X = 1,2,3,4,5 Y=2,4,5,4,5
Solution
Create five columns labeled x, y, y’, y – y’, ( y – y’)2 and N=5
b = N Σ(xy) − Σx Σy
N Σ(x2) − (Σx)2
b=5(66)-15x20
5(55)-(15)2
=
330 – 300
275-225
b= 30/50 = 0.6
a = Σy − b Σx N
= 20 – (0.6 x 15)
5
= 20 – 11
5
a= 9/5 = 2.2
=√(2.4/3)
SSy/x = 0.894
INTERPRETATION OF r 2
R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that
determines the proportion of variance in the dependent variable that can be explained by the independent variable.
In other words, r-squared shows how well the data fit the regression model (the goodness of fit).
R-squared can take any values between 0 to 1. Although the statistical measure provides some useful
insights regarding the regression model, the user should not rely only on the measure in the assessment of a
statistical model.
In addition, it does not indicate the correctness of the regression model. Therefore, the user should always
draw conclusions about the model by analyzing r-squared together with the other variables in a statistical model.
The most common interpretation of r-squared is how well the regression model explains observed data.
1
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE
Example:
A researcher decides to study students’ performance from a school over a period of time. He observed that as the
lectures proceed to operate online, the performance of students started to decline as well. The parameters for the
dependent variable “decrease in performance” are various independent variables like “lack of attention, more
internet addiction, neglecting studies” and much more.
Example
A military commander has two units return, one with 20% casualties and another with 50% casualties. He praises
the first and berates the second. The next time, the two units return with the opposite results. From this experience,
he “learns” that praise weakens performance and berating increases performance.
1
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic – fancy
indexing – structured arrays – Data manipulation with Pandas – data indexing and selection – operating on data
– missing data – Hierarchical indexing – combining datasets – aggregation and grouping – pivot tables
NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers.
NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and
data operations as the arrays grow larger in size.
Example
np.random.seed(0) # seed for reproducibility x1 =
np.random.randint(10, size=6) # One-dimensional array x2 =
np.random.randint(10, size=(3, 4)) # Two-dimensional array x3 =
np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
print("x3 ndim: ", x3.ndim) print("x3 shape:", x3.shape) print("x3
size: ", x3.size)
print("dtype:", x3.dtype)
Array Indexing:
• Accessing Single Elements
x[start:stop:step] start –
starting array index
stop – array index to stop ( last value will not be considered) step
– terms has to be printed from start to stop
Default to the values start=0, stop=size of dimension, step=1.
Example x =
np.arange(10) x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
While using negative indices the defaults for start and stop are swapped. This becomes a convenient way to
reverse an array x[::-1] # all elements, reversed
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
Reshaping of Arrays
The most flexible way of doing this is with the reshape() method. For example, if you want to put the numbers
1 through 9 in a 3×3 grid, you can do the following
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
array([1, 2, 3, 3, 2, 1])
[ 1 2 3 3 2 1 99 99 99]
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
np.vstack([x, grid])
array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
y = np.array([[99],
[99]])
np.hstack([grid, y])
array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and
np.vsplit. For each of these, we can pass a list of indices giving the split points x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5]) print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
Notice that N split points lead to N + 1 subarrays. The related functions np.hsplit and np.vsplit are similar
grid = np.arange(16).reshape((4, 4)) grid
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11] [14
15]]
Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute repeated
operations on values in NumPy arrays. Ufuncs are extremely flexible—before we saw an operation between a
scalar and an array, but we can also operate between two arrays
Ufuncs exist in two flavors: unary ufuncs, which operate on a single input, and binary ufuncs, which operate on
two inputs. We’ll see examples of both these types of functions here.
Array arithmetic
NumPy’s ufuncs make use of Python’s native arithmetic operators. The standard addition, subtraction,
multiplication, and division can all be used.
x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
Absolute value
Just as NumPy understands Python’s built-in arithmetic operators, it also understands Python’s built-in absolute
value function.
• np.abs()
• np.absolute()
x = np.array([-2, -1, 0, 1, 2])
abs(x)
array([2, 1, 0, 1, 2])
The corresponding NumPy ufunc is np.absolute, which is also available under the alias np.abs np.absolute(x)
array([2, 1, 0, 1, 2])
np.abs(x)
array([2, 1, 0, 1, 2])
Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most useful for the data scientist are the
trigonometric functions.
• np.sin()
• np.cos()
• np.tan() inverse trigonometric functions
• np.arcsin()
• np.arccos()
• np.arctan()
The inverse of the exponentials, the logarithms, are also available. The basic np.log gives the natural logarithm;
if you prefer to compute the base-2 logarithm or the base-10 logarithm as .
• np.log(x) - is a mathematical function that helps user to calculate Natural logarithm of x where x belongs
to all the input array elements
• np.log2(x) - to calculate Base-2 logarithm of x
• np.log10(x) - to calculate Base-10 logarithm of x
Specialized ufuncs
NumPy has many more ufuncs available like
• Hyperbolic trig functions,
• Bitwise arithmetic,
• Comparison operators,
More specialized and obscure ufuncs is the submodule scipy.special. If you want to compute some obscure
mathematical function on your data, chances are it is implemented in scipy.special.
• Gamma function
Aggregates
To reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly
applies a given operation to the elements of an array until only a single result remains.
x = np.arange(1, 6)
np.add.reduce(x)
Similarly, calling reduce on the multiply ufunc results in the product of all array elements
np.multiply.reduce(x)
120
If we’d like to store all the intermediate results of the computation, we can instead use
Accumulate np.add.accumulate(x)
array([ 1, 3, 6, 10, 15])
Outer products ufunc can compute the output of all pairs of two different inputs using the outer method. This
allows you, in one line, to do things like create a multiplication table.
x = np.arange(1, 6)
np.multiply.outer(x, x)
array([[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])
Example
x=[1,2,3,4] np.min(x)
1 np.max(x)
4
Multidimensional aggregates
One common type of aggregation operation is an aggregate along a row or column.
By default, each NumPy aggregation function will return the aggregate over the entire array. ie. If we use the
np.sum() it will calculates the sum of all elements of the array.
Example
m = np.random.random((3, 4))
print(M)
M.sum()
6.0850555667307118
Aggregation functions take an additional argument specifying the axis along which the aggregate is computed.
The axis normally takes either 0 or 1. if the axis = 0 then it runs along with columns, if axis =1 it runs along with
rows.
Example
We can find the minimum value within each column by specifying axis=0
M.min(axis=0)
array([ 0.66859307, 0.03783739, 0.19544769, 0.06682827])
Similarly, we can find the maximum value within each row M.max(axis=1)
array([ 0.8967576 , 0.99196818, 0.6687194 ])
array([5, 6, 7])
Broadcasting allows these types of binary operations to be performed on arrays of different sizes. a
+5
array([5, 6, 7])
We can think of this as an operation that stretches or duplicates the value 5 into the array [5, 5, 5], and adds the
results. The advantage of NumPy’s broadcasting is that this duplication of values does not actually take place.
We can similarly extend this to arrays of higher dimension. Observe the result when we add a one-dimensional
array to a two-dimensional array.
Example
M = np.ones((3, 3))
M
array([ [ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
1
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
M+a
Just as before we stretched or broadcasted one value to match the shape of the other, here we’ve stretched both a
and b to match a common shape, and the result is a two dimensional array.
The light boxes represent the broadcasted values: again, this extra memory is not actually allocated in the course
of the operation, but it can be useful conceptually to imagine that it is.
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays.
1
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions
is padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that
dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
Broadcasting example 1
Let’s look at adding a two-dimensional array to a one-dimensional array:
M = np.ones((2, 3))
a = np.arange(3)
Let’s consider an operation on these two arrays. The shapes of the arrays are: M.shape
= (2, 3)
a.shape = (3,)
We see by rule 1 that the array a has fewer dimensions, so we pad it on the left with ones: M.shape
-> (2, 3)
a.shape -> (1, 3)
By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match: M.shape
-> (2, 3)
a.shape -> (2, 3)
The shapes match, and we see that the final shape will be (2, 3): M
+a
array([[ 1., 2., 3.],
[ 1., 2., 3.]])
Broadcasting example 2
Let’s take a look at an example where both arrays need to be broadcast:
a = np.arange(3).reshape((3, 1)) b = np.arange(3)
And rule 2 tells us that we upgrade each of these ones to match the corresponding size of the other array: a.shape
-> (3, 3)
b.shape -> (3, 3)
Because the result matches, these shapes are compatible. We can see this here:
a + b array([[0, 1, 2],
1
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
[1, 2, 3],
[2, 3, 4]])
x = np.array([1, 2, 3, 4, 5]) x
< 3 # less than
array([ True, True, False, False, False], dtype=bool) x
> 3 # greater than
array([False, False, False, True, True], dtype=bool)
x != 3 # not equal
array([ True, True, False, True, True], dtype=bool)
x == 3 # equal
array([False, False, True, False, False], dtype=bool)
Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape. Here is a twodimensional
example
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4)) x
array([[5, 0, 3, 3],
1
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
[7, 9, 3, 5],
[2, 4, 7, 6]])
x<6
The result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these
Boolean results.
Boolean operators
Operator Equivalent ufunc
& np.bitwise_and
| np.bitwise_or
^ np.bitwise_xor
~ np.bitwise_not
Example
x<5
1
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Masking operation
To select these values from the array, we can simply index on this Boolean array; this is known as a masking
operation.
x[x < 5]
array([0, 3, 3, 3, 2, 4])
What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all
the values in positions at which the mask array is True.
Fancy Indexing
Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of
single scalars. This allows us to very quickly access and modify complicated subsets of an array’s values.
Exploring Fancy Indexing
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at
once.
Types of fancy indexing.
• Indexing / accessing more values
• Array of indices
• In multi dimensional
• Standard indexing
[51 92 14 71 60 20 82 86 74 74]
Array of indices
We can pass a single list or array of indices to obtain the same result.
ind = [3, 7, 4]
x[ind]
1
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
In multi dimensional
Fancy indexing also works in multiple dimensions. Consider the following array. X
= np.arange(12).reshape((3, 4))
X
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Standard indexing
Like with standard indexing, the first index refers to the row, and the second to the column.
row = np.array([0, 1, 2]) col = np.array([2, 1, 3])
X[row, col]
array ([ 2, 5, 11])
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing schemes we’ve seen.
Example array
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
• Combine fancy and simple indices
X[2, [2, 0, 1]]
array([10, 8, 9])
array([[ 6, 4, 5],
[10, 8, 9]])
• Combine fancy indexing with masking mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
array([[ 0, 2],
[ 4, 6], [
8, 10]])
1
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
[ 0 99 99 3 99 5 6 7 99 9]
[ 0 89 89 3 89 5 6 7 89 9]
Using at()
Use the at() method of ufuncs for other behavior of modifications.
x = np.zeros(10) np.add.at(x, i, 1) print(x)
[ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]
Sorting Arrays
Sorting in NumPy: np.sort and np.argsort
Python has built-in sort and sorted functions to work with lists, we won’t discuss them here because NumPy’s
np.sort function turns out to be much more efficient and useful for our purposes. By default np.sort uses an O[ N
log N], quicksort algorithm, though mergesort and heapsort are also available. For most applications, the default
quicksort is more than sufficient.
array([1, 2, 3, 4, 5])
[1 0 3 2 4]
Sorting along rows or columns
A useful feature of NumPy’s sorting algorithms is the ability to sort along specific rows or columns of a
multidimensional array using the axis argument. For example
rand = np.random.RandomState(42)
1
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
[[6 3 7 4 6 9]
[2 6 7 4 3 7]
[7 2 5 4 1 7] [5
1 4 0 9 5]]
np.sort(X, axis=0)
array([[2, 1, 4, 0, 1, 5],
[5, 2, 5, 4, 3, 7],
[6, 3, 7, 4, 6, 7],
[7, 6, 7, 4, 9, 9]])
np.sort(X, axis=1)
array([[3, 4, 6, 6, 7, 9],
[2, 3, 4, 6, 7, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 5, 9]])
array([2, 1, 3, 4, 6, 5, 7])
Note that the first three values in the resulting array are the three smallest in the array, and the remaining array
positions contain the remaining values. Within the two partitions, the elements have arbitrary order.
array([[3, 4, 6, 7, 6, 9],
[2, 3, 4, 7, 6, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 9, 5]])
1
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Structured Arrays
This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient storage
for compound, heterogeneous data.
NumPy data types
Character Description Example
'b' Byte np.dtype('b')
'i' Signed integer np.dtype('i4') == np.int32
'u' Unsigned integer np.dtype('u1') == np.uint8
'f' Floating point np.dtype('f8') == np.int64
'c' Complex floating point np.dtype('c16') == np.complex128
'S', 'a' string np.dtype('S5')
'U' Unicode string np.dtype('U') == np.str_
'V' Raw data (void) np.dtype('V') == np.void
Consider if we have several categories of data on a number of people (say, name, age, and weight), and we’d like
to store these values for use in a Python program. It would be possible to store these in three separate arrays.
name = ['Alice', 'Bob', 'Cathy', 'Doug'] age
= [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)('Doug', 19, 61.5)]
1
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
array(['Alice', 'Doug'],dtype='<U10')
Dictionary method
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
2
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns
are identified with labels rather than simple integer indices.
Pandas provide a host of useful tools, methods, and functionality on top of the basic data structures. Three
fundamental Pandas data structures: the Series, DataFrame, and Index
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
• Finding values
The values are simply a familiar NumPy array data.values
0.5
data[1:3]
1 0.50
2 0.75 dtype: float64
Series as generalized NumPy array
the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an
explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index need not be
an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index.
Strings as an index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
2
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
a 0.25
b 0.50
c 0.75 d 1.00
dtype:
float64
Noncontiguous or non sequential indices.
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2,
5, 3, 7])
data
2 0.25
5 0.50
3 0.75 7 1.00
dtype: float64
sub1={‘sai’:90,’ram’:85,’kasim’:92,’tamil’:89}
mark=pd.Series(sub1) mark
sai 90 ram
85 kasim
92 tamil
89 dtype:
int64
Array-style slicing
Mark[ ‘sai’:’kasim’]
sai 90 ram
85 kasim
92
2
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
02
14
26
dtype: int64
• Repeated to fill the specified index pd.Series(5,
index=[100, 200, 300])
100 5
200 5 300
5
dtype: int64
• Data can be a dictionary, in which index defaults to the sorted dictionary keys pd.Series({2:'a',
1:'b', 3:'c'})
1b
2a
3c
dtype: object
3c2
a
dtype: object
2
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
We can use a dictionary to construct a single two-dimensional object containing this information.
result=pd.DataFrame({'DS':sub1,'FDS':sub2})
result
DS FDS
sai 90 91
ram 85 95
kasim 92 89
tamil 89 90
sai 90 ram
85 kasim
92 tamil
89
Name: DS, dtype: int64
Note
In a two-dimensional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will return the
first column. Because of this, it is probably better to think about DataFrames as generalized dictionaries rather
than generalized arrays, though both ways of looking at the situation can be useful.
2
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
DS
sai 90
ram 85
kasim 92
tamil 89
ab
000
112
224
Even if some keys in the dictionary are missing, Pandas will fill them
in with NaN (i.e.,“not a number”) values.
abc
0 1.0 2 NaN
1 NaN 3 4.0
DS FDS
sai 90 91
ram 85 95
kasim 92 89
2
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
tamil 89 90
food water a
0.865257 0.213169 b
0.442759 0.108267
c 0.047110 0.905718
pd.DataFrame(A)
AB
0 0 0.0
1 0 0.0 2 0 0.0
ind[1] 3
ind[::2]
Int64Index([2, 5, 11], dtype='int64')
2
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
• Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd']) data
a 0.25
b 0.50
c 0.75 d 1.00
dtype:
float64
data['b']
0.5
2
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
iii. list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
data['e'] = 1.25
data
a 0.25 b 0.50
c 0.75 d 1.00
e 1.25 dtype:
float64
a 0.25 b 0.50
c 0.75 dtype:
float64
a 0.25 b 0.50
dtype: float64
Masking
data[(data > 0.3) & (data < 0.8)]
b 0
.50
c 0
.75
2
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
dtype:
float64
Fancy indexing
data[['a', 'e']]
a 0.25 e 1.25
dtype: float64
1a
3b5
c
dtype: object
loc - the loc attribute allows indexing and slicing that always references the explicit index.
data.loc[1]
'a'
data.loc[1:3]
1a3b
dtype: object
iloc - The iloc attribute allows indexing and slicing that always references the implicit Python-style index.
data.iloc[1]
'b'
data.iloc[1:3]
3b5
c
dtype: object
ix- ix is a hybrid of the two, and for Series objects is equivalent to standard [ ]-based indexing.
DataFrame as a dictionary
The first analogy we will consider is the DataFrame as a dictionary of related Series objects.
2
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing
of the column name.
DS
sai 90
ram 85
kasim 92
tamil 89
DS
sai 90
ram 85
kasim 92
tamil 89
True
Modify the object
Like with the Series objects this dictionary-style syntax can also be used to modify the object, in this case to add
a new column:
result[‘TOTAL’]=result[‘DS’]+result[‘FDS’] result
DS FDS TOTAL
sai 90 91 181
ram 85 95 180
kasim 92 89 181
tamil 89 90 179
3
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the
underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index
and column labels are maintained in the result
• loc
result.loc[: ‘ram’, : ‘FDS’ ]
DS FDS
sai 90 91
ram 85 95
• iloc result.iloc[:2,
:2 ]
DS FDS
sai 90 91
ram 85 95
• ix result.ix[:2,
:’FDS’ ]
DS FDS
sai 90 91
ram 85 95
DS FDS sai
90 91
kasim 92 89
3
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
• Modifying values
Indexing conventions may also be used to set or modify values; this is done in the standard way that
you might be accustomed to from working with NumPy.
result.iloc[1,1] =70
DS FDS TOTAL
sai 90 91 181
ram 85 70 180
kasim 92 89 181
tamil 89 90 179
result['sai':'kasim']
DS FDS TOTAL
sai 90 91 181
ram 85 70 180
kasim 92 89 181
Such slices can also refer to rows by number rather than by index:
result[1:3]
DS FDS TOTAL
ram 85 70 180
kasim 92 89 181
DS FDS TOTAL
sai 90 91 181
kasim 92 89 181
3
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
For binary operations such as addition and multiplication, Pandas will automatically align indices when passing
the objects to the ufunc.
Here we are going to see how the universal functions are working in series and DataFrames by
• Index preservation • Index alignment
Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on pandas. In outputs the index will
preserved (maintained) as shown below.
For series
x=pd.Series([1,2,3,4]) x
0 1
1 2
2 3 3 4 dtype: int64
For DataFrame
df=pd.DataFrame(np.random.randint(0,10,(3,4)),
columns=['a','b','c','d'])
df
a b c d 0 1 4
1 4
1 8 4 0 4
2 7 7 7 2
0 8103.083928
1 54.598150
2 403.428793
3 20.085537 dtype: float64
a b c d
3
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Index Alignment
Pandas will align indices in the process of performing the operation. This is very convenient when you are working
with incomplete data, as we’ll.
Index alignment in Series suppose we are combining two different data sources, then the index
will aligned accordingly.
x=pd.Series([2,4,6],index=[1,3,5])
y=pd.Series([1,3,5,7],index=[1,2,3,4]) x+y
1 3.0
2 NaN
3 9.0
4 NaN 5 NaN dtype: float64 The resulting array contains the union of indices of the two
input arrays, which we could determine using standard Python set arithmetic on these indices.
Any item for which one or the other does not have an entry is marked with NaN, or “Not a Number,” which is
how Pandas marks as missing data.
x.add(y,fill_value=0)
1 3.0
2 3.0
3 9.0
4 7.0 5 6.0 dtype: float64
AB
0 1 11
3
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
151
BAC
0409
1580
2926
A+B
A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are
sorted. As was the case with Series, we can use the associated object’s arithmetic method and pass any desired
fill_value to be used in place of missing entries. Here we’ll fill with the mean of all values in A.
A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
3
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained. Operations between a DataFrame and a Series are similar to operations between a
twodimensional and one-dimensional NumPy array. A = rng.randint(10, size=(3, 4))
A
array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
A - A[0] array([[ 0,
0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing
integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a
missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point
specification.
NumPy supports fourteen basic integer types once you account for available precisions, signedness, and
endianness of the encoding. Reserving a specific bit pattern in all available NumPy types would lead to an
unwieldy amount of overhead in special-casing various operations for various types, likely even requiring a new
fork of the NumPy package.
Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values:
the special floatingpoint NaN value, and the Python None object. This choice has some side effects, as we will
see, but in practice ends up being a good compromise in most cases of interest.
3
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
This dtype=object means that the best common type representation NumPy could infer for the contents of the
array is that they are Python objects.
dtype('float64')
You should be aware that NaN is a bit like a data virus—it infects any other object it touches. Regardless of the
operation, the result of arithmetic with NaN will be another NaN
1 + np.nan
nan 0 *
np.nan
Nan
x = pd.Series(range(2), dtype=int)
x 0 0 1 1 dtype: int64
x[0] = None
x
0 NaN
1 1.0
dtype: float64
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a
NaN value.
3
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
0 False
1 True
2 False 3 True
dtype: bool
notnull()
data.notnull()
0 True
1 False
2 True
3 False
dtype: bool
dropna() data.dropna()
0 1 2 hello
dtype: object
3
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
[2, 3, 5],
[np.nan, 4, 6]])
Df
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
df.dropna()
012
1 2.0 3.0 5
df.dropna(axis='columns')
02
15
26
Rows or columns having all null values
You can also specify how='all', which will only drop rows/columns that are all null values.
df[3] = np.nan
df
0123
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
df.dropna(axis='columns', how='all')
012
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
3
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
0123
1 2.0 3.0 5 NaN
a 1.0 b
NaN
c 2.0 d
NaN
a 1.0 b
0.0 c
2.0 d
0.0 e
3.0
dtype: float64
a 1.0 b 1.0 c
2.0 d 2.0 e 3.0
dtype: float64
Fill with next
value
We can specify a back-fill to propagate the next values backward.
data.fillna(method='bfill')
a 1.0 b
2.0 c
2.0 d
3.0 e
3.0
4
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
dtype: float64
Hierarchical Indexing
Up to this point we’ve been focused primarily on one-dimensional and twodimensional data, stored in Pandas
Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional
data—that is, data indexed by more than one or two keys.
Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional, a
far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to
incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series
and two-dimensional DataFrame objects.
Here we’ll explore the direct creation of MultiIndex objects; considerations around indexing, slicing, and
computing statistics across multiply indexed data; and useful routines for converting between simple and
hierarchically indexed representations of your data.
4
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
California 37253956
New York 19378102
Texas 25145561 dtype:
int64
MultiIndex as extra dimension we could easily have stored the same data using a simple DataFrame with index
and column labels. The unstack() method will quickly convert a multiplyindexed Series into a conventionally
indexed DataFrame.
pop_df = pop.unstack()
pop_df
2000 2010
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
4
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
total under18
California 2000 33871648 9267089
2010 37253956 9284094
New York 2000 18976457 4687374
2010 19378102 4318033
Texas 2000 20851820 5906301
2010 25145561 6879014
Universal functions
All the ufuncs and other functionality work with hierarchical indices.
f_u18 = pop_df['under18'] / pop_df['total'] f_u18.unstack()
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831 Texas
0.283251 0.273568
Methods of Multi Index Creation
To construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the
constructor.
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2']) df
data1 data2 a 1
0.554233 0.356072 2
0.925244 0.219474 b 1
0.441759 0.610054
2 0.171495 0.886688 if you pass a dictionary with appropriate tuples as keys, Pandas will automatically
recognize this and use a MultiIndex by default.
4
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
pd.Series(data)
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
4
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
4
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
33871648
• Partial indexing
The MultiIndex also supports partial indexing, or indexing just one of the levels in the index
pop['California']
year
2000 33871648 2010
37253956 dtype:
int64
• Partial slicing
Partial slicing is available as well, as long as the MultiIndex is sorted.
pop.loc['California':'New York']
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
• Sorted indices
With sorted indices, we can perform partial indexing on lower levels by passing an empty slice in the first
index
pop[:, 2000]
state
California 33871648
New York 18976457
Texas 20851820 dtype:
int64
state year
California 2000 33871648
2010 37253956 Texas
2010 25145561
dtype: int64
4
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
state year
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
dtype: int64
Rearranging Multi-Indices
We saw a brief example of this in the stack() and unstack() methods, but there are many more ways to finely
control the rearrangement of data between hierarchical indices and columns, and we’ll explore them here. • Sorted
and unsorted indices
We’ll start by creating some simple multiply indexed data where the indices are not lexographically sorted:
Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index()
and sortlevel() methods of the DataFrame. We’ll use the simplest, sort_index(), here:
char int
a 1 0.003001
2 0.164974 b
1 0.001693
2 0.526226 c 1
0.741650
2 0.569264
dtype: float64
With the index sorted in this way, partial slicing will work as expected: data['a':'b']
char int
a 1 0.003001
4
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
2 0.164974 b
1 0.001693
2 0.526226
dtype: float64
• Stacking and unstacking indices it is possible to convert a dataset from a stacked multi-index to a simple
two-dimensional representation, optionally specifying the level to use.
pop.unstack(level=0)
pop.unstack(level=1)
The opposite of unstack() is stack(), which here can be used to recover the original series: pop.unstack().stack()
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
• Index setting and resetting
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with
the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year
column holding the information that was formerly in the index. For clarity, we can optionally specify the name
of the data for the column representation.
pop_flat = pop.reset_index(name='population')
pop_flat
4
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
4
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
1A
2B3C
4D
5E6F
dtype: object
Duplicate indices
One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices,
even if the result will have duplicate indices! Consider this simple example.
x = make_df('AB', [0,
1])
x y pd.concat([x, y])
AB A B AB
0 A0 B0 0 A2 B2 0 A0 B0
1 A1 B1 1 A3 B3 1 A1 B1
0 A2 B2
1 A3 B3
5
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
1 A1 B1 3 A3 B3 1 A1 B1
2 A2 B2 4 A4 B4 2 A2 B2
3 A3 B3
4 A4 B4
Categories of Joins
• One-to-one joins
• Many-to-one joins
• Many-to-many joins
print(df1); print(df2)
df1 df2
employee group employee hire_date
0 Bob Accounting 0 Lisa 2004
1 Jake Engineering 1 Bob 2008
2 Lisa Engineering 2 Jake 2012
3 Sue HR 3 Sue 2014
To combine this information into a single DataFrame, we can use the pd.merge() function
df3 = pd.merge(df1, df2) df3
Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-toone
case, the resulting DataFrame will preserve those duplicate entries as appropriate. df4 = pd.DataFrame({'group':
['Accounting', 'Engineering', 'HR'],
5
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Many-to-many joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both
the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps most
clear with a concrete example.
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'], 'skills':
['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']}) pd.merge(df1,
df5)
0 0.374540
1 0.950714
2 0.731994
3 0.598658 4 0.156019 dtype: float64
Sum
5
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
ser.sum()
2.8119254917081569
Mean ser.mean()
0.56238509834163142
The same operations also performed in DataFrame
• The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
• The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the
individual groups.
• The combine step merges the results of these operations into an output array.
Example
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data']) Df
5
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
Column indexing.
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a modified GroupBy
object. For example df=pd.read_csv('D:\iris.csv') df.groupby('variety')
<pandas.core.groupby.generic.DataFrameGroupBy object at
0x0000023BAADE84C0>
Dispatch methods.
Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through
and called on the groups, whether they are DataFrame or Series objects. For example, you can use the describe() method of
DataFrames to perform a set of aggregations that describe each group in the data.
5
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
Example
df.groupby('variety')['petal.length'].describe().unstack()
variety
count Setosa 50.000000
Versicolor 50.000000
Virginica 50.000000
mean Setosa 1.462000
Versicolor 4.260000
Virginica 5.552000
std Setosa 0.173664
Versicolor 0.469911
Virginica 0.551895
min Setosa 1.000000
Versicolor 3.000000
Virginica 4.500000
25% Setosa 1.400000
Versicolor 4.000000
Virginica 5.100000
50% Setosa 1.500000
Versicolor 4.350000
Virginica 5.550000
75% Setosa 1.575000
Versicolor 4.600000
Virginica 5.875000
max Setosa 1.900000
Versicolor 5.100000
Virginica 6.900000
dtype: float64
5
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
data1 data2
Filtering.
A filtering operation allows you to drop data based on the group properties. For example, we might want to keep
all groups in which the standard deviation is larger than some critical value.
The filter() function should return a Boolean value specifying whether the group passes the filtering.
Transformation.
While aggregation must return a reduced version of the data, transformation can return some transformed version
of the full data to recombine. For such a transformation, the output is the same shape as the input. A common
example is to center the data by subtracting the group-wise mean:
df.groupby('key').transform(lambda x: x - x.mean())
data1 data2
0 -1.5 1.0
1 -1.5 -3.5
5
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
2 -1.5 -3.0
3 1.5 -1.0
4 1.5 3.5
5 1.5 3.0
Pivot Tables
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data. The pivot table takes simple column wise data as input, and groups the entries into a twodimensional
table that provides a multidimensional summarization of the data. The difference between pivot tables and
GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional
version of GroupBy aggregation. That is, you split apply- combine, but both the split and the combine happen
across not a one-dimensional index, but across a two-dimensional grid.
age
63 5.500000 NaN
28 3.440000 2.000000
61 7.000000 4.000000
69 5.000000 NaN
5
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – IV III SEM CSE
45 7.285714 7.375000
62 6.500000 1.000000
53 2.000000 6.250000
68 8.000000 NaN
23 1.516129 1.857143
Class tested_negative tested_positive
age
52 13.000000 3.428571
5
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Short assignment
linestyle='-' # solid linestyle='--'
# dashed linestyle='-.' #
dashdot
linestyle=':' # dotted
• linestyle and color codes can be combined into a single nonkeyword argument to the plt.plot() function
plt.plot(x, x + 0, '-g') # solid green plt.plot(x, x + 1, '--c') # dashed cyan plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r'); # dotted red
Axes Limits
• The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim() methods
Example
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
• The plt.axis() method allows you to set the x and y limits with a single call, by passing a list that specifies
[xmin, xmax, ymin, ymax]
plt.axis([-1, 11, -1.5, 1.5]);
• Aspect ratio equal is used to represent one unit in x is equal to one unit in y. plt.axis('equal')
Labeling Plots
The labeling of plots includes titles, axis labels, and simple legends. Title
- plt.title()
Label - plt.xlabel()
plt.ylabel()
Legend - plt.legend()
Example programs
Line color import matplotlib.pyplot
as plt
import numpy as
np fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000) ax.plot(x, np.sin(x)); plt.plot(x,
np.sin(x - 0), color='blue') # specify color by name plt.plot(x,
np.sin(x - 1), color='g') # short color code (rgbcmyk) plt.plot(x,
np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse');# all HTML color names supported
Line style
import matplotlib.pyplot as plt
import numpy as np fig = plt.figure() ax =
plt.axes() x = np.linspace(0, 10, 1000)
plt.plot(x, x + 0, linestyle='solid') plt.plot(x,
x + 1, linestyle='dashed') plt.plot(x, x + 2,
linestyle='dashdot') plt.plot(x, x + 3,
linestyle='dotted'); # For short, you can use
the following codes:
plt.plot(x, x + 4, linestyle='-') # solid plt.plot(x,
x + 5, linestyle='--') # dashed plt.plot(x, x + 6,
linestyle='-.') # dashdot plt.plot(x, x + 7,
linestyle=':'); # dotted
Example
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2) plt.ylim(-
1.2, 1.2);
Example programs.
Scatter plot with edge color, face color, size, and width of marker.
(Scatter plot with line)
Visualizing Errors
For any scientific measurement, accurate accounting for
errors is nearly as important, if not more important, than
accurate reporting of the number itself. For example,
imagine that I am using some astrophysical observations to
estimate the Hubble Constant, the local measurement of the
expansion rate of the Universe.
In visualization of data and results, showing these errors
effectively can make a plot convey much more complete
information.
Types of errors
• Basic Errorbars
• Continuous Errors
Basic Errorbars
A basic errorbar can be created with a single
Matplotlib function call. import
matplotlib.pyplot as plt plt.style.use('seaborn-
whitegrid')
import numpy as np x = np.linspace(0,
10, 50) dy = 0.8 y = np.sin(x) + dy *
np.random.randn(50) plt.errorbar(x, y,
yerr=dy, fmt='.k');
• Here the fmt is a format code controlling the appearance of lines and points, and has the same syntax as the
shorthand used in plt.plot()
• In addition to these basic options, the errorbar function has many options to fine tune the outputs. Using these
additional options you can easily customize the aesthetics of your errorbar plot.
Continuous Errors
• In some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does not have
a built-in convenience routine for this type of application, it’s relatively easy to combine primitives like
plt.plot and plt.fill_between for a useful result.
• Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API. This is a method
of fitting a very flexible nonparametric function to data with a continuous measure of the uncertainty.
• plt.contourf for filled contour plots, and • plt.imshow for showing images.
• Notice that by default when a single color is used, negative values are represented by dashed lines, and
positive values by solid lines.
• Alternatively, you can color-code the lines by specifying a colormap with the cmap argument.
• We’ll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range.
plt.contour(X, Y, Z, 20, cmap='RdGy');
• One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather than
continuous, which is not always what is desired.
• You could remedy this by setting the number of contours to a very high number, but this results in a rather
inefficient plot: Matplotlib must render a new polygon for each step in the level.
• A better way to handle this is to use the plt.imshow() function, which interprets a two-dimensional grid of
data as an image.
Parameters
• plt.hist( ) is used to plot histogram. The hist() function will use an array of numbers to create a histogram, the
array is sent into the function as an argument.
• bins - A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as
a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called
"intervals", "classes", or "buckets".
• normed - Histogram normalization is a technique to distribute the frequencies of the histogram over a wider
range than the current range.
• x - (n,) array or sequence of (n,) arrays Input values, this takes either a single array or a sequence of arrays
which are not required to be of the same length.
• histtype - {'bar', 'barstacked', 'step', 'stepfilled'}, optional The type of histogram to draw.
• 'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side.
• 'barstacked' is a bar-type histogram where multiple data are stacked on top of each other.
• 'step' generates a lineplot that is by default unfilled.
• 'stepfilled' generates a lineplot that is by default filled. Default is 'bar'
• align - {'left', 'mid', 'right'}, optional Controls how the histogram is plotted.
• 'right': bars are centered on the right bin edges. Default is 'mid'
• orientation - {'horizontal', 'vertical'}, optional
If 'horizontal', barh will be used for bar-type histograms and the bottom kwarg will be the left edges.
• color - color or array_like of colors or None, optional
Color spec or sequence of color specs, one per dataset. Default (None) uses the standard line color sequence.
Default is None
• label - str or None, optional. Default is None
Other parameter
• **kwargs - Patch properties, it allows us to pass a
variable number of keyword arguments to a python
function. ** denotes this type of function.
The hist() function has many options to tune both the calculation and the display; here’s an example of a more
customized histogram.
plt.hist(data, bins=30, alpha=0.5,histtype='stepfilled', color='steelblue',edgecolor='none');
The plt.hist docstring has more information on other customization options available. I find this combination of
histtype='stepfilled' along with some transparency alpha to be very useful when comparing histograms of several
distributions
1
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Example mean =
[0, 0] cov = [[1, 1],
[1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 1000).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
Legends
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw how
to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend in
Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw how
to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend in
Matplotlib
1
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Number of columns - We can use the ncol command to specify the number of columns in the legend.
ax.legend(frameon=False, loc='lower center', ncol=2) fig
Multiple legends
It is only possible to create a single legend for the entire plot. If you
try to create a second legend using plt.legend() or ax.legend(), it will
simply override the first one. We can work around this by creating a
new legend artist from scratch, and then using the lower-level
ax.add_artist() method to manually add the second artist to the plot
Example
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np x =
np.linspace(0, 10, 1000)
ax.legend(loc='lower center', frameon=True, shadow=True,borderpad=1,fancybox=True) fig
1
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Color Bars
In Matplotlib, a color bar is a separate axes that can provide a key for the meaning of colors in a plot. For continuous
labels based on the color of points, lines, or regions, a labeled color bar can be a great tool. The simplest colorbar can
be created with the plt.colorbar() function.
Discrete colorbars
Colormaps are by default continuous, but sometimes you’d like to
represent discrete values. The easiest way to do this is to use the
plt.cm.get_cmap() function, and pass the name of a suitable colormap
along with the number of desired bins.
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
Subplots
• Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single figure.
• These subplots might be insets, grids of plots, or other more complicated layouts.
• We’ll explore four routines for creating subplots in Matplotlib.
• plt.axes: Subplots by Hand
1
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
plt.figure()
ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4], xticklabels=[], ylim=(-1.2,
1.2)) ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4], ylim=(-1.2, 1.2)) x =
np.linspace(0, 10) ax1.plot(np.sin(x))
ax2.plot(np.cos(x));
1
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
• We now have two axes (the top with no tick labels) that are just touching: the bottom of the upper panel
(at position 0.5) matches the top of the lower panel (at position 0.1+ 0.4).
• If the axis value is changed in second plot both the plots are separated with each other, example ax2 =
fig.add_axes([0.1, 0.01, 0.8, 0.4
1
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
For example, a gridspec for a grid of two rows and three columns with some specified width and height space
looks like this:
Example
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
1
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the beginning
of each string will approximately mark the given coordinate location.
The transData coordinates give the usual data coordinates associated with the x- and y-axis labels. The transAxes
coordinates give the location from the bottom-left corner of the axes (here the white box) as a fraction of the axes
size.
The transfigure coordinates are similar, but specify the position from the bottom left of the figure (here the gray box)
as a fraction of the figure size.
Notice now that if we change the axes limits, it is only the transData coordinates that will be affected, while the others
remain stationary.
1
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the page.
1
8
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
• Here we’ll show a three-dimensional contour diagram of a three dimensional sinusoidal function import
numpy as np import matplotlib.pyplot as plt from
mpl_toolkits import mplot3d def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2)) x =
np.linspace(-6, 6, 30) y = np.linspace(-6, 6, 30)
X, Y = np.meshgrid(x, y) Z = f(X, Y) fig = plt.figure()
ax = plt.axes(projection='3d') ax.contour3D(X, Y, Z,
50, cmap='binary') ax.set_xlabel('x')
ax.set_ylabel('y') ax.set_zlabel('z') plt.show()
Sometimes the default viewing angle is not optimal, in which case we can
use the view_init method to set the elevation and azimuthal angles.
ax.view_init(60, 35)
fig
import numpy as np
import
matplotlib.pyplot as plt
from mpl_toolkits
import mplot3d ax =
plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='viridis',
edgecolor='none')
1
9
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
ax.set_title('surface')
plt.show()
Surface Triangulations
• For some applications, the evenly sampled grids required by the preceding routines are
overly restrictive and inconvenient.
• In these situations, the triangulation-based plots can be very useful.
import numpy as np import matplotlib.pyplot as plt from
mpl_toolkits import mplot3d theta = 2 * np.pi *
np.random.random(1000) r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta)) y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5)
2
0
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Map Projections
The Basemap package implements several dozen such projections, all referenced by a short format code. Here we’ll
briefly demonstrate some of the more common ones.
• Cylindrical projections
• Pseudo-cylindrical projections
• Perspective projections
• Conic projections
Cylindrical projection
• The simplest of map projections are cylindrical projections, in which lines of constant latitude and longitude
are mapped to horizontal and vertical lines, respectively.
• This type of mapping represents equatorial regions quite well, but results in extreme distortions near the poles.
• The spacing of latitude lines varies between different cylindrical projections, leading to different conservation
properties, and different distortion near the poles.
• Other cylindrical projections are the Mercator (projection='merc') and the cylindrical equal-area
(projection='cea') projections.
• The additional arguments to Basemap for this view specify the latitude (lat) and longitude (lon) of the lower-
left corner (llcrnr) and upper-right corner (urcrnr) for the desired map, in units of degrees.
import numpy as np import
matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
2
1
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Pseudo-cylindrical projections
• Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain
vertical; this can give better properties near the poles of the projection.
• The Mollweide projection (projection='moll') is one common example of this, in which all meridians are
elliptical arcs
• It is constructed so as to
• preserve area across the map: though there are
distortions near the poles, the area of small
patches reflects the true area.
• Other pseudo-cylindrical projections are the
sinusoidal (projection='sinu') and Robinson
(projection='robin') projections.
• The extra arguments to Basemap here refer to
the central latitude (lat_0) and longitude (lon_0) for the desired map.
import numpy as np import matplotlib.pyplot as
plt from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 6), edgecolor='w') m =
Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)
2
2
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Perspective projections
• Perspective projections are constructed using a particular choice of perspective point, similar to if you
photographed the Earth from a particular point in space (a point which, for some projections, technically lies
within the Earth!).
• One common example is the orthographic projection (projection='ortho'), which shows one side of the globe
as seen from a viewer at a very long distance.
• Thus, it can show only half the globe at a time.
• Other perspective-based projections include the
gnomonic projection (projection='gnom') and
stereographic projection (projection='stere').
• These are often the most useful for showing small
portions of the map. import numpy as np import
matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);
Conic projections
• A conic projection projects the map onto a single cone, which is then unrolled.
• This can lead to very good local properties, but regions far from the focus point of the cone may become very
distorted.
• One example of this is the Lambert conformal conic projection (projection='lcc').
• It projects the map onto a cone arranged in such a way that two standard parallels (specified in Basemap by
lat_1 and lat_2) have well-represented distances, with scale decreasing between them and increasing outside
of them.
• Other useful conic projections are the equidistant conic (projection='eqdc') and the Albers equal-area
(projection='aea') projection import numpy as np import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55, width=1.6E7, height=1.2E7) draw_map(m)
2
3
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
• Whole-globe images
bluemarble() - Project NASA’s blue marble image onto the map
shadedrelief() - Project a shaded relief image onto the map
2
4
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
• If we pass the full two-dimensional dataset to kdeplot, we will get a two-dimensional visualization of the
data.
• We can see the joint distribution and the marginal distributions together using sns.jointplot.
Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful for
exploring correlations between multidimensional data, when you’d like to plot all pairs of values against each other.
2
5
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
We’ll demo this with the Iris dataset, which lists measurements of petals and sepals of three iris species:
import seaborn as sns iris = sns.load_dataset("iris")
sns.pairplot(iris, hue='species', size=2.5);
Faceted histograms
• Sometimes the best way to view data is via histograms of subsets. Seaborn’s FacetGrid makes this extremely
simple.
• We’ll take a look at some data that shows the amount that restaurant staff receive in tips based on various
indicator data
2
6
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter
within bins defined by any other parameter.
Joint distributions
Similar to the pair plot we saw earlier, we can use sns.jointplot to show the joint distribution between different
datasets, along with the associated marginal distributions.
2
7
DOWNLOADED FROM STUCOR APP
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Bar plots
Time series can be plotted with sns.factorplot.
2
8
DOWNLOADED FROM STUCOR APP