Book Sample - Brownlee Beginners - Guide - Data - Science - Sample
Book Sample - Brownlee Beginners - Guide - Data - Science - Sample
MASTERY
THE BEGINNER’S
GUIDE TO
Data
Science
A Journey from Data to
Insight with Statistical
Techniques
Vinod Chugani
This is Just a Sample
MACHINE LEARNING
MASTERY
THE BEGINNER’S
GUIDE TO
Data
Science
A Journey from Data to
Insight with Statistical
Techniques
Vinod Chugani
This is Just a Sample ii
Disclaimer
The information contained within this eBook is strictly for educational purposes. If you wish to
apply ideas contained in this eBook, you are taking full responsibility for your actions.
The author has made every effort to ensure the accuracy of the information within this book was
correct at time of publication. The author does not assume and hereby disclaims any liability to any
party for any loss, damage, or disruption caused by errors or omissions, whether such errors or
omissions result from accident, negligence, or any other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or
mechanical, recording or by any information storage and retrieval system, without written
permission from the author.
Credits
Founder: Jason Brownlee
Lead Editor: Adrian Tam
Author: Vinod Chugani
Technical Reviewers: Yoyo Chan and Kanwal Mehreen
Copyright
The Beginner’s Guide to Data Science
© 2024 MachineLearningMastery.com. All Rights Reserved.
Edition: v1.00
Contents
Preface iv
Introduction v
Data science involves applying scientific techniques to extract insights from data, essentially
telling a story through data analysis. It involves utilizing sophisticated methods like machine
learning models to identify patterns or validate hypotheses. However, cultivating a data
science mindset is the most fundamental and crucial aspect.
This mindset does not rely on complex models; even basic statistical techniques are
sufficient to illustrate a data science workflow. This book operates on this premise: without
delving into advanced machine learning, it teaches how to approach data and substantiate
hypotheses. The essence of data science lies in constructing a compelling narrative supported
by empirical evidence.
In this guide, you will be led through each step, from data manipulation in Python to
employing fundamental statistical functions in NumPy and SciPy to draw insights from the
data. Whether you are new to data science or seeking to enhance your skills, this guide
requires minimal prerequisites yet instills the mindset of an experienced data scientist.
Introduction
Welcome to The Beginner’s Guide to Data Science. This book is a primer to data science but
does not involve any advanced machine learning models. The advanced models are deliberately
avoided because you should focus on what data science is about—–telling stories.
Data science can be difficult because there are unlimited tools you can use as long as they
can help you tell a story. As a beginner, you can get lost because you see a lot of statistical
tests, models, equations, and plots being used, and you may feel like you are jumping around.
However, you will find data science is easier to grasp once you focus on the objective rather
than the tool.
Book Organization
This book is in two parts. The first part is about data wrangling, particularly with the pandas
library. This is useful because you often need to process the dataset to create derived data or
filter for useful data. This part includes the following chapters:
1. Revealing the Invisible: Visualizing Missing Values in Ames Housing
2. Exploring Dictionaries, Classifying Variables, and Imputing Data
3. Beyond SQL: Transforming Real Estate Data into Actionable Insights with Pandas
4. Harmonizing Data: A Symphony of Segmenting, Concatenating, Pivoting, and
Merging
These chapters guide you using Python and the pandas library to manipulate a dataset. This
is a useful and important skill because you should manipulate the entire dataset rather than
processing each number from the dataset. With pandas, you can filter the data or find the
average with just one line of code.
The second part of the book assumes you can process the data readily. Then, you will
be shown how to obtain information from the raw data to make a statement.
This part includes the following chapters:
4. Decoding Data: An Introduction to Descriptive Statistics
5. From Data to Map: Visualizing Ames House Prices with Python
6. Feature Relationships 101: Lessons from the Ames Housing Data
7. Mastering Pair Plots for Visualization and Hypothesis Creation
8. Inferential Insights: How Confidence Intervals Illuminate the Ames Real Estate Market
9. Testing Assumptions in Real Estate: A Dive into Hypothesis Testing
10. Garage or Not? Housing Insights Through the Chi-Squared Test
11. Leveraging ANOVA and Kruskal-Wallis Tests to Analyze the Impact of the Great
Recession on Housing Prices
12. Spotting the Exception: Classical Methods for Outlier Detection in Data Science
13. Skewness Be Gone: Transformative Tricks for Data Scientists
14. Finding Value with Data: The Cohesive Force Behind Luxury Real Estate Decisions
All these chapters are based on the same dataset, but each asks a different question. This is
not a simple question like what is the average price of a house, but one like whether having
a garage makes a house more expensive. To answer such a question, you must find the right
mathematical tool, apply the appropriate Python function, and interpret the result correctly.
To tell a story about the data, you focus on the objective, and there can be multiple ways
to achieve that. Therefore, you should consider the methods and models used in each chapter
as examples rather than the only golden rule for the task. It may not be the best way for a
task either, but surely, it is a simple solution so you can learn the thought process easily.
vii
Appendix B outlines the tools you might find useful. In Appendix C you will find a
summary of how you should apply the thought process to a data science project.
as your needs or interests motivate you. To get the most from this book, you should also
attempt to improve the results, try a different function or model, apply the method to a
similar but different problem, and so on. You are welcome to share your findings with us at
[email protected].
Next
Let’s dive in. Next up is Part I, where you will take a tour of the pandas library in Python
to wrangle with data, which is an essential skill you will be using throughout this book.
1
Revealing the Invisible:
Visualizing Missing Values in
Ames Housing
The digital age has ushered in an era where data-driven decision-making is pivotal in various
domains, real estate being a prime example. Comprehensive datasets, like the Ames housing
dataset, offer a treasure trove for data enthusiasts. Through meticulous exploration and
analysis of such datasets, one can uncover patterns, gain insights, and make informed decisions.
Starting from this chapter, you will embark on a captivating journey through the intricate
lanes of Ames properties, focusing primarily on data science techniques.
Let’s get started.
Overview
This chapter is divided into three parts; they are:
B The Ames Properties Dataset
B Loading & Sizing Up the Dataset
B Uncovering & Visualizing Missing Values
The Ames Housing Dataset was envisioned as a modern alternative to the older Boston
Housing Dataset. Covering residential sales in Ames, Iowa between 2006 and 2010, it presents
a diverse array of variables, setting the stage for advanced regression techniques.
This time frame is particularly significant in U.S. history. The period leading up to
2007–2008 saw the dramatic inflation of housing prices, fueled by speculative frenzy and
subprime mortgages. This culminated in the devastating collapse of the housing bubble
in late 2007, an event vividly captured in narratives like “The Big Short.” The aftermath
of this collapse rippled across the nation, leading to the Great Recession. Housing prices
plummeted, foreclosures skyrocketed, and many Americans found themselves underwater on
their mortgages. The Ames dataset provides a glimpse into this turbulent period, capturing
property sales during national economic upheaval.
Dataset Dimensions. Before diving into intricate analyses, it’s essential to familiarize yourself
with the basic structure and data types of the dataset. This step provides a roadmap for
subsequent exploration and ensures you tailor your analyses based on the nature of data.
With the Python environment in place, let’s load and gauge the dataset in terms of rows
(individual properties) and columns (attributes of these properties).
# Dataset shape
print(Ames.shape)
This prints:
(2579, 85)
The dataset comprises 2579 properties described across 85 attributes.
Data Types. Recognizing the data type of each attribute helps shape our analysis approach.
Numerical attributes might be summarized using measures like mean or median, while mode
(the most frequently seen value) is apt for categorical attributes.
1.2 Loading & Sizing Up the Dataset 3
import pandas as pd
Ames = pd.read_csv('Ames.csv')
This shows:
object 44
int64 27
float64 14
dtype: int64
The Data Dictionary. Each dataset usually comes with a data dictionary to describe each of
its features. It specifies the meaning, possible values, and even the logic behind its collection.
Whether you’re deciphering the meaning behind an unfamiliar feature or discerning the
significance of particular values, the data dictionary serves as a comprehensive guide. It
bridges the gap between raw data and actionable insights, ensuring that the analyses and
decisions are well-informed.
import pandas as pd
Ames = pd.read_csv('Ames.csv')
# View a few datatypes from the dataset (first and last 5 features)
print(data_types)
This shows, for example, Ground Living Area and Sale Price are numerical (int64) data types,
while Sale Condition (object) is a categorical data type:
PID int64
GrLivArea int64
SalePrice int64
MSSubClass int64
MSZoning object
...
SaleCondition object
GeoRefNo float64
Prop_Addr object
Latitude float64
1.3 Uncovering & Visualizing Missing Values 4
Longitude float64
Length: 85, dtype: object
and you can match that with the data dictionary, an excerpt as follows:
C
Alloca Allocation - two linked properties with separate deeds, typically condo C
with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)
Output 1.4: Excerpt from the data dictionary showing the “SaleCondition” has string
values and “SalePrice” is numerical
NaN or None? In pandas, the isnull() function is used to detect missing values in a
DataFrame or Series. Specifically, it identifies the following types of missing data:
B np.nan (Not a Number), often used to denote missing numerical data
B None, which is a built-in object in Python to denote the absence of a value or a null
value
Both nan and NaN are just different ways to refer to np.nan in NumPy, and isnull() will
identify them as missing values. Here is a quick example.
1.3 Uncovering & Visualizing Missing Values 5
import pandas as pd
import numpy as np
print(df)
print()
print(missing_data)
This prints:
A B C D
0 1.0 a NaN 1
1 2.0 b NaN 2
2 NaN None NaN 3
3 4.0 d NaN 4
4 5.0 e NaN 5
A 1
B 1
C 5
D 0
dtype: int64
Visualizing Missing Values. When it comes to visualizing missing data, tools like missingno,
matplotlib, and seaborn come in handy. By sorting the features based on the percentage of
missing values and placing them into a DataFrame, you can easily rank the features most
affected by missing data.
import pandas as pd
Ames = pd.read_csv('Ames.csv')
# Combining the counts and percentages into a DataFrame for better visualization
missing_info = pd.DataFrame({'Missing Values': missing_data,
'Percentage': missing_percentage})
1.3 Uncovering & Visualizing Missing Values 6
This shows:
The missingno package facilitates a swift, graphical representation of missing data. The white
lines or gaps in the visualization denote missing values. However, it will only accommodate
up to 50 labeled variables. Past that range, labels begin to overlap or become unreadable,
and by default large displays omit them.
1.3 Uncovering & Visualizing Missing Values 7
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
Ames = pd.read_csv('Ames.csv')
msno.matrix(Ames, sparkline=False, fontsize=20)
plt.show()
You can get a bar chart using msno.bar to show the top 15 features with least count of
non-missing values, i.e., those with the most missing values.
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
Ames = pd.read_csv('Ames.csv')
# Combining the counts and percentages into a DataFrame for better visualization
missing_info = pd.DataFrame({'Missing Values': missing_data,
'Percentage': missing_percentage})
The illustration above denotes that Pool Quality, Miscellaneous Feature, and the type of Alley
access to the property are the three features with the highest number of missing values since
their bar is the shortest.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Ames = pd.read_csv('Ames.csv')
missing_data = Ames.isnull().sum()
missing_percentage = (missing_data / len(Ames)) * 100
missing_info = pd.DataFrame({'Missing Values': missing_data,
'Percentage': missing_percentage})
# Filter to show only the top 15 columns with the most missing values
top_15_missing_info = missing_info.nlargest(15, 'Percentage')
Listing 1.8: Showing top columns with the highest percentage of missing values
1.4 Further Reading 9
Figure 1.3: Using seaborn horizontal bar plots to visualize missing data.
A horizontal bar plot using seaborn allows you to list features with the highest missing values
in a vertical format, adding both readability and aesthetic value.
Handling missing values is more than just a technical requirement; it’s a significant step
that can influence the quality of your machine learning models. Understanding and visualizing
these missing values are the first steps in this intricate dance.
Papers
Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of
Semester Regression Project”. Journal of Statistics Education, 19(3), 2011.
https://jse.amstat.org/v19n3/decock.pdf
Resources
Dean De Cock. Ames Housing Dataset. 2011.
https://jse.amstat.org/v19n3/decock/AmesHousing.txt
Dean De Cock. Ames Housing Data Dictionary. 2011.
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
1.5 Summary 10
1.5 Summary
In this chapter, you embarked on an exploration of the Ames Properties dataset, a
comprehensive collection of housing data tailored for data science applications.
Specifically, you learned:
B About the context of the Ames dataset, including the pioneers and academic
importance behind it.
B How to extract dataset dimensions, data types, and missing values.
B How to use packages like missingno, Matplotlib, and seaborn to quickly visualize
your missing data.
As you learn about missing data, you will see how you can fill in the missing values in the
next chapter.
4
Harmonizing Data: A Symphony
of Segmenting, Concatenating,
Pivoting, and Merging
In a data science project, the data you collect is often not in the shape that you want it to be.
Often you will need to create derived features, aggregate subsets of data into a summarized
form, or select a portion of the data according to some complex logic. This is not a hypothetical
situation. In a project big or small, the data you obtained at the first step is very likely far
from ideal.
As a data scientist, you must be handy to format the data into the right shape to make
your subsequent steps easier. In the following, you will learn how to slice and dice the dataset
in pandas as well as reassemble them into a very different form to make the useful data more
pronounced, so that analysis can be easier.
Let’s get started.
Overview
This chapter is divided into two parts; they are:
B Segmenting and Concatenating: Choreographing with Pandas
B Pivoting and Merging: Dancing with Pandas
import pandas as pd
By executing the above code, you have enriched your dataset with a new column entitled
“Price_Category.” Here’s a glimpse of the output you’ve obtained:
SalePrice Price_Category
0 126000 Low
1 139500 Medium
2 124900 Low
3 114000 Low
4 227000 Premium
... ... ...
2574 121000 Low
2575 139600 Medium
2576 145000 Medium
2577 217500 Premium
2578 215000 Premium
help you understand at a glance the historical trends in property construction as they relate
to pricing.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def categorize_by_price(row):
if row['SalePrice'] <= quantiles.iloc[0]:
return 'Low'
elif row['SalePrice'] <= quantiles.iloc[1]:
return 'Medium'
elif row['SalePrice'] <= quantiles.iloc[2]:
return 'High'
else:
return 'Premium'
Ames = pd.read_csv('Ames.csv')
quantiles = Ames['SalePrice'].quantile([0.25, 0.5, 0.75])
Ames['Price_Category'] = Ames.apply(categorize_by_price, axis=1)
# Create a figure
plt.figure(figsize=(10, 6))
Below is the ECDF plot, which provides a visual summation of the data you’ve categorized.
An ECDF, or Empirical Cumulative Distribution Function, is a statistical tool used to describe
the distribution of data points in a dataset. It represents the proportion or percentage of data
points that fall below or at a certain value. Essentially, it gives you a way to visualize the
4.1 Segmenting and Concatenating: Choreographing with Pandas 14
distribution of data points across different values, providing insights into the shape, spread,
and central tendency of the data. ECDF plots are particularly useful because they allow for
easy comparison between different datasets. Notice how the curves for each price category
give you a narrative of housing trends over the years.
From the plot, it is evident that lower and medium-priced homes have a higher frequency of
being built in earlier years, while high and premium-priced homes tend to be of more recent
construction.
...
# Stacking Low and Medium categories into an "affordable_homes" DataFrame
affordable_homes = pd.concat([low_priced_homes, medium_priced_homes])
Through this, you can compare and analyze the characteristics that differentiate more accessible
homes from their expensive counterparts. Should the two DataFrames you pass on to concat()
4.2 Pivoting and Merging: Dancing with Pandas 15
have different columns, the resulting DataFrame will have the superset of columns from both,
while some rows filled in NaN for the columns that they have no data from the original
DataFrame.
...
# Creating pivot tables with both mean living area and home count
aggfunc = {'GrLivArea': 'mean', 'Fireplaces': 'count'}
pivot_affordable = affordable_homes.pivot_table(index='Fireplaces', aggfunc=aggfunc)
pivot_luxury = luxury_homes.pivot_table(index='Fireplaces', aggfunc=aggfunc)
pivot_affordable.rename(columns=rename_rules, inplace=True)
pivot_affordable.index.name = 'Fire'
pivot_luxury.rename(columns=rename_rules, inplace=True)
pivot_luxury.index.name = 'Fire'
Listing 4.4: Using pivot table to summarize the number of homes and the average
area
4.2 Pivoting and Merging: Dancing with Pandas 16
With these pivot tables, you can now easily visualize and compare how features like fireplaces
correlate with the living area and how frequently they occur within each segment. The first
pivot table was crafted from the “affordable” homes DataFrame and demonstrates that most
properties within this grouping do not have any fireplaces.
HmCount AvLivArea
Fire
0 931 1159.050483
1 323 1296.808050
2 38 1379.947368
The second pivot table which was derived from the “luxury” homes DataFrame illustrates
that properties within this subset have a range of zero to four fireplaces, with one fireplace
being the most common.
HmCount AvLivArea
Fire
0 310 1560.987097
1 808 1805.243812
2 157 1998.248408
3 11 2088.090909
4 1 2646.000000
With the creation of the pivot tables, you’ve distilled the data into a form that’s ripe for the
next analytical step—melding these insights using Pandas.merge() to see how these features
interplay across the broader market.
The pivot table above is the simplest one. The more advanced version allows you to
specify not only the index but also the columns in the argument. The idea is similar: you pick
two columns, one specified as index and the other as columns argument, in which the values
of these two columns are aggregated and become a matrix. The value in the matrix is then
the result as specified by the aggfunc argument.
You can consider the following example, which produces a similar result as above:
...
pivot = Ames.pivot_table(index="Fireplaces",
columns="Price_Category",
aggfunc={'GrLivArea':'mean', 'Fireplaces':'count'})
print(pivot)
This prints:
4.2 Pivoting and Merging: Dancing with Pandas 17
Fireplaces GrLivArea
Price_Category High Low Medium Premium High Low Medium Premium
Fireplaces
0 228.0 520.0 411.0 82.0 1511.912281 1081.496154 1257.172749 1697.439024
1 357.0 116.0 207.0 451.0 1580.644258 1184.112069 1359.961353 1983.031042
2 52.0 9.0 29.0 105.0 1627.384615 1184.888889 1440.482759 2181.914286
3 5.0 NaN NaN 6.0 1834.600000 NaN NaN 2299.333333
4 NaN NaN NaN 1.0 NaN NaN NaN 2646.000000
Output 4.4: The number of homes and average living area by the number of fireplaces
and price category
You can see the result is the same by comparing, for example, the count of low and medium
homes of zero fireplaces to be 520 and 411, respectively, which 931 = 520+411 as you obtained
previously. You see the second-level columns are labeled with Low, Medium, High, and
Premium because you specified “Price_Category” as columns argument in pivot_table(). The
dictionary to the aggfunc argument gives the top-level columns.
...
pivot_outer_join = pd.merge(pivot_affordable, pivot_luxury, on='Fire', how='outer',
suffixes=('_aff', '_lux')).fillna(0)
print(pivot_outer_join)
Output 4.5: Count of homes and average living area by number of fireplaces
In this case, the outer join functions similarly to a right join, capturing every distinct category
of fireplaces present across both market segments. It is interesting to note that there are no
properties within the affordable price range that have 3 or 4 fireplaces. You need to specify
4.2 Pivoting and Merging: Dancing with Pandas 18
two strings for the suffixes argument because the “HmCount” and “AvLivArea” columns exist
in both DataFrames pivot_affordable and pivot_luxury. You see “HmCount_aff” is zero for 3
and 4 fireplaces because you need them as a placeholder for the outer join to match the rows
in pivot_luxury.
Next, you can use the inner join, focusing on the intersection where affordable and luxury
homes share the same number of fireplaces. This approach highlights the core similarities
between the two segments.
...
pivot_inner_join = pd.merge(pivot_affordable, pivot_luxury, on='Fire', how='inner',
suffixes=('_aff', '_lux'))
print(pivot_inner_join)
Interestingly, in this context, the inner join mirrors the functionality of a left join, showcasing
categories present in both datasets. You do not see the rows corresponding to 3 and 4
fireplaces because it is the result of an inner join, and there are no such rows in the DataFrame
pivot_affordable.
Lastly, a cross join allows you to examine every possible combination of affordable and
luxury home attributes, offering a comprehensive view of how different features interact across
the entire dataset. The result is sometimes called the Cartesian product of rows from the two
DataFrames.
...
# Resetting index to display cross join
pivot_affordable.reset_index(inplace=True)
pivot_luxury.reset_index(inplace=True)
The result is as follows, which demonstrates the result of cross-join but does not provide any
special insight into the context of this dataset.
Online
concat() method. pandas.
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
DataFrame.pivot_table() method. pandas.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html
merge() method. pandas.
https://pandas.pydata.org/docs/reference/api/pandas.merge.html
4.4 Summary
In this comprehensive exploration of data harmonization techniques using Python and Pandas,
you’ve delved into the intricacies of segmenting, concatenating, pivoting, and merging datasets.
From dividing datasets into meaningful segments based on price categories to visualizing trends
in construction years, and from stacking datasets to analyzing broader market categories using
Pandas.concat(), to summarizing and analyzing data points within segments using pivot
tables, you’ve covered a wide array of essential data manipulation and analysis techniques.
Additionally, by leveraging Pandas.merge() to compare segmented datasets and derive insights
from different types of merge operations (outer, inner, cross), you’ve unlocked the power of
data integration and exploration. Armed with these techniques, data scientists and analysts
can navigate the complex landscape of data with confidence, uncovering hidden patterns, and
extracting valuable insights that drive informed decision-making.
Specifically, you learned:
B How to divide datasets into meaningful segments based on price categories and visualize
trends in construction years.
B The use of Pandas.concat() to stack datasets and analyze broader market categories.
B The role of pivot tables in summarizing and analyzing data points within segments.
B How to leverage Pandas.merge() to compare segmented datasets and derive insights
from different types of merge operations (outer, inner, cross).
Starting from the next chapter, you will see examples of making a statement from data using
data science techniques. The first would be the simplest one: Describe what you can see from
the data from the most superficial aspect.
Garage or Not? Housing Insights
Through the Chi-Squared Test
11
The chi-squared test for independence is a statistical procedure employed to assess the
relationship between two categorical variables—determining whether they are correlated or
independent. The exploration of the visual appeal of a property and its impact on its valuation
is intriguing. But how often do you associate the outlook of a house with functional features
like a garage? With the chi-squared test, you can determine whether there exists a statistically
significant correlation between features.
Let’s get started.
Overview
This chapter is divided into four parts; they are:
B Understanding the Chi-Squared Test
B How the Chi-Squared Test Works
B Unraveling the Correlation Between External Quality and Garage Presence
B Important Caveats
In this study, you will focus on the visual appeal of a house (categorized as “Great” or
“Average”) and its relation to the presence or absence of a garage. For the results of the
chi-squared test to be valid, the following conditions must be satisfied:
B Independence: The observations must be independent, meaning the occurrence of one
outcome shouldn’t affect another. Your dataset satisfies this as each entry represents
a distinct house.
B Large Sample Size: The dataset should be randomly sampled and sufficiently large
to be representative. Your data, sourced from Ames, Iowa, meets this criterion.
B Minimum Expected Frequency for Each Category: Every cell in the contingency table
should have an expected frequency of at least 5. This is vital for the reliability
of the test, as the chi-squared test relies on a large sample approximation. You will
demonstrate this condition below by creating and visualizing the expected frequencies.
import pandas as pd
from scipy.stats import chi2_contingency
Observed Frequencies:
No Garage With Garage
Average 121 1544
Great 8 906
11.4 Important Caveats 24
Expected Frequencies:
No Garage With Garage
Average 83.3 1581.7
Great 45.7 868.3
3. Chi-squared Test:
B With the data aptly prepared, you constructed a contingency table to depict the
observed frequencies between the newly formed categories. They are the two tables
printed in the output.
B You then performed a chi-squared test on this contingency table using SciPy. The p-
value is printed and found much less than α (0.05). The extremely low p-value suggests
rejecting the null hypothesis, meaning there is a statistically significant relationship
between the external quality of a house and the presence of a garage in this dataset.
B A glance at the expected frequencies satisfies the third condition of a chi-squared test,
which requires a minimum of 5 occurrences in each cell.
Through this analysis, you not only refined and simplified the data to make it more interpretable
but also provided statistical evidence of a correlation between two categorical variables of
interest.
B No Causation: While the test can determine correlation, it doesn’t infer causation.
So, even though there’s a significant link between the external quality of a house and
its garage presence, you can’t conclude that one causes the other.
B Directionality: The test indicates an correlation but doesn’t specify its direction.
However, your data suggests that houses labeled as “Great” in terms of external
quality are more likely to have garages than those labeled as “Average”.
B Magnitude: The test doesn’t provide insights into the strength of a relationship. Other
metrics, like Cramér’s V, would be more informative in this regard.
B External Validity: Your conclusions are specific to the Ames dataset. Caution is
advised when generalizing these findings to other regions.
Online
H. B. Berman. Chi-square Test for Independence. Stat Trek.
https://stattrek.com/chi-square-test/independence
Chi-square test. Wikipedia.
https://en.wikipedia.org/wiki/Chi-squared_test
scipy.stats.chi2_contingency API. SciPy.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingenc
y.html
11.6 Summary
In this chapter, you delved into the chi-squared test and its application on the Ames housing
dataset. You discovered a significant correlation between the external quality of a house and
the presence of a garage.
Specifically, you learned:
B The fundamentals and practicality of the chi-squared test.
B The chi-squared test revealed a significant correlation between the external quality
of a house and the presence of a garage in the Ames dataset. Houses with a “Great”
external quality rating showed a higher likelihood of having a garage when compared
to those with an “Average” rating, a trend that was statistically significant.
B The vital caveats and limitations of the chi-squared test.
In the next chapter, you will learn about ANOVA, which can be considered as an extension
of the t-test.
This is Just a Sample
MACHINE LEARNING
MASTERY
THE BEGINNER’S
GUIDE TO
Data
Science
A Journey from Data to
Insight with Statistical
Techniques
Vinod Chugani