0% found this document useful (0 votes)

64 views9 pages

Feature Engineering - Python Data Science Handbook

Uploaded by

nicholasdevera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views9 pages

Feature Engineering - Python Data Science Handbook

Uploaded by

nicholasdevera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

This is an excerpt from the Python Data Science Handbook (http://shop.oreilly.com/product/0636920034919.do) by Jake
VanderPlas; Jupyter notebooks are available on GitHub (https://github.com/jakevdp/PythonDataScienceHandbook).

The text is released under the CC-BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code
is released under the MIT license (https://opensource.org/licenses/MIT). If you find this content useful, please consider
supporting the work by buying the book (http://shop.oreilly.com/product/0636920034919.do)!

Feature Engineering

< Hyperparameters and Model Validation (05.03-hyperparameters-and-model-

validation.html) | Contents (index.html) | In Depth: Naive Bayes Classification
(05.05-naive-bayes.html) >

Open in Colab

(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/note
Feature-Engineering.ipynb)

The previous sections outline the fundamental ideas of machine learning, but
all of the examples assume that you have numerical data in a tidy,
[n_samples, n_features] format. In the real world, data rarely comes in such
a form. With this in mind, one of the more important steps in using machine
learning in practice is feature engineering: that is, taking whatever information
you have about your problem and turning it into numbers that you can use to
build your feature matrix.

In this section, we will cover a few common examples of feature engineering

tasks: features for representing categorical data, features for representing text,
and features for representing images. Additionally, we will discuss derived
features for increasing model complexity and imputation of missing data. Often
this process is known as vectorization, as it involves converting arbitrary data
into well-behaved vectors.

# Categorical Features
One common type of non-numerical data is categorical data. For example,
imagine you are exploring some data on housing prices, and along with
numerical features like "price" and "rooms", you also have "neighborhood"
information. For example, your data might look something like this:

In [1]: data = [
{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
{'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
{'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
{'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 1/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

You might be tempted to encode this data with a straightforward numerical

mapping:

In [2]: {'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};

It turns out that this is not generally a useful approach in Scikit-Learn: the
package's models make the fundamental assumption that numerical features
reflect algebraic quantities. Thus such a mapping would imply, for example,
that Queen Anne < Fremont < Wallingford, or even that Wallingford - Queen
Anne = Fremont, which (niche demographic jokes aside) does not make much
sense.

In this case, one proven technique is to use one-hot encoding, which effectively
creates extra columns indicating the presence or absence of a category with a
value of 1 or 0, respectively. When your data comes as a list of dictionaries,
Scikit-Learn's DictVectorizer will do this for you:

In [3]: from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

Out[3]: array([[ 0, 1, 0, 850000, 4],

[ 1, 0, 0, 700000, 3],
[ 0, 0, 1, 650000, 3],
[ 1, 0, 0, 600000, 2]], dtype=int64)

Notice that the 'neighborhood' column has been expanded into three separate
columns, representing the three neighborhood labels, and that each row has a
1 in the column associated with its neighborhood. With these categorical
features thus encoded, you can proceed as normal with fitting a Scikit-Learn
model.

To see the meaning of each column, you can inspect the feature names:

In [4]: vec.get_feature_names()

Out[4]: ['neighborhood=Fremont',
'neighborhood=Queen Anne',
'neighborhood=Wallingford',
'price',
'rooms']

There is one clear disadvantage of this approach: if your category has many
possible values, this can greatly increase the size of your dataset. However,
because the encoded data contains mostly zeros, a sparse output can be a very
efficient solution:

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 2/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

In [5]: vec = DictVectorizer(sparse=True, dtype=int)

vec.fit_transform(data)

Out[5]: <4x5 sparse matrix of type '<class 'numpy.int64'>'

with 12 stored elements in Compressed Sparse Row format>

Many (though not yet all) of the Scikit-Learn estimators accept such sparse
inputs when fitting and evaluating models.
sklearn.preprocessing.OneHotEncoder and
sklearn.feature_extraction.FeatureHasher are two additional tools that
Scikit-Learn includes to support this type of encoding.

# Text Features
Another common need in feature engineering is to convert text to a set of
representative numerical values. For example, most automatic mining of social
media data relies on some form of encoding the text as numbers. One of the
simplest methods of encoding data is by word counts: you take each snippet of
text, count the occurrences of each word within it, and put the results in a table.

For example, consider the following set of three phrases:

In [6]: sample = ['problem of evil',

'evil queen',
'horizon problem']

For a vectorization of this data based on word count, we could construct a

column representing the word "problem," the word "evil," the word "horizon,"
and so on. While doing this by hand would be possible, the tedium can be
avoided by using Scikit-Learn's CountVectorizer :

In [7]: from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
X

Out[7]: <3x5 sparse matrix of type '<class 'numpy.int64'>'

with 7 stored elements in Compressed Sparse Row format>

The result is a sparse matrix recording the number of times each word appears;
it is easier to inspect if we convert this to a DataFrame with labeled columns:

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 3/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

In [8]: import pandas as pd

pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Out[8]: evilhorizonofproblemqueen
01 0 1 1 0
11 0 0 0 1
20 1 0 1 0

There are some issues with this approach, however: the raw word counts lead
to features which put too much weight on words that appear very frequently,
and this can be sub-optimal in some classification algorithms. One approach to
fix this is known as term frequency-inverse document frequency (TF–IDF) which
weights the word counts by a measure of how often they appear in the
documents. The syntax for computing these features is similar to the previous
example:

In [9]: from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Out[9]: evil horizon of problemqueen

00.5178560.0000000.6809190.5178560.000000
10.6053490.0000000.0000000.0000000.795961
20.0000000.7959610.0000000.6053490.000000

For an example of using TF-IDF in a classification problem, see In Depth: Naive

Bayes Classification (05.05-naive-bayes.html).

# Image Features
Another common need is to suitably encode images for machine learning
analysis. The simplest approach is what we used for the digits data in
Introducing Scikit-Learn (05.02-introducing-scikit-learn.html): simply using the
pixel values themselves. But depending on the application, such approaches
may not be optimal.

A comprehensive summary of feature extraction techniques for images is well

beyond the scope of this section, but you can find excellent implementations of
many of the standard approaches in the Scikit-Image project (http://scikit-
image.org). For one example of using Scikit-Learn and Scikit-Image together,
see Feature Engineering: Working with Images (05.14-image-features.html).

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 4/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

# Derived Features
Another useful type of feature is one that is mathematically derived from some
input features. We saw an example of this in Hyperparameters and Model
Validation (05.03-hyperparameters-and-model-validation.html) when we
constructed polynomial features from our input data. We saw that we could
convert a linear regression into a polynomial regression not by changing the
model, but by transforming the input! This is sometimes known as basis
function regression, and is explored further in In Depth: Linear Regression
(05.06-linear-regression.html).

For example, this data clearly cannot be well described by a straight line:

In [10]: %matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4, 5])
y = np.array([4, 2, 1, 3, 7])
plt.scatter(x, y);

Still, we can fit a line to the data using LinearRegression and get the optimal
result:

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 5/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

In [11]: from sklearn.linear_model import LinearRegression

X = x[:, np.newaxis]
model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(x, y)
plt.plot(x, yfit);

It's clear that we need a more sophisticated model to describe the relationship
between x and y.

One approach to this is to transform the data, adding extra columns of features
to drive more flexibility in the model. For example, we can add polynomial
features to the data this way:

In [12]: from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X)
print(X2)

[[ 1. 1. 1.]
[ 2. 4. 8.]
[ 3. 9. 27.]
[ 4. 16. 64.]
[ 5. 25. 125.]]

The derived feature matrix has one column representing x, and a second
column representing x2 , and a third column representing x3 . Computing a
linear regression on this expanded input gives a much closer fit to our data:

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 6/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

In [13]: model = LinearRegression().fit(X2, y)

yfit = model.predict(X2)
plt.scatter(x, y)
plt.plot(x, yfit);

This idea of improving a model not by changing the model, but by transforming
the inputs, is fundamental to many of the more powerful machine learning
methods. We explore this idea further in In Depth: Linear Regression (05.06-
linear-regression.html) in the context of basis function regression. More
generally, this is one motivational path to the powerful set of techniques known
as kernel methods, which we will explore in In-Depth: Support Vector Machines
(05.07-support-vector-machines.html).

# Imputation of Missing Data

Another common need in feature engineering is handling of missing data. We
discussed the handling of missing data in DataFrame s in Handling Missing Data
(03.04-missing-values.html), and saw that often the NaN value is used to mark
missing values. For example, we might have a dataset that looks like this:

In [14]: from numpy import nan

X = np.array([[ nan, 0, 3 ],
[ 3, 7, 9 ],
[ 3, 5, 2 ],
[ 4, nan, 6 ],
[ 8, 8, 1 ]])
y = np.array([14, 16, -1, 8, -5])

When applying a typical machine learning model to such data, we will need to
first replace such missing data with some appropriate fill value. This is known
as imputation of missing values, and strategies range from simple (e.g.,
replacing missing values with the mean of the column) to sophisticated (e.g.,
using matrix completion or a robust model to handle such data).

The sophisticated approaches tend to be very application-specific, and we

won't dive into them here. For a baseline imputation approach, using the mean,
median, or most frequent value, Scikit-Learn provides the Imputer class:

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 7/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

In [15]: from sklearn.preprocessing import Imputer

imp = Imputer(strategy='mean')
X2 = imp.fit_transform(X)
X2

Out[15]: array([[ 4.5, 0. , 3. ],

[ 3. , 7. , 9. ],
[ 3. , 5. , 2. ],
[ 4. , 5. , 6. ],
[ 8. , 8. , 1. ]])

We see that in the resulting data, the two missing values have been replaced
with the mean of the remaining values in the column. This imputed data can
then be fed directly into, for example, a LinearRegression estimator:

In [16]: model = LinearRegression().fit(X2, y)

model.predict(X2)

Out[16]: array([ 13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.3

# Feature Pipelines
With any of the preceding examples, it can quickly become tedious to do the
transformations by hand, especially if you wish to string together multiple
steps. For example, we might want a processing pipeline that looks something
like this:

1. Impute missing values using the mean

2. Transform features to quadratic
3. Fit a linear regression

To streamline this type of processing pipeline, Scikit-Learn provides a

Pipeline object, which can be used as follows:

In [17]: from sklearn.pipeline import make_pipeline

model = make_pipeline(Imputer(strategy='mean'),
PolynomialFeatures(degree=2),
LinearRegression())

This pipeline looks and acts like a standard Scikit-Learn object, and will apply
all the specified steps to any input data.

In [18]: model.fit(X, y) # X with missing values, from above

print(y)
print(model.predict(X))

[14 16 -1 8 -5]
[ 14. 16. -1. 8. -5.]

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 8/9
2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

All the steps of the model are applied automatically. Notice that for the
simplicity of this demonstration, we've applied the model to the data it was
trained on; this is why it was able to perfectly predict the result (refer back to
Hyperparameters and Model Validation (05.03-hyperparameters-and-model-
validation.html) for further discussion of this).

For some examples of Scikit-Learn pipelines in action, see the following section
on naive Bayes classification, as well as In Depth: Linear Regression (05.06-
linear-regression.html), and In-Depth: Support Vector Machines (05.07-support-
vector-machines.html).

< Hyperparameters and Model Validation (05.03-hyperparameters-and-model-

validation.html) | Contents (index.html) | In Depth: Naive Bayes Classification
(05.05-naive-bayes.html) >

Open in Colab

(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/note
Feature-Engineering.ipynb)

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 9/9

ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Feature Engineering Guide
No ratings yet
Feature Engineering Guide
51 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
45 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Mastering Feature Engineering Techniques
No ratings yet
Mastering Feature Engineering Techniques
6 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
Data Sampling and Feature Engineering Guide
No ratings yet
Data Sampling and Feature Engineering Guide
2 pages
Feature Engineering with Pandas Techniques
No ratings yet
Feature Engineering with Pandas Techniques
14 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
M PDF
No ratings yet
M PDF
13 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
123 pages
Unit II
No ratings yet
Unit II
119 pages
Feature Engineering Techniques in Data Science
100% (2)
Feature Engineering Techniques in Data Science
76 pages
Lab 6
No ratings yet
Lab 6
6 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
12 pages
Data Pre-Processing for Machine Learning
No ratings yet
Data Pre-Processing for Machine Learning
12 pages
Features
No ratings yet
Features
5 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Introduction to Machine Learning with Scikit-Learn
No ratings yet
Introduction to Machine Learning with Scikit-Learn
2 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
Rajat Agarwal-21bcon630
No ratings yet
Rajat Agarwal-21bcon630
13 pages
Unit-4 Part 3 Feature Engineering
No ratings yet
Unit-4 Part 3 Feature Engineering
29 pages
An Empirical Analysis of Feature Engineering For Predictive Modeling
No ratings yet
An Empirical Analysis of Feature Engineering For Predictive Modeling
6 pages
Census Income Data Analysis Guide
No ratings yet
Census Income Data Analysis Guide
22 pages
Scikit-Learn Classification Cheat Sheet
No ratings yet
Scikit-Learn Classification Cheat Sheet
1 page
Machinelearning
No ratings yet
Machinelearning
26 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
C1 W2 Lab04 FeatEng PolyReg Soln
No ratings yet
C1 W2 Lab04 FeatEng PolyReg Soln
5 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Notes Chapter Feature Representation
No ratings yet
Notes Chapter Feature Representation
6 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
Assignment1 LATEX
No ratings yet
Assignment1 LATEX
11 pages
Unit 3
No ratings yet
Unit 3
110 pages
ML 3
No ratings yet
ML 3
24 pages
MLP Week 2 Slides
No ratings yet
MLP Week 2 Slides
82 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Exam 1
No ratings yet
Exam 1
3 pages
Python Tips for Data Scientists
No ratings yet
Python Tips for Data Scientists
13 pages
Week 10
No ratings yet
Week 10
50 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
139 pages
UNIT - 10 - Extra - Grammar - Exercises ENGLISH I
100% (2)
UNIT - 10 - Extra - Grammar - Exercises ENGLISH I
4 pages
HSG9
No ratings yet
HSG9
8 pages
Spek & Harga Komputer
No ratings yet
Spek & Harga Komputer
3 pages
There Were Two Trees in The Garden
No ratings yet
There Were Two Trees in The Garden
11 pages
Vijeo Citect - Project Development
No ratings yet
Vijeo Citect - Project Development
342 pages
Extrovert vs. Introvert: Speaking Skills Study
No ratings yet
Extrovert vs. Introvert: Speaking Skills Study
18 pages
Sheila Walsh
No ratings yet
Sheila Walsh
6 pages
Circle Map
No ratings yet
Circle Map
2 pages
Boolean Simplification Guide
No ratings yet
Boolean Simplification Guide
33 pages
Metaphysical Deja Vu Hacking and Latour On Science Studies and Metaphysics - Martin Kusch 2002
No ratings yet
Metaphysical Deja Vu Hacking and Latour On Science Studies and Metaphysics - Martin Kusch 2002
9 pages
Grade 4 English Semester 1 Guide
No ratings yet
Grade 4 English Semester 1 Guide
6 pages
Class 9th Paper
No ratings yet
Class 9th Paper
3 pages
Poem Explanation
No ratings yet
Poem Explanation
2 pages
10 Ea Transfer Order
No ratings yet
10 Ea Transfer Order
2 pages
Scheme of English Senior 4
No ratings yet
Scheme of English Senior 4
10 pages
MATH 6 2nd QTR
No ratings yet
MATH 6 2nd QTR
3 pages
Homology: Understanding Mathematical Holes
No ratings yet
Homology: Understanding Mathematical Holes
3 pages
Solution Manual For Effective Human Relations Interpersonal and Organizational Applications 12th Edition by Reece ISBN 1133960839 9781133960836
No ratings yet
Solution Manual For Effective Human Relations Interpersonal and Organizational Applications 12th Edition by Reece ISBN 1133960839 9781133960836
6 pages
Module 4 - Chapter 3
No ratings yet
Module 4 - Chapter 3
7 pages
Faith and Struggle in the Midwest
No ratings yet
Faith and Struggle in the Midwest
2 pages
The Sumerian King List - Livius
No ratings yet
The Sumerian King List - Livius
8 pages
Did Rizal Retract?: Case Title
No ratings yet
Did Rizal Retract?: Case Title
2 pages
Đề Thi Học Kì 2 Lớp 5 Môn Tiếng Anh
No ratings yet
Đề Thi Học Kì 2 Lớp 5 Môn Tiếng Anh
6 pages
Degrees of Comparison: When We Compare Two Nouns: Comparative. When We Compare Three or More Nouns: Superlative
No ratings yet
Degrees of Comparison: When We Compare Two Nouns: Comparative. When We Compare Three or More Nouns: Superlative
6 pages
Foss Lab Manual
No ratings yet
Foss Lab Manual
34 pages
Wrapper Class Icse Class 10
No ratings yet
Wrapper Class Icse Class 10
8 pages
Bachelor Degree Thesis Example
100% (3)
Bachelor Degree Thesis Example
8 pages
Mobile Phones
No ratings yet
Mobile Phones
2 pages
History and Significance of Hadith
No ratings yet
History and Significance of Hadith
10 pages
Behaviorism Theory
100% (2)
Behaviorism Theory
30 pages

Feature Engineering - Python Data Science Handbook

Uploaded by

Feature Engineering - Python Data Science Handbook

Uploaded by

2/18/25, 6:52 PM Feature Engineering | Python Data Science Handbook

< Hyperparameters and Model Validation (05.03-hyperparameters-and-model-

In this section, we will cover a few common examples of feature engineering

You might be tempted to encode this data with a straightforward numerical

In [2]: {'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};

In [3]: from sklearn.feature_extraction import DictVectorizer

Out[3]: array([[ 0, 1, 0, 850000, 4],

In [5]: vec = DictVectorizer(sparse=True, dtype=int)

Out[5]: <4x5 sparse matrix of type '<class 'numpy.int64'>'

For example, consider the following set of three phrases:

In [6]: sample = ['problem of evil',

For a vectorization of this data based on word count, we could construct a

In [7]: from sklearn.feature_extraction.text import CountVectorizer

Out[7]: <3x5 sparse matrix of type '<class 'numpy.int64'>'

In [8]: import pandas as pd

In [9]: from sklearn.feature_extraction.text import TfidfVectorizer

Out[9]: evil horizon of problemqueen

For an example of using TF-IDF in a classification problem, see In Depth: Naive

A comprehensive summary of feature extraction techniques for images is well

In [10]: %matplotlib inline

In [11]: from sklearn.linear_model import LinearRegression

In [12]: from sklearn.preprocessing import PolynomialFeatures

In [13]: model = LinearRegression().fit(X2, y)

# Imputation of Missing Data

In [14]: from numpy import nan

The sophisticated approaches tend to be very application-specific, and we

In [15]: from sklearn.preprocessing import Imputer

Out[15]: array([[ 4.5, 0. , 3. ],

In [16]: model = LinearRegression().fit(X2, y)

Out[16]: array([ 13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.3

1. Impute missing values using the mean

To streamline this type of processing pipeline, Scikit-Learn provides a

In [17]: from sklearn.pipeline import make_pipeline

In [18]: model.fit(X, y) # X with missing values, from above

< Hyperparameters and Model Validation (05.03-hyperparameters-and-model-

You might also like