0% found this document useful (0 votes)
36 views8 pages

Section 1 - Introduction To Regression

The document outlines exercises related to regression analysis using the Advertisement dataset, including simple data plotting, kNN regression, and finding the best k value. It provides step-by-step instructions for creating scatter plots, applying the kNN algorithm, and evaluating model performance using mean squared error. Additionally, it includes hints and references to relevant documentation for various functions used in the exercises.

Uploaded by

oracledba1963
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views8 pages

Section 1 - Introduction To Regression

The document outlines exercises related to regression analysis using the Advertisement dataset, including simple data plotting, kNN regression, and finding the best k value. It provides step-by-step instructions for creating scatter plots, applying the kNN algorithm, and evaluating model performance using mean squared error. Additionally, it includes hints and references to relevant documentation for various functions used in the exercises.

Uploaded by

oracledba1963
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Section 1 - Introduction to Regression

Exercise: Simple Data Plotting


The aim of this exercise is to plot TV Ads vs Sales based on the Advertisement dataset which should
look similar to the graph given below.

Instructions:
Read the Advertisement data and view the top rows of the dataframe to get an understanding
of the data and the columns.
Select the first 7 observations and the columns TV and sales to make a new data frame.
Create a scatter plot of the new data frame TV budget vs sales .

Hints:
The following are direct links to documentation for some of the functions used in this
exercise. As always, if you are unsure how to use a function, refer to its documentation.

pd.read_csv(filename)

Returns a pandas dataframe containing the data and labels from the file data

df.iloc[]

Returns a subset of the dataframe that is contained in the row range passed as the argument
df.head()

Returns the first 5 rows of the dataframe with the column names

plt.scatter()

A scatter plot of y vs. x with varying marker size and/or color

plt.xlabel()

This is used to specify the text to be displayed as the label for the x-axis

plt.ylabel()

This is used to specify the text to be displayed as the label for the y-axis

plt.title()

This is used to specify the title to be displayed for the plot

Note: This exercise is auto-graded and you can try multiple attempts. See the Programing
Assignments tab in the New to edX? section of the Introduction for more information about
Automated edTests.
Exercise: Simple kNN Regression
The goal of this exercise is to re-create the plots given below. You would have come across these
graphs in the lecture as well.

Instructions:
Part 1: KNN by hand for k=1

Read the Advertisement data.


Get a subset of the data from row 5 to row 13.
Apply the kNN algorithm by hand and plot the first graph as given above.

Part 2: Using sklearn package

Read the Advertisement dataset.


Split the data into train and test sets using the train_test_split() function.
Set k_list as the possible k values ranging from 1 to 70.
For each value of k in k_list :
Use sklearn KNearestNeighbors() to fit train data.
Predict on the test data.
Use the helper code to get the second plot above for k=1,10,70.

Hints:
The following are direct links to documentation for some of the functions used in this
exercise. As always, if you are unsure how to use a function, refer to its documentation.

np.argsort()

Returns the indices that would sort an array.

df.iloc[]

Returns a subset of the dataframe that is contained in the column range passed as the argument.

plt.plot( )

Plot y versus x as lines and/or markers.

df.values

Returns a Numpy representation of the DataFrame.

pd.idxmin()

Returns index of the first occurrence of minimum over requested axis.

np.min()

Returns the minimum along a given axis.

np.max()

Returns the maximum along a given axis.


model.fit( )

Fit the k-nearest neighbors regressor from the training dataset.

model.predict( )

Predict the target for the provided data.

np.zeros()

Returns a new array of given shape and type, filled with zeros.

train_test_split(X,y)

Split arrays or matrices into random train and test subsets.

np.linspace()

Returns evenly spaced numbers over a specified interval.

KNeighborsRegressor(n_neighbors=k_value)

Regression-based on k-nearest neighbors.

Note: This exercise is auto-graded, hence please remember to set all the parameters to the values
mentioned in the scaffold before marking. See the Programing Assignments tab in the New to
edX? section of the Introduction for more information about Automated edTests.
Exercise: Finding the Best k in kNN Regression
The goal here is to find the value of k of the best performing model based on the test MSE.

Instructions:
Read the data into a Pandas dataframe object.
Select the sales column as the response variable and TV budget column as the predictor
variable.
Make a train-test split using sklearn.model_selection.train_test_split .
Create a list of integer k values using numpy.linspace .
For each value of k
Fit a kNN regression on train set.
Calculate MSE on test set and store it.
Plot the test MSE values for each k.
Find the k value associated with the lowest test MSE.
Hints:
The following are direct links to documentation for some of the functions used in this
exercise. As always, if you are unsure how to use a function, refer to its documentation.

train_test_split(X,y)

Split arrays or matrices into random train and test subsets.

np.linspace()

Returns evenly spaced numbers over a specified interval.

KNeighborsRegressor(n_neighbors=k_value)

Regression-based on k-nearest neighbors.

model.predict()

Predict the target for the provided data.

mean_squared_error()

Computes the mean squared error regression loss.

dict.keys()

Returns a view object that displays a list of all the keys in the dictionary.

dict.values()

Returns a list of all the values available in a given dictionary.

plt.plot()

Plot y versus x as lines and/or markers.

dict.items()
Returns a list of dict's (key, value) tuple pairs.

Note: This exercise is auto-graded and you can try multiple attempts. See the Programing
Assignments tab in the New to edX? section of the Introduction for more information about
Automated edTests.

You might also like