Section 1 - Introduction to Regression
Exercise: Simple Data Plotting
The aim of this exercise is to plot TV Ads vs Sales based on the Advertisement dataset which should
look similar to the graph given below.
Instructions:
Read the Advertisement data and view the top rows of the dataframe to get an understanding
of the data and the columns.
Select the first 7 observations and the columns TV and sales to make a new data frame.
Create a scatter plot of the new data frame TV budget vs sales .
Hints:
The following are direct links to documentation for some of the functions used in this
exercise. As always, if you are unsure how to use a function, refer to its documentation.
pd.read_csv(filename)
Returns a pandas dataframe containing the data and labels from the file data
df.iloc[]
Returns a subset of the dataframe that is contained in the row range passed as the argument
df.head()
Returns the first 5 rows of the dataframe with the column names
plt.scatter()
A scatter plot of y vs. x with varying marker size and/or color
plt.xlabel()
This is used to specify the text to be displayed as the label for the x-axis
plt.ylabel()
This is used to specify the text to be displayed as the label for the y-axis
plt.title()
This is used to specify the title to be displayed for the plot
Note: This exercise is auto-graded and you can try multiple attempts. See the Programing
Assignments tab in the New to edX? section of the Introduction for more information about
Automated edTests.
Exercise: Simple kNN Regression
The goal of this exercise is to re-create the plots given below. You would have come across these
graphs in the lecture as well.
Instructions:
Part 1: KNN by hand for k=1
Read the Advertisement data.
Get a subset of the data from row 5 to row 13.
Apply the kNN algorithm by hand and plot the first graph as given above.
Part 2: Using sklearn package
Read the Advertisement dataset.
Split the data into train and test sets using the train_test_split() function.
Set k_list as the possible k values ranging from 1 to 70.
For each value of k in k_list :
Use sklearn KNearestNeighbors() to fit train data.
Predict on the test data.
Use the helper code to get the second plot above for k=1,10,70.
Hints:
The following are direct links to documentation for some of the functions used in this
exercise. As always, if you are unsure how to use a function, refer to its documentation.
np.argsort()
Returns the indices that would sort an array.
df.iloc[]
Returns a subset of the dataframe that is contained in the column range passed as the argument.
plt.plot( )
Plot y versus x as lines and/or markers.
df.values
Returns a Numpy representation of the DataFrame.
pd.idxmin()
Returns index of the first occurrence of minimum over requested axis.
np.min()
Returns the minimum along a given axis.
np.max()
Returns the maximum along a given axis.
model.fit( )
Fit the k-nearest neighbors regressor from the training dataset.
model.predict( )
Predict the target for the provided data.
np.zeros()
Returns a new array of given shape and type, filled with zeros.
train_test_split(X,y)
Split arrays or matrices into random train and test subsets.
np.linspace()
Returns evenly spaced numbers over a specified interval.
KNeighborsRegressor(n_neighbors=k_value)
Regression-based on k-nearest neighbors.
Note: This exercise is auto-graded, hence please remember to set all the parameters to the values
mentioned in the scaffold before marking. See the Programing Assignments tab in the New to
edX? section of the Introduction for more information about Automated edTests.
Exercise: Finding the Best k in kNN Regression
The goal here is to find the value of k of the best performing model based on the test MSE.
Instructions:
Read the data into a Pandas dataframe object.
Select the sales column as the response variable and TV budget column as the predictor
variable.
Make a train-test split using sklearn.model_selection.train_test_split .
Create a list of integer k values using numpy.linspace .
For each value of k
Fit a kNN regression on train set.
Calculate MSE on test set and store it.
Plot the test MSE values for each k.
Find the k value associated with the lowest test MSE.
Hints:
The following are direct links to documentation for some of the functions used in this
exercise. As always, if you are unsure how to use a function, refer to its documentation.
train_test_split(X,y)
Split arrays or matrices into random train and test subsets.
np.linspace()
Returns evenly spaced numbers over a specified interval.
KNeighborsRegressor(n_neighbors=k_value)
Regression-based on k-nearest neighbors.
model.predict()
Predict the target for the provided data.
mean_squared_error()
Computes the mean squared error regression loss.
dict.keys()
Returns a view object that displays a list of all the keys in the dictionary.
dict.values()
Returns a list of all the values available in a given dictionary.
plt.plot()
Plot y versus x as lines and/or markers.
dict.items()
Returns a list of dict's (key, value) tuple pairs.
Note: This exercise is auto-graded and you can try multiple attempts. See the Programing
Assignments tab in the New to edX? section of the Introduction for more information about
Automated edTests.