Instructions
[Link] the instructions in each question carefully.
2. A Jupyter notebook along with output for each cell is expected.
3. Any assignment submitted using other python IDEs are not considered for
grading.
4. Use appropriate labels for all visualizations.
5. Upload the [Link] file along with the notebook when required.\
6. If dataset link is expired, search for the same dataset online from any
repository and use it.
Question 1
1. Import the dataset from [Link]
data/review_polarity.[Link] .
2. Split the data into training and testing. use 10-fold cross validation.
3. Extract features using TF-IDF and display the features.
4. Model the classifier using GaussianNB, BernoulliNB and MultinomialNB and
train the classifiers.
5. Compute the accuracy and confusion matrix for each models.
6. Create an output .csv file consisting actual Test set values of Y (column
name: Actual) and Predictions of Y(column name: Predicted).
Question 2
Consider the diabetes data ([Link]) has a response variable of whether a
person is having diabetes, which is given by a 1.
1. Import the dataset from [Link]
database.
2. Identify the columns with missing values (1 point). Fill the missing values
with mean value for numerical attributes and mode value for categorical attributes.
3. Extract X as all columns except the last column and Y as last column.
4. Visualize the dataset.
5. Split the data into training set and testing set. Perform 10-fold cross
validation.
6. Train a Logistic regression model for the dataset.
7. Display the coefficients and form the logistic regression equation.
8. Compute the accuracy and confusion matrix.
9. Plot the decision boundary.