Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
Size of the House (sqft)
Number of Bedrooms
Age of the House (years)
Distance to the City Center
Color of the Front Door
# Import necessary libraries
import pandas as pd
# Create a small sample dataset
data = pd.DataFrame({
'Size (sqft)': [1500, 1600, 1700, 1800, 1900],
'Bedrooms': [3, 3, 4, 4, 5],
'Age of House (years)': [10, 15, 20, 25, 30],
'Distance to City Center (km)': [5, 4, 6, 3, 2],
'Front Door Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Price': [300000, 320000, 350000, 370000, 400000]
})
# Display the data
data
Size Age of House Distance to City Front Door
Bedrooms Price
(sqft) (years) Center (km) Color
0 1500 3 10 5 Red 300000
1 1600 3 15 4 Blue 320000
2 1700 4 20 6 Green 350000
3 1800 4 25 3 Blue 370000
4 1900 5 30 2 Red 400000
data
# Drop the "Front Door Color" column as it doesn't add value to our prediction
data = data.drop(columns=['Front Door Color'])
1 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
# Display the modified data
data
Size Age of House Distance to City Center
Bedrooms Price
(sqft) (years) (km)
0 1500 3 10 5 300000
1 1600 3 15 4 320000
2 1700 4 20 6 350000
3 1800 4 25 3 370000
4 1900 5 30 2 400000
data
170: This is the mean (average) of the distribution. The generated numbers will center around 170.
10: This is the standard deviation (spread) of the distribution. It determines how much the numbers
100: This is the number of values to generate. In this case, it will create an array of 100 random n
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Seed for reproducibility
np.random.seed(0)
# Create a simple dataset for height and weight
height = np.random.normal(170 10 100) # Average height in cm
2 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
height = np.random.normal(170, 10, 100) # Average height in cm
weight = height * 0.5 + np.random.normal(0, 5, 100) # Weight dependent on height
# Combine data into a DataFrame
data = pd.DataFrame({'Height': height, 'Weight': weight})
# Plot the data
plt.figure(figsize=(8, 6))
plt.scatter(data['Height'], data['Weight'], color='b', alpha=0.7)
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Scatter plot of Height vs. Weight')
plt.show()
# Apply PCA
pca = PCA(n_components=1)
data_transformed = pca.fit_transform(data)
# Print the transformed data (principal component)
print("Transformed Data (1 Principal Component):\n", data_transformed[:10])
# Plot the data in 1D (principal component)
plt.scatter(data_transformed, np.zeros_like(data_transformed))
( )
3 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
plt.xlabel("Principal Component 1")
plt.title("Data after PCA (1D)")
plt.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-509fb8b12387> in <cell line: 2>()
1 # Apply PCA
----> 2 pca = PCA(n_components=1)
3 data_transformed = pca.fit_transform(data)
4
5 # Print the transformed data (principal component)
NameError: name 'PCA' is not defined
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
# Sample documents
documents = [
"The cat is on the table",
"The dog is under the table",
"Cats and dogs are friends",
"Dogs run and cats jump"
]
# Convert text data into a term-document matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
# Print the term-document matrix
print("Original Term-Document Matrix:\n", X.toarray())
print("\nFeature Names:", vectorizer.get_feature_names_out())
Original Term-Document Matrix:
[[0 0 1 0 0 0 0 1 0 1 0 1 2 0]
[0 0 0 0 1 0 0 1 0 0 0 1 2 1]
[1 1 0 1 0 1 1 0 0 0 0 0 0 0]
[1 0 0 1 0 1 0 0 1 0 1 0 0 0]]
Feature Names: ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'friends' 'is' 'jump' 'on' 'run'
'table' 'the' 'under']
# Apply Truncated SVD
svd = TruncatedSVD(n_components=2) # Reducing to 2 components
X_reduced = svd.fit_transform(X)
# Print the reduced matrix
print("Reduced Term-Document Matrix:\n", X_reduced)
4 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
print("Reduced Term-Document Matrix:\n", X_reduced)
# Plot the reduced representation of documents
plt.scatter(X_reduced[:, 0], X_reduced[:, 1])
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.title("Document Representation after SVD")
plt.show()
Reduced Term-Document Matrix:
[[ 2.64575131e+00 3.92045402e-18]
[ 2.64575131e+00 1.01846971e-15]
[-7.34164894e-16 2.00000000e+00]
[-7.35324166e-16 2.00000000e+00]]
Hours Studied
Attendance Rate
Participation in Class
Previous Grades
Extra-Curricular Activities
5 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
Start with No Features.
Evaluate Each Feature Individually to see which one gives the best prediction.
Add the Best Feature to the model.
Evaluate the Remaining Features with the selected feature(s).
Repeat until adding more features doesn’t significantly improve the model.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# Sample dataset: Predicting Final Grade based on various factors
data = {
'Hours_Studied': [5, 8, 2, 6, 9, 3, 7, 10, 4, 5],
'Attendance_Rate': [80, 90, 70, 85, 95, 60, 88, 92, 75, 82],
'Participation': [3, 4, 2, 3, 5, 1, 4, 5, 2, 3],
'Previous_Grades': [75, 85, 65, 78, 88, 70, 80, 90, 68, 74],
'Extra_Curricular': [1, 2, 0, 1, 2, 0, 1, 2, 1, 1],
'Final_Grade': [78, 88, 65, 80, 90, 68, 85, 92, 72, 76]
}
# Convert the data to a DataFrame
df = pd.DataFrame(data)
X = df.drop('Final_Grade', axis=1) # Features (input data)
y = df['Final_Grade'] # Target variable (what we want to predict)
# Initialize variables for forward selection
remaining_features = list(X.columns) # Start with all features available for selection
selected_features = [] # Start with an empty set of selected features
model = LinearRegression() # Initialize a linear regression model
# Forward Selection Process
print("Forward Feature Selection Process:")
# Keep adding features until no significant improvement is observed
while remaining_features:
scores = {} # Dictionary to store scores for each feature
# Test adding each feature not yet selected
for feature in remaining_features:
# Temporarily add the current feature to the selected set
current_features = selected_features + [feature]
# Use only the selected features in the model
X_subset = X[current_features]
# Evaluate model performance using cross-validation and calculate average score
score = cross_val_score(model, X_subset, y, cv=3).mean()
6 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
# Store the score of this model in the dictionary
scores[feature] = score
# Find the feature that gives the best improvement in model performance
best_feature = max(scores, key=scores.get) # Select the feature with the highest score
# Add the best feature to the selected features list
selected_features.append(best_feature)
# Remove the best feature from the remaining features list
remaining_features.remove(best_feature)
# Print the selected feature and its score for tracking
print(f"Selected Feature: {best_feature}, Score: {scores[best_feature]:.4f}")
# Print the order in which features were selected
print("\nSelected Features in Order:", selected_features)
Forward Feature Selection Process:
Selected Feature: Hours_Studied, Score: 0.9679
Selected Feature: Extra_Curricular, Score: 0.9619
Selected Feature: Participation, Score: 0.9374
Selected Feature: Attendance_Rate, Score: 0.9163
Selected Feature: Previous_Grades, Score: -1.0925
Selected Features in Order: ['Hours_Studied', 'Extra_Curricular', 'Participation', 'Atten
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
7 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
# Sample dataset: Predicting Final Grade based on various factors
data = {
'Hours_Studied': [5, 8, 2, 6, 9, 3, 7, 10, 4, 5],
'Attendance_Rate': [80, 90, 70, 85, 95, 60, 88, 92, 75, 82],
'Participation': [3, 4, 2, 3, 5, 1, 4, 5, 2, 3],
'Previous_Grades': [75, 85, 65, 78, 88, 70, 80, 90, 68, 74],
'Extra_Curricular': [1, 2, 0, 1, 2, 0, 1, 2, 1, 1],
'Final_Grade': [78, 88, 65, 80, 90, 68, 85, 92, 72, 76]
}
# Convert the data to a DataFrame
df = pd.DataFrame(data)
X = df.drop('Final_Grade', axis=1) # Features (input data)
y = df['Final_Grade'] # Target variable (what we want to predict)
remaining_features = list(X.columns) # Start with all features available for selection
selected_features = [] # Start with an empty set of selected features
model = LinearRegression() # Initialize a linear regression model
while remaining_features:: Start a loop that runs as long as there are still features we haven’t
scores = {}: Initialize an empty dictionary to keep track of model scores for each feature we test.
Inner Loop: Testing Each Feature:
8 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
Inner Loop: Testing Each Feature:
for feature in remaining_features:: Loop over each feature we haven’t selected yet.
current_features = selected_features + [feature]: Add the current feature to the selected list t
X_subset = X[current_features]: Create X_subset, which includes only the selected features.
score = cross_val_score(model, X_subset, y, cv=3).mean(): Evaluate the model’s performance
print("Forward Feature Selection Process:")
while remaining_features:
scores = {} # Dictionary to store scores for each feature
# Test adding each feature not yet selected
for feature in remaining_features:
# Temporarily add the current feature to the selected set
current_features = selected_features + [feature]
# Use only the selected features in the model
X_subset = X[current_features]
# Evaluate model performance using cross-validation and calculate average score
score = cross_val_score(model, X_subset, y, cv=3).mean()
# Store the score of this model in the dictionary
scores[feature] = score
print("\nSelected Features in Order:", selected_features)
import numpy as np
# Set the random seed for reproducibility
np.random.seed(0)
9 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
np.random.seed(0)
# Simulate flipping a coin 100 times (0 = Tails, 1 = Heads)
flips = np.random.choice([0, 1], size=100) # 0 for Tails, 1 for Heads
# Count the number of Heads and Tails
heads_count = np.sum(flips == 1)
tails_count = np.sum(flips == 0)
# Calculate empirical probabilities
p_heads = heads_count / len(flips)
p_tails = tails_count / len(flips)
print(f"Empirical Probability of Heads: {p_heads}")
print(f"Empirical Probability of Tails: {p_tails}")
np.random.choice([0, 1], size=100) simulates flipping a coin 100 times. Here, 0 represents Tails and
We count how many times 1 (Heads) and 0 (Tails) occur in the array.
By dividing the count of each outcome by the total number of flips, we get the empirical probability
import numpy as np
# Data for study time (hours) and test scores
study_time = [2, 3, 4, 5, 6]
test_scores = [60, 65, 70, 75, 80]
# Calculate correlation
correlation = np.corrcoef(study_time, test_scores)[0 1]
10 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...
correlation = np.corrcoef(study_time, test_scores)[0, 1]
print("Correlation between study time and test scores:", correlation)
Correlation between study time and test scores: 1.0
11 of 11 11/11/2024, 4:46 PM