0% found this document useful (0 votes)

35 views11 pages

Feature Selection in Python

The document outlines a series of data analysis tasks using Python, including creating datasets, applying statistical methods, and visualizing results. It covers topics such as feature selection for predicting final grades, generating random datasets for height and weight, and applying dimensionality reduction techniques like PCA and SVD. The document also includes code snippets for implementing these analyses in a Jupyter notebook environment.

Uploaded by

vamsikopparthi84

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views11 pages

Feature Selection in Python

Uploaded by

vamsikopparthi84

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

Size of the House (sqft)

Number of Bedrooms
Age of the House (years)
Distance to the City Center
Color of the Front Door

# Import necessary libraries

import pandas as pd

# Create a small sample dataset

data = pd.DataFrame({
'Size (sqft)': [1500, 1600, 1700, 1800, 1900],
'Bedrooms': [3, 3, 4, 4, 5],
'Age of House (years)': [10, 15, 20, 25, 30],
'Distance to City Center (km)': [5, 4, 6, 3, 2],
'Front Door Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Price': [300000, 320000, 350000, 370000, 400000]
})

# Display the data

data

Size Age of House Distance to City Front Door

Bedrooms Price
(sqft) (years) Center (km) Color

0 1500 3 10 5 Red 300000

1 1600 3 15 4 Blue 320000

2 1700 4 20 6 Green 350000

3 1800 4 25 3 Blue 370000

4 1900 5 30 2 Red 400000

data

# Drop the "Front Door Color" column as it doesn't add value to our prediction
data = data.drop(columns=['Front Door Color'])

1 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

# Display the modified data

data

Size Age of House Distance to City Center

Bedrooms Price
(sqft) (years) (km)

0 1500 3 10 5 300000

1 1600 3 15 4 320000

2 1700 4 20 6 350000

3 1800 4 25 3 370000

4 1900 5 30 2 400000

data

170: This is the mean (average) of the distribution. The generated numbers will center around 170.
10: This is the standard deviation (spread) of the distribution. It determines how much the numbers
100: This is the number of values to generate. In this case, it will create an array of 100 random n

# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Seed for reproducibility

np.random.seed(0)

# Create a simple dataset for height and weight

height = np.random.normal(170 10 100) # Average height in cm

2 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

height = np.random.normal(170, 10, 100) # Average height in cm

weight = height * 0.5 + np.random.normal(0, 5, 100) # Weight dependent on height

# Combine data into a DataFrame

data = pd.DataFrame({'Height': height, 'Weight': weight})

# Plot the data

plt.figure(figsize=(8, 6))
plt.scatter(data['Height'], data['Weight'], color='b', alpha=0.7)
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Scatter plot of Height vs. Weight')
plt.show()

# Apply PCA
pca = PCA(n_components=1)
data_transformed = pca.fit_transform(data)

# Print the transformed data (principal component)

print("Transformed Data (1 Principal Component):\n", data_transformed[:10])

# Plot the data in 1D (principal component)

plt.scatter(data_transformed, np.zeros_like(data_transformed))
( )

3 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

plt.xlabel("Principal Component 1")

plt.title("Data after PCA (1D)")
plt.show()

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-509fb8b12387> in <cell line: 2>()
1 # Apply PCA
----> 2 pca = PCA(n_components=1)
3 data_transformed = pca.fit_transform(data)
4
5 # Print the transformed data (principal component)

NameError: name 'PCA' is not defined

from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
"The cat is on the table",
"The dog is under the table",
"Cats and dogs are friends",
"Dogs run and cats jump"
]

# Convert text data into a term-document matrix

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Print the term-document matrix

print("Original Term-Document Matrix:\n", X.toarray())
print("\nFeature Names:", vectorizer.get_feature_names_out())

Original Term-Document Matrix:

[[0 0 1 0 0 0 0 1 0 1 0 1 2 0]
[0 0 0 0 1 0 0 1 0 0 0 1 2 1]
[1 1 0 1 0 1 1 0 0 0 0 0 0 0]
[1 0 0 1 0 1 0 0 1 0 1 0 0 0]]

Feature Names: ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'friends' 'is' 'jump' 'on' 'run'
'table' 'the' 'under']

# Apply Truncated SVD

svd = TruncatedSVD(n_components=2) # Reducing to 2 components
X_reduced = svd.fit_transform(X)

# Print the reduced matrix

print("Reduced Term-Document Matrix:\n", X_reduced)

4 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

print("Reduced Term-Document Matrix:\n", X_reduced)

# Plot the reduced representation of documents

plt.scatter(X_reduced[:, 0], X_reduced[:, 1])
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.title("Document Representation after SVD")
plt.show()

Reduced Term-Document Matrix:

[[ 2.64575131e+00 3.92045402e-18]
[ 2.64575131e+00 1.01846971e-15]
[-7.34164894e-16 2.00000000e+00]
[-7.35324166e-16 2.00000000e+00]]

Hours Studied
Attendance Rate
Participation in Class
Previous Grades
Extra-Curricular Activities

5 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

Start with No Features.

Evaluate Each Feature Individually to see which one gives the best prediction.
Add the Best Feature to the model.
Evaluate the Remaining Features with the selected feature(s).
Repeat until adding more features doesn’t significantly improve the model.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Sample dataset: Predicting Final Grade based on various factors

data = {
'Hours_Studied': [5, 8, 2, 6, 9, 3, 7, 10, 4, 5],
'Attendance_Rate': [80, 90, 70, 85, 95, 60, 88, 92, 75, 82],
'Participation': [3, 4, 2, 3, 5, 1, 4, 5, 2, 3],
'Previous_Grades': [75, 85, 65, 78, 88, 70, 80, 90, 68, 74],
'Extra_Curricular': [1, 2, 0, 1, 2, 0, 1, 2, 1, 1],
'Final_Grade': [78, 88, 65, 80, 90, 68, 85, 92, 72, 76]
}

# Convert the data to a DataFrame

df = pd.DataFrame(data)
X = df.drop('Final_Grade', axis=1) # Features (input data)
y = df['Final_Grade'] # Target variable (what we want to predict)

# Initialize variables for forward selection

remaining_features = list(X.columns) # Start with all features available for selection
selected_features = [] # Start with an empty set of selected features
model = LinearRegression() # Initialize a linear regression model

# Forward Selection Process

print("Forward Feature Selection Process:")

# Keep adding features until no significant improvement is observed

while remaining_features:
scores = {} # Dictionary to store scores for each feature

# Test adding each feature not yet selected

for feature in remaining_features:
# Temporarily add the current feature to the selected set
current_features = selected_features + [feature]

# Use only the selected features in the model

X_subset = X[current_features]

# Evaluate model performance using cross-validation and calculate average score

score = cross_val_score(model, X_subset, y, cv=3).mean()

6 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

# Store the score of this model in the dictionary

scores[feature] = score

# Find the feature that gives the best improvement in model performance
best_feature = max(scores, key=scores.get) # Select the feature with the highest score

# Add the best feature to the selected features list

selected_features.append(best_feature)

# Remove the best feature from the remaining features list

remaining_features.remove(best_feature)

# Print the selected feature and its score for tracking

print(f"Selected Feature: {best_feature}, Score: {scores[best_feature]:.4f}")

# Print the order in which features were selected

print("\nSelected Features in Order:", selected_features)

Forward Feature Selection Process:

Selected Feature: Hours_Studied, Score: 0.9679
Selected Feature: Extra_Curricular, Score: 0.9619
Selected Feature: Participation, Score: 0.9374
Selected Feature: Attendance_Rate, Score: 0.9163
Selected Feature: Previous_Grades, Score: -1.0925

Selected Features in Order: ['Hours_Studied', 'Extra_Curricular', 'Participation', 'Atten

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

7 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

# Sample dataset: Predicting Final Grade based on various factors

# Convert the data to a DataFrame

df = pd.DataFrame(data)
X = df.drop('Final_Grade', axis=1) # Features (input data)
y = df['Final_Grade'] # Target variable (what we want to predict)

remaining_features = list(X.columns) # Start with all features available for selection

selected_features = [] # Start with an empty set of selected features
model = LinearRegression() # Initialize a linear regression model

while remaining_features:: Start a loop that runs as long as there are still features we haven’t

scores = {}: Initialize an empty dictionary to keep track of model scores for each feature we test.

Inner Loop: Testing Each Feature:

8 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

Inner Loop: Testing Each Feature:

for feature in remaining_features:: Loop over each feature we haven’t selected yet.

current_features = selected_features + [feature]: Add the current feature to the selected list t

X_subset = X[current_features]: Create X_subset, which includes only the selected features.

score = cross_val_score(model, X_subset, y, cv=3).mean(): Evaluate the model’s performance

print("Forward Feature Selection Process:")

while remaining_features:
scores = {} # Dictionary to store scores for each feature

# Test adding each feature not yet selected

for feature in remaining_features:
# Temporarily add the current feature to the selected set
current_features = selected_features + [feature]

# Use only the selected features in the model

X_subset = X[current_features]

# Evaluate model performance using cross-validation and calculate average score

score = cross_val_score(model, X_subset, y, cv=3).mean()

# Store the score of this model in the dictionary

scores[feature] = score

print("\nSelected Features in Order:", selected_features)

import numpy as np

# Set the random seed for reproducibility

np.random.seed(0)

9 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

np.random.seed(0)

# Simulate flipping a coin 100 times (0 = Tails, 1 = Heads)

flips = np.random.choice([0, 1], size=100) # 0 for Tails, 1 for Heads

# Count the number of Heads and Tails

heads_count = np.sum(flips == 1)
tails_count = np.sum(flips == 0)

# Calculate empirical probabilities

p_heads = heads_count / len(flips)
p_tails = tails_count / len(flips)

print(f"Empirical Probability of Heads: {p_heads}")

print(f"Empirical Probability of Tails: {p_tails}")

np.random.choice([0, 1], size=100) simulates flipping a coin 100 times. Here, 0 represents Tails and

We count how many times 1 (Heads) and 0 (Tails) occur in the array.

By dividing the count of each outcome by the total number of flips, we get the empirical probability

import numpy as np

# Data for study time (hours) and test scores

study_time = [2, 3, 4, 5, 6]
test_scores = [60, 65, 70, 75, 80]

# Calculate correlation
correlation = np.corrcoef(study_time, test_scores)[0 1]

10 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

correlation = np.corrcoef(study_time, test_scores)[0, 1]

print("Correlation between study time and test scores:", correlation)

Correlation between study time and test scores: 1.0

11 of 11 11/11/2024, 4:46 PM

ML Manual
No ratings yet
ML Manual
18 pages
Advanced Machine Learning Course Guide
No ratings yet
Advanced Machine Learning Course Guide
36 pages
1
No ratings yet
1
13 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
B22EE010 Report
No ratings yet
B22EE010 Report
9 pages
DA Programs
No ratings yet
DA Programs
44 pages
ML Manual
No ratings yet
ML Manual
30 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Train
No ratings yet
Train
17 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
AI&ML
No ratings yet
AI&ML
9 pages
Titanic Shuffle Analysis in ML Lab
No ratings yet
Titanic Shuffle Analysis in ML Lab
24 pages
ML Lab
No ratings yet
ML Lab
29 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
PCA Implementation and Analysis
No ratings yet
PCA Implementation and Analysis
15 pages
ML Labmanual
No ratings yet
ML Labmanual
33 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
27 pages
ML All Projectpdf Removed
No ratings yet
ML All Projectpdf Removed
41 pages
ML Experiment WithDataset
No ratings yet
ML Experiment WithDataset
23 pages
Mlalllabprgs
No ratings yet
Mlalllabprgs
17 pages
C121 Exp2
No ratings yet
C121 Exp2
23 pages
External
No ratings yet
External
11 pages
M PDF
No ratings yet
M PDF
13 pages
Micro
No ratings yet
Micro
4 pages
AI Regression & Classification Guide
No ratings yet
AI Regression & Classification Guide
47 pages
Linear Regression with Boston Housing Data
No ratings yet
Linear Regression with Boston Housing Data
14 pages
Machine Learning Lab Experiments Guide
No ratings yet
Machine Learning Lab Experiments Guide
47 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
DA Lab
No ratings yet
DA Lab
27 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
ML NEW Final Format
No ratings yet
ML NEW Final Format
37 pages
1st PGM
No ratings yet
1st PGM
10 pages
Machine Learning Lab Manaul BCSL606
No ratings yet
Machine Learning Lab Manaul BCSL606
27 pages
Machine Learning Programs
No ratings yet
Machine Learning Programs
10 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
ML Lab Works
No ratings yet
ML Lab Works
14 pages
HIV Regression Source Code
No ratings yet
HIV Regression Source Code
26 pages
Advance Python
No ratings yet
Advance Python
5 pages
ML Programs
No ratings yet
ML Programs
14 pages
Perceptron Implementation in Python
No ratings yet
Perceptron Implementation in Python
32 pages
Python Linear Regression Tutorial
No ratings yet
Python Linear Regression Tutorial
6 pages
Auto MPG Dataset Analysis
No ratings yet
Auto MPG Dataset Analysis
25 pages
Final-12-Lab Programs
No ratings yet
Final-12-Lab Programs
30 pages
Experiment 1
No ratings yet
Experiment 1
19 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Zerox Ready
No ratings yet
Zerox Ready
21 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
8 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
MDS372 Lab4 2448001
No ratings yet
MDS372 Lab4 2448001
17 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
FIND-S and Decision Tree Algorithms Explained
No ratings yet
FIND-S and Decision Tree Algorithms Explained
24 pages
ML Lab
No ratings yet
ML Lab
14 pages
Temporary Sf10 For Bsu
No ratings yet
Temporary Sf10 For Bsu
6 pages
Raz Cqlj43 Rainbows
No ratings yet
Raz Cqlj43 Rainbows
2 pages
Selected Research Papers For ML - AI Project
No ratings yet
Selected Research Papers For ML - AI Project
3 pages
Social Studies Lesson For Portfolio
No ratings yet
Social Studies Lesson For Portfolio
4 pages
Reviewer - Employee Testing and Selection REVISED
No ratings yet
Reviewer - Employee Testing and Selection REVISED
3 pages
Hot Chocolate Word Search
No ratings yet
Hot Chocolate Word Search
3 pages
6th State Level Math Test Scorecard
No ratings yet
6th State Level Math Test Scorecard
34 pages
Palak Gupta Resume@@
No ratings yet
Palak Gupta Resume@@
2 pages
UCLA Environmental Science Degree
No ratings yet
UCLA Environmental Science Degree
2 pages
Quiz 1 Research in Daily Life 2
No ratings yet
Quiz 1 Research in Daily Life 2
3 pages
Impact of Immediate and Delayed Error Co
No ratings yet
Impact of Immediate and Delayed Error Co
10 pages
Group5 Lessonplan
No ratings yet
Group5 Lessonplan
4 pages
International Geeta Olympiad
No ratings yet
International Geeta Olympiad
2 pages
Dementia Beyond Drugs: Changing The Culture of Care, Second Edition (Excerpt)
No ratings yet
Dementia Beyond Drugs: Changing The Culture of Care, Second Edition (Excerpt)
8 pages
Primary 2 English
100% (2)
Primary 2 English
4 pages
Evaluation Report Checklist
No ratings yet
Evaluation Report Checklist
3 pages
Secure AI Lifecycle
No ratings yet
Secure AI Lifecycle
41 pages
SAP ABAP 7.4 Exam Prep Guide
No ratings yet
SAP ABAP 7.4 Exam Prep Guide
5 pages
CIT-506 Infomation Technology Innovation in Business
50% (2)
CIT-506 Infomation Technology Innovation in Business
10 pages
Senior Student Schedule
No ratings yet
Senior Student Schedule
1 page
Form 14 Grade 3 Q2
No ratings yet
Form 14 Grade 3 Q2
2 pages
Urinary System Physiology Overview
No ratings yet
Urinary System Physiology Overview
77 pages
Readiness Checklist For Feeding Implementation
No ratings yet
Readiness Checklist For Feeding Implementation
13 pages
Beyond The Code
No ratings yet
Beyond The Code
5 pages
Sociology 1st Year
No ratings yet
Sociology 1st Year
7 pages
CYL Flyers Introduction and Tips
No ratings yet
CYL Flyers Introduction and Tips
12 pages
Trading Learning Roadmap Telugu
No ratings yet
Trading Learning Roadmap Telugu
3 pages
Combined ADCP Course Outlines
No ratings yet
Combined ADCP Course Outlines
21 pages
PPT - Consumer Learning Process
No ratings yet
PPT - Consumer Learning Process
32 pages
Glory Resume
No ratings yet
Glory Resume
2 pages

Feature Selection in Python

Uploaded by

Feature Selection in Python

Uploaded by

Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

Size of the House (sqft)

# Import necessary libraries

# Create a small sample dataset

# Display the data

Size Age of House Distance to City Front Door

0 1500 3 10 5 Red 300000

1 1600 3 15 4 Blue 320000

2 1700 4 20 6 Green 350000

3 1800 4 25 3 Blue 370000

4 1900 5 30 2 Red 400000

# Display the modified data

Size Age of House Distance to City Center

# Import necessary libraries

# Seed for reproducibility

# Create a simple dataset for height and weight

height = np.random.normal(170, 10, 100) # Average height in cm

# Combine data into a DataFrame

# Plot the data

# Print the transformed data (principal component)

# Plot the data in 1D (principal component)

plt.xlabel("Principal Component 1")

NameError: name 'PCA' is not defined

from sklearn.decomposition import TruncatedSVD

# Convert text data into a term-document matrix

# Print the term-document matrix

Original Term-Document Matrix:

# Apply Truncated SVD

# Print the reduced matrix

print("Reduced Term-Document Matrix:\n", X_reduced)

# Plot the reduced representation of documents

Reduced Term-Document Matrix:

Start with No Features.

# Sample dataset: Predicting Final Grade based on various factors

# Convert the data to a DataFrame

# Initialize variables for forward selection

# Forward Selection Process

# Keep adding features until no significant improvement is observed

# Test adding each feature not yet selected

# Use only the selected features in the model

# Evaluate model performance using cross-validation and calculate average score

# Store the score of this model in the dictionary

# Add the best feature to the selected features list

# Remove the best feature from the remaining features list

# Print the selected feature and its score for tracking

# Print the order in which features were selected

Forward Feature Selection Process:

Selected Features in Order: ['Hours_Studied', 'Extra_Curricular', 'Participation', 'Atten

# Sample dataset: Predicting Final Grade based on various factors

# Convert the data to a DataFrame

remaining_features = list(X.columns) # Start with all features available for selection

Inner Loop: Testing Each Feature:

Inner Loop: Testing Each Feature:

score = cross_val_score(model, X_subset, y, cv=3).mean(): Evaluate the model’s performance

print("Forward Feature Selection Process:")

# Test adding each feature not yet selected

# Use only the selected features in the model

# Evaluate model performance using cross-validation and calculate average score

# Store the score of this model in the dictionary

print("\nSelected Features in Order:", selected_features)

# Set the random seed for reproducibility

# Simulate flipping a coin 100 times (0 = Tails, 1 = Heads)

# Count the number of Heads and Tails

# Calculate empirical probabilities

print(f"Empirical Probability of Heads: {p_heads}")

# Data for study time (hours) and test scores

correlation = np.corrcoef(study_time, test_scores)[0, 1]

Correlation between study time and test scores: 1.0

You might also like