PCA
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
data = fetch_openml("mnist_784", version=1)
x, y = data.data, data.target
plt.imshow(x.iloc[50].values.reshape(28, 28), cmap="gray")
plt.show()
pca = PCA(n_components=50)
x_reduced = pca.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_reduced, y, random_state=42,
test_size=0.2)
lg = LogisticRegression(max_iter=1000, solver="lbfgs", multi_class="multinomial")
lg.fit(x_train, y_train)
predicted = lg.predict(x_test)
accuracy = accuracy_score(y_test, predicted)
print(accuracy)
This code demonstrates how to perform handwritten digit classification
using the MNIST dataset with Principal Component Analysis (PCA) for
dimensionality reduction and Logistic Regression for classification. Here's
a detailed breakdown:
1. Importing Libraries
python
Copy
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
fetch_openml: Used to download datasets from OpenML
train_test_split: For splitting data into training and test sets
LogisticRegression: The classification model we'll use
accuracy_score: To evaluate model performance
PCA: For dimensionality reduction
matplotlib.pyplot: For visualizing the digits
2. Loading the MNIST Dataset
python
Copy
data = fetch_openml("mnist_784", version=1)
x, y = data.data, data.target
MNIST contains 70,000 handwritten digit images (0-9)
Each image is 28×28 pixels, flattened into a 784-dimensional vector
(features)
x contains the pixel data, y contains the digit labels (0-9)
3. Visualizing a Sample Digit
python
Copy
plt.imshow(x.iloc[50].values.reshape(28, 28), cmap="gray")
plt.show()
Selects the 50th image from the dataset
Reshapes the 784-length vector back to 28×28
Displays it in grayscale
This helps verify the data is loaded correctly
4. Dimensionality Reduction with PCA
python
Copy
pca = PCA(n_components=50)
x_reduced = pca.fit_transform(x)
PCA reduces the 784 dimensions to just 50 principal components
This speeds up training while preserving most of the variance
The reduced dataset (x_reduced) now has 50 features per image
instead of 784
5. Splitting Data into Train/Test Sets
python
Copy
x_train, x_test, y_train, y_test = train_test_split(x_reduced, y, random_state=42,
test_size=0.2)
80% of data (56,000 samples) for training
20% of data (14,000 samples) for testing
random_state=42 ensures reproducible splits
6. Training Logistic Regression Model
python
Copy
lg = LogisticRegression(max_iter=1000, solver="lbfgs", multi_class="multinomial")
lg.fit(x_train, y_train)
Uses multinomial logistic regression (for multi-class classification)
max_iter=1000: Allows up to 1000 iterations for convergence
solver="lbfgs": A good choice for medium-sized datasets
multi_class="multinomial": Properly handles the 10-class problem
7. Making Predictions and Evaluating
Accuracy
python
Copy
predicted = lg.predict(x_test)
accuracy = accuracy_score(y_test, predicted)
print(accuracy)
Predicts labels for the test set
Compares predictions to true labels
Prints the accuracy score (fraction of correct predictions)
Expected Output
The code will:
1. Display a sample digit image
2. Print an accuracy score around 0.91-0.92 (91-92% accuracy) on the
test set
Key Concepts Illustrated
1. Dimensionality Reduction: PCA helps reduce computation time
while maintaining performance
2. Multi-class Classification: Logistic regression can handle multiple
classes
3. Model Evaluation: Using a held-out test set to measure real-world
performance
4. Image Data Handling: Working with flattened image vectors
The accuracy could potentially be improved by:
Using more PCA components (at the cost of speed)
Trying more complex models like neural networks
Performing hyperparameter tuning
Image Processing Code
import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
# Specify the image path
image_path = image_path = "C:/Users/GeeKs/Desktop/dagi/pictures/dagi.jpg"
# Open the image using PIL
image = Image.open(image_path)
# Resize the image to 28x28
resized_image = image.resize((28, 28))
# Convert the image to grayscale
grayscale_image = resized_image.convert("L")
# Convert to numpy array
image_array = np.array(grayscale_image)
# Create a pandas DataFrame from the 2D array
image_df = pd.DataFrame(image_array)
# Display the original and resized images
plt.title("Original Image")
plt.imshow(image)
plt.show()
plt.title("Grayscale Resized Image (28x28)")
plt.imshow(grayscale_image, cmap="gray")
plt.show()
This code demonstrates how to load, preprocess, and visualize an image
using Python's PIL (Pillow), NumPy, pandas, and matplotlib libraries. The
code prepares an image for potential machine learning applications (like
digit classification) by converting it to the MNIST dataset format (28×28
grayscale).
Full Code Breakdown
1. Importing Required Libraries
python
Copy
import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
pandas (pd): For creating DataFrames (though not strictly
necessary for this operation)
numpy (np): For array operations and conversions
PIL.Image: From Pillow library, for image loading and processing
matplotlib.pyplot (plt): For image visualization
2. Specifying Image Path
python
Copy
image_path = "C:/Users/GeeKs/Desktop/dagi/pictures/dagi.jpg"
Defines the path to the image file (note there's a duplicate
assignment)
Uses Windows path format with forward slashes (also works with
raw strings or double backslashes)
3. Loading the Image
python
Copy
image = Image.open(image_path)
Opens the image file using PIL's Image.open()
Creates an Image object that can be manipulated
4. Resizing the Image
python
Copy
resized_image = image.resize((28, 28))
Resizes the image to 28×28 pixels (standard MNIST dataset size)
Uses bilinear interpolation by default for resizing
5. Converting to Grayscale
python
Copy
grayscale_image = resized_image.convert("L")
Converts the color image to 8-bit grayscale (mode "L")
Each pixel will have values 0-255 (black to white)
6. Converting to NumPy Array
python
Copy
image_array = np.array(grayscale_image)
Converts the PIL Image object to a NumPy array
Creates a 28×28 2D array where each element is a pixel intensity
value
7. Creating a pandas DataFrame (Optional)
python
Copy
image_df = pd.DataFrame(image_array)
Converts the NumPy array to a pandas DataFrame
This step might be useful if you need tabular manipulation of pixel
data
Not strictly necessary for most image processing pipelines
8. Visualizing the Images
python
Copy
plt.title("Original Image")
plt.imshow(image)
plt.show()
plt.title("Grayscale Resized Image (28x28)")
plt.imshow(grayscale_image, cmap="gray")
plt.show()
First block:
o Shows the original color image with a title
o Uses plt.imshow() with default colormap
Second block:
o Shows the processed grayscale image
o Uses "gray" colormap for proper grayscale display
Both use plt.show() to render the figures
Expected Output
When you run this code, you'll see two popup windows showing:
1. The original image in its full color and original size
2. The processed version as a 28×28 grayscale image
Key Processing Steps
1. Resizing: Standardizes the image dimensions to match common ML
datasets
2. Grayscale Conversion: Reduces color information to single-
channel intensity
3. Array Conversion: Prepares the image for numerical processing
4. Visualization: Verifies each transformation step
Potential Use Cases
This preprocessing pipeline is particularly useful for:
Preparing custom images for MNIST-style digit classification
Creating input for neural networks that expect 28×28 grayscale
images
Image processing workflows that require standardized input sizes
Possible Improvements
1. Normalization: Add pixel value normalization (divide by 255)
2. Inversion: MNIST expects white digits on black background
3. Error Handling: Add try-catch for file operations
4. Binarization: Optional thresholding for black-and-white conversion