0% found this document useful (0 votes)
30 views27 pages

ML Lab Manual

The document provides a comprehensive overview of Python programming for statistical analysis, focusing on central tendency measures (mean, median, mode) and measures of dispersion (variance, standard deviation) using built-in libraries such as statistics, math, NumPy, and SciPy. It includes example code, explanations of key functions, and answers to common questions regarding these libraries and their applications in data science and machine learning. Additionally, it introduces the Pandas library for data manipulation and preprocessing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views27 pages

ML Lab Manual

The document provides a comprehensive overview of Python programming for statistical analysis, focusing on central tendency measures (mean, median, mode) and measures of dispersion (variance, standard deviation) using built-in libraries such as statistics, math, NumPy, and SciPy. It includes example code, explanations of key functions, and answers to common questions regarding these libraries and their applications in data science and machine learning. Additionally, it introduces the Pandas library for data manipulation and preprocessing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Program-1.

Write a python program to compute Central Tendency Measures: Mean, Median, Mode Measure of
Dispersion: Variance, Standard Deviation

Ans:
OUTPUT:

Source Code Editable:

import statistics
def compute_measures(data):
mean_value = statistics.mean(data)
median_value = statistics.median(data)
mode_value = statistics.mode(data)
variance_value = statistics.variance(data)
std_deviation_value = statistics.stdev(data)

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
print(f"Variance: {variance_value}")
print(f"Standard Deviation: {std_deviation_value}")

# Example dataset
data = [10, 20, 20, 30, 40, 50, 50, 50, 60, 70]
compute_measures(data)

OUTPUT:
Mean: 40
Median: 45.0
Mode: 50
Variance: 377.77777777777777
Standard Deviation: 19.436506316151
VIVA VOICE QUESTION & ANSWERS:

1. What are the central tendency measures, and why are they important?

Answer:
Central tendency measures indicate the center or typical value of a dataset. The three main measures are:

 Mean: The arithmetic average of the dataset.


 Median: The middle value when data is sorted.
 Mode: The most frequently occurring value.
These measures help summarize data and make comparisons across datasets.

2. What are variance and standard deviation, and how do they differ?

Answer:
Both variance and standard deviation measure how spread out the data is.

 Variance is the average squared deviation from the mean.


 Standard Deviation is the square root of the variance, representing dispersion in the same unit as the
data.
Standard deviation is more commonly used because it is easier to interpret.

3. How do you calculate the mean, median, and mode in Python?

Answer:
Python’s statistics module provides built-in functions:

import statistics

data = [10, 20, 30, 40, 50, 50]

mean_value = statistics.mean(data) # Computes Mean


median_value = statistics.median(data) # Computes Median
mode_value = statistics.mode(data) # Computes Mode

print(mean_value, median_value, mode_value)

4. What happens if there is more than one mode in the dataset?

Answer:
The statistics.mode() function returns a single mode. If multiple modes exist, it raises an error. Instead, use
statistics.multimode() to return all modes.
Example:

data = [10, 20, 20, 30, 30, 40]


modes = statistics.multimode(data) # Returns [20, 30]

5. Why do we use (N-1) instead of N for sample variance?

Answer:
For sample variance, we divide by N−1instead of N to correct for bias in estimating the population variance.
This is called Bessel's correction, and it ensures that the sample variance is an unbiased estimator of the
population variance.

For
population variance, we divide by N.

6. How do you calculate variance and standard deviation in Python?

Answer:
Using the statistics module:

import statistics

data = [10, 20, 30, 40, 50]


variance_value = statistics.variance(data) # Computes Variance
std_dev_value = statistics.stdev(data) # Computes Standard Deviation

print(variance_value, std_dev_value)

variance() returns the squared deviation from the mean, while stdev() returns its square root.

7. How can you visualize central tendency and dispersion in Python?

Answer:
We can use Matplotlib or Seaborn to visualize data distributions.
Example:

import matplotlib.pyplot as plt


import seaborn as sns

data = [10, 20, 30, 40, 50, 50, 60, 70]

sns.histplot(data, kde=True)
plt.axvline(statistics.mean(data), color='red', label="Mean")
plt.axvline(statistics.median(data), color='blue', linestyle="dashed", label="Median")
plt.legend()
plt.show()

This histogram shows the mean (red) and median (blue dashed) over the data distribution.
Program-2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy

Study of Python Basic Libraries: Statistics, Math, NumPy, and SciPy


Python provides several built-in and external libraries for mathematical and statistical computations. In this
detailed study, we will explore four key libraries:

1. Statistics Module (statistics)


2. Math Module (math)
3. NumPy Library (numpy)
4. SciPy Library (scipy)

1. Statistics Module (statistics)

The statistics module is a part of Python’s standard library and provides functions for basic statistical
analysis.

Key Functions and Their Usage

Function Description Example

mean(data) Returns the arithmetic mean (average) statistics.mean([1, 2, 3]) → 2.0

median(data) Returns the middle value of sorted data statistics.median([1, 2, 3, 4]) → 2.5

mode(data) Returns the most frequent value statistics.mode([1, 1, 2, 3]) → 1

variance(data) Returns the sample variance statistics.variance([1, 2, 3]) → 1.0

stdev(data) Returns the sample standard deviation statistics.stdev([1, 2, 3]) → 1.0

Example Code:
import statistics as stats

data = [10, 20, 30, 40, 50, 50]

print("Mean:", stats.mean(data)) # 33.33


print("Median:", stats.median(data)) # 35.0
print("Mode:", stats.mode(data)) # 50
print("Variance:", stats.variance(data)) # 266.67
print("Standard Deviation:", stats.stdev(data)) # 16.33

2. Math Module (math)

The math module provides mathematical functions like logarithms, trigonometry, and factorials.

Key Functions and Their Usage


Function Description Example

sqrt(x) Returns the square root of x math.sqrt(25) → 5.0

factorial(x) Returns x! (factorial of x) math.factorial(5) → 120

log(x, base) Returns the logarithm of x to the given base math.log(8, 2) → 3.0

sin(x), cos(x), math.sin(math.radians(90)) →


tan(x)
Trigonometric functions (x in radians) 1.0

gcd(a, b)
Returns the greatest common divisor of a and math.gcd(48, 18) → 6
b

Example Code:
import math

print("Square Root of 25:", math.sqrt(25)) # 5.0


print("Factorial of 5:", math.factorial(5)) # 120
print("Logarithm base 10 of 100:", math.log10(100)) # 2.0
print("Sine of 90 degrees:", math.sin(math.radians(90))) # 1.0
print("GCD of 48 and 18:", math.gcd(48, 18)) # 6

3. NumPy Library (numpy)

NumPy is a powerful library for numerical computations, especially for handling large arrays and matrices
efficiently.

Key Functions and Their Usage

Function Description Example

np.array([elements]) Creates an array np.array([1, 2, 3])

np.mean(arr) Computes the mean np.mean([1, 2, 3]) → 2.0

np.median(arr) Computes the median np.median([1, 2, 3]) → 2.0

np.std(arr) Computes the standard deviation np.std([1, 2, 3]) → 0.816

np.var(arr) Computes the variance np.var([1, 2, 3]) → 0.667

np.percentile(arr, q) Computes the q-th percentile np.percentile([1, 2, 3], 50) → 2.0

Example Code:
import numpy as np

arr = np.array([10, 20, 30, 40, 50])

print("Mean:", np.mean(arr)) # 30.0


print("Median:", np.median(arr)) # 30.0
print("Standard Deviation:", np.std(arr)) # 14.14
print("Variance:", np.var(arr)) # 200.0
print("25th Percentile:", np.percentile(arr, 25)) # 20.0

4. SciPy Library (scipy)

SciPy builds on NumPy and provides additional scientific computing tools, including statistics, optimization,
and linear algebra.

Key Functions and Their Usage

Function Description Example

stats.mode(data) Computes the mode stats.mode([1, 2, 2, 3]) → Mode: 2

stats.iqr(data) Computes the interquartile range stats.iqr([1, 2, 3, 4, 5]) → 2.0

stats.zscore(data) Computes the Z-score stats.zscore([10, 20, 30])

scipy.linalg.det(matrix) Computes the determinant of a matrix det([[1,2],[3,4]])

Example Code:
from scipy import stats
import numpy as np

data = np.array([10, 20, 30, 40, 50, 50])

print("Mode:", stats.mode(data)) # Mode: 50


print("Interquartile Range:", stats.iqr(data)) # 20.0
print("Z-scores:", stats.zscore(data)) # [-1.22, -0.61, 0.0, 0.61, 1.22, 1.22]

Summary Table
Library Purpose Example Functions

statistics Basic statistical analysis mean(), median(), mode(), variance()

math Mathematical operations sqrt(), factorial(), log(), sin()

numpy Numerical computations, array operations mean(), std(), percentile()

scipy Advanced scientific computing stats.mode(), stats.iqr(), zscore()

Each of these libraries plays a crucial role in data science, engineering, and scientific computing.

https://chatgpt.com/share/67b4bda2-621c-8001-8a02-ec82f170170a
VIVA VOICE QUESTIONS AND ANSWERS

1. What is the difference between the math module and the statistics module in Python?

Answer:

 The math module provides basic mathematical functions such as logarithms, trigonometry, and power
functions.
 The statistics module is specifically used for statistical computations like mean, median, mode, variance,
and standard deviation.
 Example:
 import math
 print(math.sqrt(16)) # 4.0

 import statistics
 print(statistics.mean([1, 2, 3, 4, 5])) # 3.0

2. What are the advantages of using NumPy over Python lists for numerical computations?

Answer:

 Speed: NumPy arrays are faster than lists due to efficient memory storage.
 Memory Efficiency: NumPy arrays consume less memory compared to Python lists.
 Vectorized Operations: NumPy performs operations on entire arrays without using loops.
 Built-in Functions: Supports advanced mathematical operations like linear algebra and Fourier transforms.

Example:

import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr * 2) # Vectorized operation: [2 4 6 8]

3. How do you calculate the mean and standard deviation using NumPy?

Answer:

 Use numpy.mean() for mean and numpy.std() for standard deviation.

import numpy as np
data = np.array([10, 20, 30, 40, 50])
print("Mean:", np.mean(data)) # 30.0
print("Standard Deviation:", np.std(data)) # 14.14

4. What is SciPy, and how does it differ from NumPy?

Answer:

 SciPy is built on top of NumPy and provides additional scientific computing functionalities like integration,
optimization, and interpolation.
 NumPy focuses on efficient array operations, whereas SciPy extends NumPy to handle more advanced
mathematical problems.

Example:

from scipy import linalg


import numpy as np
A = np.array([[1, 2], [3, 4]])
print(linalg.inv(A)) # Inverse of matrix A

5. How do you generate random numbers using NumPy?

Answer:

Use numpy.random module to generate random numbers.

import numpy as np
print(np.random.random()) # Random float between 0 and 1
print(np.random.randint(1, 10)) # Random integer between 1 and 9

6. What is the difference between mode() in statistics and SciPy’s stats.mode()?

Answer:

 statistics.mode() works on a simple list and returns the most common element.
 scipy.stats.mode() works efficiently on large datasets and multidimensional arrays.

Example:

import statistics
import scipy.stats as stats
data = [1, 2, 2, 3, 3, 3, 4]
print(statistics.mode(data)) # Output: 3

import numpy as np
arr = np.array([1, 2, 2, 3, 3, 3, 4])
print(stats.mode(arr)) # ModeResult(mode=array([3]), count=array([3]))

7. How do you perform integration using SciPy?

Answer:

Use scipy.integrate.quad() to integrate a function.

from scipy.integrate import quad


import numpy as np

def f(x):
return np.sin(x)

result, _ = quad(f, 0, np.pi) # Integrates sin(x) from 0 to π


print(result) # Output: 2.0
8. How can you compute the determinant of a matrix using NumPy and SciPy?

Answer:

Use numpy.linalg.det() or scipy.linalg.det().

import numpy as np
from scipy import linalg

A = np.array([[3, 2], [1, 4]])


print(np.linalg.det(A)) # Using NumPy
print(linalg.det(A)) # Using SciPy

9. How do you find the roots of a quadratic equation using SciPy?

Answer:

Use numpy.roots() or scipy.optimize to find the roots.

import numpy as np
coefficients = [1, -3, 2] # x² - 3x + 2 = 0
print(np.roots(coefficients)) # Output: [2. 1.]

10. How can you calculate a normal distribution PDF using SciPy?

Answer:

Use scipy.stats.norm.pdf() for the probability density function.

import scipy.stats as stats


print(stats.norm.pdf(0, loc=0, scale=1)) # PDF at x=0 for standard normal distribution

These questions cover both conceptual and practical knowledge, ensuring a strong grasp of Statistics, Math,
NumPy, and SciPy in Python. Let me know if you need more advanced viva questions! 😊
Program 3: Study of Python Libraries for ML application such as Pandas and Matplotlib

Study of Pandas - Python Library for Machine Learning Applications


Introduction

Pandas is a powerful and widely used Python library for data manipulation, preprocessing, and analysis in
Machine Learning (ML). It provides flexible data structures such as Series and DataFrame, which make
handling large datasets easier.

Why Use Pandas for ML?

✅ Efficient Data Handling: Supports large datasets.


✅ Data Cleaning & Preprocessing: Handles missing values, duplicates, and filtering.
✅ Integration: Works well with NumPy, Matplotlib, Scikit-learn.
✅ Feature Engineering: Aggregation, transformation, and encoding.
✅ Data Input/Output: Reads and writes from CSV, Excel, SQL, JSON.

Installing Pandas
pip install pandas

Pandas Data Structures

1. Series - One-Dimensional Data Structure

A Series is like a column in a table or an array with labeled indexes.

Example: Creating a Pandas Series


import pandas as pd

data = [10, 20, 30, 40]


series = pd.Series(data, index=['A', 'B', 'C', 'D'])
print(series)

Output:

A 10
B 20
C 30
D 40
dtype: int64

2. DataFrame - Two-Dimensional Data Structure

A DataFrame is like a spreadsheet with rows and columns.

Example: Creating a Pandas DataFrame


import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]}

df = pd.DataFrame(data)
print(df)

Output:

Name Age Salary


0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

Basic Operations in Pandas for ML

1. Reading and Writing Data

 Read CSV File

df = pd.read_csv('data.csv')

 Write DataFrame to CSV

df.to_csv('output.csv', index=False)

2. Data Selection and Filtering

 Select a Column

df['Name']

 Select Multiple Columns

df[['Name', 'Age']]

 Filter Rows Based on Condition

df[df['Age'] > 28]

3. Handling Missing Data

 Remove missing values

df.dropna()

 Fill missing values with a default value

df.fillna(0)
4. Grouping and Aggregation

 Group by Column and Compute Mean

df.groupby('Age')['Salary'].mean()

5. Merging and Joining DataFrames

 Merge Two DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})


df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [1000, 2000, 3000]})

merged_df = pd.merge(df1, df2, on='ID')


print(merged_df)

Pandas in Machine Learning Applications

1. Exploratory Data Analysis (EDA)

EDA is the first step in data preprocessing for ML models.

 Histograms help understand the distribution of features.


 Scatter plots help detect correlations.

Example:

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('sample_data.csv')

# Histogram of a feature
plt.hist(df['Feature1'], bins=20, color='blue', edgecolor='black')
plt.title("Feature Distribution")
plt.xlabel("Feature1")
plt.ylabel("Count")
plt.show()

2. Feature Engineering with Pandas

Feature engineering involves creating new features or modifying existing ones for ML models.

Example: Creating New Features


df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')
print(df.head())
3. Encoding Categorical Data

ML models require numerical data, so categorical features must be encoded.

Example: One-Hot Encoding


df = pd.get_dummies(df, columns=['Gender'], drop_first=True)

4. Normalization & Scaling

Feature scaling ensures ML models perform optimally.

Example: Min-Max Scaling


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Salary']] = scaler.fit_transform(df[['Salary']])

5. Train-Test Split for ML Models

Splitting the dataset into training and testing sets is essential for ML models.

Example: Splitting Data


from sklearn.model_selection import train_test_split

X = df[['Age', 'Salary']]
y = df['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Pandas Integration with Scikit-learn

Pandas is often used with Scikit-learn for building ML models.

Example: Training a Simple ML Model


from sklearn.linear_model import LogisticRegression

# Create Model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
predictions = model.predict(X_test)

Conclusion

 Pandas is an essential library for data preprocessing, analysis, and visualization in ML applications.
 It helps in cleaning, transforming, and preparing data for ML models.
 Works seamlessly with Scikit-learn, Matplotlib, and NumPy to build ML pipelines.
Study of Matplotlib - Python Library for Machine Learning Applications
Introduction

Matplotlib is a powerful data visualization library in Python. It is widely used in Machine Learning (ML),
Data Science, and Exploratory Data Analysis (EDA) to create a variety of graphs, charts, and plots.

Key Features of Matplotlib

✔️ Supports various types of plots: Line, Bar, Scatter, Histogram, Pie, etc.
✔️ Highly customizable: Colors, labels, styles, grids, and legends.
✔️ Supports interactive and animated visualizations.
✔️ Works well with NumPy, Pandas, and Seaborn.
✔️ Can generate plots for Jupyter Notebooks and GUI applications.

Installing Matplotlib
pip install matplotlib

Basic Structure of Matplotlib

Matplotlib mainly consists of the following components:

 Figure: The entire plotting area.


 Axes: The individual plots inside the figure.
 Plot elements: Lines, bars, markers, etc.

Example of Basic Matplotlib Usage:

import matplotlib.pyplot as plt

# Creating Data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 50]

# Plotting the Data


plt.plot(x, y, marker='o', linestyle='--', color='b', label="Line Plot")

# Adding Labels and Title


plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Basic Line Plot")
plt.legend()

# Display the Plot


plt.show()
Types of Plots in Matplotlib

1. Line Plot (Trend Analysis)

Used for time series data and trends in Machine Learning.

import numpy as np

x = np.linspace(0, 10, 100)


y = np.sin(x)

plt.plot(x, y, color="r", linestyle="-")


plt.title("Sine Wave")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(True)
plt.show()

2. Scatter Plot (Correlation Analysis)

Used in ML for visualizing relationships between two variables.

import numpy as np

# Generating random data


x = np.random.rand(50)
y = np.random.rand(50)

plt.scatter(x, y, color='g', marker='o')


plt.title("Scatter Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

3. Bar Chart (Category Comparison)

Used in classification problems to compare categories.

categories = ['A', 'B', 'C', 'D']


values = [10, 20, 30, 40]

plt.bar(categories, values, color=['red', 'blue', 'green', 'purple'])


plt.title("Bar Chart Example")
plt.xlabel("Categories")
plt.ylabel("Values")
plt.show()

4. Histogram (Data Distribution)

Used for feature distribution analysis in ML.

data = np.random.randn(1000)

plt.hist(data, bins=30, color='blue', edgecolor='black')


plt.title("Histogram Example")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

5. Pie Chart (Category Proportion)

Used in categorical data analysis.

labels = ['Class A', 'Class B', 'Class C', 'Class D']


sizes = [30, 20, 40, 10]
colors = ['gold', 'lightcoral', 'lightskyblue', 'lightgreen']

plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=140)


plt.title("Pie Chart Example")
plt.show()

6. Subplots (Multiple Plots in One Figure)

Used to visualize multiple variables in ML models.

fig, ax = plt.subplots(2, 2, figsize=(8, 6))

# Line Plot
ax[0, 0].plot(x, y, 'r')
ax[0, 0].set_title("Line Plot")

# Scatter Plot
ax[0, 1].scatter(x, y, color='g')
ax[0, 1].set_title("Scatter Plot")

# Bar Chart
ax[1, 0].bar(categories, values, color='b')
ax[1, 0].set_title("Bar Chart")

# Histogram
ax[1, 1].hist(data, bins=20, color='purple')
ax[1, 1].set_title("Histogram")

plt.tight_layout()
plt.show()

Matplotlib in Machine Learning Applications

1. Exploratory Data Analysis (EDA)

EDA is the first step in data preprocessing for ML models.

 Histograms help understand the distribution of features.


 Scatter plots help detect correlations.

Example:

import pandas as pd
# Load dataset
df = pd.read_csv('sample_data.csv')

# Histogram of a feature
plt.hist(df['Feature1'], bins=20, color='blue', edgecolor='black')
plt.title("Feature Distribution")
plt.xlabel("Feature1")
plt.ylabel("Count")
plt.show()

2. Model Performance Visualization

Plotting loss and accuracy during ML model training.

epochs = [1, 2, 3, 4, 5]
train_loss = [0.9, 0.7, 0.5, 0.3, 0.2]
val_loss = [1.0, 0.8, 0.6, 0.4, 0.3]

plt.plot(epochs, train_loss, 'r', label="Train Loss")


plt.plot(epochs, val_loss, 'b', label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.show()

Conclusion

 Matplotlib is an essential library for visualizing data in ML applications.


 It helps in data preprocessing, EDA, and model performance evaluation.
 Works seamlessly with Pandas, NumPy, and Seaborn for better ML workflows.

Program 4: Write a Python program to implement Simple Linear Regression

Regression

Regression is a statistical technique used in machine learning to model and analyze the relationship between
dependent (target) and independent (predictor) variables. The primary objective of regression is to predict a
continuous outcome based on input features.

There are different types of regression techniques, but the most common one is Linear Regression.
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data

np.random.seed(42)

X = 2 * np.random.rand(100, 1) # Feature variable

Y = 4 + 3 * X + np.random.randn(100, 1) # Target variable with some noise

# Split data into training and testing sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Create and train the linear regression model

model = LinearRegression()

model.fit(X_train, Y_train)
# Make predictions

Y_pred = model.predict(X_test)

# Model evaluation

mse = mean_squared_error(Y_test, Y_pred)

r2 = r2_score(Y_test, Y_pred)

print(f"Coefficients: {model.coef_[0][0]}")

print(f"Intercept: {model.intercept_[0]}")

print(f"Mean Squared Error: {mse}")

print(f"R^2 Score: {r2}")

# Plot results

plt.scatter(X_test, Y_test, color='blue', label='Actual data')

plt.plot(X_test, Y_pred, color='red', linewidth=2, label='Regression Line')

plt.xlabel("X")

plt.ylabel("Y")

plt.legend()

plt.title("Simple Linear Regression")

plt.show()
Program 5: Implementation of Multiple Linear Regression for House Price Prediction using sklearn
Multiple Linear Regression:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data

np.random.seed(42)

X1 = 2 * np.random.rand(100, 1) # First feature

X2 = 3 * np.random.rand(100, 1) # Second feature

Y = 4 + 3 * X1 + 2 * X2 + np.random.randn(100, 1) # Target variable with noise

# Combine features into a single matrix

X = np.hstack((X1, X2))
# Split data into training and testing sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Create and train the multiple linear regression model

model = LinearRegression()

model.fit(X_train, Y_train)

# Make predictions

Y_pred = model.predict(X_test)

# Model evaluation

mse = mean_squared_error(Y_test, Y_pred)

r2 = r2_score(Y_test, Y_pred)

print(f"Coefficients: {model.coef_[0]}")

print(f"Intercept: {model.intercept_[0]}")

print(f"Mean Squared Error: {mse}")

print(f"R^2 Score: {r2}")

# Plot actual vs predicted values

plt.scatter(Y_test, Y_pred, color='blue', label='Predicted vs Actual')

plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], color='red', linestyle='--', label='Perfect Fit')

plt.xlabel("Actual Values")

plt.ylabel("Predicted Values")

plt.legend()

plt.title("Multiple Linear Regression - Actual vs Predicted")

plt.show()
Study of Python Library for ML application such as Pandas

Study of Python Library for ML application such as Matplotlib


Pandas - Python Library for Data Manipulation and Analysis

Introduction to Pandas:

Pandas is an open-source Python library used for data manipulation, analysis, and preprocessing. It provides
fast, flexible, and powerful data structures like Series and DataFrame to work with structured data efficiently.
Pandas is widely used in data science, machine learning, financial analysis, and big data processing.

Key Features of Pandas

1. Data Structures:
o Series: A one-dimensional labeled array.
o DataFrame: A two-dimensional, tabular data structure (like a spreadsheet).
o Panel (Deprecated): Used for handling three-dimensional data.
2. Data Handling:
o Read and write from CSV, Excel, JSON, SQL, HTML, and more.
o Load large datasets and perform fast operations.
3. Data Cleaning and Transformation:
o Handle missing values (dropna(), fillna()).
o Remove duplicates (drop_duplicates()).
o Convert data types (astype()).
4. Filtering and Indexing:
o Select rows and columns using labels (loc[]) or positions (iloc[]).
o Apply boolean conditions for filtering.
5. Aggregation and Grouping:
o Use groupby() for grouping data and computing aggregate statistics.
o Perform pivot table operations (pivot_table()).
6. Data Visualization:
o Built-in support for plotting graphs using Matplotlib (df.plot()).
7. Time Series Analysis:
o Handle and manipulate datetime objects.
o Perform resampling, shifting, and rolling window calculations.

Pandas Data Structures

1. Series - One-Dimensional Data Structure

A Series is similar to a list or an array, but with labels (index).

Example: Creating a Pandas Series


import pandas as pd

data = [10, 20, 30, 40]


series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)

Output:

a 10
b 20
c 30
d 40
dtype: int64

2. DataFrame - Two-Dimensional Data Structure

A DataFrame is a table-like structure with rows and columns, similar to an Excel spreadsheet.

Example: Creating a Pandas DataFrame


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]}

df = pd.DataFrame(data)
print(df)

Output:

Name Age Salary


0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
Basic Operations in Pandas

1. Reading and Writing Data

 Read CSV File

df = pd.read_csv('data.csv')

 Write DataFrame to CSV

df.to_csv('output.csv', index=False)

2. Data Selection and Filtering

 Select a Column

df['Name']

 Select Multiple Columns

df[['Name', 'Age']]

 Filter Rows Based on Condition

df[df['Age'] > 28]

3. Handling Missing Data

 Remove missing values

df.dropna()

 Fill missing values with a default value

df.fillna(0)

4. Grouping and Aggregation

 Group by Column and Compute Mean

df.groupby('Age')['Salary'].mean()

5. Merging and Joining DataFrames

 Merge Two DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})


df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [1000, 2000, 3000]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
Data Visualization in Pandas

Pandas provides built-in support for visualization using Matplotlib.

Example: Plotting a Line Chart


import matplotlib.pyplot as plt

df.plot(x='Age', y='Salary', kind='line')


plt.show()
Conclusion

 Pandas is a crucial library for data preprocessing, analysis, and visualization.


 It helps in handling structured data efficiently for machine learning and data science applications.
 Its integration with NumPy, Matplotlib, and Scikit-learn makes it a preferred choice for data analysis and ML
pipelines.

You might also like