Program-1.
Write a python program to compute Central Tendency Measures: Mean, Median, Mode Measure of
Dispersion: Variance, Standard Deviation
Ans:
OUTPUT:
Source Code Editable:
import statistics
def compute_measures(data):
mean_value = statistics.mean(data)
median_value = statistics.median(data)
mode_value = statistics.mode(data)
variance_value = statistics.variance(data)
std_deviation_value = statistics.stdev(data)
print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
print(f"Variance: {variance_value}")
print(f"Standard Deviation: {std_deviation_value}")
# Example dataset
data = [10, 20, 20, 30, 40, 50, 50, 50, 60, 70]
compute_measures(data)
OUTPUT:
Mean: 40
Median: 45.0
Mode: 50
Variance: 377.77777777777777
Standard Deviation: 19.436506316151
VIVA VOICE QUESTION & ANSWERS:
1. What are the central tendency measures, and why are they important?
Answer:
Central tendency measures indicate the center or typical value of a dataset. The three main measures are:
Mean: The arithmetic average of the dataset.
Median: The middle value when data is sorted.
Mode: The most frequently occurring value.
These measures help summarize data and make comparisons across datasets.
2. What are variance and standard deviation, and how do they differ?
Answer:
Both variance and standard deviation measure how spread out the data is.
Variance is the average squared deviation from the mean.
Standard Deviation is the square root of the variance, representing dispersion in the same unit as the
data.
Standard deviation is more commonly used because it is easier to interpret.
3. How do you calculate the mean, median, and mode in Python?
Answer:
Python’s statistics module provides built-in functions:
import statistics
data = [10, 20, 30, 40, 50, 50]
mean_value = statistics.mean(data) # Computes Mean
median_value = statistics.median(data) # Computes Median
mode_value = statistics.mode(data) # Computes Mode
print(mean_value, median_value, mode_value)
4. What happens if there is more than one mode in the dataset?
Answer:
The statistics.mode() function returns a single mode. If multiple modes exist, it raises an error. Instead, use
statistics.multimode() to return all modes.
Example:
data = [10, 20, 20, 30, 30, 40]
modes = statistics.multimode(data) # Returns [20, 30]
5. Why do we use (N-1) instead of N for sample variance?
Answer:
For sample variance, we divide by N−1instead of N to correct for bias in estimating the population variance.
This is called Bessel's correction, and it ensures that the sample variance is an unbiased estimator of the
population variance.
For
population variance, we divide by N.
6. How do you calculate variance and standard deviation in Python?
Answer:
Using the statistics module:
import statistics
data = [10, 20, 30, 40, 50]
variance_value = statistics.variance(data) # Computes Variance
std_dev_value = statistics.stdev(data) # Computes Standard Deviation
print(variance_value, std_dev_value)
variance() returns the squared deviation from the mean, while stdev() returns its square root.
7. How can you visualize central tendency and dispersion in Python?
Answer:
We can use Matplotlib or Seaborn to visualize data distributions.
Example:
import matplotlib.pyplot as plt
import seaborn as sns
data = [10, 20, 30, 40, 50, 50, 60, 70]
sns.histplot(data, kde=True)
plt.axvline(statistics.mean(data), color='red', label="Mean")
plt.axvline(statistics.median(data), color='blue', linestyle="dashed", label="Median")
plt.legend()
plt.show()
This histogram shows the mean (red) and median (blue dashed) over the data distribution.
Program-2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
Study of Python Basic Libraries: Statistics, Math, NumPy, and SciPy
Python provides several built-in and external libraries for mathematical and statistical computations. In this
detailed study, we will explore four key libraries:
1. Statistics Module (statistics)
2. Math Module (math)
3. NumPy Library (numpy)
4. SciPy Library (scipy)
1. Statistics Module (statistics)
The statistics module is a part of Python’s standard library and provides functions for basic statistical
analysis.
Key Functions and Their Usage
Function Description Example
mean(data) Returns the arithmetic mean (average) statistics.mean([1, 2, 3]) → 2.0
median(data) Returns the middle value of sorted data statistics.median([1, 2, 3, 4]) → 2.5
mode(data) Returns the most frequent value statistics.mode([1, 1, 2, 3]) → 1
variance(data) Returns the sample variance statistics.variance([1, 2, 3]) → 1.0
stdev(data) Returns the sample standard deviation statistics.stdev([1, 2, 3]) → 1.0
Example Code:
import statistics as stats
data = [10, 20, 30, 40, 50, 50]
print("Mean:", stats.mean(data)) # 33.33
print("Median:", stats.median(data)) # 35.0
print("Mode:", stats.mode(data)) # 50
print("Variance:", stats.variance(data)) # 266.67
print("Standard Deviation:", stats.stdev(data)) # 16.33
2. Math Module (math)
The math module provides mathematical functions like logarithms, trigonometry, and factorials.
Key Functions and Their Usage
Function Description Example
sqrt(x) Returns the square root of x math.sqrt(25) → 5.0
factorial(x) Returns x! (factorial of x) math.factorial(5) → 120
log(x, base) Returns the logarithm of x to the given base math.log(8, 2) → 3.0
sin(x), cos(x), math.sin(math.radians(90)) →
tan(x)
Trigonometric functions (x in radians) 1.0
gcd(a, b)
Returns the greatest common divisor of a and math.gcd(48, 18) → 6
b
Example Code:
import math
print("Square Root of 25:", math.sqrt(25)) # 5.0
print("Factorial of 5:", math.factorial(5)) # 120
print("Logarithm base 10 of 100:", math.log10(100)) # 2.0
print("Sine of 90 degrees:", math.sin(math.radians(90))) # 1.0
print("GCD of 48 and 18:", math.gcd(48, 18)) # 6
3. NumPy Library (numpy)
NumPy is a powerful library for numerical computations, especially for handling large arrays and matrices
efficiently.
Key Functions and Their Usage
Function Description Example
np.array([elements]) Creates an array np.array([1, 2, 3])
np.mean(arr) Computes the mean np.mean([1, 2, 3]) → 2.0
np.median(arr) Computes the median np.median([1, 2, 3]) → 2.0
np.std(arr) Computes the standard deviation np.std([1, 2, 3]) → 0.816
np.var(arr) Computes the variance np.var([1, 2, 3]) → 0.667
np.percentile(arr, q) Computes the q-th percentile np.percentile([1, 2, 3], 50) → 2.0
Example Code:
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
print("Mean:", np.mean(arr)) # 30.0
print("Median:", np.median(arr)) # 30.0
print("Standard Deviation:", np.std(arr)) # 14.14
print("Variance:", np.var(arr)) # 200.0
print("25th Percentile:", np.percentile(arr, 25)) # 20.0
4. SciPy Library (scipy)
SciPy builds on NumPy and provides additional scientific computing tools, including statistics, optimization,
and linear algebra.
Key Functions and Their Usage
Function Description Example
stats.mode(data) Computes the mode stats.mode([1, 2, 2, 3]) → Mode: 2
stats.iqr(data) Computes the interquartile range stats.iqr([1, 2, 3, 4, 5]) → 2.0
stats.zscore(data) Computes the Z-score stats.zscore([10, 20, 30])
scipy.linalg.det(matrix) Computes the determinant of a matrix det([[1,2],[3,4]])
Example Code:
from scipy import stats
import numpy as np
data = np.array([10, 20, 30, 40, 50, 50])
print("Mode:", stats.mode(data)) # Mode: 50
print("Interquartile Range:", stats.iqr(data)) # 20.0
print("Z-scores:", stats.zscore(data)) # [-1.22, -0.61, 0.0, 0.61, 1.22, 1.22]
Summary Table
Library Purpose Example Functions
statistics Basic statistical analysis mean(), median(), mode(), variance()
math Mathematical operations sqrt(), factorial(), log(), sin()
numpy Numerical computations, array operations mean(), std(), percentile()
scipy Advanced scientific computing stats.mode(), stats.iqr(), zscore()
Each of these libraries plays a crucial role in data science, engineering, and scientific computing.
https://chatgpt.com/share/67b4bda2-621c-8001-8a02-ec82f170170a
VIVA VOICE QUESTIONS AND ANSWERS
1. What is the difference between the math module and the statistics module in Python?
Answer:
The math module provides basic mathematical functions such as logarithms, trigonometry, and power
functions.
The statistics module is specifically used for statistical computations like mean, median, mode, variance,
and standard deviation.
Example:
import math
print(math.sqrt(16)) # 4.0
import statistics
print(statistics.mean([1, 2, 3, 4, 5])) # 3.0
2. What are the advantages of using NumPy over Python lists for numerical computations?
Answer:
Speed: NumPy arrays are faster than lists due to efficient memory storage.
Memory Efficiency: NumPy arrays consume less memory compared to Python lists.
Vectorized Operations: NumPy performs operations on entire arrays without using loops.
Built-in Functions: Supports advanced mathematical operations like linear algebra and Fourier transforms.
Example:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr * 2) # Vectorized operation: [2 4 6 8]
3. How do you calculate the mean and standard deviation using NumPy?
Answer:
Use numpy.mean() for mean and numpy.std() for standard deviation.
import numpy as np
data = np.array([10, 20, 30, 40, 50])
print("Mean:", np.mean(data)) # 30.0
print("Standard Deviation:", np.std(data)) # 14.14
4. What is SciPy, and how does it differ from NumPy?
Answer:
SciPy is built on top of NumPy and provides additional scientific computing functionalities like integration,
optimization, and interpolation.
NumPy focuses on efficient array operations, whereas SciPy extends NumPy to handle more advanced
mathematical problems.
Example:
from scipy import linalg
import numpy as np
A = np.array([[1, 2], [3, 4]])
print(linalg.inv(A)) # Inverse of matrix A
5. How do you generate random numbers using NumPy?
Answer:
Use numpy.random module to generate random numbers.
import numpy as np
print(np.random.random()) # Random float between 0 and 1
print(np.random.randint(1, 10)) # Random integer between 1 and 9
6. What is the difference between mode() in statistics and SciPy’s stats.mode()?
Answer:
statistics.mode() works on a simple list and returns the most common element.
scipy.stats.mode() works efficiently on large datasets and multidimensional arrays.
Example:
import statistics
import scipy.stats as stats
data = [1, 2, 2, 3, 3, 3, 4]
print(statistics.mode(data)) # Output: 3
import numpy as np
arr = np.array([1, 2, 2, 3, 3, 3, 4])
print(stats.mode(arr)) # ModeResult(mode=array([3]), count=array([3]))
7. How do you perform integration using SciPy?
Answer:
Use scipy.integrate.quad() to integrate a function.
from scipy.integrate import quad
import numpy as np
def f(x):
return np.sin(x)
result, _ = quad(f, 0, np.pi) # Integrates sin(x) from 0 to π
print(result) # Output: 2.0
8. How can you compute the determinant of a matrix using NumPy and SciPy?
Answer:
Use numpy.linalg.det() or scipy.linalg.det().
import numpy as np
from scipy import linalg
A = np.array([[3, 2], [1, 4]])
print(np.linalg.det(A)) # Using NumPy
print(linalg.det(A)) # Using SciPy
9. How do you find the roots of a quadratic equation using SciPy?
Answer:
Use numpy.roots() or scipy.optimize to find the roots.
import numpy as np
coefficients = [1, -3, 2] # x² - 3x + 2 = 0
print(np.roots(coefficients)) # Output: [2. 1.]
10. How can you calculate a normal distribution PDF using SciPy?
Answer:
Use scipy.stats.norm.pdf() for the probability density function.
import scipy.stats as stats
print(stats.norm.pdf(0, loc=0, scale=1)) # PDF at x=0 for standard normal distribution
These questions cover both conceptual and practical knowledge, ensuring a strong grasp of Statistics, Math,
NumPy, and SciPy in Python. Let me know if you need more advanced viva questions! 😊
Program 3: Study of Python Libraries for ML application such as Pandas and Matplotlib
Study of Pandas - Python Library for Machine Learning Applications
Introduction
Pandas is a powerful and widely used Python library for data manipulation, preprocessing, and analysis in
Machine Learning (ML). It provides flexible data structures such as Series and DataFrame, which make
handling large datasets easier.
Why Use Pandas for ML?
✅ Efficient Data Handling: Supports large datasets.
✅ Data Cleaning & Preprocessing: Handles missing values, duplicates, and filtering.
✅ Integration: Works well with NumPy, Matplotlib, Scikit-learn.
✅ Feature Engineering: Aggregation, transformation, and encoding.
✅ Data Input/Output: Reads and writes from CSV, Excel, SQL, JSON.
Installing Pandas
pip install pandas
Pandas Data Structures
1. Series - One-Dimensional Data Structure
A Series is like a column in a table or an array with labeled indexes.
Example: Creating a Pandas Series
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data, index=['A', 'B', 'C', 'D'])
print(series)
Output:
A 10
B 20
C 30
D 40
dtype: int64
2. DataFrame - Two-Dimensional Data Structure
A DataFrame is like a spreadsheet with rows and columns.
Example: Creating a Pandas DataFrame
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
Basic Operations in Pandas for ML
1. Reading and Writing Data
Read CSV File
df = pd.read_csv('data.csv')
Write DataFrame to CSV
df.to_csv('output.csv', index=False)
2. Data Selection and Filtering
Select a Column
df['Name']
Select Multiple Columns
df[['Name', 'Age']]
Filter Rows Based on Condition
df[df['Age'] > 28]
3. Handling Missing Data
Remove missing values
df.dropna()
Fill missing values with a default value
df.fillna(0)
4. Grouping and Aggregation
Group by Column and Compute Mean
df.groupby('Age')['Salary'].mean()
5. Merging and Joining DataFrames
Merge Two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [1000, 2000, 3000]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
Pandas in Machine Learning Applications
1. Exploratory Data Analysis (EDA)
EDA is the first step in data preprocessing for ML models.
Histograms help understand the distribution of features.
Scatter plots help detect correlations.
Example:
import pandas as pd
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_csv('sample_data.csv')
# Histogram of a feature
plt.hist(df['Feature1'], bins=20, color='blue', edgecolor='black')
plt.title("Feature Distribution")
plt.xlabel("Feature1")
plt.ylabel("Count")
plt.show()
2. Feature Engineering with Pandas
Feature engineering involves creating new features or modifying existing ones for ML models.
Example: Creating New Features
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')
print(df.head())
3. Encoding Categorical Data
ML models require numerical data, so categorical features must be encoded.
Example: One-Hot Encoding
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)
4. Normalization & Scaling
Feature scaling ensures ML models perform optimally.
Example: Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Salary']] = scaler.fit_transform(df[['Salary']])
5. Train-Test Split for ML Models
Splitting the dataset into training and testing sets is essential for ML models.
Example: Splitting Data
from sklearn.model_selection import train_test_split
X = df[['Age', 'Salary']]
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Pandas Integration with Scikit-learn
Pandas is often used with Scikit-learn for building ML models.
Example: Training a Simple ML Model
from sklearn.linear_model import LogisticRegression
# Create Model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predictions
predictions = model.predict(X_test)
Conclusion
Pandas is an essential library for data preprocessing, analysis, and visualization in ML applications.
It helps in cleaning, transforming, and preparing data for ML models.
Works seamlessly with Scikit-learn, Matplotlib, and NumPy to build ML pipelines.
Study of Matplotlib - Python Library for Machine Learning Applications
Introduction
Matplotlib is a powerful data visualization library in Python. It is widely used in Machine Learning (ML),
Data Science, and Exploratory Data Analysis (EDA) to create a variety of graphs, charts, and plots.
Key Features of Matplotlib
✔️ Supports various types of plots: Line, Bar, Scatter, Histogram, Pie, etc.
✔️ Highly customizable: Colors, labels, styles, grids, and legends.
✔️ Supports interactive and animated visualizations.
✔️ Works well with NumPy, Pandas, and Seaborn.
✔️ Can generate plots for Jupyter Notebooks and GUI applications.
Installing Matplotlib
pip install matplotlib
Basic Structure of Matplotlib
Matplotlib mainly consists of the following components:
Figure: The entire plotting area.
Axes: The individual plots inside the figure.
Plot elements: Lines, bars, markers, etc.
Example of Basic Matplotlib Usage:
import matplotlib.pyplot as plt
# Creating Data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 50]
# Plotting the Data
plt.plot(x, y, marker='o', linestyle='--', color='b', label="Line Plot")
# Adding Labels and Title
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Basic Line Plot")
plt.legend()
# Display the Plot
plt.show()
Types of Plots in Matplotlib
1. Line Plot (Trend Analysis)
Used for time series data and trends in Machine Learning.
import numpy as np
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y, color="r", linestyle="-")
plt.title("Sine Wave")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(True)
plt.show()
2. Scatter Plot (Correlation Analysis)
Used in ML for visualizing relationships between two variables.
import numpy as np
# Generating random data
x = np.random.rand(50)
y = np.random.rand(50)
plt.scatter(x, y, color='g', marker='o')
plt.title("Scatter Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
3. Bar Chart (Category Comparison)
Used in classification problems to compare categories.
categories = ['A', 'B', 'C', 'D']
values = [10, 20, 30, 40]
plt.bar(categories, values, color=['red', 'blue', 'green', 'purple'])
plt.title("Bar Chart Example")
plt.xlabel("Categories")
plt.ylabel("Values")
plt.show()
4. Histogram (Data Distribution)
Used for feature distribution analysis in ML.
data = np.random.randn(1000)
plt.hist(data, bins=30, color='blue', edgecolor='black')
plt.title("Histogram Example")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
5. Pie Chart (Category Proportion)
Used in categorical data analysis.
labels = ['Class A', 'Class B', 'Class C', 'Class D']
sizes = [30, 20, 40, 10]
colors = ['gold', 'lightcoral', 'lightskyblue', 'lightgreen']
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=140)
plt.title("Pie Chart Example")
plt.show()
6. Subplots (Multiple Plots in One Figure)
Used to visualize multiple variables in ML models.
fig, ax = plt.subplots(2, 2, figsize=(8, 6))
# Line Plot
ax[0, 0].plot(x, y, 'r')
ax[0, 0].set_title("Line Plot")
# Scatter Plot
ax[0, 1].scatter(x, y, color='g')
ax[0, 1].set_title("Scatter Plot")
# Bar Chart
ax[1, 0].bar(categories, values, color='b')
ax[1, 0].set_title("Bar Chart")
# Histogram
ax[1, 1].hist(data, bins=20, color='purple')
ax[1, 1].set_title("Histogram")
plt.tight_layout()
plt.show()
Matplotlib in Machine Learning Applications
1. Exploratory Data Analysis (EDA)
EDA is the first step in data preprocessing for ML models.
Histograms help understand the distribution of features.
Scatter plots help detect correlations.
Example:
import pandas as pd
# Load dataset
df = pd.read_csv('sample_data.csv')
# Histogram of a feature
plt.hist(df['Feature1'], bins=20, color='blue', edgecolor='black')
plt.title("Feature Distribution")
plt.xlabel("Feature1")
plt.ylabel("Count")
plt.show()
2. Model Performance Visualization
Plotting loss and accuracy during ML model training.
epochs = [1, 2, 3, 4, 5]
train_loss = [0.9, 0.7, 0.5, 0.3, 0.2]
val_loss = [1.0, 0.8, 0.6, 0.4, 0.3]
plt.plot(epochs, train_loss, 'r', label="Train Loss")
plt.plot(epochs, val_loss, 'b', label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.show()
Conclusion
Matplotlib is an essential library for visualizing data in ML applications.
It helps in data preprocessing, EDA, and model performance evaluation.
Works seamlessly with Pandas, NumPy, and Seaborn for better ML workflows.
Program 4: Write a Python program to implement Simple Linear Regression
Regression
Regression is a statistical technique used in machine learning to model and analyze the relationship between
dependent (target) and independent (predictor) variables. The primary objective of regression is to predict a
continuous outcome based on input features.
There are different types of regression techniques, but the most common one is Linear Regression.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1) # Feature variable
Y = 4 + 3 * X + np.random.randn(100, 1) # Target variable with some noise
# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, Y_train)
# Make predictions
Y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
print(f"Coefficients: {model.coef_[0][0]}")
print(f"Intercept: {model.intercept_[0]}")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")
# Plot results
plt.scatter(X_test, Y_test, color='blue', label='Actual data')
plt.plot(X_test, Y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.title("Simple Linear Regression")
plt.show()
Program 5: Implementation of Multiple Linear Regression for House Price Prediction using sklearn
Multiple Linear Regression:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Generate synthetic data
np.random.seed(42)
X1 = 2 * np.random.rand(100, 1) # First feature
X2 = 3 * np.random.rand(100, 1) # Second feature
Y = 4 + 3 * X1 + 2 * X2 + np.random.randn(100, 1) # Target variable with noise
# Combine features into a single matrix
X = np.hstack((X1, X2))
# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Create and train the multiple linear regression model
model = LinearRegression()
model.fit(X_train, Y_train)
# Make predictions
Y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
print(f"Coefficients: {model.coef_[0]}")
print(f"Intercept: {model.intercept_[0]}")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")
# Plot actual vs predicted values
plt.scatter(Y_test, Y_pred, color='blue', label='Predicted vs Actual')
plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], color='red', linestyle='--', label='Perfect Fit')
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.legend()
plt.title("Multiple Linear Regression - Actual vs Predicted")
plt.show()
Study of Python Library for ML application such as Pandas
Study of Python Library for ML application such as Matplotlib
Pandas - Python Library for Data Manipulation and Analysis
Introduction to Pandas:
Pandas is an open-source Python library used for data manipulation, analysis, and preprocessing. It provides
fast, flexible, and powerful data structures like Series and DataFrame to work with structured data efficiently.
Pandas is widely used in data science, machine learning, financial analysis, and big data processing.
Key Features of Pandas
1. Data Structures:
o Series: A one-dimensional labeled array.
o DataFrame: A two-dimensional, tabular data structure (like a spreadsheet).
o Panel (Deprecated): Used for handling three-dimensional data.
2. Data Handling:
o Read and write from CSV, Excel, JSON, SQL, HTML, and more.
o Load large datasets and perform fast operations.
3. Data Cleaning and Transformation:
o Handle missing values (dropna(), fillna()).
o Remove duplicates (drop_duplicates()).
o Convert data types (astype()).
4. Filtering and Indexing:
o Select rows and columns using labels (loc[]) or positions (iloc[]).
o Apply boolean conditions for filtering.
5. Aggregation and Grouping:
o Use groupby() for grouping data and computing aggregate statistics.
o Perform pivot table operations (pivot_table()).
6. Data Visualization:
o Built-in support for plotting graphs using Matplotlib (df.plot()).
7. Time Series Analysis:
o Handle and manipulate datetime objects.
o Perform resampling, shifting, and rolling window calculations.
Pandas Data Structures
1. Series - One-Dimensional Data Structure
A Series is similar to a list or an array, but with labels (index).
Example: Creating a Pandas Series
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)
Output:
a 10
b 20
c 30
d 40
dtype: int64
2. DataFrame - Two-Dimensional Data Structure
A DataFrame is a table-like structure with rows and columns, similar to an Excel spreadsheet.
Example: Creating a Pandas DataFrame
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
Basic Operations in Pandas
1. Reading and Writing Data
Read CSV File
df = pd.read_csv('data.csv')
Write DataFrame to CSV
df.to_csv('output.csv', index=False)
2. Data Selection and Filtering
Select a Column
df['Name']
Select Multiple Columns
df[['Name', 'Age']]
Filter Rows Based on Condition
df[df['Age'] > 28]
3. Handling Missing Data
Remove missing values
df.dropna()
Fill missing values with a default value
df.fillna(0)
4. Grouping and Aggregation
Group by Column and Compute Mean
df.groupby('Age')['Salary'].mean()
5. Merging and Joining DataFrames
Merge Two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [1000, 2000, 3000]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
Data Visualization in Pandas
Pandas provides built-in support for visualization using Matplotlib.
Example: Plotting a Line Chart
import matplotlib.pyplot as plt
df.plot(x='Age', y='Salary', kind='line')
plt.show()
Conclusion
Pandas is a crucial library for data preprocessing, analysis, and visualization.
It helps in handling structured data efficiently for machine learning and data science applications.
Its integration with NumPy, Matplotlib, and Scikit-learn makes it a preferred choice for data analysis and ML
pipelines.