Data science
Vikas College Of Arts, Science and Commerce Page 1
INDEX
Sr
Title Date Sign
No
1 Introduction to Excel
2 Data Frames and Basic Data Pre-processing
3 Feature Scaling and Dummification
4 Hypothesis Testing
5 ANOVA (Analysis of Variance)
6 Regression and Its Types
7 Logistic Regression and Decision Tree
8 K-Means Clustering
9 Principal Component Analysis (PCA)
10 Data Visualization and Storytelling
Vikas College Of Arts, Science and Commerce Page 2
PRACTICAL 1
Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.
Steps
Step 1: Go to conditional formatting > Greater Than
Step 2: Enter the greater than filter value for example 2000.
Vikas College Of Arts, Science and Commerce Page 3
Step 3: Go to Data Bars > Solid Fill in conditional formatting.
B. Create a pivot table to analyse and summarize data.
Steps
Step 1: select the entire table and go to Insert tab PivotChart > Pivotchart Step 2:
Select “New worksheet” in the create pivot chart window.
Vikas College Of Arts, Science and Commerce Page 4
Step 3: Select and drag attributes in the below boxes.
C. Use VLOOKUP function to retrieve information from a different worksheet or table. Steps:
Step 1: click on an empty cell and type the following command.
=VLOOKUP(B3, B3:D3,1, TRUE)
Vikas College Of Arts, Science and Commerce Page 5
D. Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.
Step 2: Fill the information in the window accordingly and click ok.
Vikas College Of Arts, Science and Commerce Page 6
Vikas College Of Arts, Science and Commerce Page 7
PRACTICAL 2
Data Frames and Basic Data Pre-processing
A. Read data from CSV and JSON files into a data frame.
B. Perform basic data pre-processing tasks such as handling missing values and outliers. Code:
import pandas as pd
# Reading CSV file into DataFrame
df = pd.read_csv("[Link]")
print("Our dataset:")
print(df)
# Reading JSON file into DataFrame
data = pd.read_json("[Link]")
print(data)
# Displaying the first 10 rows of the DataFrame
[Link](10)
# Filling missing values with 0
print("Dataset after filling NA values with 0:")
df2 = [Link](value=0)
print(df2)
# Dropping rows with any missing values
print("Dataset after dropping NA values:")
[Link](inplace=True)
print(df)
Vikas College Of Arts, Science and Commerce Page 8
C. Manipulate and transform data using functions like filtering, sorting, and grouping Code:
import pandas as pd
# Reading CSV file into DataFrame
df = pd.read_csv("[Link]")
# Filtering data based on a condition (e.g., age greater than 25)
filtered_df = df[df["age"] > 25]
# Sorting data based on a column (e.g., sorting by age in descending order)
sorted_df = df.sort_values(by="age", ascending=False)
# Grouping data based on a column and applying an aggregation function (e.g., finding the average age per
city)
grouped_df = [Link]("city").agg({"age": "mean"})
# Displaying the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)
# Displaying the sorted DataFrame
print("\nSorted DataFrame:")
print(sorted_df)
# Displaying the grouped DataFrame
print("\nGrouped DataFrame:")
print(grouped_df)
Vikas College Of Arts, Science and Commerce Page 9
PRACTICAL 3
Feature Scaling and Dummification
A. Apply feature-scaling techniques like standardization and normalization to numerical
features.
Code:
# Standardization and normalization import pandas as pd
import numpy as np
from [Link] import Normalizer
from [Link] import StandardScaler
print("printing few data")
df = pd.read_csv("D:\TYCS\Data Science\[Link]")
print([Link]())
print("Max values")
max_vals = [Link]([Link](df))
print(max_vals)
print((df - max_vals) / max_vals)
print("Normalization")
scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = [Link](scaled_data, columns=[Link])
print(scaled_df.head())
print("Standardization")
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = [Link](scaled_data, columns=[Link])
print(scaled_df.head())
Vikas College Of Arts, Science and Commerce Page 10
Vikas College Of Arts, Science and Commerce Page 11
B. Perform feature Dummification to convert categorical variables into numerical
representations.
Code:
import pandas as pd
data = pd.read_csv("[Link]")
categorical_features = data.select_dtypes(include="object")
dummies = pd.get_dummies(categorical_features)
data = [Link]([data, dummies], axis=1)
[Link](categorical_features, axis=1, inplace=True)
data.to_csv("[Link]")
Vikas College Of Arts, Science and Commerce Page 12
Practical 4 Hypothesis
Testing
Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test) # t-test
import numpy as np
import [Link] as stats
[Link](42)
scoreA = [Link](loc=70,scale=10,size=30)
scoreB = [Link](loc=75,scale=10,size=30)
t_stat,pvalue = stats.ttest_ind(scoreA,scoreB)
print(f"T-Statistics: {t_stat}\nP-Value: {pvalue}")
alpha = 0.05
if pvalue < alpha:
print("Reject the null hypothesis. There is a significant difference in exam scores.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in exam scores.")
Output:
Chi-test
import numpy as np
import [Link] as stats
observed_data = [Link]([[25, 15], [20, 40]])
chi2, pvalue, dof, expected = stats.chi2_contingency(observed_data)
print(f'Chi-Square Statistic: {chi2}\nPvalue: {pvalue}\nDegrees of Freedom: {dof}\nExpected
frequency:\n{expected}')
alpha = 0.05
if pvalue < alpha:
print("Reject the null hypothesis. There is a significant association between gender and job satisfaction.")
else:
print("Fail to reject the null hypothesis. Gender and job satisfaction are independent.")
Output:
Vikas College Of Arts, Science and Commerce Page 13
Practical 5
ANOVA (Analysis of Variance)
Perform one-way ANOVA to compare means across multiple groups.
from [Link] import f_oneway
# Define sample data for each group
group1 = [15, 20, 25, 30, 35]
group2 = [10, 18, 22, 28, 32]
group3 = [12, 16, 20, 24, 28]
f_statistic, p_value = f_oneway(group1, group2, group3)
print("One-way ANOVA results:")
print("F-statistic:", f_statistic)
print("P-value:", p_value)
alpha = 0.05
if p_value < alpha:
print(
"Reject null hypothesis: There are significant differences between the means of the groups."
else:
print(
"Fail to reject null hypothesis: There are no significant differences between the means of the groups."
Output:-
Vikas College Of Arts, Science and Commerce Page 14
Practical 6
Regression and its Types.
import numpy as np
import [Link] as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error, r2_score
# Dependent variable (predictor)
X = [Link]([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
# Independent variable (predictor)
y = [Link]([[7], [9], [11], [13], [15], [17], [19], [21], [23], [25]])
# Dependent variable (response)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Simple Linear Regression
model = LinearRegression()
[Link](X_train, y_train) # Fitting the model
# Coefficients
print("Intercept:", model.intercept_[0])
print("Coefficient:", model.coef_[0][0])
# Predictions
y_pred = [Link](X_test)
# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
Vikas College Of Arts, Science and Commerce Page 15
# Plotting the regression line
[Link](X_test, y_test, color="blue")
[Link](X_test, y_pred, color="red")
[Link]("Simple Linear Regression")
[Link]("Independent Variable (X)")
[Link]("Dependent Variable (y)")
[Link]()
Output:
Vikas College Of Arts, Science and Commerce Page 16
Practical 7
Logistic Regression and Decision Tree
import numpy as np
import [Link] as plt
from [Link] import make_blobs
from [Link] import KMeans
from [Link] import silhouette_score
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0)
# Determine the optimal number of clusters using the silhouette score
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
score = silhouette_score(X, kmeans.labels_)
silhouette_scores.append(score)
# Plot the silhouette scores
[Link](range(2, 11), silhouette_scores, marker="o")
[Link]("Number of clusters")
[Link]("Silhouette Score")
[Link]("Silhouette Score for Optimal Number of Clusters")
[Link]()
# Choose the optimal number of clusters based on the silhouette score
optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2
# Apply K-Means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=0).fit(X)
# Visualize the clustering results
[Link](X[:, 0], X[:, 1], c=kmeans.labels_, cmap="viridis", s=50, alpha=0.7)
[Link](
kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
Vikas College Of Arts, Science and Commerce Page 17
s=200,
c="red",
marker="X",
label="Centroids",
)
[Link]("K-Means Clustering")
[Link]("Feature 1")
[Link]("Feature 2")
[Link]()
[Link]()
# Analyze the cluster characteristics
silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg}")
Output:
Vikas College Of Arts, Science and Commerce Page 18
Vikas College Of Arts, Science and Commerce Page 19
Practical 8
K-Means clustering
import pandas as pd
from [Link] import MinMaxScaler
from [Link] import KMeans
import [Link] as plt
# Load data
data = pd.read_csv("[Link]")
# Display the first few rows of the dataset
[Link]()
# Define categorical and continuous features
categorical_features = ["Channel", "Region"]
continuous_features = [
"Fresh",
"Milk",
"Grocery",
"Frozen",
"Detergents_Paper",
"Delicassen",
]
# Descriptive statistics for continuous features
data[continuous_features].describe()
# Convert categorical features into dummy variables
for col in categorical_features:
dummies = pd.get_dummies(data[col], prefix=col)
data = [Link]([data, dummies], axis=1)
[Link](col, axis=1, inplace=True)
Vikas College Of Arts, Science and Commerce Page 20
# Display the first few rows of the updated dataset
[Link]()
# Normalize the data
mms = MinMaxScaler()
data_transformed = mms.fit_transform(data)
# Calculate the sum of squared distances for different values of k
sum_of_squared_distances = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
[Link](data_transformed)
sum_of_squared_distances.append(km.inertia_)
# Plot the elbow method graph
[Link](K, sum_of_squared_distances, "bx-")
[Link]("Number of Clusters (k)")
[Link]("Sum of Squared Distances")
[Link]("Elbow Method for Optimal k")
[Link]()
Output:
Vikas College Of Arts, Science and Commerce Page 21
Practical 9
Principal Component Analysis (PCA)
import pandas as pd
from [Link] import load_iris
from [Link] import PCA
import [Link] as plt
# Load the Iris dataset
iris = load_iris()
X = [Link]
y = [Link]
target_names = iris.target_names
# Perform PCA
pca = PCA(n_components=2) # Specify the number of components (dimensions)
X_r = pca.fit_transform(X)
# Create a DataFrame for visualization
df = [Link](data=X_r, columns=['PC1', 'PC2'])
df['target'] = y
# Plot the data
[Link](figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
[Link]([Link][df['target'] == i, 'PC1'], [Link][df['target'] == i, 'PC2'], color=color, alpha=.8, lw=lw,
label=target_name)
[Link]('PCA of IRIS dataset')
[Link](loc='best', shadow=False, scatterpoints=1)
[Link]('Principal Component 1')
[Link]('Principal Component 2')
[Link]()
Output:
Vikas College Of Arts, Science and Commerce Page 22
Vikas College Of Arts, Science and Commerce Page 23
Practical 10
Data Visualization and Storytelling
import pandas as pd
import [Link] as plt
import seaborn as sns
# Load the dataset
# Assume '[Link]' contains your dataset
df = pd.read_csv("[Link]")
# Perform data analysis
# Example: Calculate summary statistics
summary_stats = [Link]()
# Create meaningful visualizations
# Example: Plot a histogram of a numerical variable
[Link](figsize=(8, 6))
[Link](data=df, x="numerical_variable", bins=20, kde=True)
[Link]("Histogram of Numerical Variable")
[Link]("Numerical Variable")
[Link]("Frequency")
[Link]()
# Example: Plot a bar chart of a categorical variable
[Link](figsize=(8, 6))
[Link](data=df, x="categorical_variable", palette="viridis")
[Link]("Bar Chart of Categorical Variable")
[Link]("Categories")
[Link]("Count")
[Link](rotation=45)
[Link]()
# Present findings and insights in a clear and concise manner
# Example: Use Markdown to format text for presentation
print("# Data Analysis and Visualization Report\n")
print("## Summary Statistics:\n")
print(summary_stats)
print("\n## Insights:\n")
print(
"- The histogram shows that the distribution of the numerical variable is approximately normal."
)
print(
"- The bar chart indicates that category A is the most frequent in the categorical variable."
)
print(
"- The scatterplot suggests a positive correlation between numerical variables 1 and 2, with different
categories showing distinct patterns.\n"
Vikas College Of Arts, Science and Commerce Page 24
)
Output:
Vikas College Of Arts, Science and Commerce Page 25
Vikas College Of Arts, Science and Commerce Page 26