0% found this document useful (0 votes)
12 views25 pages

Big Data Analysis

Bda

Uploaded by

jayavarshan66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

Big Data Analysis

Bda

Uploaded by

jayavarshan66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

EX.

NO:01
INSTALLATION OF HADOOP ON UBUNTU
DATE:21.07.25

AIM:

To Install Hadoop in pseudo-distributed mode on Ubuntu.

STEPS:

1. Prerequisites
 Java Development Kit (JDK): Hadoop requires Java to run. Ensure you have a compatible JDK
installed (e.g., OpenJDK 8 or 11). You can check your Java version with java -version.
 OpenSSH: Used for secure shell access, crucial for Hadoop's inter-process communication. Install it
with:
bash
sudo apt-get install openssh-server openssh-client
Use code with caution.

2. Set up the Hadoop user and passwordless SSH


 Create a new user (recommended): It's best to create a dedicated user and group for Hadoop
installations.
bash
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo usermod -aG sudo hduser
Use code with caution.
 Configure passwordless SSH: This allows Hadoop daemons to communicate with each other without
constantly prompting for passwords.
bash
su - hduser
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost # Verify passwordless login (type 'yes' if prompted)
exit # Exit the hduser session
Use code with caution.

3. Download and install Hadoop


 Download Hadoop: Choose a stable Hadoop release from the Apache Hadoop project page. You can
use wget to download the tarball (e.g., for Hadoop 3.4).
bash
cd /usr/local
sudo wget [Hadoop download URL]
Use code with caution.
 Extract and configure: Extract the downloaded file and make necessary ownership changes.
bash
sudo tar -xzvf hadoop-3.x.x.tar.gz
sudo mv hadoop-3.x.x hadoop
sudo chown -R hduser:hadoop hadoop
Use code with caution.
4. Configure Hadoop environment variables
 Edit ~/.bashrc: Add the following lines to your ~/.bashrc file (replace /usr/lib/jvm/java-8-openjdk-
amd64 with your actual JAVA_HOME path) and then source the file.
bash
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # Adjust based on your Java version
Use code with caution.
bash
source ~/.bashrc
Use code with caution.
 Edit hadoop-env.sh: Open $HADOOP_HOME/etc/hadoop/hadoop-env.sh and set
the JAVA_HOME variable by uncommenting the line and providing the path to your JDK installation.
bash
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # Adjust based on your Java version
Use code with caution.

5. Configure Hadoop files


Edit the following XML files located in $HADOOP_HOME/etc/hadoop:
 core-site.xml: Define the NameNode URL and a temporary directory for MapReduce processing.
xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
</configuration>
Use code with caution.
Remember to create the /app/hadoop/tmp directory and set ownership for the hduser.
 hdfs-site.xml: Configure DataNode directory and replication factor.
xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/app/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/app/hadoop/hdfs/datanode</value>
</property>
</configuration>
Use code with caution.
Create the specified namenode and datanode directories and assign ownership to hduser.
 mapred-site.xml: Configure MapReduce to use YARN. (If this file doesn't exist, rename mapred-
site.xml.template to mapred-site.xml and then edit it).
xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
Use code with caution.
 yarn-site.xml: Configure ResourceManager and NodeManager.
xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
</configuration>
Use code with caution.

6. Format NameNode and start Hadoop daemons


 Format the HDFS NameNode: This initializes the file system. Only do this once during the initial
setup.
bash
hdfs namenode -format
Use code with caution.
 Start HDFS and YARN daemons:
bash
start-dfs.sh
start-yarn.sh
Use code with caution.

7. Verify installation
 Use jps to check if all Hadoop daemons are running (NameNode, DataNode, SecondaryNameNode,
ResourceManager, NodeManager).
 Access the Hadoop ResourceManager web interface in your browser by visiting http://localhost:8088.
You should be able to monitor the cluster and view nodes.
Important notes
 The pseudo-distributed mode is suitable for learning, testing, and development.
 For large-scale data processing in production environments, a fully-distributed Hadoop cluster with
multiple nodes is necessary.
 Remember to stop the Hadoop daemons using stop-dfs.sh and stop-yarn.sh before shutting down
your system to avoid potential errors on subsequent startups. You can also use stop-all.sh to stop both
HDFS and YARN daemons.

RESULT:

Installation of Hadoop in pseudo distributed mode is done successfully.


EX.NO:02
BASIC MAPREDUCE PROGRAM
DATE:21.07.25

AIM:
To implement Mapreduce program for word count and Martix multiplication.

WORD COUNT PROGRAM:

MAPPER.PY:

#!/usr/bin/env python3
import sys

for line in sys.stdin:


line=line.strip()
words=line.split()
for i in words:
print(f"{i}\t1")

REDUCER.PY:

#!/usr/bin/env python3
import sys

cword=None
ccount=0

for line in sys.stdin:


word,count=line.strip().split('\t')
count=int(count)

if word==cword:
ccount+=count
else:
if cword:
print(f"{cword}\t{ccount}")
cword=word
ccount=count

if cword:
print(f"{cword}\t{ccount}")

OUTPUT:
MATRIX MULTIPLICATION PROGRAM:

MAPPER.PY:

#!/usr/bin/env python3
import sys

for line in sys.stdin:


parts = line.strip().split(",")
if len(parts) != 4:
continue
matrix, i, j, value = parts
i, j, value = int(i), int(j), int(value)

if matrix == "A":
for col in range(2):
print(f"{i},{col}\tA,{j},{value}")
elif matrix == "B":
for row in range(2):
print(f"{row},{j}\tB,{i},{value}")

REDUCER.PY:

#!/usr/bin/env python3
import sys
from collections import defaultdict

current_key = None
a_values = defaultdict(int)
b_values = defaultdict(int)

for line in sys.stdin:


key, value = line.strip().split("\t")
i, j = map(int, key.strip().split(","))
tag, k, val = value.strip().split(",")
k = int(k)
val = int(val)

if current_key != (i, j):


if current_key:
total = 0
for index in a_values:
total += a_values[index] * b_values.get(index, 0)
print(f"{current_key[0]},{current_key[1]}\t{total}")

current_key = (i, j)
a_values = defaultdict(int)
b_values = defaultdict(int)

if tag == "A":
a_values[k] = val
elif tag == "B":
b_values[k] = val

if current_key:
total = 0
for index in a_values:
total += a_values[index] * b_values.get(index, 0)
print(f"{current_key[0]},{current_key[1]}\t{total}")

OUTPUT:

RESULT:

Mapreduce program for word count and matrix multiplication is implemented successfully.
EX.NO:03
STATISTICAL METHODS FOR LARGE DATASET
DATE:29.07.25

AIM:

To compute descriptive statistics – mean, median, mode, standard deviation on a large dataset.

DATASET USED: IRIS DATASET

CODE:
MAPPER.PY:
from sklearn.datasets import load_iris
import numpy as np

data = load_iris()
features = data['data']
feature_names = data['feature_names']

for col_index, col_name in enumerate(feature_names):


for row in features:
value = float(row[col_index])
key = col_name.replace(" ", "_")
print(f"{key}\t1\t{value}\t{value**2}\t{value}\t{value}")

REDUCER.PY:
import sys
import math
from collections import defaultdict

stats = defaultdict(lambda: {
"count": 0,
"sum": 0.0,
"sum_sq": 0.0,
"min": float('inf'),
"max": float('-inf')
})

for line in sys.stdin:


try:
key, count, sum_val, sum_sq, min_v, max_v = line.strip().split('\t')
count = int(count)
sum_val = float(sum_val)
sum_sq = float(sum_sq)
min_v = float(min_v)
max_v = float(max_v)

stats[key]["count"] += count
stats[key]["sum"] += sum_val
stats[key]["sum_sq"] += sum_sq
stats[key]["min"] = min(stats[key]["min"], min_v)
stats[key]["max"] = max(stats[key]["max"], max_v)
except:
continue

for key, stat in stats.items():


count = stat["count"]
total_sum = stat["sum"]
sum_sq = stat["sum_sq"]
min_val = stat["min"]
max_val = stat["max"]

mean = total_sum / count


variance = (sum_sq / count) - (mean ** 2)
std_dev = math.sqrt(variance)
range_val = max_val - min_val

print(f"\n--- {key.replace('_', ' ').title()} ---")


print(f"Count:\t{count}")
print(f"Sum:\t{total_sum:.2f}")
print(f"Mean:\t{mean:.2f}")
print(f"Min:\t{min_val:.2f}")
print(f"Max:\t{max_val:.2f}")
print(f"Range:\t{range_val:.2f}")
print(f"Variance:\t{variance:.2f}")
print(f"Standard Deviation:\t{std_dev:.2f}")

OUTPUT:

RESULT:

Mapreduce program for implementing descriptive statics – mean, median, mode, standard deviation
is executed successfully.
EX.NO:04
VISUALIZATION OF LARGE DATASET
DATE:29.07.25

AIM:

To plot visualizations [Histogram and Box plot] for a large dataset.

DATASET USED: IPL DATASET

CODE:

MAPPER.PY:

import sys
import csv

reader = csv.reader(sys.stdin)
next(reader) # Skip header

for row in reader:


try:
print(f"4s\t{int(row[4])}")
print(f"6s\t{int(row[5])}")
except:
continue

REDUCER.PY:

import sys

data = {"4s": [], "6s": []}

for line in sys.stdin:


key, value = line.strip().split("\t")
if key in data:
data[key].append(int(value))

for key in data:


print(f"{key}\t{','.join(map(str, data[key]))}")

VISUALIZE.PY:

import pandas as pd
import matplotlib.pyplot as plt

# Read data from files with headers


fours = pd.read_csv("fours.txt")["4s"].astype(float)
sixes = pd.read_csv("sixes.txt")["6s"].astype(float)
# Calculate IQR for 4s
iqr_fours = fours.quantile(0.75) - fours.quantile(0.25)
# Calculate IQR for 6s
iqr_sixes = sixes.quantile(0.75) - sixes.quantile(0.25)

# Print IQR values


print(f"IQR for 4s: {iqr_fours}")
print(f"IQR for 6s: {iqr_sixes}")

# Boxplot
plt.figure(figsize=(10, 5))
plt.boxplot([fours, sixes], labels=['4s', '6s'])
plt.title("Boxplot of 4s and 6s")
plt.xlabel("Shots")
plt.ylabel("Count")
plt.grid(True)
plt.savefig("boxplot.png")
plt.show()

OUTPUT:
RESULT:

Mapreduce program for visualizing Histogram and Box plot on a large dataset is executed successfully.
EX.NO:05
CORRELATION ANALYSIS OF LARGE DATASET
DATE:05.08.25

AIM:

To plot Correlation matrix for a large multivariant dataset.

DATASET USED: Wine Dataset

CODE:
MAPPER.PY:

from sklearn.datasets import load_wine


wine = load_wine()
features = wine.data
feature_names = wine.feature_names

for i,feature in enumerate(feature_names):


for row in features:
print(f"{feature}\t{row[i]}")

REDUCER.PY:

import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from collections import defaultdict

data = defaultdict(list)

for line in sys.stdin:


try:
key, value = line.strip().split('\t')
data [key].append(float(value))
except:
continue

df = pd.DataFrame(dict(data))

corr_df = df.corr()

plt.figure(figsize=(8,6))
sb.heatmap(corr_df, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
OUTPUT:

RESULT:

Mapreduce program for plotting Correlation matrix for a large multivariant dataset is executed
successfully.
EX.NO:06
CLUSTERING ANALYSIS OF LARGE DATASET
DATE:05.08.25

AIM:

To perform Clustering analysis on a large multivariant dataset.

DATASET USED: Wine Dataset

CODE:

MAPPER.PY:

from sklearn.datasets import load_wine

wine = load_wine()
features = wine.data
feature_names = wine.feature_names

for i, feature in enumerate(feature_names):


for row in features:
print(f"{feature}\t{row[i]}")

REDUCER.PY:

import sys
import pandas as pd
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

data = defaultdict(list)

for line in sys.stdin:


try:
key, value = line.strip().split('\t')
data[key].append(float(value))
except:
continue

df = pd.DataFrame(dict(data))

k=3
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(df)

for idx, cluster_id in enumerate(clusters):


print(f"Sample_{idx}\tCluster_{cluster_id}")
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(df)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=clusters, cmap='viridis', s=50)
plt.title("KMeans Clusters of Wine Dataset (PCA-reduced)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(scatter, label='Cluster')
plt.tight_layout()
plt.show()

OUTPUT:
RESULT:

Mapreduce program for performing Clustering analysis on a large multivariant dataset is executed
successfully.
EX.NO:07
CLASSIFICATION ANALYSIS OF LARGE DATASET
DATE:12.08.25

AIM:
To perform classification of a large multi-variate dataset into two or more classes.

DATASET USED: Breast_cancer Dataset

CODE:
MAPPER.PY:

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
features = data.data
labels = data.target
n_samples, n_features = features.shape

for i in range(n_samples):
feature_str = ','.join(map(str, features[i]))
label = labels[i]
print(f"Sample_{i}\t{feature_str},{label}")

REDUCER.PY:

import sys
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

sample_ids = []
X = []
y = []

for line in sys.stdin:


try:
sample_id, values = line.strip().split('\t')
*features_str, label_str = values.split(',')
features = list(map(float, features_str))
label = int(label_str)
sample_ids.append(sample_id)
X.append(features)
y.append(label)
except Exception:
continue

X_df = pd.DataFrame(X)
y_series = pd.Series(y)
X_train, X_test, y_train, y_test, ids_train, ids_test = train_test_split(
X_df, y_series, sample_ids, test_size=0.3, random_state=42
)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Output predictions for test samples to stdout


for sample_id, pred_label in zip(ids_test, y_pred):
print(f"{sample_id}\tClass_{pred_label}")

# Print evaluation metrics to stderr


print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}", file=sys.stderr)
print("Classification Report:", file=sys.stderr)
print(classification_report(y_test, y_pred), file=sys.stderr)

OUTPUT:

RESULT:

Mapreduce program for performing classification of a large multi-variate dataset into two or more
classes is executed successfully.
EX.NO:08
PYSPARK PREPROCESSING AND DATA VISUALIZATION
DATE:21.08.25

AIM:

To plot Box-plots and histograms of all the numerical variables and show statistical description of a
large dataset in Apache Pyspark.

DATASET USED: TITANIC DATASET

CODE:

# Databricks notebook source


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, isnan
from pyspark.sql.functions import col, count, when, isnan
from pyspark.sql.types import FloatType, DoubleType
import matplotlib.pyplot as plt

spark = SparkSession.builder.getOrCreate()

# Load Titanic dataset from Unity Catalog Volume


df = spark.read.csv("dbfs:/Volumes/bda_1/default/titanic/titanic.csv",
header=True, inferSchema=True)

# Show first few rows


df.show(5)

# Count missing values for each column


missing = df.select([
count(
when(
col(c).isNull() |
(
isnan(col(c)) if df.schema[c].dataType in [FloatType(), DoubleType()] else False
)|
(
(col(c) == "") if df.schema[c].dataType.simpleString() == "string" else False
),
c
)
).alias(c)
for c in df.columns
])

display(missing)
# Summary statistics
print("Summary statistics:")
df.describe().show()

# Convert 'Age' column to Pandas for visualization


age_pd = df.select("Age").dropna().toPandas()

# Histogram of Age
plt.figure(figsize=(8,4))
plt.hist(age_pd['Age'], bins=20, color="skyblue", edgecolor="black")
plt.title("Histogram of Passenger Age")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

# Boxplot of Age
plt.figure(figsize=(6,3))
plt.boxplot(age_pd['Age'], vert=False)
plt.title("Boxplot of Passenger Age")
plt.xlabel("Age")
plt.show()

OUTPUT:

Missing values:

Summary statistics:
RESULT:

Pyspark program for plotting Box-plots and histograms of all the numerical variables and showing
statistical description of a large dataset in Apache Pyspark is successful.
EX.NO:09
PYSPARK CLASSIFICATION AND REGRESSION
DATE:28.08.25

AIM:

To perform classification and Regression on a large dataset using Apache Pyspark.

DATASET USED: WINE DATASET

CLASSIFICATION:

CODE:

from sklearn.datasets import load_wine


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Load wine dataset


wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression model


model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Test Accuracy: {accuracy:.4f}\n")


print("Classification Report:")
print(classification_report(y_test, y_pred))

OUTPUT:
REGRESSION:

DATASET USED: DIABETES DATASET

CODE:

from sklearn.datasets import load_diabetes


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load diabetes dataset (regression)


diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
feature_names = diabetes.feature_names

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Linear Regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate


y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Test MSE: {mse:.4f}")


print(f"Test R2 Score: {r2:.4f}\n")

# Plot predicted vs actual values


import matplotlib.pyplot as plt

plt.scatter(y_test, y_pred, edgecolor='k', alpha=0.7)


plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Linear Regression: Actual vs Predicted")
plt.show()
OUTPUT:

RESULT:

Pyspark program for performing Classifiaction and Regression on a large dataset using Apcahe Spark
is successful.

You might also like