0% found this document useful (0 votes)

12 views25 pages

Big Data Analysis

Bda

Uploaded by

jayavarshan66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Big Data Analysis

Bda

Uploaded by

jayavarshan66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

EX.

NO:01
INSTALLATION OF HADOOP ON UBUNTU
DATE:21.07.25

AIM:

To Install Hadoop in pseudo-distributed mode on Ubuntu.

STEPS:

1. Prerequisites
 Java Development Kit (JDK): Hadoop requires Java to run. Ensure you have a compatible JDK
installed (e.g., OpenJDK 8 or 11). You can check your Java version with java -version.
 OpenSSH: Used for secure shell access, crucial for Hadoop's inter-process communication. Install it
with:
bash
sudo apt-get install openssh-server openssh-client
Use code with caution.

2. Set up the Hadoop user and passwordless SSH

 Create a new user (recommended): It's best to create a dedicated user and group for Hadoop
installations.
bash
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo usermod -aG sudo hduser
Use code with caution.
 Configure passwordless SSH: This allows Hadoop daemons to communicate with each other without
constantly prompting for passwords.
bash
su - hduser
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost # Verify passwordless login (type 'yes' if prompted)
exit # Exit the hduser session
Use code with caution.

3. Download and install Hadoop

 Download Hadoop: Choose a stable Hadoop release from the Apache Hadoop project page. You can
use wget to download the tarball (e.g., for Hadoop 3.4).
bash
cd /usr/local
sudo wget [Hadoop download URL]
Use code with caution.
 Extract and configure: Extract the downloaded file and make necessary ownership changes.
bash
sudo tar -xzvf hadoop-3.x.x.tar.gz
sudo mv hadoop-3.x.x hadoop
sudo chown -R hduser:hadoop hadoop
Use code with caution.
4. Configure Hadoop environment variables
 Edit ~/.bashrc: Add the following lines to your ~/.bashrc file (replace /usr/lib/jvm/java-8-openjdk-
amd64 with your actual JAVA_HOME path) and then source the file.
bash
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # Adjust based on your Java version
Use code with caution.
bash
source ~/.bashrc
Use code with caution.
 Edit hadoop-env.sh: Open $HADOOP_HOME/etc/hadoop/hadoop-env.sh and set
the JAVA_HOME variable by uncommenting the line and providing the path to your JDK installation.
bash
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # Adjust based on your Java version
Use code with caution.

5. Configure Hadoop files

Edit the following XML files located in $HADOOP_HOME/etc/hadoop:
 core-site.xml: Define the NameNode URL and a temporary directory for MapReduce processing.
xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
</configuration>
Use code with caution.
Remember to create the /app/hadoop/tmp directory and set ownership for the hduser.
 hdfs-site.xml: Configure DataNode directory and replication factor.
xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/app/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/app/hadoop/hdfs/datanode</value>
</property>
</configuration>
Use code with caution.
Create the specified namenode and datanode directories and assign ownership to hduser.
 mapred-site.xml: Configure MapReduce to use YARN. (If this file doesn't exist, rename mapred-
site.xml.template to mapred-site.xml and then edit it).
xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
Use code with caution.
 yarn-site.xml: Configure ResourceManager and NodeManager.
xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
</configuration>
Use code with caution.

6. Format NameNode and start Hadoop daemons

 Format the HDFS NameNode: This initializes the file system. Only do this once during the initial
setup.
bash
hdfs namenode -format
Use code with caution.
 Start HDFS and YARN daemons:
bash
start-dfs.sh
start-yarn.sh
Use code with caution.

7. Verify installation
 Use jps to check if all Hadoop daemons are running (NameNode, DataNode, SecondaryNameNode,
ResourceManager, NodeManager).
 Access the Hadoop ResourceManager web interface in your browser by visiting http://localhost:8088.
You should be able to monitor the cluster and view nodes.
Important notes
 The pseudo-distributed mode is suitable for learning, testing, and development.
 For large-scale data processing in production environments, a fully-distributed Hadoop cluster with
multiple nodes is necessary.
 Remember to stop the Hadoop daemons using stop-dfs.sh and stop-yarn.sh before shutting down
your system to avoid potential errors on subsequent startups. You can also use stop-all.sh to stop both
HDFS and YARN daemons.

RESULT:

Installation of Hadoop in pseudo distributed mode is done successfully.

EX.NO:02
BASIC MAPREDUCE PROGRAM
DATE:21.07.25

AIM:
To implement Mapreduce program for word count and Martix multiplication.

WORD COUNT PROGRAM:

MAPPER.PY:

#!/usr/bin/env python3
import sys

for line in sys.stdin:

line=line.strip()
words=line.split()
for i in words:
print(f"{i}\t1")

REDUCER.PY:

#!/usr/bin/env python3
import sys

cword=None
ccount=0

for line in sys.stdin:

word,count=line.strip().split('\t')
count=int(count)

if word==cword:
ccount+=count
else:
if cword:
print(f"{cword}\t{ccount}")
cword=word
ccount=count

if cword:
print(f"{cword}\t{ccount}")

OUTPUT:
MATRIX MULTIPLICATION PROGRAM:

MAPPER.PY:

#!/usr/bin/env python3
import sys

for line in sys.stdin:

parts = line.strip().split(",")
if len(parts) != 4:
continue
matrix, i, j, value = parts
i, j, value = int(i), int(j), int(value)

if matrix == "A":
for col in range(2):
print(f"{i},{col}\tA,{j},{value}")
elif matrix == "B":
for row in range(2):
print(f"{row},{j}\tB,{i},{value}")

REDUCER.PY:

#!/usr/bin/env python3
import sys
from collections import defaultdict

current_key = None
a_values = defaultdict(int)
b_values = defaultdict(int)

for line in sys.stdin:

key, value = line.strip().split("\t")
i, j = map(int, key.strip().split(","))
tag, k, val = value.strip().split(",")
k = int(k)
val = int(val)

if current_key != (i, j):

if current_key:
total = 0
for index in a_values:
total += a_values[index] * b_values.get(index, 0)
print(f"{current_key[0]},{current_key[1]}\t{total}")

current_key = (i, j)
a_values = defaultdict(int)
b_values = defaultdict(int)

if tag == "A":
a_values[k] = val
elif tag == "B":
b_values[k] = val

if current_key:
total = 0
for index in a_values:
total += a_values[index] * b_values.get(index, 0)
print(f"{current_key[0]},{current_key[1]}\t{total}")

OUTPUT:

RESULT:

Mapreduce program for word count and matrix multiplication is implemented successfully.
EX.NO:03
STATISTICAL METHODS FOR LARGE DATASET
DATE:29.07.25

AIM:

To compute descriptive statistics – mean, median, mode, standard deviation on a large dataset.

DATASET USED: IRIS DATASET

CODE:
MAPPER.PY:
from sklearn.datasets import load_iris
import numpy as np

data = load_iris()
features = data['data']
feature_names = data['feature_names']

for col_index, col_name in enumerate(feature_names):

for row in features:
value = float(row[col_index])
key = col_name.replace(" ", "_")
print(f"{key}\t1\t{value}\t{value**2}\t{value}\t{value}")

REDUCER.PY:
import sys
import math
from collections import defaultdict

stats = defaultdict(lambda: {
"count": 0,
"sum": 0.0,
"sum_sq": 0.0,
"min": float('inf'),
"max": float('-inf')
})

for line in sys.stdin:

try:
key, count, sum_val, sum_sq, min_v, max_v = line.strip().split('\t')
count = int(count)
sum_val = float(sum_val)
sum_sq = float(sum_sq)
min_v = float(min_v)
max_v = float(max_v)

stats[key]["count"] += count
stats[key]["sum"] += sum_val
stats[key]["sum_sq"] += sum_sq
stats[key]["min"] = min(stats[key]["min"], min_v)
stats[key]["max"] = max(stats[key]["max"], max_v)
except:
continue

for key, stat in stats.items():

count = stat["count"]
total_sum = stat["sum"]
sum_sq = stat["sum_sq"]
min_val = stat["min"]
max_val = stat["max"]

mean = total_sum / count

variance = (sum_sq / count) - (mean ** 2)
std_dev = math.sqrt(variance)
range_val = max_val - min_val

print(f"\n--- {key.replace('_', ' ').title()} ---")

print(f"Count:\t{count}")
print(f"Sum:\t{total_sum:.2f}")
print(f"Mean:\t{mean:.2f}")
print(f"Min:\t{min_val:.2f}")
print(f"Max:\t{max_val:.2f}")
print(f"Range:\t{range_val:.2f}")
print(f"Variance:\t{variance:.2f}")
print(f"Standard Deviation:\t{std_dev:.2f}")

OUTPUT:

RESULT:

Mapreduce program for implementing descriptive statics – mean, median, mode, standard deviation
is executed successfully.
EX.NO:04
VISUALIZATION OF LARGE DATASET
DATE:29.07.25

AIM:

To plot visualizations [Histogram and Box plot] for a large dataset.

DATASET USED: IPL DATASET

CODE:

MAPPER.PY:

import sys
import csv

reader = csv.reader(sys.stdin)
next(reader) # Skip header

for row in reader:

try:
print(f"4s\t{int(row[4])}")
print(f"6s\t{int(row[5])}")
except:
continue

REDUCER.PY:

import sys

data = {"4s": [], "6s": []}

for line in sys.stdin:

key, value = line.strip().split("\t")
if key in data:
data[key].append(int(value))

for key in data:

print(f"{key}\t{','.join(map(str, data[key]))}")

VISUALIZE.PY:

import pandas as pd
import matplotlib.pyplot as plt

# Read data from files with headers

fours = pd.read_csv("fours.txt")["4s"].astype(float)
sixes = pd.read_csv("sixes.txt")["6s"].astype(float)
# Calculate IQR for 4s
iqr_fours = fours.quantile(0.75) - fours.quantile(0.25)
# Calculate IQR for 6s
iqr_sixes = sixes.quantile(0.75) - sixes.quantile(0.25)

# Print IQR values

print(f"IQR for 4s: {iqr_fours}")
print(f"IQR for 6s: {iqr_sixes}")

# Boxplot
plt.figure(figsize=(10, 5))
plt.boxplot([fours, sixes], labels=['4s', '6s'])
plt.title("Boxplot of 4s and 6s")
plt.xlabel("Shots")
plt.ylabel("Count")
plt.grid(True)
plt.savefig("boxplot.png")
plt.show()

OUTPUT:
RESULT:

Mapreduce program for visualizing Histogram and Box plot on a large dataset is executed successfully.
EX.NO:05
CORRELATION ANALYSIS OF LARGE DATASET
DATE:05.08.25

AIM:

To plot Correlation matrix for a large multivariant dataset.

DATASET USED: Wine Dataset

CODE:
MAPPER.PY:

from sklearn.datasets import load_wine

wine = load_wine()
features = wine.data
feature_names = wine.feature_names

for i,feature in enumerate(feature_names):

for row in features:
print(f"{feature}\t{row[i]}")

REDUCER.PY:

import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from collections import defaultdict

data = defaultdict(list)

for line in sys.stdin:

try:
key, value = line.strip().split('\t')
data [key].append(float(value))
except:
continue

df = pd.DataFrame(dict(data))

corr_df = df.corr()

plt.figure(figsize=(8,6))
sb.heatmap(corr_df, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
OUTPUT:

RESULT:

Mapreduce program for plotting Correlation matrix for a large multivariant dataset is executed
successfully.
EX.NO:06
CLUSTERING ANALYSIS OF LARGE DATASET
DATE:05.08.25

AIM:

To perform Clustering analysis on a large multivariant dataset.

DATASET USED: Wine Dataset

CODE:

MAPPER.PY:

from sklearn.datasets import load_wine

wine = load_wine()
features = wine.data
feature_names = wine.feature_names

for i, feature in enumerate(feature_names):

for row in features:
print(f"{feature}\t{row[i]}")

REDUCER.PY:

import sys
import pandas as pd
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

data = defaultdict(list)

for line in sys.stdin:

try:
key, value = line.strip().split('\t')
data[key].append(float(value))
except:
continue

df = pd.DataFrame(dict(data))

k=3
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(df)

for idx, cluster_id in enumerate(clusters):

print(f"Sample_{idx}\tCluster_{cluster_id}")
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(df)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=clusters, cmap='viridis', s=50)
plt.title("KMeans Clusters of Wine Dataset (PCA-reduced)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(scatter, label='Cluster')
plt.tight_layout()
plt.show()

OUTPUT:
RESULT:

Mapreduce program for performing Clustering analysis on a large multivariant dataset is executed
successfully.
EX.NO:07
CLASSIFICATION ANALYSIS OF LARGE DATASET
DATE:12.08.25

AIM:
To perform classification of a large multi-variate dataset into two or more classes.

DATASET USED: Breast_cancer Dataset

CODE:
MAPPER.PY:

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
features = data.data
labels = data.target
n_samples, n_features = features.shape

for i in range(n_samples):
feature_str = ','.join(map(str, features[i]))
label = labels[i]
print(f"Sample_{i}\t{feature_str},{label}")

REDUCER.PY:

import sys
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

sample_ids = []
X = []
y = []

for line in sys.stdin:

try:
sample_id, values = line.strip().split('\t')
*features_str, label_str = values.split(',')
features = list(map(float, features_str))
label = int(label_str)
sample_ids.append(sample_id)
X.append(features)
y.append(label)
except Exception:
continue

X_df = pd.DataFrame(X)
y_series = pd.Series(y)
X_train, X_test, y_train, y_test, ids_train, ids_test = train_test_split(
X_df, y_series, sample_ids, test_size=0.3, random_state=42
)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Output predictions for test samples to stdout

for sample_id, pred_label in zip(ids_test, y_pred):
print(f"{sample_id}\tClass_{pred_label}")

# Print evaluation metrics to stderr

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}", file=sys.stderr)
print("Classification Report:", file=sys.stderr)
print(classification_report(y_test, y_pred), file=sys.stderr)

OUTPUT:

RESULT:

Mapreduce program for performing classification of a large multi-variate dataset into two or more
classes is executed successfully.
EX.NO:08
PYSPARK PREPROCESSING AND DATA VISUALIZATION
DATE:21.08.25

AIM:

To plot Box-plots and histograms of all the numerical variables and show statistical description of a
large dataset in Apache Pyspark.

DATASET USED: TITANIC DATASET

CODE:

# Databricks notebook source

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, isnan
from pyspark.sql.functions import col, count, when, isnan
from pyspark.sql.types import FloatType, DoubleType
import matplotlib.pyplot as plt

spark = SparkSession.builder.getOrCreate()

# Load Titanic dataset from Unity Catalog Volume

df = spark.read.csv("dbfs:/Volumes/bda_1/default/titanic/titanic.csv",
header=True, inferSchema=True)

# Show first few rows

df.show(5)

# Count missing values for each column

missing = df.select([
count(
when(
col(c).isNull() |
(
isnan(col(c)) if df.schema[c].dataType in [FloatType(), DoubleType()] else False
)|
(
(col(c) == "") if df.schema[c].dataType.simpleString() == "string" else False
),
c
)
).alias(c)
for c in df.columns
])

display(missing)
# Summary statistics
print("Summary statistics:")
df.describe().show()

# Convert 'Age' column to Pandas for visualization

age_pd = df.select("Age").dropna().toPandas()

# Histogram of Age
plt.figure(figsize=(8,4))
plt.hist(age_pd['Age'], bins=20, color="skyblue", edgecolor="black")
plt.title("Histogram of Passenger Age")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

# Boxplot of Age
plt.figure(figsize=(6,3))
plt.boxplot(age_pd['Age'], vert=False)
plt.title("Boxplot of Passenger Age")
plt.xlabel("Age")
plt.show()

OUTPUT:

Missing values:

Summary statistics:
RESULT:

Pyspark program for plotting Box-plots and histograms of all the numerical variables and showing
statistical description of a large dataset in Apache Pyspark is successful.
EX.NO:09
PYSPARK CLASSIFICATION AND REGRESSION
DATE:28.08.25

AIM:

To perform classification and Regression on a large dataset using Apache Pyspark.

DATASET USED: WINE DATASET

CLASSIFICATION:

CODE:

from sklearn.datasets import load_wine

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Load wine dataset

wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression model

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Test Accuracy: {accuracy:.4f}\n")

print("Classification Report:")
print(classification_report(y_test, y_pred))

OUTPUT:
REGRESSION:

DATASET USED: DIABETES DATASET

CODE:

from sklearn.datasets import load_diabetes

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load diabetes dataset (regression)

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
feature_names = diabetes.feature_names

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Linear Regression model

model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Test MSE: {mse:.4f}")

print(f"Test R2 Score: {r2:.4f}\n")

# Plot predicted vs actual values

import matplotlib.pyplot as plt

plt.scatter(y_test, y_pred, edgecolor='k', alpha=0.7)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Linear Regression: Actual vs Predicted")
plt.show()
OUTPUT:

RESULT:

Pyspark program for performing Classifiaction and Regression on a large dataset using Apcahe Spark
is successful.

Big Data File
No ratings yet
Big Data File
16 pages
Ccs334-Bda Lab Manual
No ratings yet
Ccs334-Bda Lab Manual
48 pages
Bi Lab File
No ratings yet
Bi Lab File
19 pages
Lab Manual
No ratings yet
Lab Manual
27 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
BDA-Lab Record
No ratings yet
BDA-Lab Record
43 pages
Bda Record
No ratings yet
Bda Record
42 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
Bigdata Manual Final
No ratings yet
Bigdata Manual Final
66 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Ccs334 Bda Record
No ratings yet
Ccs334 Bda Record
43 pages
BIGDATA ANALYTICS Lab Manual
No ratings yet
BIGDATA ANALYTICS Lab Manual
44 pages
Big Data Printout
No ratings yet
Big Data Printout
46 pages
Big Data Journal
No ratings yet
Big Data Journal
54 pages
Big Data Record
No ratings yet
Big Data Record
13 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
Experiment No - 1
No ratings yet
Experiment No - 1
13 pages
Hadoop Setup & File Management Guide
No ratings yet
Hadoop Setup & File Management Guide
16 pages
Ccs334-Bda Lab Manual
No ratings yet
Ccs334-Bda Lab Manual
50 pages
CCS334 Bda
No ratings yet
CCS334 Bda
23 pages
Exp 1-2
No ratings yet
Exp 1-2
9 pages
Big Data Analytics lab-JD
No ratings yet
Big Data Analytics lab-JD
49 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Command to Check Exit Code in Linux
No ratings yet
Command to Check Exit Code in Linux
43 pages
Big Data Record
No ratings yet
Big Data Record
14 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Data Science Record
No ratings yet
Data Science Record
30 pages
Big Data Analytics Lab Guide
No ratings yet
Big Data Analytics Lab Guide
44 pages
Hadoop Installation and MapReduce Guide
No ratings yet
Hadoop Installation and MapReduce Guide
25 pages
Lab Manual
No ratings yet
Lab Manual
34 pages
Bda File
No ratings yet
Bda File
28 pages
Hadoop 3 Installation
No ratings yet
Hadoop 3 Installation
10 pages
Week 1 Lab
No ratings yet
Week 1 Lab
8 pages
BDA Practical Experiment 1
No ratings yet
BDA Practical Experiment 1
5 pages
Bda Lab Manual Print 3.6.24
No ratings yet
Bda Lab Manual Print 3.6.24
45 pages
Matrix Multiplication Using Hadoop Map-Reduce
No ratings yet
Matrix Multiplication Using Hadoop Map-Reduce
10 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Hadoop Single Node Cluster Setup Guide
No ratings yet
Hadoop Single Node Cluster Setup Guide
61 pages
Installation of Hadoop in Ubuntu
No ratings yet
Installation of Hadoop in Ubuntu
15 pages
Procedure: 1
No ratings yet
Procedure: 1
29 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Bda Manual
No ratings yet
Bda Manual
33 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Hadoop Setup Guide for Linux Users
No ratings yet
Hadoop Setup Guide for Linux Users
23 pages
104 Da11-13
No ratings yet
104 Da11-13
14 pages
1.mrplab Intro
No ratings yet
1.mrplab Intro
18 pages
BIG Data File
No ratings yet
BIG Data File
28 pages
2023MCS320004 HEMANTH TARRA - Hadoop Installation - Assignment
No ratings yet
2023MCS320004 HEMANTH TARRA - Hadoop Installation - Assignment
9 pages
Big Data Lab Record
No ratings yet
Big Data Lab Record
30 pages
Installation of Hadoop
No ratings yet
Installation of Hadoop
6 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
34 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
Install Hadoop 3.3.0 & Run WordCount
100% (1)
Install Hadoop 3.3.0 & Run WordCount
16 pages
Hadoop Lab Practical Guide
No ratings yet
Hadoop Lab Practical Guide
69 pages
Install Hadoop on Ubuntu: Step-by-Step Guide
No ratings yet
Install Hadoop on Ubuntu: Step-by-Step Guide
15 pages
BDA Practical
No ratings yet
BDA Practical
38 pages
Hackathon Node Js
No ratings yet
Hackathon Node Js
2 pages
Bayesln Belif Setuoak.: P CL, H) PUJA)
No ratings yet
Bayesln Belif Setuoak.: P CL, H) PUJA)
1 page
D For Debate - Event Brochure - .
No ratings yet
D For Debate - Event Brochure - .
5 pages
Wa0001.
No ratings yet
Wa0001.
35 pages
Predicted ML QP
No ratings yet
Predicted ML QP
1 page
Guidelines - Poster
No ratings yet
Guidelines - Poster
1 page
PUB00026R4 Tech Adv Series DeviceNet
No ratings yet
PUB00026R4 Tech Adv Series DeviceNet
8 pages
Brother MFC T920DW Data Sheet
No ratings yet
Brother MFC T920DW Data Sheet
2 pages
PTCL DSL Modem Setting To Set Up A Distorted Modem
100% (1)
PTCL DSL Modem Setting To Set Up A Distorted Modem
117 pages
Virtual Server Solutions for Businesses
No ratings yet
Virtual Server Solutions for Businesses
16 pages
VBA Archives - Best Excel Tutorial
No ratings yet
VBA Archives - Best Excel Tutorial
4 pages
Introduction To It - Ites Industry Unit-1 Part-B: Nagarjuna Public School
No ratings yet
Introduction To It - Ites Industry Unit-1 Part-B: Nagarjuna Public School
5 pages
Training For SAP Modules: ABAP Course Content
No ratings yet
Training For SAP Modules: ABAP Course Content
8 pages
SA-MP Server Log Overview
No ratings yet
SA-MP Server Log Overview
15 pages
Installation and Configuration of Pfsense 2.4.4 Firewall Router
No ratings yet
Installation and Configuration of Pfsense 2.4.4 Firewall Router
14 pages
E Commerce Module 1
No ratings yet
E Commerce Module 1
4 pages
Ethical Hacking PDF
100% (6)
Ethical Hacking PDF
29 pages
ETT Handbook
No ratings yet
ETT Handbook
24 pages
Livecycle Es2 5 Guidelines FINAL
No ratings yet
Livecycle Es2 5 Guidelines FINAL
36 pages
Cisco Router Configuration Guide
No ratings yet
Cisco Router Configuration Guide
5 pages
CBLM For BigData Data Analytics and Data Science
No ratings yet
CBLM For BigData Data Analytics and Data Science
374 pages
Software Engineering Assignment
No ratings yet
Software Engineering Assignment
6 pages
Practical Programs (5 - 24)
No ratings yet
Practical Programs (5 - 24)
19 pages
Internship SEO Test
No ratings yet
Internship SEO Test
14 pages
LaTeX & Mendeley for Academics
No ratings yet
LaTeX & Mendeley for Academics
17 pages
100 Computer Teacher Interview Questions With Answers
No ratings yet
100 Computer Teacher Interview Questions With Answers
60 pages
Yamaha Steinberg USB Driver V2.0.5 For Mac Release Notes: System Requirements For Software
No ratings yet
Yamaha Steinberg USB Driver V2.0.5 For Mac Release Notes: System Requirements For Software
12 pages
Ooad SDLC
0% (1)
Ooad SDLC
32 pages
README Updating Firmware
No ratings yet
README Updating Firmware
8 pages
SIM7020 Series - TLS - Application Note - V1.03
No ratings yet
SIM7020 Series - TLS - Application Note - V1.03
10 pages
MCQ On IT 2
No ratings yet
MCQ On IT 2
8 pages
STD Xi Comp Sci Term 2 Practical Report File 2024-25
No ratings yet
STD Xi Comp Sci Term 2 Practical Report File 2024-25
18 pages
Wireless Device Control Using DTMF
100% (1)
Wireless Device Control Using DTMF
26 pages
HA System (Veritas) Software Installation and Commissioning Guide (Solaris)
No ratings yet
HA System (Veritas) Software Installation and Commissioning Guide (Solaris)
302 pages
Instructions SFG 2025 Entrance Test-1
No ratings yet
Instructions SFG 2025 Entrance Test-1
3 pages

Big Data Analysis

Uploaded by

Big Data Analysis

Uploaded by

EX.

To Install Hadoop in pseudo-distributed mode on Ubuntu.

2. Set up the Hadoop user and passwordless SSH

3. Download and install Hadoop

5. Configure Hadoop files

6. Format NameNode and start Hadoop daemons

Installation of Hadoop in pseudo distributed mode is done successfully.

WORD COUNT PROGRAM:

for line in sys.stdin:

for line in sys.stdin:

for line in sys.stdin:

for line in sys.stdin:

if current_key != (i, j):

DATASET USED: IRIS DATASET

for col_index, col_name in enumerate(feature_names):

for line in sys.stdin:

for key, stat in stats.items():

mean = total_sum / count

print(f"\n--- {key.replace('_', ' ').title()} ---")

To plot visualizations [Histogram and Box plot] for a large dataset.

DATASET USED: IPL DATASET

for row in reader:

data = {"4s": [], "6s": []}

for line in sys.stdin:

for key in data:

# Read data from files with headers

# Print IQR values

To plot Correlation matrix for a large multivariant dataset.

DATASET USED: Wine Dataset

from sklearn.datasets import load_wine

for i,feature in enumerate(feature_names):

for line in sys.stdin:

To perform Clustering analysis on a large multivariant dataset.

DATASET USED: Wine Dataset

from sklearn.datasets import load_wine

for i, feature in enumerate(feature_names):

for line in sys.stdin:

for idx, cluster_id in enumerate(clusters):

DATASET USED: Breast_cancer Dataset

from sklearn.datasets import load_breast_cancer

for line in sys.stdin:

# Output predictions for test samples to stdout

# Print evaluation metrics to stderr

DATASET USED: TITANIC DATASET

# Databricks notebook source

# Load Titanic dataset from Unity Catalog Volume

# Show first few rows

# Count missing values for each column

# Convert 'Age' column to Pandas for visualization

To perform classification and Regression on a large dataset using Apache Pyspark.

DATASET USED: WINE DATASET

from sklearn.datasets import load_wine

# Load wine dataset

# Train Logistic Regression model

# Predict and evaluate

print(f"Test Accuracy: {accuracy:.4f}\n")

DATASET USED: DIABETES DATASET

from sklearn.datasets import load_diabetes

# Load diabetes dataset (regression)

# Train Linear Regression model

# Predict and evaluate

print(f"Test MSE: {mse:.4f}")

# Plot predicted vs actual values

plt.scatter(y_test, y_pred, edgecolor='k', alpha=0.7)

You might also like