0% found this document useful (0 votes)
31 views20 pages

CC Mini Project Report

Uploaded by

03rajput.ki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views20 pages

CC Mini Project Report

Uploaded by

03rajput.ki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Savitribai Phule Pune University

Modern Education Society’s Wadia College of Engineering,


Pune
19, Bund Garden, V.K. Joag Path, Pune – 411001.

ACCREDITED BY NBA AND NAAC WITH ’A++’ GRADE

DEPARTMENT OF COMPUTER ENGINEERING

MINI PROJECT REPORT

SUBMITTED BY
Ms. Khushi Ranjitsing Rajput (55)
Ms. Anushka Jaywant Kharade(56)
Ms. Aditi Pankaj Pawar(58)

(Academic Year: 2023-2024)


Savitribai Phule Pune University
Modern Education Society’s Wadia College of Engineering,
Pune
19, Bund Garden, V.K. Joag Path, Pune – 411001.

ACCREDITED BY NBA AND NAAC WITH ’A++’ GRADE )

DEPARTMENT OF COMPUTER ENGINEERING

Certificate

This is to certify that the “Mini Project Reposrt” submitted by Khushi


Ranjitsing Rajput(55), Anushka Jaywant Kharade(56), Aditi Pankaj
Pawar(58)is work done by her and submitted during the 2023-24 academic year, in
partial fulfillment of the requirements for the award of the degree of BACHELOR
OF ENGINEERING in COMPUTER ENGINEERING, at MES Wadia
College of Engineering,Pune.

Prof.A.D.Dhawae (Dr.(Mrs.) N. F. Shaikh)


Supervisor Head of Department
Internship Report

ACKNOWLEDGEMENT

I would like to express my sincere gratitude to our HOD Dr. Mrs. N.F.Shaikh
for guiding us through the mini project on cloud computing, and for their continuous
support and encouragement during the process.
Special thanks to my subject teacher and mentor Prof. A.D. Dhawale for his in-
valuable mentorship, patience, and expertise during my mini project. Additionally, I
appreciate my entire team for their collaborative spirit.

Ms.Khushi Ranjitsing Rajput(F21111066)


T.E. Computer

Dept. of Computer Engineering i


Contents

1 INTRODUCTION 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 PROBLEM STATEMENT 3
2.1 Project Problem Statements . . . . . . . . . . . . . . . . . . . . . . . . 3

3 SYSTEM ANALYSIS 4
3.1 Architecture of Hadoop with Focus on HDFS: . . . . . . . . . . . . . . 4
3.2 Key Features of HDFS: . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Hardware Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Software Requirement Specifications . . . . . . . . . . . . . . . . . . . . 6

4 METHODOLOGY 7

5 RESULTS 11

6 CONCLUSION 13

7 REFERENCES 14

ii
List of Figures

3.1 Hdfs Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.1 Implementation step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


4.2 Implementation step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.1 Resultant on browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

iii
Abstract

This report presents the implementation and evaluation of a file management sys-
tem using Hadoop Distributed File System (HDFS) within the context of beginner-level
cloud computing. The project aimed to demonstrate the scalability, fault tolerance,
and efficiency of HDFS in handling basic file management tasks, including file creation,
retrieval, and deletion. Through practical experimentation and testing, the system’s
performance and reliability were rigorously evaluated, showcasing its suitability for
cloud computing projects of varying complexities. The results highlight the system’s
ability to efficiently manage data across distributed clusters, maintain data integrity,
and provide a user-friendly interface for executing file management operations. Fur-
thermore, the report discusses the significance of implementing a file management
system using HDFS in the context of cloud computing, emphasizing its role in stream-
lining data storage, retrieval, and processing tasks, and fostering innovation in data
management practices. Lastly, potential future enhancements and extensions to the
project are discussed, offering insights into opportunities for further exploration and
improvement in cloud-based file management and data processing.
Chapter 1

INTRODUCTION

1.1 Introduction
Cloud computing has revolutionized technology by providing on-demand access to com-
puting services over the internet, transforming how businesses and individuals utilize
computational resources. This shift offers unparalleled flexibility, scalability, and ac-
cessibility globally. Cloud computing’s significance lies in its unmatched scalability,
cost-effectiveness, and accessibility, allowing organizations to dynamically adjust re-
sources, optimize IT budgets, and collaborate seamlessly from anywhere. Its reliable
infrastructure ensures high availability and fault tolerance, while democratizing access
to advanced technologies like AI and big data analytics, driving innovation and growth
in the digital era.
The project focuses on implementing a File Management System using Hadoop Dis-
tributed File System (HDFS). In distributed systems like Hadoop, efficient file man-
agement is crucial for handling large volumes of data across multiple nodes. HDFS
provides a scalable and fault-tolerant solution for storing and managing files in such
environments. This project explores the significance of effective file management in dis-
tributed systems and demonstrates how HDFS enables seamless storage, retrieval, and
manipulation of data, laying the foundation for robust data processing and analytics
workflows.

1
Internship Report

1.2 Motivation
The project on ’File Management System using HDFS’ is motivated by the escalating
importance of big data and the necessity for scalable file storage solutions to handle
the massive volumes of data generated in today’s digital era. Traditional file systems
often struggle to manage such extensive datasets, prompting the exploration of HDFS
(Hadoop Distributed File System) as a distributed file storage solution tailored for
big data workloads. HDFS, with its fault-tolerant architecture and scalability fea-
tures, offers a platform capable of storing and processing vast amounts of data across
distributed clusters, aligning with the evolving data storage and processing needs of
modern organizations.
Hadoop and HDFS play a crucial role in managing large data volumes across dis-
tributed clusters by enabling parallel processing of massive datasets on commodity
hardware. The scalability and cost-effectiveness of Hadoop and HDFS make them es-
sential tools for organizations dealing with growing data requirements, allowing for effi-
cient storage, management, and analysis of extensive datasets from diverse sources like
IoT devices and social media platforms. These technologies have become indispensable
for organizations seeking to leverage big data analytics for informed decision-making
and driving innovation in a data-driven world.

Dept. of Computer Engineering 2


Chapter 2

PROBLEM STATEMENT

2.1 Project Problem Statements


The project aims to address the challenges faced by traditional file management systems
in distributed environments, particularly in handling big data. These challenges include
limited scalability, performance bottlenecks, and lack of fault tolerance, which hinder
efficient data management and processing across distributed clusters. By emphasizing
the need for a robust and scalable file management system capable of handling big
data, the project seeks to explore the implementation of HDFS (Hadoop Distributed
File System) as a solution to overcome these challenges and enable organizations to
effectively store, manage, and analyze large datasets in distributed environments with
enhanced scalability, fault tolerance, and performance.

3
Chapter 3

SYSTEM ANALYSIS

3.1 Architecture of Hadoop with Focus on HDFS:


• Master-Slave Architecture: Hadoop follows a master-slave architecture where
there is one master node (NameNode) and multiple slave nodes (DataNodes).
The NameNode manages the file system namespace and regulates access to files
by clients. DataNodes store the actual data blocks of files and execute read and
write requests.

• NameNode: Acts as the central metadata repository for the file system. Stores
information about the directory tree structure, file permissions, and the mapping
of data blocks to DataNodes. Handles client requests for file system operations
such as opening, closing, and renaming files.

• DataNodes: Store and manage the actual data blocks of files in the distributed file
system. Report their status to the NameNode periodically, providing information
about available storage capacity and health.

Figure 3.1: Hdfs Architecture

4
Internship Report

3.2 Key Features of HDFS:


1. Fault Tolerance: HDFS achieves fault tolerance through data replication. Data
blocks are replicated across multiple DataNodes, typically three replicas by de-
fault. If a DataNode fails, the replicas hosted on other DataNodes ensure data
availability and durability.

2. Scalability: HDFS is designed to scale horizontally to accommodate growing data


volumes. New DataNodes can be added to the cluster to increase storage capacity
and throughput. The distributed nature of HDFS allows it to handle petabytes
or even exabytes of data seamlessly.

3. Data Locality: Data locality is a fundamental principle of HDFS that aims to


optimize data processing performance. HDFS tries to execute computations on
the nodes where the data resides to minimize data movement across the network.
By co-locating computation with data, HDFS reduces network congestion and
improves processing efficiency.

4. High Throughput: HDFS is optimized for streaming data access patterns, making
it suitable for applications that require high throughput. It prioritizes sequen-
tial reads and writes over random access, making it ideal for large-scale data
processing tasks such as batch processing and data warehousing.

5. Data Integrity: HDFS ensures data integrity through checksums and periodic
data integrity checks. Checksums are used to verify data consistency during reads
and writes. Periodic data integrity checks detect and correct data corruption
issues proactively.

6. Compression: HDFS supports data compression to reduce storage requirements


and improve data transfer efficiency. It provides built-in compression codecs such
as Gzip, Snappy, and LZO, allowing users to compress data transparently.

These key features make HDFS a robust and reliable distributed file system, well-
suited for storing and processing large-scale data sets in Hadoop clusters.

3.3 Hardware Infrastructure


• Require high processing power and memory

• Large-scale distributed storage (HDDs or SSDs).

• Networking Equipment: Gigabit Ethernet or higher-speed networking.Network


switches for inter-node communication.

• Rack Infrastructure: Rack-mounted servers for space efficiency. Proper cable


management and PDUs.

• Cluster management software and monitoring systems.

• Redundant hardware components and failover mechanisms.

• Off-site backup storage and replication.

Dept. of Computer Engineering 5


Internship Report

3.4 Software Requirement Specifications


• Apache Hadoop

• Linux-based operating system

• Java Development Kit (JDK)

• Hadoop Command-Line Interface (CLI)

• Hadoop Distributed File System (HDFS)

• Hadoop Common

• Additional Tools and Libraries:

1. Apache Hive
2. Apache HBase
3. Apache Spark
4. Apache Pig
5. Apache Sqoop
6. Apache Flume
7. Apache Oozie
8. Apache Mahout

Dept. of Computer Engineering 6


Chapter 4

METHODOLOGY

The methodology for implementing file management tasks in Hadoop involves a series
of structured steps to leverage the Hadoop ecosystem effectively. Initially, the process
commences by starting the Hadoop daemon and initializing the local host, setting
the groundwork for subsequent operations. Accessing the Hadoop Distributed File
System (HDFS) through a web browser interface provides a centralized platform for file
management actions. From there, utilizing command-line interfaces becomes pivotal for
executing various operations. Creating files and directories is achieved through specific
commands, such as ’hadoop fs -touchz’ for files and ’hadoop fs -mkdir’ for directories,
allowing users to establish and organize data structures within the distributed file
system seamlessly.
1. Start Hadoop Daemon and Initialize Local Host: Begin by launching the Hadoop
daemon and initializing the local host on your workstation.
2. Access HDFS System Files: Open a web browser and navigate to ”localhost:9870”
to access the Hadoop Distributed File System (HDFS) through the Hadoop web
interface. Within the utilities section, select ”Browse” to explore system files and
directories stored in HDFS.
3. Creating Files: Open the command prompt. Utilize the following commands to
create files:
Example: hadoop fs -touchz /1.txt
Additional example: hadoop fs -touchz /docx
4. Creating Directories: Use the following command to create directories:
hadoop fs -mkdir /mydir
5. Creating Subdirectories: Employ the command below to create subdirectories
within existing directories: hadoop fs -mkdir /mydir/dirl
6. Retrieving Files: For retrieving files from HDFS:
Utilize the command: hadoop fs -get /path/to/1.txt c: local Alternately, use:
hadoop fs -put /path/to/1.txt c: local for uploading files.
7. Deleting Files: To delete a file from HDFS, execute:
hadoop fs -rm /1.txt
8. Deleting Directories: For deleting directories:
Utilize the command: hadoop fs -rm -r /directoryname

7
Internship Report

Figure 4.1: Implementation step 1

Dept. of Computer Engineering 8


Internship Report

Figure 4.2: Implementation step 2

Following these steps ensured effective utilization of Hadoop for file management
tasks, enabling the creation, deletion, and retrieval of files and directories within the
HDFS environment.
Moreover, the methodology underscores the importance of clarity and precision in
command execution to avoid errors and streamline operations effectively. Retrieving
files from HDFS involves employing ’hadoop fs -get’ or ’hadoop fs -put’ commands, fa-
cilitating seamless data transfer between local and distributed file systems. Similarly,
deletion operations for files and directories are executed with the ’hadoop fs -rm’ com-
mand, ensuring efficient management of data resources within the Hadoop ecosystem.
By adhering to this structured methodology, users can harness the power of Hadoop
for file management tasks, enabling efficient creation, retrieval, and deletion of files
and directories, thereby facilitating robust data management practices in distributed
computing environments.

Dept. of Computer Engineering 9


Internship Report

Figure 4.3: Terminal

Dept. of Computer Engineering 10


Chapter 5

RESULTS

The project’s results demonstrate the successful implementation of a file management


system using Hadoop Distributed File System (HDFS) for our cloud computing subject.
Through practical experimentation and testing, we showcased the system’s ability to
handle basic file management tasks efficiently within a Hadoop cluster environment.
The system exhibited notable scalability, allowing it to manage varying workloads and
data volumes effectively, showcasing its suitability for beginner-level cloud computing
projects.

Figure 5.1: Resultant on browser

Furthermore, the fault tolerance mechanisms inherent in HDFS were demonstrated,


ensuring data integrity and availability even under simulated failure scenarios. The
simplicity and ease of use of the system’s command-line interface (CLI) were evident,

11
Internship Report

providing us with a straightforward means of executing file management operations


using familiar commands. Overall, the project’s results underscore our grasp of funda-
mental cloud computing concepts and our ability to apply them in practical scenarios,
setting a strong foundation for our future endeavors in the field.

Dept. of Computer Engineering 12


Chapter 6

CONCLUSION

In our project, we successfully implemented a file management system using Hadoop


Distributed File System (HDFS), showcasing its scalability, fault tolerance, and ef-
ficiency within a beginner-level cloud computing context. Through rigorous testing,
we demonstrated the system’s ability to handle basic file management tasks, such as
creating, retrieving, and deleting files and directories, while maintaining consistent
performance. The system’s scalability was evident as it efficiently managed varying
workloads and data volumes, showcasing its suitability for cloud computing projects of
varying complexities. Additionally, the fault tolerance mechanisms inherent in HDFS
ensured data integrity and availability, even under simulated failure scenarios, high-
lighting the reliability of the system in handling adverse conditions. The simplicity
and ease of use of the system’s command-line interface (CLI) provided a user-friendly
means of executing file management operations, underscoring our grasp of fundamental
cloud computing concepts and our ability to apply them effectively.
Implementing a file management system using HDFS is significant in the context
of cloud computing as it provides a scalable, reliable, and efficient solution for manag-
ing large volumes of data across distributed clusters. This not only streamlines data
storage and retrieval processes but also lays the foundation for more advanced data
processing and analytics tasks. By leveraging HDFS, organizations can unlock new
possibilities for collaboration, innovation, and insights, driving efficiency and competi-
tiveness in the digital era. Furthermore, integrating fault tolerance mechanisms ensures
data reliability and continuity, enhancing the resilience of cloud-based applications and
services. Moving forward, potential enhancements to the project could include explor-
ing the integration of additional Hadoop ecosystem components, optimizing the system
for specific use cases, and incorporating advanced security features to address evolving
data privacy and security concerns in cloud environments.

13
Chapter 7

REFERENCES

[1]https://aws.amazon.com/what-is/hadoop/.

[2]https://www.simplilearn.com/tutorials/hadoop-tutorial

[3]https://www.youtube.com/watch?v=UcU8XiqL7MA

[4]https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs.

[5]https://data-flair.training/blogs/top-hadoop-hdfs-commands-tutorial/

[6]https://www.youtube.com/watch?v=JcRqicngvrA

[7]https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCom

14

You might also like