0% found this document useful (0 votes)

9 views20 pages

BDA Report

Uploaded by

vinaysonaganti111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views20 pages

BDA Report

Uploaded by

vinaysonaganti111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

A

Course End Project report

Big Data Pipeline for Log File Analysis

Big Data Analytics Course End Project

Submitted in the Partial Fulfilment of the

Requirements
for the Award of the Degree of

BACHELOR OF TECHNOLOGY
IN
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Submitted
by

S. No Roll Number Name of the student

1 23885A7301 A.Akshaya

2 23885A7302 A.Yashwanth

3 23885A7303 Ch.Shamala Divya

4 23885A7304 G.Manish Kumar

5 23885A7305 N.Naveen

6 23885A7306 P.Abhinay

7 23885A7307 S.Vinay

Under the Guidance of

Mr. AnandKumar B
Assistant Professor
Department of AIML

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

VARDHAMAN COLLEGE OF ENGINEERING

(AUTONOMOUS)
Affiliated to JNTUH, Approved by AICTE, Accredited by NAAC with A++ Grade, ISO 9001:2015 Certified
Kacharam, Shamshabad, Hyderabad – 501218, Telangana, India
VARDHAMAN COLLEGE OF ENGINEERING
(AUTONOMOUS)
Affiliated to JNTUH, Approved by AICTE, Accredited by NAAC with A++ Grade, ISO 9001:2015 Certified
Kacharam, Shamshabad, Hyderabad – 501218, Telangana, India

Department of Artificial Intelligence and Machine Learning

CERTIFICATE

This is to certify that the BDA Course End Project report entitled, “Big Data Pipeline for Log
File Analysis”, done by A.Akshaya(23885A7301), A.Yashwanth(23885A7302),
Ch.Shamala Divya(23885A7303), G.Manish Kumar(23885A7304), N.Naveen(23885A7305),
P.Abhinay(23885A7306), S.Vinay(23885A7307) Submitting to the Department of Artificial
Intelligence & Machine Learning, VARDHAMAN COLLEGE OF ENGINEERING, in
partial fulfilment of the requirements for the Degree of BACHELOR OF TECHNOLOGY in
Artificial Intelligence & Machine Learning, during the year 2024-25. It is certified that he/she
has completed the project satisfactorily.

Signature of the Instructor Signature of Head of the

Department
Mr. AnandKumar B Dr. Gagan Deep Arora
Assistant Professor Head of the Department, AIML

Dept. of Artificial Intelligence and Machine Learning 2|Page

DECLARATION

We hereby declare that the work described in this BDA Course End Project report
entitled “Big Data Analytics Course End Project” which is being submitted by us in partial
fulfilment for the award of BACHELOROF TECHNOLOGY in the Department of
Artificial Intelligence & Machine Learning, Vardhaman College of Engineering, affiliated
to the Jawaharlal Nehru Technological University Hyderabad.

The work is original and has not been submitted for any Degree or Diploma of this or anyother
university.

Signature of the Student

A.Akshya (23885A7301)
A. Yashwanth (23885A7302)
Ch.Shamala Divya(23885A7303)
G.Manish Kumar (23885A7304)
N.Naveen (23885A7305)
P.Abhinay (23885A7306)
S.Vinay(23885A7307)

Dept. of Artificial Intelligence and Machine Learning 3|Page

Content

Sl. No. Chapter Name Page

No.
1 ABSTRACT 5
2 INTRODUCTION 6
3 LITERATURE SURVEY 7
4 METHODOLOGY 9
5 CODE 11
6 RESULTS AND ANALYSIS 15
7 CONCLUSION AND FUTURE SCOPE 18
8 REFERENCES 20

Dept. of Artificial Intelligence and Machine Learning 4|Page

ABSTRACT

In this project, we developed a log level analysis system using Hadoop MapReduce to
process and categorize large log files. The primary goal was to analyze logs and identify the
distribution of different log levels (DEBUG, INFO, ERROR, WARN, TRACE). Log files are
critical for debugging, system monitoring, and performance optimization, as they provide
vital insights into system behavior and errors. However, handling large volumes of log data
manually is impractical, which is where Hadoop’s distributed computing model comes
in.Using Hadoop MapReduce, the project processes log files stored on HDFS (Hadoop
Distributed File System). The Mapper class reads each log line, extracts the log level, and
outputs it along with a count of 1. The Reducer aggregates these counts and outputs the total
count for each log level. This system allows for efficient processing of large datasets in
parallel across a cluster, making it scalable for use in production environments.The result of
the analysis is a set of log level counts, which can be useful for identifying system trends,
frequent errors, or areas where debugging efforts should be focused. The project showcases
how big data tools like Hadoop can be leveraged for log analysis and other real-world use
cases that involve large-scale data processing.

Keywords: Hadoop, HDFS, MapReduce, Log Analysis, Log Parsing, Data Aggregation, Distributed
Computing, Fault Tolerance, System Monitoring.

Dept. of Artificial Intelligence and Machine Learning 5

Chapter 1
INTRODUCTION
The In the modern digital era, where every application, device, and service generates logs at an
unprecedented rate, the importance of efficient log analysis cannot be overstated. Log files are
fundamental assets for any organization, providing critical insights into system operations, user
activities, errors, security threats, and overall application performance. However, as the volume
of data grows, traditional data processing techniques struggle to keep pace, leading to bottlenecks,
delayed insights, and missed opportunities for optimization.

To address these challenges, Big Data technologies like Apache Hadoop have emerged as game-
changers. Hadoop offers a distributed computing framework capable of handling vast datasets
across clusters of machines, providing scalability, fault tolerance, and high-speed processing.
This project explores how Hadoop, specifically using its HDFS (Hadoop Distributed File System)
for storage and MapReduce for processing, can be leveraged for effective log file analysis.

In this project, a sample server log file was uploaded into HDFS, and a custom-built MapReduce
job was executed to analyze the occurrence and frequency of different log levels — including
INFO, DEBUG, ERROR, WARN, and TRACE. By breaking the problem into smaller sub-tasks
(map) and then aggregating the results (reduce), Hadoop allows the analysis to be carried out
swiftly and accurately, even as the data size scales up.

The goal of this log analysis is not only to count the number of log entries per severity level but
also to help in understanding underlying patterns, identifying potential failure points, and
improving the system's reliability and performance. Insights derived from this process can be
vital for proactive monitoring, troubleshooting, and optimizing system operations.

Thus, through this project, we demonstrate the practical application of Big Data tools in real-
world scenarios, highlighting how they transform complex, heavy-lifting tasks into manageable,
automated, and insightful processes.

Dept. of Artificial Intelligence and Machine Learning 6

Chapter 2
LITERATURE SURVEY

The field of sentiment analysis has experienced significant growth, driven by the increasing
volume of social media data and the advancements in Big Data technologies. Researchers have
explored various techniques and platforms for analyzing sentiment in online text, with a
particular focus on the challenges and opportunities presented by platforms like Twitter.

1. Big Data Challenges and Oppurtunities:

Manyika et al. (2011) discussed how the explosion of Big Data brought both immense
opportunities and complex challenges. With unstructured data like system logs growing
exponentially, traditional data storage and processing methods became obsolete. They stressed the
need for advanced frameworks capable of handling large-scale datasets effectively. Their research
provided the foundation for Big Data technologies to evolve. Hadoop emerged as a key solution to
address these storage and computation needs.

2. Hadoop’s Role in Distributed Data Processing:

White (2012) explained in "Hadoop: The Definitive Guide" how Hadoop revolutionized data
processing by distributing data across multiple nodes. The HDFS (Hadoop Distributed File System)
allows massive datasets to be stored with redundancy and fault tolerance. Meanwhile, MapReduce
enables efficient, parallel data processing without manual intervention. This distributed
architecture became crucial for log analysis where data volumes are enormous. The study
positioned Hadoop as a backbone technology for Big Data applications.

3. Log File Analysis in Cloud Environments:

Srinivas et al. (2017) highlighted how analyzing cloud-generated log files requires scalable, fault-
tolerant systems. They explored Hadoop’s role in managing and processing logs generated from
thousands of cloud-based services. Hadoop’s batch-processing nature allowed efficient analysis of
huge logs to detect failures and anomalies. The research showed that traditional logging tools were
insufficient in cloud environments. Therefore, Hadoop's adaptability to cloud-scale data streams
was highly emphasized.

Dept. of Artificial Intelligence and Machine Learning 7

4. MapReduce Programming Paradigm:
Dean and Ghemawat (2004) originally proposed the MapReduce model, which simplifies large
scale data processing through a two-step map and reduce approach, making it ideal for tasks like log
data categorization and summarization.

5. Applications of Log Mining:

Xu et al. (2009) explored various applications of log mining, such as system troubleshooting,
anomaly detection, and predictive maintenance, demonstrating the immense value hidden in
analyzing system-generated logs

6. Scalability and Fault Tolerance in Hadoop:

Borthakur (2007) discussed how Hadoop’s architecture ensures high availability and fault
tolerance through replication of data blocks, making it suitable for critical applications like system
log analysis where data reliability is crucial.

7. Comparison of Traditional vs Big Data Log Analysis:

Research conducted by Sharma and Sood (2016) compared traditional log analysis methods with
Big Data approaches, concluding that traditional systems fail to manage the scale and speed of
modern log data effectively, whereas Hadoop-based solutions offer significant performance
improvements.

8. Optimization Techniques for MapReduce Jobs:

Zaharia et al. (2010) introduced optimizations like speculative execution and resource-aware
scheduling to enhance MapReduce performance, directly impacting the efficiency of log analysis
workflows by reducing job completion times and resource wastage.

Dept. of Artificial Intelligence and Machine Learning 8

Chapter 3
METHODOLOGY

The methodology adopted for the Log File Analysis project is a structured and systematic
approach using the Hadoop ecosystem, specifically the MapReduce programming model. The
project focuses on efficiently processing large volumes of log data to extract meaningful insights
about system behavior, error patterns, and information flows.The complete process can be
divided into the following major phases:
1. Environment Setup:
o Hadoop 3.x was installed on a MacBook Air device, ensuring compatibility with

the macOS operating system.

o Java Development Kit (JDK) was installed and configured, as Java is essential for
writing MapReduce applications.
o Environment variables such as JAVA_HOME and HADOOP_HOME were
correctly set up.
o Hadoop services like namenode and datanode were initialized and verified to
ensure the Hadoop Distributed File System (HDFS) was operational.
o Command-line tools were tested to confirm that basic Hadoop operations like hdfs
dfs -ls, hdfs dfs -put, and hdfs dfs -cat were functioning properly.
2. Data Collection and Preparation:
o A sample.log file was curated, containing different types of log messages typically
generated by real-world systems, such as INFO, DEBUG, ERROR, WARN, and
TRACE logs.

o Care was taken to format the log entries in a manner resembling production server
logs, with timestamps and log levels for realism.

o This log file was then uploaded into the Hadoop Distributed File System using the
command:hdfs dfs -put /Users/vinaysonaganti/hadoop/sample.log
/log_files/sample.log

o The file was verified in HDFS to ensure that it was correctly stored and accessible
for further processing.

Dept. of Artificial Intelligence and Machine Learning 9

3. Development of MapReduce Components:
o LogMapper.java:This file defined the Mapper class which reads each line of the
input file.It extracts the log level (INFO, DEBUG, etc.) from each log line.For
every occurrence of a log level, it emits a key-value pair:

o LogReducer.java:The Reducer receives grouped keys (log levels) and their list of
values (all 1s).It sums up all values for each log level to get the total count.It then
outputs each log level alongside its occurrence count.

o LogDriver.java:Acts as the main driver class.It configures the Hadoop job by

linking the Mapper and Reducer classes.It sets the input and output paths in
HDFS.It specifies the output key and value data types.

o All Java files were compiled using:javac -classpath `hadoop classpath` -d

compiled/ *.java

o After successful compilation, a JAR file named log_analysis.jar was created:jar -

cvf log_analysis.jar -C compiled/ .
4. Execution of MapReduce Job:
o In this project, a lexicon-based approach will be used for sentiment analysis.
Lexicon-based methods rely on sentiment lexicons, which are dictionaries of
words and their associated sentiment scores.

o The JAR file was executed using the Hadoop command-line interface:hadoop jar
log_analysis.jar LogDriver /log_files/sample.log /output

o Here,/log_files/sample.log is the input directory containing the logs./output is the

output directory where the final analysis results were stored.

o During job execution:Hadoop split the input data into chunks (InputSplits).Each
split was processed independently by Mapper tasks.Intermediate outputs were
shuffled and sorted before being handed over to the Reducer tasks.
5. Output Validation and Result Collection:
o Post successful job execution:The system generated an output folder /output in
HDFS containing two files: _SUCCESS and part-r-00000.The _SUCCESS file
indicated the job completed successfully without any errors.The part-r-00000 file
contained the final analyzed data.

o Using the command:hdfs dfs -cat /output/part-r-00000 the output was displayed, showing
the count of each log level.DEBUG-434,ERROR -6, INFO -96, TRACE -816,WARN-11.

Dept. of Artificial Intelligence and Machine Learning 10

Chapter 4
CODE
1. Hadoop Setup and Data Storage

 Hadoop Installation and Configuration:

o Download and install a Hadoop distribution (e.g., Apache Hadoop).
o Configure the core Hadoop components (HDFS and YARN).
o Set up the configuration files (core-site.xml, hdfs-site.xml, yarn-site.xml,
mapred-site.xml).

 HDFS Setup
o Format the HDFS file system:
o hdfs namenode -format

o Start Hadoop services:

o start-dfs.sh
o start-yarn.sh

 HDFS Directory Creation:

o Creating HDFS Directories:You need a directory in HDFS to store
your log file (e.g., sample.log). If it doesn't already exist,
create it:hdfs dfs -mkdir -p /log_files
o Upload log file to HDFS: Use the hdfs dfs -put command to copy the
sample.log file from your local machine to HDFS:hdfs dfs -put
/path/to/local/sample.log /log_files/sample.log

 LogMapper Class:
o TThe LogMapper class is responsible for processing each line of the log file,
extracting the log level (e.g., INFO, ERROR), and emitting a key-value pair with
the log level and a count of 1

o Code:

o public class LogMapper extends

Mapper<LongWritable, Text, Text,
IntWritable> {

o private Text logLevel = new

Text();

Dept. of Artificial Intelligence and Machine Learning 11

o private final static
IntWritable one = new
IntWritable(1);

o public void
map(LongWritable key, Text value,
Context context) throws
IOException, InterruptedException
{

o String line =
value.toString();

o String[] parts =
line.split(" "); // Assuming the
log level is the first word

o if (parts.length > 0) {

o
logLevel.set(parts[0]); //
Extract the log level (e.g., INFO,
DEBUG)

o
context.write(logLevel, one); //
Emit the log level with a count of
1

o }

2. LogReducer Class
 The LogReducer class aggregates the counts of each log level and outputs the final sum for
each log level.
o code:
o public class LogReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
o private IntWritable result = new IntWritable();
o
o public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
o int sum = 0;
o for (IntWritable val : values) {

Dept. of Artificial Intelligence and Machine Learning 12

o sum += val.get(); // Sum all the values for the
given log level
o }
o result.set(sum); // Set the total count for the log
level
o context.write(key, result); // Emit the log level and
its total count
o }
o }

 LogDriver Class:
o The LogDriver class is the entry point for the MapReduce job. It sets up the job
configuration, specifies the Mapper and Reducer classes, and defines the
input/output paths.
o Code:
o public class LogDriver {
o public static void main(String[] args) throws Exception {
o Configuration conf = new Configuration();
o Job job = Job.getInstance(conf, "Log Level Analysis");
o
o // Set the job's Jar
o job.setJarByClass(LogDriver.class);
o
o // Set the Mapper and Reducer classes
o job.setMapperClass(LogMapper.class);
o job.setReducerClass(LogReducer.class);
o
o // Set the output key and value types
o job.setOutputKeyClass(Text.class);
o job.setOutputValueClass(IntWritable.class);
o
o // Set the input and output paths
o FileInputFormat.addInputPath(job, new Path(args[0])); //
Input path (HDFS)
o FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Output path (HDFS)
o
o // Exit with status of job
o System.exit(job.waitForCompletion(true) ? 0 : 1);
o }
o }
o

 Compiling the Code and Creating the JAR File:

o Once the code is written, it needs to be compiled into a JAR file for Hadoop to
execute. Use the following commands to compile the code and package it into a
JAR file:
o Code Compilation:# Compile the Java classes
o javac -classpath `hadoop classpath` *.java
o Creating the JAR:jar -cvf LogAnalysis.jar *.class

Dept. of Artificial Intelligence and Machine Learning 13

 Running the Hadoop Job:
o With the JAR file created, you can now run the MapReduce job on Hadoop. You
need to specify the input path (HDFS location of the log file) and the output path
(where the result should be stored).
o Run the job:
o hadoop jar LogAnalysis.jar LogDriver /log_files/sample.log /output

o Explanation: The hadoop jar command runs the job with the LogDriver class. It
processes the log file located at /log_files/sample.log and stores the results in
/output in HDFS.
o Viewing the Output:Once the job finishes, you can check the output in the
specified HDFS output directory. The results will contain the counts of each log
level (e.g., INFO, DEBUG, ERROR).
o hdfs dfs -ls /output # List the files in the output directory
o hdfs dfs -cat /output/part-r-00000 # Display the contents of the output file
o Example Output:
o DEBUG 434
o ERROR 6
o INFO 96
o TRACE 816
o WARN 11
o This Code section comprehensively explains the steps for setting up Hadoop,
writing the MapReduce code, compiling it, and running it on Hadoop, followed by
retrieving and analyzing the output.

Dept. of Artificial Intelligence and Machine Learning 14

Chapter 5
RESULTS AND ANALYSIS

Results:

1. Hadoop (Data Storage)

o HDFS Directory Creation:The log data is stored in HDFS under the /log_files/
directory. To verify this, you can use the following command:

o hdfs dfs -ls /log_files/

o The output should list the log file sample.log, confirming its presence in HDFS.

o HDFS File Content:

o The content of the file stored in HDFS consists of log data in a raw format, where
each line represents a log entry. These lines contain log levels such as DEBUG,
ERROR, INFO, TRACE, and WARN.
Discussion:
o HDFS is used for storing log files in a distributed manner. This provides fault
tolerance and scalability, especially for large datasets. Hadoop ensures that the
data is securely stored across multiple nodes in the cluster.

o The sample.log file is directly ingested into HDFS, making it easily accessible for
further processing by the MapReduce job.
o The unique filename (tweet ID) ensures that each tweet is stored as a separate file,
which can be efficient for retrieval and processing in some scenarios.

2. Log Level Analysis (MapReduce)

o Results:

o MapReduce Job Execution:The LogMapper class processes each line of the input
log file and maps each log entry to a specific log level (DEBUG, ERROR, INFO,
etc.). The LogReducer aggregates the counts of each log level.

o After executing the job, the results are stored in the output directory (/output/).
The output file (part-r-00000) contains the final log level counts, with each line
showing the log level and its respective count. For example:
Dept. of Artificial Intelligence and Machine Learning 15
o DEBUG 434

o ERROR 6

o INFO 96

o TRACE 816

o WARN 11

o You can use the following command to check the output:

o hdfs dfs -cat /output/part-r-00000

o Discussion:

o The LogMapper reads each line of the log file and extracts the log level from the
beginning of each log entry. This key (log level) and a value of 1 (to count
occurrences) are emitted by the mapper.

o The LogReducer aggregates the emitted key-value pairs, counting the number of
occurrences for each log level. These counts are written to the output file in HDFS.

o The results help analyze the distribution of log levels in the system, giving insight
into which log levels are most frequently used. This can be useful for identifying
areas that need optimization (e.g., reducing verbose logging like DEBUG or
TRACE).

3. Performance and Scalability Analysis:

o Results:

o Job Execution Time:

o The MapReduce job ran with the sample.log file, and the job completed in a
reasonable amount of time given the size of the file. The time taken for the job to
complete depends on several factors such as input size, available resources, and
Hadoop cluster configuration.

o Discussion:

o For small datasets like the sample.log file, the job execution time is minimal.
However, for larger datasets, the performance of the job can be improved by
adjusting the number of reducers or tuning Hadoop configurations.

o Scalability:The job scales horizontally with the addition of more nodes to the
Hadoop cluster. This makes it possible to process larger log files without
Dept. of Artificial Intelligence and Machine Learning 16
significant performance degradation, as Hadoop distributes the work across the
cluster.

4. Insights and Conclusion:

o Results:The analysis of log levels successfully categorized and counted the
occurrences of various log levels (DEBUG, ERROR, INFO, etc.). The results,
stored in HDFS, provide a clear view of the distribution of log levels in the
sample file.
o Discussion:Verbose Logging: The results show that DEBUG and TRACE log
levels are the most frequent, which suggests that the system is producing detailed
logs, useful for debugging. However, this verbosity might be excessive for regular
operations, and reducing these levels could help improve performance.
o Error Logs: The ERROR log level had fewer occurrences, which is a positive sign,
indicating that there were fewer issues or failures in the system during the time
period captured by the log file.
o Conclusion:The Log Level Analysis project successfully implemented a
MapReduce job to analyze log data stored in HDFS. The results provide valuable
insights into the logging practices of the system, including the prevalence of
different log levels.
o This analysis is important for optimizing logging configurations and ensuring that
the system logs are appropriately detailed for both monitoring and troubleshooting
purposes.

Dept. of Artificial Intelligence and Machine Learning 17

Chapter 6

CONCLUSION AND FUTURE SCOPE

Conclusion:
The implementation of the Log Level Analysis system effectively demonstrates the power of
Hadoop MapReduce for processing and analyzing large volumes of log data. By leveraging HDFS
for distributed storage and MapReduce for parallel processing, the system efficiently categorizes
log entries into different log levels (DEBUG, ERROR, INFO, TRACE, WARN). The project
showcases the scalability and fault tolerance of Hadoop when handling log data, offering an
optimized solution for analyzing logs from various sources in real-time.The system provides
valuable insights by identifying which log levels are most prevalent, helping system administrators
and developers to understand the verbosity and focus of the logging system. By analyzing this data,
organizations can determine whether the logs are too verbose (e.g., excessive DEBUG or TRACE
logs) or if critical errors are being underreported. This analysis is crucial for ensuring the
efficiency and health of systems by adjusting log levels based on operational needs.
Future Scope:

While the project provides a strong foundation for analyzing log data, there are several areas for
improvement and expansion:
1. Log Data Enrichment and Processing:: The The current system focuses only on
counting log levels. Future work could include enriching log data with additional context,
such as the source of the logs (e.g., server, application) or adding metadata like timestamps
and error severity. This would allow for more detailed analysis, such as identifying trends
over time or correlating log entries with specific system events.
2. Real-time Log Processing: For true real-time log analysis, integrating a stream
processing engine like Apache Kafka or Apache Flink could help process log data as it
is generated. This would enable immediate identification of critical errors, making the
system more responsive in detecting and addressing issues as they occur.
3. Advanced Log Pattern Recognition: The current analysis focuses on basic log level
counts. Future work could explore more advanced techniques such as log pattern
recognition using machine learning models. This would allow the system to detect
abnormal log patterns that could indicate potential issues like security breaches,
performance degradation, or system failures.

Dept. of Artificial Intelligence and Machine Learning 18

4. Multilingual Log Analysis: While the project focuses on logs in English, many systems
generate logs in different languages. Extending the system to handle multilingual logs
and providing tailored analysis for logs in various languages could expand the system’s
applicability and reach.
5. Integration with Monitoring Systems: The system could be integrated with existing
monitoring and alerting systems to provide a more comprehensive solution for system
administrators. By linking log analysis with performance metrics or uptime data, the system
could automatically trigger alerts when certain log patterns (e.g., repeated ERROR entries) are
detected.
6. Improved Visualizations: While the current system outputs the log level counts,
enhancing the system with real-time dashboards and visualizations could provide a
more user-friendly interface. These visualizations could display trends over time,
highlight spikes in error logs, or show correlations between log levels and system
performance metrics.
7. Scalability Enhancements: As the volume of log data increases, there may be a need
to scale the system further. Future improvements could involve optimizing the Hadoop
job performance, experimenting with different configurations, and adjusting the
number of mappers and reducers based on input size to ensure the system remains
efficient under heavy loads.
8. Automated Log Level Adjustment:Based on the analysis of log data, future versions of the
system could automatically adjust log levels in real-time. For example, if the system detects an
increase in ERROR logs, it could increase the verbosity of logging to capture more detailed
information, and then reduce the verbosity once the issue is resolved.

Dept. of Artificial Intelligence and Machine Learning 19

REFERENCES

1. Ranger, D., & Lee, Y. (2018). "Log Analytics with Hadoop: An Overview of Data
Processing and Visualization." Journal of Big Data, 5(1), 45-52.
2. Zikria, Y. B., & Lee, H. (2018). "Big Data Processing Using Hadoop: A Review of
Technologies." International Journal of Computer Science and Information Security,
16(4), 47-52.
3. Jindal, A., & Kumar, R. (2019). "Log Data Processing with Apache Hadoop: A
Review of Frameworks and Tools." International Journal of Data Science and Big Data
Analytics, 4(3), 134-142.
4. Apache Hadoop Documentation. (2023). "Hadoop Overview."
https://hadoop.apache.org/docs/
5. Apache Hive Documentation. (2023). "Hive Overview."
https://cwiki.apache.org/confluence/display/Hive/
6. Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Houghton Mifflin Harcourt.
7. De Moura, L., & Chowdhury, D. (2017). "Log File Analysis in Big Data Systems."
International Journal of Computer Applications, 56(2), 58-65.
8. Raj, M., & Kumar, A. (2020). "Log Analytics: A Big Data Approach for Security and
Monitoring." International Journal of Data Science, 9(1), 10-20.
9. Dean, J., & Ghemawat, S. (2008). "MapReduce: Simplified Data Processing on Large
Clusters." Communications of the ACM, 51(1), 107-113.
10. Bharadwaj, M., & Kumar, R. (2019). "Analyzing Log Data Using Apache Hive and
Hadoop." Journal of Big Data Research, 6(3), 200-208.

Dept. of Artificial Intelligence and Machine Learning 20

Trend Analysis of Access Patterns Using Hadoop PDF
No ratings yet
Trend Analysis of Access Patterns Using Hadoop PDF
84 pages
Twitter Sentiment Analysis Project Report
No ratings yet
Twitter Sentiment Analysis Project Report
42 pages
Big Data and Hadoop Assignment Guide
No ratings yet
Big Data and Hadoop Assignment Guide
5 pages
Big Data G
No ratings yet
Big Data G
11 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
12 pages
Big Data
No ratings yet
Big Data
27 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Cassandra Compaction and DAA Mapping
No ratings yet
Cassandra Compaction and DAA Mapping
84 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Week 5 Researchpaper
No ratings yet
Week 5 Researchpaper
7 pages
Installation and Configuration System Tool For Hadoop
No ratings yet
Installation and Configuration System Tool For Hadoop
30 pages
Gag PDF
No ratings yet
Gag PDF
15 pages
Big Data
No ratings yet
Big Data
19 pages
Big Data Analytics Lab Manual CE802-N
No ratings yet
Big Data Analytics Lab Manual CE802-N
44 pages
Hadoop's Role in Big Data Processing
No ratings yet
Hadoop's Role in Big Data Processing
2 pages
Exp 5 Big Data Analytics and Computing Lab Manual
No ratings yet
Exp 5 Big Data Analytics and Computing Lab Manual
28 pages
Big Data Stock Analysis with Hadoop
No ratings yet
Big Data Stock Analysis with Hadoop
24 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Ipl Report
100% (3)
Ipl Report
44 pages
History and Features of Hadoop
No ratings yet
History and Features of Hadoop
11 pages
Hadoop for Massive Data Processing
No ratings yet
Hadoop for Massive Data Processing
3 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Dsbdal Te It Manual
No ratings yet
Dsbdal Te It Manual
86 pages
Big Data Analytics QP
No ratings yet
Big Data Analytics QP
36 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Adm Final Word
No ratings yet
Adm Final Word
20 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Final Project
No ratings yet
Final Project
4 pages
Big Data
No ratings yet
Big Data
3 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
AYUSHI
No ratings yet
AYUSHI
40 pages
Attachment
No ratings yet
Attachment
11 pages
BIG DATA ANALYTIS LAB File Shivam
No ratings yet
BIG DATA ANALYTIS LAB File Shivam
42 pages
Big Data Hadoop - Course Curriculum - V1
No ratings yet
Big Data Hadoop - Course Curriculum - V1
7 pages
Understanding HDFS in Big Data
No ratings yet
Understanding HDFS in Big Data
61 pages
CAPSTONE PROJECTInstallation PDF
No ratings yet
CAPSTONE PROJECTInstallation PDF
33 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
94 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Unit-I Material
No ratings yet
Unit-I Material
32 pages
Optimize Small Files in Hadoop
No ratings yet
Optimize Small Files in Hadoop
62 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
21 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Bda Lab Manual - Cse 8 Sem - Compl
No ratings yet
Bda Lab Manual - Cse 8 Sem - Compl
57 pages
Hadoop Ecosystem Overview and Commands
No ratings yet
Hadoop Ecosystem Overview and Commands
9 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
BigData and Hadoop - Syllabus
No ratings yet
BigData and Hadoop - Syllabus
2 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Automated Log Parsing Techniques
No ratings yet
Automated Log Parsing Techniques
24 pages
Big Data
No ratings yet
Big Data
8 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
6 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
48 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
24 pages
Tutorial - Perf Wiki
No ratings yet
Tutorial - Perf Wiki
23 pages
Manual Do Software BrasilTech
No ratings yet
Manual Do Software BrasilTech
595 pages
1.3.2 Wired and Wireless Networks, Protocols and Layers - Workbook OCR GCSE ANSWERS
No ratings yet
1.3.2 Wired and Wireless Networks, Protocols and Layers - Workbook OCR GCSE ANSWERS
1 page
DX 100-200 KVA Manual - Final
No ratings yet
DX 100-200 KVA Manual - Final
63 pages
Chazmin Taylor: Project Manager Profile
No ratings yet
Chazmin Taylor: Project Manager Profile
1 page
Home Assignment Unit 5
No ratings yet
Home Assignment Unit 5
13 pages
The Encoding For Thai: Werner Lemberg 2005/07/04
No ratings yet
The Encoding For Thai: Werner Lemberg 2005/07/04
34 pages
Wa0034.
No ratings yet
Wa0034.
12 pages
Session 36 Ultra-High-Density D2D and High-Performance Optical Transceivers
No ratings yet
Session 36 Ultra-High-Density D2D and High-Performance Optical Transceivers
29 pages
Professional Summary: Vijay Anand Rajendran
No ratings yet
Professional Summary: Vijay Anand Rajendran
3 pages
Software Engineering Bhrigu Soni
No ratings yet
Software Engineering Bhrigu Soni
44 pages
Markov Chains
No ratings yet
Markov Chains
73 pages
2023 CCPM Report Template and Action Plan Eng
No ratings yet
2023 CCPM Report Template and Action Plan Eng
10 pages
Cisco H.323 Gateway Troubleshooting
No ratings yet
Cisco H.323 Gateway Troubleshooting
30 pages
Setu Solutions for Traceability Systems
No ratings yet
Setu Solutions for Traceability Systems
21 pages
Jumper Settings 2003 Mode
No ratings yet
Jumper Settings 2003 Mode
9 pages
KKVision Visual Positioning Laser Cutting Control System
100% (1)
KKVision Visual Positioning Laser Cutting Control System
58 pages
Best Practices For Team-Based Development
No ratings yet
Best Practices For Team-Based Development
4 pages
Cloud Security: Best Practices & Insights
No ratings yet
Cloud Security: Best Practices & Insights
22 pages
IDS Reference Architecture Model 3.0 2019
No ratings yet
IDS Reference Architecture Model 3.0 2019
118 pages
Rational Approximation: Lloyd N. Trefethen January 3, 2025
No ratings yet
Rational Approximation: Lloyd N. Trefethen January 3, 2025
4 pages
Data Mining Complete
No ratings yet
Data Mining Complete
95 pages
Sample Profiel Summary Points
No ratings yet
Sample Profiel Summary Points
5 pages
Design and Implementation of A Computer Based Treasury Management System
0% (1)
Design and Implementation of A Computer Based Treasury Management System
61 pages
Instructions For TBR800 Operation
No ratings yet
Instructions For TBR800 Operation
9 pages
Computer Programming Exam 2022
No ratings yet
Computer Programming Exam 2022
2 pages
Functional Block Diagram of Typical IIoT System
No ratings yet
Functional Block Diagram of Typical IIoT System
12 pages
Farooq Masood Chishti-Resume
No ratings yet
Farooq Masood Chishti-Resume
4 pages
Skill Lab Manual-All Branches
No ratings yet
Skill Lab Manual-All Branches
192 pages
003 1044 MA v01r01 PDF
No ratings yet
003 1044 MA v01r01 PDF
13 pages

BDA Report

Uploaded by

BDA Report

Uploaded by

A

Course End Project report

Big Data Pipeline for Log File Analysis

Submitted in the Partial Fulfilment of the

S. No Roll Number Name of the student

3 23885A7303 Ch.Shamala Divya

4 23885A7304 G.Manish Kumar

Under the Guidance of

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

VARDHAMAN COLLEGE OF ENGINEERING

Department of Artificial Intelligence and Machine Learning

Signature of the Instructor Signature of Head of the

Dept. of Artificial Intelligence and Machine Learning 2|Page

Signature of the Student

Dept. of Artificial Intelligence and Machine Learning 3|Page

Sl. No. Chapter Name Page

Dept. of Artificial Intelligence and Machine Learning 4|Page

Dept. of Artificial Intelligence and Machine Learning 5

Dept. of Artificial Intelligence and Machine Learning 6

1. Big Data Challenges and Oppurtunities:

2. Hadoop’s Role in Distributed Data Processing:

3. Log File Analysis in Cloud Environments:

Dept. of Artificial Intelligence and Machine Learning 7

5. Applications of Log Mining:

6. Scalability and Fault Tolerance in Hadoop:

7. Comparison of Traditional vs Big Data Log Analysis:

8. Optimization Techniques for MapReduce Jobs:

Dept. of Artificial Intelligence and Machine Learning 8

the macOS operating system.

Dept. of Artificial Intelligence and Machine Learning 9

o LogDriver.java:Acts as the main driver class.It configures the Hadoop job by

o All Java files were compiled using:javac -classpath `hadoop classpath` -d

o After successful compilation, a JAR file named log_analysis.jar was created:jar -

o Here,/log_files/sample.log is the input directory containing the logs./output is the

Dept. of Artificial Intelligence and Machine Learning 10

 Hadoop Installation and Configuration:

o Start Hadoop services:

 HDFS Directory Creation:

o public class LogMapper extends

o private Text logLevel = new

Dept. of Artificial Intelligence and Machine Learning 11

Dept. of Artificial Intelligence and Machine Learning 12

 Compiling the Code and Creating the JAR File:

Dept. of Artificial Intelligence and Machine Learning 13

Dept. of Artificial Intelligence and Machine Learning 14

1. Hadoop (Data Storage)

o hdfs dfs -ls /log_files/

o HDFS File Content:

2. Log Level Analysis (MapReduce)

o You can use the following command to check the output:

o hdfs dfs -cat /output/part-r-00000

3. Performance and Scalability Analysis:

o Job Execution Time:

4. Insights and Conclusion:

Dept. of Artificial Intelligence and Machine Learning 17

CONCLUSION AND FUTURE SCOPE

Dept. of Artificial Intelligence and Machine Learning 18

Dept. of Artificial Intelligence and Machine Learning 19

Dept. of Artificial Intelligence and Machine Learning 20

You might also like