BDA Report
BDA Report
BACHELOR OF TECHNOLOGY
IN
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted
by
1 23885A7301 A.Akshaya
2 23885A7302 A.Yashwanth
5 23885A7305 N.Naveen
6 23885A7306 P.Abhinay
7 23885A7307 S.Vinay
CERTIFICATE
This is to certify that the BDA Course End Project report entitled, “Big Data Pipeline for Log
File Analysis”, done by A.Akshaya(23885A7301), A.Yashwanth(23885A7302),
Ch.Shamala Divya(23885A7303), G.Manish Kumar(23885A7304), N.Naveen(23885A7305),
P.Abhinay(23885A7306), S.Vinay(23885A7307) Submitting to the Department of Artificial
Intelligence & Machine Learning, VARDHAMAN COLLEGE OF ENGINEERING, in
partial fulfilment of the requirements for the Degree of BACHELOR OF TECHNOLOGY in
Artificial Intelligence & Machine Learning, during the year 2024-25. It is certified that he/she
has completed the project satisfactorily.
We hereby declare that the work described in this BDA Course End Project report
entitled “Big Data Analytics Course End Project” which is being submitted by us in partial
fulfilment for the award of BACHELOROF TECHNOLOGY in the Department of
Artificial Intelligence & Machine Learning, Vardhaman College of Engineering, affiliated
to the Jawaharlal Nehru Technological University Hyderabad.
The work is original and has not been submitted for any Degree or Diploma of this or anyother
university.
A.Akshya (23885A7301)
A. Yashwanth (23885A7302)
Ch.Shamala Divya(23885A7303)
G.Manish Kumar (23885A7304)
N.Naveen (23885A7305)
P.Abhinay (23885A7306)
S.Vinay(23885A7307)
In this project, we developed a log level analysis system using Hadoop MapReduce to
process and categorize large log files. The primary goal was to analyze logs and identify the
distribution of different log levels (DEBUG, INFO, ERROR, WARN, TRACE). Log files are
critical for debugging, system monitoring, and performance optimization, as they provide
vital insights into system behavior and errors. However, handling large volumes of log data
manually is impractical, which is where Hadoop’s distributed computing model comes
in.Using Hadoop MapReduce, the project processes log files stored on HDFS (Hadoop
Distributed File System). The Mapper class reads each log line, extracts the log level, and
outputs it along with a count of 1. The Reducer aggregates these counts and outputs the total
count for each log level. This system allows for efficient processing of large datasets in
parallel across a cluster, making it scalable for use in production environments.The result of
the analysis is a set of log level counts, which can be useful for identifying system trends,
frequent errors, or areas where debugging efforts should be focused. The project showcases
how big data tools like Hadoop can be leveraged for log analysis and other real-world use
cases that involve large-scale data processing.
Keywords: Hadoop, HDFS, MapReduce, Log Analysis, Log Parsing, Data Aggregation, Distributed
Computing, Fault Tolerance, System Monitoring.
To address these challenges, Big Data technologies like Apache Hadoop have emerged as game-
changers. Hadoop offers a distributed computing framework capable of handling vast datasets
across clusters of machines, providing scalability, fault tolerance, and high-speed processing.
This project explores how Hadoop, specifically using its HDFS (Hadoop Distributed File System)
for storage and MapReduce for processing, can be leveraged for effective log file analysis.
In this project, a sample server log file was uploaded into HDFS, and a custom-built MapReduce
job was executed to analyze the occurrence and frequency of different log levels — including
INFO, DEBUG, ERROR, WARN, and TRACE. By breaking the problem into smaller sub-tasks
(map) and then aggregating the results (reduce), Hadoop allows the analysis to be carried out
swiftly and accurately, even as the data size scales up.
The goal of this log analysis is not only to count the number of log entries per severity level but
also to help in understanding underlying patterns, identifying potential failure points, and
improving the system's reliability and performance. Insights derived from this process can be
vital for proactive monitoring, troubleshooting, and optimizing system operations.
Thus, through this project, we demonstrate the practical application of Big Data tools in real-
world scenarios, highlighting how they transform complex, heavy-lifting tasks into manageable,
automated, and insightful processes.
The field of sentiment analysis has experienced significant growth, driven by the increasing
volume of social media data and the advancements in Big Data technologies. Researchers have
explored various techniques and platforms for analyzing sentiment in online text, with a
particular focus on the challenges and opportunities presented by platforms like Twitter.
Srinivas et al. (2017) highlighted how analyzing cloud-generated log files requires scalable, fault-
tolerant systems. They explored Hadoop’s role in managing and processing logs generated from
thousands of cloud-based services. Hadoop’s batch-processing nature allowed efficient analysis of
huge logs to detect failures and anomalies. The research showed that traditional logging tools were
insufficient in cloud environments. Therefore, Hadoop's adaptability to cloud-scale data streams
was highly emphasized.
Zaharia et al. (2010) introduced optimizations like speculative execution and resource-aware
scheduling to enhance MapReduce performance, directly impacting the efficiency of log analysis
workflows by reducing job completion times and resource wastage.
The methodology adopted for the Log File Analysis project is a structured and systematic
approach using the Hadoop ecosystem, specifically the MapReduce programming model. The
project focuses on efficiently processing large volumes of log data to extract meaningful insights
about system behavior, error patterns, and information flows.The complete process can be
divided into the following major phases:
1. Environment Setup:
o Hadoop 3.x was installed on a MacBook Air device, ensuring compatibility with
o Care was taken to format the log entries in a manner resembling production server
logs, with timestamps and log levels for realism.
o This log file was then uploaded into the Hadoop Distributed File System using the
command:hdfs dfs -put /Users/vinaysonaganti/hadoop/sample.log
/log_files/sample.log
o The file was verified in HDFS to ensure that it was correctly stored and accessible
for further processing.
o LogReducer.java:The Reducer receives grouped keys (log levels) and their list of
values (all 1s).It sums up all values for each log level to get the total count.It then
outputs each log level alongside its occurrence count.
o The JAR file was executed using the Hadoop command-line interface:hadoop jar
log_analysis.jar LogDriver /log_files/sample.log /output
o During job execution:Hadoop split the input data into chunks (InputSplits).Each
split was processed independently by Mapper tasks.Intermediate outputs were
shuffled and sorted before being handed over to the Reducer tasks.
5. Output Validation and Result Collection:
o Post successful job execution:The system generated an output folder /output in
HDFS containing two files: _SUCCESS and part-r-00000.The _SUCCESS file
indicated the job completed successfully without any errors.The part-r-00000 file
contained the final analyzed data.
o Using the command:hdfs dfs -cat /output/part-r-00000 the output was displayed, showing
the count of each log level.DEBUG-434,ERROR -6, INFO -96, TRACE -816,WARN-11.
HDFS Setup
o Format the HDFS file system:
o hdfs namenode -format
LogMapper Class:
o TThe LogMapper class is responsible for processing each line of the log file,
extracting the log level (e.g., INFO, ERROR), and emitting a key-value pair with
the log level and a count of 1
o Code:
o public void
map(LongWritable key, Text value,
Context context) throws
IOException, InterruptedException
{
o String line =
value.toString();
o String[] parts =
line.split(" "); // Assuming the
log level is the first word
o if (parts.length > 0) {
o
logLevel.set(parts[0]); //
Extract the log level (e.g., INFO,
DEBUG)
o
context.write(logLevel, one); //
Emit the log level with a count of
1
o }
o }
o }
2. LogReducer Class
The LogReducer class aggregates the counts of each log level and outputs the final sum for
each log level.
o code:
o public class LogReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
o private IntWritable result = new IntWritable();
o
o public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
o int sum = 0;
o for (IntWritable val : values) {
LogDriver Class:
o The LogDriver class is the entry point for the MapReduce job. It sets up the job
configuration, specifies the Mapper and Reducer classes, and defines the
input/output paths.
o Code:
o public class LogDriver {
o public static void main(String[] args) throws Exception {
o Configuration conf = new Configuration();
o Job job = Job.getInstance(conf, "Log Level Analysis");
o
o // Set the job's Jar
o job.setJarByClass(LogDriver.class);
o
o // Set the Mapper and Reducer classes
o job.setMapperClass(LogMapper.class);
o job.setReducerClass(LogReducer.class);
o
o // Set the output key and value types
o job.setOutputKeyClass(Text.class);
o job.setOutputValueClass(IntWritable.class);
o
o // Set the input and output paths
o FileInputFormat.addInputPath(job, new Path(args[0])); //
Input path (HDFS)
o FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Output path (HDFS)
o
o // Exit with status of job
o System.exit(job.waitForCompletion(true) ? 0 : 1);
o }
o }
o
o Explanation: The hadoop jar command runs the job with the LogDriver class. It
processes the log file located at /log_files/sample.log and stores the results in
/output in HDFS.
o Viewing the Output:Once the job finishes, you can check the output in the
specified HDFS output directory. The results will contain the counts of each log
level (e.g., INFO, DEBUG, ERROR).
o hdfs dfs -ls /output # List the files in the output directory
o hdfs dfs -cat /output/part-r-00000 # Display the contents of the output file
o Example Output:
o DEBUG 434
o ERROR 6
o INFO 96
o TRACE 816
o WARN 11
o This Code section comprehensively explains the steps for setting up Hadoop,
writing the MapReduce code, compiling it, and running it on Hadoop, followed by
retrieving and analyzing the output.
Results:
o The output should list the log file sample.log, confirming its presence in HDFS.
o The content of the file stored in HDFS consists of log data in a raw format, where
each line represents a log entry. These lines contain log levels such as DEBUG,
ERROR, INFO, TRACE, and WARN.
Discussion:
o HDFS is used for storing log files in a distributed manner. This provides fault
tolerance and scalability, especially for large datasets. Hadoop ensures that the
data is securely stored across multiple nodes in the cluster.
o The sample.log file is directly ingested into HDFS, making it easily accessible for
further processing by the MapReduce job.
o The unique filename (tweet ID) ensures that each tweet is stored as a separate file,
which can be efficient for retrieval and processing in some scenarios.
o MapReduce Job Execution:The LogMapper class processes each line of the input
log file and maps each log entry to a specific log level (DEBUG, ERROR, INFO,
etc.). The LogReducer aggregates the counts of each log level.
o After executing the job, the results are stored in the output directory (/output/).
The output file (part-r-00000) contains the final log level counts, with each line
showing the log level and its respective count. For example:
Dept. of Artificial Intelligence and Machine Learning 15
o DEBUG 434
o ERROR 6
o INFO 96
o TRACE 816
o WARN 11
o Discussion:
o The LogMapper reads each line of the log file and extracts the log level from the
beginning of each log entry. This key (log level) and a value of 1 (to count
occurrences) are emitted by the mapper.
o The LogReducer aggregates the emitted key-value pairs, counting the number of
occurrences for each log level. These counts are written to the output file in HDFS.
o The results help analyze the distribution of log levels in the system, giving insight
into which log levels are most frequently used. This can be useful for identifying
areas that need optimization (e.g., reducing verbose logging like DEBUG or
TRACE).
o Results:
o The MapReduce job ran with the sample.log file, and the job completed in a
reasonable amount of time given the size of the file. The time taken for the job to
complete depends on several factors such as input size, available resources, and
Hadoop cluster configuration.
o Discussion:
o For small datasets like the sample.log file, the job execution time is minimal.
However, for larger datasets, the performance of the job can be improved by
adjusting the number of reducers or tuning Hadoop configurations.
o Scalability:The job scales horizontally with the addition of more nodes to the
Hadoop cluster. This makes it possible to process larger log files without
Dept. of Artificial Intelligence and Machine Learning 16
significant performance degradation, as Hadoop distributes the work across the
cluster.
While the project provides a strong foundation for analyzing log data, there are several areas for
improvement and expansion:
1. Log Data Enrichment and Processing:: The The current system focuses only on
counting log levels. Future work could include enriching log data with additional context,
such as the source of the logs (e.g., server, application) or adding metadata like timestamps
and error severity. This would allow for more detailed analysis, such as identifying trends
over time or correlating log entries with specific system events.
2. Real-time Log Processing: For true real-time log analysis, integrating a stream
processing engine like Apache Kafka or Apache Flink could help process log data as it
is generated. This would enable immediate identification of critical errors, making the
system more responsive in detecting and addressing issues as they occur.
3. Advanced Log Pattern Recognition: The current analysis focuses on basic log level
counts. Future work could explore more advanced techniques such as log pattern
recognition using machine learning models. This would allow the system to detect
abnormal log patterns that could indicate potential issues like security breaches,
performance degradation, or system failures.
1. Ranger, D., & Lee, Y. (2018). "Log Analytics with Hadoop: An Overview of Data
Processing and Visualization." Journal of Big Data, 5(1), 45-52.
2. Zikria, Y. B., & Lee, H. (2018). "Big Data Processing Using Hadoop: A Review of
Technologies." International Journal of Computer Science and Information Security,
16(4), 47-52.
3. Jindal, A., & Kumar, R. (2019). "Log Data Processing with Apache Hadoop: A
Review of Frameworks and Tools." International Journal of Data Science and Big Data
Analytics, 4(3), 134-142.
4. Apache Hadoop Documentation. (2023). "Hadoop Overview."
https://hadoop.apache.org/docs/
5. Apache Hive Documentation. (2023). "Hive Overview."
https://cwiki.apache.org/confluence/display/Hive/
6. Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Houghton Mifflin Harcourt.
7. De Moura, L., & Chowdhury, D. (2017). "Log File Analysis in Big Data Systems."
International Journal of Computer Applications, 56(2), 58-65.
8. Raj, M., & Kumar, A. (2020). "Log Analytics: A Big Data Approach for Security and
Monitoring." International Journal of Data Science, 9(1), 10-20.
9. Dean, J., & Ghemawat, S. (2008). "MapReduce: Simplified Data Processing on Large
Clusters." Communications of the ACM, 51(1), 107-113.
10. Bharadwaj, M., & Kumar, R. (2019). "Analyzing Log Data Using Apache Hive and
Hadoop." Journal of Big Data Research, 6(3), 200-208.