0% found this document useful (0 votes)
28 views33 pages

Ibda Course File

The document outlines the course structure for 'Introduction to Big Data Analytics' (Course Code: 2012PE05) for B.Tech students in the Department of Information Technology for the academic year 2024-2025. It includes course objectives, outcomes, specific outcomes, and a detailed syllabus covering topics such as Hadoop, MapReduce, and data visualization tools. The document also emphasizes the department's vision and mission to empower students, particularly women, in the field of Information Technology.

Uploaded by

Raj kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views33 pages

Ibda Course File

The document outlines the course structure for 'Introduction to Big Data Analytics' (Course Code: 2012PE05) for B.Tech students in the Department of Information Technology for the academic year 2024-2025. It includes course objectives, outcomes, specific outcomes, and a detailed syllabus covering topics such as Hadoop, MapReduce, and data visualization tools. The document also emphasizes the department's vision and mission to empower students, particularly women, in the field of Information Technology.

Uploaded by

Raj kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

DEPARTMENT OF INFORMATION TECHNOLOGY

COURSE FILE
ACADEMIC YEAR: 2024-2025

Course Title INTRODUCTION TO BIG DATA ANALYTICS

Course Code 2012PE05

Program B.Tech.

Year & Semester IV Year II Sem

Course Type PROFESSIONAL ELECTIVE- 1

Regulation R20

Theory Practical

Lecture Tutorials Credits Laboratory Credits


Course Structure
3 0 3 0 3

Course Coordinator Mr. SUVARNA SUNIL KUMAR

Faculty In-Charge HOD


DEPARTMENT OF INFORMATION TECHNOLOGY

INSTITUTE VISION

• Visualizing a great future for the intelligentsia by imparting state-of the art
Technologies in the field of Engineering and Technology for the bright future and
prosperity of the students.

• To offer world class Name training to the promising Engineers.


DEPARTMENT OF INFORMATION TECHNOLOGY

INSTITUTE MISSION

• To nurture high level of Decency, Dignity and Discipline in women to attain high
intellectual abilities.
• To produce employable students at National and International levels by effective training
programmes.
• To create pleasant academic environment for generating high level learning attitudes
DEPARTMENT OF INFORMATION TECHNOLOGY

DEPARTMENT VISION

To empower women in the field of Information Technology through quality education,


nurturing them into globally competent professionals with strong technical skills, ethical
values, and leadership qualities, ready to meet the challenges of the evolving IT industry.
DEPARTMENT OF INFORMATION TECHNOLOGY
DEPARTMENT MISSION

❖ M1: To offer a high quality education that integrates cutting-edge technologies,


nurtures creativity and analytical skills, and shapes ethical, globally competitive
professionals.
❖ M2: To develop leadership qualities and enhance employability through hands-on
training, industry collaboration, and research in emerging technologies,
preparing women to address the dynamic challenges of the IT sector.
❖ M3: To impart technological education with a strong emphasis on dignity, decency
and discipline to develop professional engineers who are both technically
component and socially responsible
DEPARTMENT OF INFORAMTION TECHNOLOGY
PROGRAMME EDUCATIONAL OBJECTIVES (PEO)

Technical Proficiency and Innovation Graduates will develop a solid foundation in Information
PEO 1 Technology, employing modern tools and innovative methodologies to effectively solve industry
challenges.
Leadership and Professional Excellence Graduates will demonstrate leadership abilities, effective
PEO 2
teamwork, and ethical practices, enabling them to achieve career success and contribute to the
global IT sector.
Lifelong Learning and Societal Impact Graduates will engage in continuous learning, adapt to
PEO 3
technological advancements, and apply their skills to positively influence society, with a special
emphasis on empowering students in the field of Information Technology.
DEPARTMENT OF INFORAMTION TECHNOLOGY
PROGRAMME SPECIFIC OUTCOMES (PSOs)

PSO 1 Problem Solving and Application Development Graduates will be able to analyze real-world
problems and design efficient IT solutions, applying programming, database management, and
software development methodologies.
PSO 2 Modern Tool Usage and Technological Adaptability Graduates will be proficient in using modern IT
tools and technologies, while continuously adapting to emerging trends to enhance system
development and deployment.
PSO 3 Professional Ethics and Societal Contributions Graduates will uphold professional ethics,
effectively contribute to team-based projects, and apply IT solutions to address societal needs,
with a focus on women’s empowerment and community development.
DEPARTMENT OF INFORMATION TECHNOLOGY
SYLLABUS
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
B.Tech. IV Year II Sem LTPC

3003

COURSE OBJECTIVES:
➢ Gain an understanding of what constitutes "Big Data" and the key characteristics
(e.g., volume, velocity, variety, and veracity).
➢ Learn about the challenges of working with big data and how these challenges
differ from traditional data analysis.
➢ Introduction to popular big data tools and platforms like Hadoop, Apache Spark,
and NoSQL databases.
➢ Learn how to set up, configure, and work with these technologies to process and
analyze big data.
➢ Learn various data analysis techniques including descriptive, diagnostic, predictive,
and prescriptive analytics.
➢ Explore how to visualize big data to derive actionable insights using tools like
Tableau, Power BI, or Python visualization libraries.

COURSE OUTCOMES:
➢ Demonstrate a clear understanding of the characteristics and challenges of big data
(volume, velocity, variety, and veracity).
➢ Identify the key differences between traditional data analysis and big data analytics.
➢ Gain hands-on experience with essential big data tools and technologies, such as
Hadoop, Apache Spark, NoSQL databases (e.g., MongoDB), and data storage
solutions (e.g., HDFS).
➢ Be able to configure and use big data tools to process large datasets efficiently.
➢ Effectively store, manage, and retrieve big data using appropriate data storage
systems and distributed computing frameworks.
➢ Apply data processing techniques such as MapReduce, Spark, and batch processing
to analyze large datasets.
UNIT-I INTRODUCTION TO BIG DATA Introduction to Big data: Overview,
Characteristics of Data, Evolution of Big Data, Definition of Big Data, Challenges
with Big Data. Big data analytics: Classification of Analytics, Importance and
challenges of big data, Terminologies, Data storage and analysis.

UNIT-II HADOOP TECHNOLOGY Introduction to Hadoop: A brief history of


Hadoop, Convolution approach versus Hadoop, Introduction to Hadoop
Ecosystem, Processing data with Hadoop, Hadoop distributors, Use case,
Challenge in Hadoop.

UNIT-III HADOOP FILE SYSTEM Introduction to Hadoop distributed file


system (HDFS): Overview, Design of HDFS, Concepts, Basic File systems vs.
Hadoop File systems, Local File System, File-Based Data Structures, Sequential
File, Map File. The Java Interface: Library Classes, inserting data from a Hadoop
URL,inserting data using the file system API, Storing Data.

UNIT-IV FUNDAMENTALS OF MAP REDUCE Introduction to Map reduce:


Its framework, Features of Map reduce, Its working, Analyze Map reduce
functions, Map reduce techniques to optimize job, Uses, Controlling input formats
in map reduce execution, Different phases in map reduce, Applications .
UNIT-V BIG DATA PLATFORMS Sqoop, Cassandra, Mongo DB, Hive, PIG,
Storm, Flink, Apache.

TEXT BOOK:
1. Seema Acharya, SubhashiniChellappan, “Big Data and
Analytics”, WileyPublications, First Edition,2015
REFERENCE BOOKS:
1. Judith Huruwitz, Alan Nugent, Fern Halper, Marcia
Kaufman, “Big datafor dummies”, John Wiley &
Sons, Inc.(2013)
2. Tom White, “Hadoop The Definitive Guide”, O’Reilly
Publications, FourthEdition, 2015
3. Dirk Deroos, Paul C.Zikopoulos, Roman B.Melnky, Bruce Brown, RafaelCoss,
DEPARTMENT OF INFORMATION TECHNOLOGY
ACCADAMIC CALENDER
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
COURSE OBJECTIVES
➢ Gain an understanding of what constitutes "Big Data" and the key characteristics
(e.g., volume, velocity, variety, and veracity).
➢ Learn about the challenges of working with big data and how these challenges
differ from traditional data analysis.
➢ Introduction to popular big data tools and platforms like Hadoop, Apache Spark,
and NoSQL databases.
➢ Learn how to set up, configure, and work with these technologies to process and
analyze big data.
➢ Learn various data analysis techniques including descriptive, diagnostic, predictive,
and prescriptive analytics.
➢ Explore how to visualize big data to derive actionable insights using tools like
Tableau, Power BI, or Python visualization libraries.
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
COURSE OUTCOMES

➢ Demonstrate a clear understanding of the characteristics and challenges of big data


(volume, velocity, variety, and veracity).
➢ Identify the key differences between traditional data analysis and big data analytics.
➢ Gain hands-on experience with essential big data tools and technologies, such as
Hadoop, Apache Spark, NoSQL databases (e.g., MongoDB), and data storage
solutions (e.g., HDFS).
➢ Be able to configure and use big data tools to process large datasets efficiently.
➢ Effectively store, manage, and retrieve big data using appropriate data storage
systems and distributed computing frameworks.
➢ Apply data processing techniques such as MapReduce, Spark, and batch processing
to analyze large datasets.
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
PROGRAMME SPECIFIC OUTCOMES-PSO’s

PSO 1: Students will gain hands-on experience with industry-standard tools and
technologies used to manage, process, and analyze large-scale data. They will
become proficient in using distributed computing frameworks like Apache Hadoop
and Apache Spark for big data processing..
PSO 2: Students will develop the ability to clean, preprocess, and transform raw data
into a usable format for analysis and machine learning. They will master data
cleaning techniques to handle missing values, remove duplicates, and deal with
outliers.
• PSO 3: Students will gain expertise in exploring large datasets using statistical
methods to identify patterns, trends, and anomalies. They will also learn how to
present these insights using data visualization techniques.
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
COURSE OUTCOMES MAPPING WITH Pos and PSOs:

Program Specific
PROGRAM OUTCOMES(PO)
Outcomes(PSO)

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3

CO1 2 2 2 1 3 2 3

CO2 2 2 3 2 1 3 2 3

CO3 2 2 3 2 1 3 2 3

CO4 2 2 3 2 3 1 3 3 3

CO5 2 2 3 2 1 3 2 3 3

CO6 2 2 3 2 1 3 3

Avg. 2 2 3 3 2 3 1 3 2 3 3

3 – High
2 - Medium
1 - Low
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
APPICATIONS OF EACH UNIT
UNIT-I
1. Healthcare & Medicine, 2, Finance & Banking 3. E-commerce & Retail 4. Social Media & Digital Marketing, 5.
Transportation & Traffic Management, 7. Education & Research 8. Entertainment & Media 9. Cybersecurity & Fraud Prevention
10. Manufacturing & Industry.
UNIT-II
1. Healthcare & Life Sciences, 2. Finance & Banking, 3. E-commerce & Retail,4. Social Media & Digital Marketing, 5.
Government & Smart Cities , 6. Telecommunications, 7. Manufacturing & Industry 4.0 , 8. Cybersecurity & Fraud Detection. 9.
Media & Entertainment , 10. Education & Research.
UNIT-III
1.Big Data Storage & Processing, 2. Data Warehousing & ETL Pipelines ,3. Healthcare & Bioinformatics, 4. Finance &
Banking, 5. Social Media & Web Analytics, 6. Telecommunications & IoT, 17. E-commerce & Retail, 8. Cybersecurity &
Log Management, 9. Media & Entertainment, 10. Scientific Research & Smart Cities.

UNIT-IV
1. Data Processing & Analytics, 2. Search Engines (Google, Bing, Yahoo, etc.), 3. Social Media & Sentiment Analysis
4. E-commerce & Retail, 5. Healthcare & Bioinformatics , 6. Finance & Banking, 7. Telecommunications & IoT
8. Cybersecurity & Threat Detection, 9. Scientific Research & Climate Modeling, 10. Media & Entertainment

UNIT-V
1. Hadoop Distributed File System (HDFS)
1. Application: Distributed storage of large datasets

🔹 Industries: Data warehousing, cloud storage, and archival systems

🔹 Examples:

• Storing and managing massive datasets in organizations like Facebook, Yahoo, and LinkedIn
• Handling petabytes of genomic data for medical research
• Storing and retrieving financial transactions for fraud analysis

2. MapReduce
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
APPICATIONS OF EACH UNIT
1. Apache
Application: Distributed data processing using parallel computing
🔹 Industries: Search engines, finance, and big data analytics

🔹 Examples:
• Google's PageRank algorithm for web indexing
• Fraud detection in banking by analyzing transaction patterns
• Analyzing social media trends on platforms like Twitter and Facebook

2. Apache Hive
Application: Data warehousing and SQL-like querying on big data
🔹 Industries: Business intelligence, e-commerce, and financial analytics

🔹 Examples:
• Querying large-scale sales data in retail businesses like Amazon and Walmart
• Processing financial reports and stock market analysis
• Ad-hoc querying for insights in customer data

3.Apache HBase

3 .Application: High-level data processing for ETL (Extract, Transform, Load)

🔹 Industries: Telecommunications, social media analytics, and sensor data processing

🔹 Examples:
• Analyzing call detail records (CDR) in telecom companies
• Processing raw logs from web servers to analyze website traffic
• Transforming sensor data from IoT devices for predictive maintenance

4. Apache NoSQL

Application: Real-time NoSQL database for large-scale applications


🔹 Industries: Internet of Things (IoT), real-time analytics, and fraud detection

🔹 Examples:
• Storing and retrieving data from billions of social media posts
• Managing real-time sensor data in IoT applications
• Processing real-time customer transactions in e-commerce
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
APPICATIONS OF EACH UNIT
📌 Application: Real-time data streaming and messaging

🔹 Industries: Stock market analysis, IoT, and fraud detection

🔹 Examples:

• Streaming stock market data for algorithmic trading


• Processing IoT sensor data in real-time for predictive maintenance
• Detecting fraudulent transactions by analyzing live banking data
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
PROJECT RELATED TO SUBJECT
1. . Sentiment Analysis on Social Media Data

📌 Objective: Analyze social media posts (e.g., tweets, Facebook comments) to determine public sentiment on a
specific topic (e.g., a political event, a product launch).
🔹 Data Source: Twitter API, Kaggle sentiment datasets
🔹 Technologies: Hadoop, Apache Spark, Python (NLTK, TextBlob), Apache Hive
🔹 Key Features:

• Collect and preprocess social media data


• Apply Natural Language Processing (NLP) for sentiment detection
• Classify sentiments as Positive, Negative, or Neutral
• Visualize results using dashboards (Tableau, Power BI)

2. Customer Purchase Behavior Analytics in E-Commerce

📌 Objective: Analyze shopping trends and recommend products based on customer behavior.
🔹 Data Source: Amazon product review dataset (Kaggle), e-commerce transaction records
🔹 Technologies: Hadoop, Spark MLlib, Python, Apache Hive
🔹 Key Features:

• Process customer purchase data


• Use recommendation algorithms (Collaborative Filtering, Content-based Filtering)
• Predict best-selling products and optimize stock inventory

3. Real-Time Fraud Detection in Banking Transactions

📌 Objective: Detect fraudulent financial transactions using machine learning and big data analytics.
🔹 Data Source: Bank transaction datasets (Kaggle), financial transaction logs
🔹 Technologies: Hadoop, Apache Spark Streaming, Kafka, Python (Scikit-learn), HBase
🔹 Key Features:

• Analyze customer transaction patterns


• Detect anomalies using ML models
• Generate real-time alerts for suspicious transactions
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
PROJECT RELATED TO SUBJECT
4.Healthcare Data Analytics for Disease Prediction
📌 Objective: Predict diseases (e.g., diabetes, heart disease) based on patient records.
🔹 Data Source: Public healthcare datasets (WHO, CDC, Kaggle)
🔹 Technologies: Hadoop, Spark, Python (Scikit-learn, TensorFlow), Apache Hive
🔹 Key Features:

• Process and analyze large-scale patient records


• Identify patterns in medical conditions
• Use ML models for disease prediction

5. Traffic Flow Prediction and Congestion Analysis

📌 Objective: Predict and analyze traffic congestion using GPS and real-time data.
🔹 Data Source: Google Maps API, open traffic datasets
🔹 Technologies: Hadoop, Spark, Kafka, Python, NoSQL (MongoDB)
🔹 Key Features:

• Collect real-time traffic data


• Analyze congestion patterns and peak traffic hours
• Suggest alternative routes using predictive models

6. Crime Rate Prediction Using Big Data

📌 Objective: Predict crime hotspots based on historical crime data and external factors.
🔹 Data Source: FBI crime datasets, Kaggle datasets, open government data
🔹 Technologies: Hadoop, Spark MLlib, Python, Apache Hive
🔹 Key Features:

• Analyze past crime records based on location and time


• Identify high-crime areas
• Use ML models to predict future crime trends
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
PROJECT RELATED TO SUBJECT
7. Energy Consumption Analytics Using IoT Data

📌 Objective: Analyze smart meter data to predict and optimize energy consumption.
🔹 Data Source: IoT-based smart meter datasets
🔹 Technologies: Hadoop, Spark Streaming, Kafka, NoSQL (MongoDB), Python
🔹 Key Features:

• Collect real-time energy consumption data


• Identify peak usage times
• Provide recommendations for energy savings

8. Airline Customer Satisfaction Analytics

📌 Objective: Analyze customer feedback and complaints to improve airline services.


🔹 Data Source: Airline review datasets (Kaggle), customer feedback forms
🔹 Technologies: Hadoop, Apache Spark, Python (NLP), Apache Hive
🔹 Key Features:

• Process and analyze customer reviews


• Categorize sentiments into positive, neutral, or negative
• Suggest improvements for airline services

9. Climate Change and Weather Pattern Analysis

📌 Objective: Analyze global temperature and climate data to detect patterns.


🔹 Data Source: NASA, NOAA, Kaggle climate datasets
🔹 Technologies: Hadoop, Apache Spark, Python, Power BI
🔹 Key Features:

• Process large-scale climate datasets


• Identify trends in temperature rise and CO₂ emissions
• Predict future climate changes using ML models

10. Customer Churn Prediction in Telecom Industry


📌 Objective: Predict customer churn based on call records and usage patterns.
🔹 Data Source: Telecom datasets (Kaggle), real-world CDR data
🔹 Technologies: Hadoop, Apache Spark, Python (Scikit-learn), Apache Pig
🔹 Key Features:
• Process customer call and usage data
• Identify factors leading to customer churn
• Use ML models to predict at-risk customers
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
SUPPORTING DOCUMENTS PPTS
DEPARTMENT OF INFORMATION TECHNOLOGY
INTRODUCTION TO BIGDATA ANALYTICS (2012PE05)
SUPPORTING DOCUMENTS PPTS
COURSE: IV B.TECH II SEM SUBJECT: INTRODUCTION TO BIG
DEPARTMENT: IT
DATA ANALYTICS
UNIT-I,II,III R20 REGULATION ACADEMIC YEAR : 2024-2025

ASSIGNMENT SHEET-I
SET-1
1. "Could you elucidate the intricacies encompassed within the deployment journey of a
BigDatsolution?

2. In the academic context, could you delve into the intricate reasoning that drives the adoption of
Hadoop as a pivotal component within the expansive domain of Big Data Analytics?

3. As an esteemed academician within the university, could you exemplify the practical application
of Big Data Analytics and elucidate its profound significance in contemporary domains?

4. "As a distinguished academician within the university, could you undertake an in-depth
examination and deconstruction of the myriad methodologies employed in unlocking the latent
potential inherent in the realm of Big Data analytics?"

5. As an esteemed scholar in the field, could you expound upon a plethora of fundamental
attributes intricately woven into the complex fabric of the Hadoop framework??

SET-2
1. As a scholar in the field, could you delve into the intricate details encapsulating the architectural
features of Hadoop??
2. As an academician within the scholarly realm, could you delineate the disparities between
RDBMS and Hadoop?

3. In the capacity of an academic professional, could you elaborate on how Big Data Analytics
exemplifies its pivotal role?

4. As an esteemed academic, could you elaborate on the intricate subject of Big Data Modelling?
5. Could you delve into the multifaceted intricacies surrounding Apache Spark from a scholarly
perspective?
SET-3

1. As an academician, could you delineate the distinctions between Hadoop and Spark?
2. Could you elaborate on the concepts of HDFS blocks and InputSplits?
3. elucidate on the intricate web of components and frameworks comprising the Hadoop Ecosystem
4. enumerate the constraints and drawbacks associated with Hadoop?
5Could you delve into the intricacies surrounding the distributed cache paradigm within the domain of Big Data
Analytics?

SET-4

1. Could you elaborate on the diverse array of configuration files utilized within the Hadoop ecosystem?
2. As part of your academic assignment, please analyze and expound upon the multifaceted attributes
that epitomize Big Data Analytics
3. Elucidate the intricate components comprising Apache HBase's Region Server architecture,
highlighting their respective functionalities and interdependencies.
4.In the academic context, your task is to delve into the complexities surrounding the elucidation of the
"V's" that delineate the landscape of Big Data Analytics, exploring the nuances inherent in each
dimension: volume, velocity, variety, veracity, and value
5.In the academic domain, your assignment is to undertake a comprehensive exploration of the
multifaceted components comprising Apache HBase, delving into their intricate functionalities,
interconnections, and contributions to the overarching architecture of the system.

SET-5:

1. Within the academic realm, your scholarly task is to embark on a detailed exploration delineating the
intricacies surrounding Apache Spark, encompassing its multifaceted architecture, core functionalities,
and pivotal role within the domain of big data processing and analytics.
2. Underlying rationales driving the adoption of Hadoop in the domain of Big Data

Analytics?
3Could you outline the methodologies for utilizing Big Data Analytics?
4. Enumerate the attributes of Apache Spark?
5. Could you elucidate the sequential processes entailed within the realm of Big Data Analytics?
SET-6:
1. Could you enumerate a selection of optimal methodologies adhered to in the domain of Big Data
Analytics?
2. Compose a comprehensive analysis detailing five distinctive features inherent to the practice of Big
Data Analytics, emphasizing their significance and implications within the field.
3. Could you explicate the concept of the various "V's" in the context of data analytics, elucidating their
respective roles and contributions to the overarching framework, with a focus on their implications for
data management and processing strategies?
4. Could you provide a detailed analysis of the intricate interplay and components comprising the
Hadoop Ecosystem, elucidating its structural framework, constituent technologies, and their respective
functionalities, with an emphasis on their collective impact on distributed data processing and analytics?
5. Could you provide a comprehensive analysis delineating the operational dynamics and significance of
distributed cache within the realm of Big Data Analytics, focusing on its role in enhancing data
processing efficiency, mitigating latency, and optimizing resource utilization, while also exploring its
integration within distributed computing frameworks?

FACULTY HOD
SUBJECT: INTRODUCTION TO BIG
DEPARTMENT: IT COURSE: IV B.TECH II SEM
DATA ANALYTICS
UNIT-I,II,III R20 REGULATION ACADEMIC YEAR : 2024-2025

ASSIGNMENT SHEET-II
SET-1
1. Provide an overview of Hadoop and discuss its methodology for managing extensive data processing tasks.
2. Explore the nuanced roles and significance of the Map phase in the context of the MapReduce framework,
emphasizing its primary functions and contributions to distributed data processing.
3 Identify and discuss two distinctive attributes inherent to the MapReduce framework, elucidating their
significance in distributed data processing environments.
4. Explore the fundamental operational characteristics of Sqoop, highlighting its role and impact in facilitating
data integration processes between relational databases and Hadoop ecosystems.
5. Investigate the core principles and structural foundations of the document-oriented data model employed by
MongoDB, emphasizing its architectural aspects and implications for database design and management
strategies.

SET-2
1. Identify and discuss a selection of prominent vendors offering Hadoop distributions within the current market
landscape, highlighting their key contributions and market positioning in the realm of big data technologies.
2. Analyze the intricate process by which MapReduce decomposes extensive computational tasks into
manageable, granular components, elucidating the mechanisms underlying the subdivision of labor within
distributed computing frameworks.
3. Conduct a comprehensive examination of the distinct characteristics and operational disparities between the
Map and Reduce functions within the MapReduce paradigm, delving into their respective roles, methodologies,
and contributions to distributed data processing workflows.
4. Investigate the operational mechanisms and strategic approaches employed by Sqoop to facilitate seamless
data interchange between Hadoop environments and relational databases, emphasizing the intricacies of data
migration and synchronization processes across disparate data storage systems.
5.Write a MongoDB query to retrieve documents that match specific criteria.
SET-3

1. Examine the ways in which Hadoop distributors contribute to the optimization and enrichment of the Hadoop
ecosystem's capabilities, emphasizing their role in augmenting the platform's functionalities and expanding its
utility within diverse data processing environments.
2. Illustrate a hypothetical situation demonstrating the strategic advantage of implementing a Combiner
function in MapReduce, emphasizing its practical utility in optimizing data processing workflows within
distributed computing environments.
3. Present a case study showcasing a specific input format utilized in MapReduce, highlighting its functional
importance and operational impact on data processing efficiencies within distributed computing frameworks.
4. Outline a step-by-step procedure showcasing the integration of data from a MySQL database into Hadoop
through Sqoop, emphasizing practical methodologies and techniques employed in transferring data between
relational databases and distributed computing environments.
5. Develop a comprehensive set of instructions to establish a Hive table and populate it with data sourced
externally, focusing on the practical implementation steps involved in configuring data storage and retrieval
within a distributed computing environment.
SET-4
1. Evaluate the suitability of employing Hadoop for data processing tasks across diverse scenarios, considering
factors such as data volume, complexity, and performance requirements to formulate informed
recommendations regarding its applicability within specific contexts.
2.Investigate the fault tolerance mechanisms within MapReduce's job execution framework, considering its
distributed nature and complex interdependencies. Challenge students to critically evaluate the effectiveness of
these mechanisms through real-world case studies or simulations, culminating in proposals for advancing fault
tolerance in similar distributed computing environments.
3. Examine how different partitioning strategies affect MapReduce job optimization. Analyze their impact on
performance and resource utilization. Assign students to compare various partitioning approaches and explore
advanced techniques for further optimization.
4. Analyze Sqoop alongside other data transfer tools in the Hadoop ecosystem. Explore their similarities,
differences, strengths, and weaknesses. Assign students to conduct in-depth evaluations of each tool's features,
performance benchmarks, and compatibility with different data sources. Encourage them to present their
findings through comparative studies and propose recommendations for selecting the most suitable tool for
specific data transfer requirements.
5.Investigate how Pig streamlines data processing tasks within the Hadoop ecosystem. Explore its role in
abstracting complex operations and facilitating efficient data manipulation. Assign students to analyze real-
world use cases where Pig's scripting language simplifies data processing workflows, and challenge them to
propose advanced optimization techniques or integration strategies to enhance its functionality further.

1.Enumerate the potential obstacles encountered when handling varied datasets using Hadoop.
SET 5:
1.Explore a scenario where the implementation of a Hadoop distributor demonstrably enhanced the efficiency
of data processing.
2.Evaluate the scalability of MapReduce and its importance in big data processing.
3.Compare the pros and cons of employing MapReduce in data warehousing versus machine learning
applications.
4. List the primary characteristics of the Cassandra NoSQL database.
5. Detail the roles of spouts and bolts within the Storm architecture.
SET-6:
1.Demonstrate the files based data structure in sequential file system.

2. Outline the Applications of Hive and how it helps .


3. Compare Pig and Storm tools
4. Extract the features and applications of Flink and Apache.
5. =Summarize the benefits of Big data resources in IT.
6. Describe the steps involved in support vector-based inference methodology
7. What is sampling and sampling distribution give a detailed analysis
8. . Define Arcing classifier & Bagging predictors in detail.

FACULTY HOD
SUBJECT:
COURSE: IV B.TECH
DEPARTMENT: IT INTRODUCTION TO BIG
II SEM
DATA ANALYTICS

UNIT - II R20 REGULATION 2024-2025

TUTORIAL SHEET-II
1. Compare and contrast the Convolution approach with Hadoop, emphasizing the
strengths and weaknesses of each approach in handling large-scale data processing
tasks.
2. Analyze the significance of various components within the Hadoop Ecosystem, such as
HDFS, MapReduce, and YARN, in addressing specific challenges associated with big
data processing.
3. Analyze the fundamental principles behind processing data with Hadoop, focusing on
the MapReduce programming model and how it facilitates parallel and distributed
computing.
4. Explore the role of Hadoop distributors in the ecosystem, discussing how different
distributors contribute to the adoption and implementation of Hadoop in various
enterprise environments.
5. Identify and analyze challenges associated with Hadoop implementation, addressing
issues such as data security, complexity in configuration, and optimizing performance in
different deployment scenarios.
6. Compare and contrast different core methods of a Reducer

7. Elucidate HDFS and YARN? Extract their respective components.

8. Outline some of the main configuration files used in Hadoop

Signature of Faculty Signature of HOD


SUBJECT:
COURSE: IV B.TECH
DEPARTMENT: IT INTRODUCTION TO BIG
II SEM
DATA ANALYTICS

UNIT – V R20 REGULATION 2024-2025

TUTORIAL SHEET-V
1. Analyze the challenges and benefits associated with using Sqoop for importing and
exporting data in a big data environment.
2. Compare Cassandra with traditional relational databases, highlighting scenarios where
Cassandra excels in terms of performance and data handling.
3. Formulate the use cases where MongoDB is particularly suitable, considering factors
such as flexibility, scalability, and ease of development.
4. Explore the purpose of Apache Pig in the context of big data processing. How does Pig
simplify the development of complex data processing tasks in comparison to raw
MapReduce?
5. Identify the significance of Apache projects in the big data ecosystem and their impact
on the evolution of data processing technologies.
6. Illustrate the commands can you use to start and stop all the Hadoop daemons at one time.
7. Elaborate HDFS environment fault-tolerant.
8. Elucidate Zookeeper. Illustrate the benefits of using a zookeeper.

Signature of Faculty Signature of HOD


SUBJECT:
COURSE: IV B.TECH
DEPARTMENT: IT INTRODUCTION TO BIG
II SEM
DATA ANALYTICS

UNIT - I R20 REGULATION 2023-2024

TUTORIAL SHEET-I
1. Can you trace the evolution of big data and highlight the major milestones that have
contributed to its current significance in the realm of information technology?
2. Analyze the challenges associated with big data, considering factors such as volume,
velocity, variety, and complexity. How do these challenges impact data management
and analysis?
3. Classify the different types of analytics used in big data analytics, and provide insights
into how each type contributes to extracting meaningful insights from large datasets.
4. Explore the terminologies commonly associated with big data analytics, elucidating
their meanings and contextual relevance within the analytics process.
5. Examine the methods and techniques employed in big data storage and analysis, with a
focus on how these approaches facilitate efficient processing and extraction of valuable
information.
6. Elaborate Hadoop and Big Data co-related.
7. Examine data management tools used with Edge Nodes in Hadoop

Signature of Faculty Signature of HOD


SUBJECT:
COURSE: IV B.TECH
DEPARTMENT: IT INTRODUCTION TO BIG
II SEM
DATA ANALYTICS

UNIT - IV R20 REGULATION 2024-2025

TUTORIAL SHEET-IV
1. Can you delineate the critical stages involved in the MapReduce process and highlight
their respective functions?
2. Could we contribute continue functionality be ensured, and what measures contribute to
maintaining high system reliability in the face of node failures?
3. Give the purpose of GroupingByKey in distributed data processing, and how does it
enhance the efficiency of data manipulation?
4. Could you expound upon the role played by FileInputFormat in Hadoop's data
processing, emphasizing its significance in managing and processing large-scale
datasets?
5. Identify the control mechanism for executing MapReduce tasks using InputFormat in
Hadoop, and how does it influence the overall execution flow?
6. Justify what makes an HDFS environment fault-tolerant.
7. Factorize Hadoop YARN and what are its main components.

8. Elucidate Map reduce architecture and its working in Hadoop


.

Signature of Faculty Signature of HOD


SUBJECT:
COURSE: IV B.TECH
DEPARTMENT: IT INTRODUCTION TO BIG
II SEM
DATA ANALYTICS

UNIT - III R20 REGULATION 2024-2025

TUTORIAL SHEET-III
1. Analyze key concepts associated with HDFS, such as block storage, fault tolerance, and
data replication. How do these concepts contribute to the reliability and scalability of
HDFS?
2. Discuss scenarios where traditional file systems might be more appropriate and
instances where Hadoop File Systems offer distinct advantages.
3. Analyze the role of the Local File System in Hadoop, highlighting its function and how
it interacts with Hadoop's distributed architecture.
4. Examine the Java Interface in Hadoop, focusing on key library classes and their
functions. How does the Java Interface facilitate interaction with HDFS for data
processing tasks?
5. Assess best practices for efficient data storage in HDFS, considering factors like data
compression, partitioning, and the impact on overall performance.
6. Name the most popular data management tools used with Edge Nodes in Hadoop.

7. Review the different file formats that can be used in Hadoop.

8. Compare and contrast Active and Passive Namenodes.

Signature of Faculty Signature of HOD

You might also like