0% found this document useful (0 votes)
24 views45 pages

Big Data One Shot

The document explains the 5 Vs of Big Data: Volume, Velocity, Variety, Veracity, and Value, highlighting their impact on modern data systems. It discusses types of digital data, compares conventional data systems with Big Data platforms, and describes the architecture of a Big Data system. Additionally, it addresses ethical challenges, privacy issues, and the evolution of Big Data, including technological and business drivers behind its rise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views45 pages

Big Data One Shot

The document explains the 5 Vs of Big Data: Volume, Velocity, Variety, Veracity, and Value, highlighting their impact on modern data systems. It discusses types of digital data, compares conventional data systems with Big Data platforms, and describes the architecture of a Big Data system. Additionally, it addresses ethical challenges, privacy issues, and the evolution of Big Data, including technological and business drivers behind its rise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

‭LePic‬

‭ ig Data‬
B
‭One Shot‬
‭ nit-01+02+03+04+05‬
U
‭Unit-01‬
‭ ues‬‭→‬‭Explain the 5 Vs of Big Data in detail. How‬‭do they define the scope‬
Q
‭and complexity of modern data systems?‬

‭Ans:-)‬‭1.‬‭Volume‬
‭ ‬‭Refersto the massive amount of data generated every‬‭second from‬

‭varioussourcessuch associal media, sensors, transactions, machines, etc.‬
‭ ‬‭Organizations deal with terabytesto petabytes of‬‭data that traditionalsystems cannot‬

‭manage efficiently.‬‭→‬‭High volume requiresscalable‬‭storage and distributed processing‬
‭systemslike Hadoop and Spark.‬

‭ .‬‭Velocity‬
2
‭→‬‭Describesthe speed at which data is generated, collected,‬‭and processed.‬
‭ ‬‭Real-time or near-real-time processing is crucial‬‭for timely decision-making‬

‭(e.g., fraud detection, recommendation engines).‬
‭→‬‭Technologieslike Kafka, Flink, and Storm are used‬‭to handle high-velocity data streams.‬

‭ .‬‭Variety‬
3
‭→‬‭Representsthe different forms of data:structured‬‭(databases),semi-structured (XML,‬
‭JSON), and unstructured (videos, images, audio).‬
‭ ‬‭Handling diverse data typesincreasesthe complexity‬‭ofstorage,‬

‭integration, and analysis.‬‭→‬‭Requiresflexible systems‬‭capable of handling‬
‭all formats.‬

‭ .‬‭Veracity‬
4
‭→‬‭Relatesto the trustworthiness and accuracy of data.‬
‭→‬‭Data may be inconsistent, incomplete, or noisy,‬‭affecting the quality of insights.‬
‭→‬‭Data cleansing and validation mechanisms are essential‬‭to improve reliability.‬
‭ .‬‭Value‬
5
‭→‬‭Denotesthe usefulness of data in making informed‬‭decisions.‬
‭LePic‬
‭ ‬‭The main goal isto extract meaningful insightsthat‬‭add business value.‬

‭→‬‭Techniqueslike data mining, analytics, and machine‬‭learning help unlock this value.‬

‭ ues‬‭→‬‭What are the major types of digital data? Classify‬‭and explain with‬
Q
‭examples from real-world applications‬

‭Ans:-)‬‭1.‬‭Structured Data‬
‭ ‬‭Organized data that residesin fixed fields within‬‭records or files, typically in‬

‭relational databases.‬‭→‬‭Easily searchable using SQL-based‬‭queries.‬
‭ ‬‭Examples:‬

‭– Bank transaction records(amount, date, account number)‬
‭– Employee detailsin an HR system (name, ID, department)‬
‭– Sensor data stored in rows and columns‬

‭ .‬‭Semi-Structured Data‬
2
‭→‬‭Data that does not follow a strict tabular format‬‭but has a clear structure‬
‭through tags or markers.‬‭→‬‭Can be stored in NoSQL‬‭databases and parsed using‬
‭specific parsers.‬
‭ ‬‭Examples:‬

‭– JSON and XML files used in web APIs‬
‭– Email (with fieldslike To, From, Subject, Body)‬
‭– Logsfrom servers or IoT devices‬

‭ .‬‭Unstructured Data‬
3
‭→‬‭Data without any predefined structure or format,‬‭making it difficult to process‬
‭and analyze directly.‬‭→‬‭Requires advanced toolsfor‬‭storage, processing, and‬
‭analysis(e.g., NLP, image recognition).‬‭→‬‭Examples:‬
–‭ Social media posts(text, images, videos)‬
‭– Multimedia files(MP3, MP4, JPEG)‬
‭– Documents and PDFs containing mixed content‬

‭ ues‬‭→‬‭Compare and contrast conventional data systems‬‭with Big Data‬


Q
‭platforms. Why are traditional systems inadequate for today’s data?‬

‭Ans:-)‬
‭LePic‬
‭Feature‬‭Conventional Data Systems‬‭Big Data Platforms‬

‭Data Volume‬‭Handles GBsto TBs of data Capable of handling‬‭TBsto PBs and beyond‬
‭Data Variety‬‭Primarily structured data Handlesstructured,semi structured, and‬
‭unstructured data‬

‭Data Velocity‬‭Processes data in batches(slower) Supports‬‭real-time and streaming data‬


‭processing‬
s‭ erver)‬
‭ calability‬‭Limited verticalscalability‬
S ‭Horizontalscalability (adding more‬
‭(adding more power to a single‬ ‭machinesto the cluster)‬

‭Storage Architecture‬‭Centralized storage using RDBMS‬‭Distributed storage (e.g., HDFS,‬


‭NoSQL databases)‬

‭Processing Model‬‭Relational model using SQL Distributed‬‭computing (e.g.,‬


‭MapReduce, Spark)‬
‭Cost-effective using commodity hardware‬
‭Cost‬‭Expensive for large-scale data due to‬ ‭and open-source tools‬
‭high hardware/software cost‬
‭ igh fault tolerance through replication and‬
H
‭ ault Tolerance‬‭Minimal;system failure‬
F ‭failover‬
‭often disrupts processing‬ ‭mechanisms‬

‭Flexibility‬‭Rigid schema design Schema-less or flexible‬‭schema support‬

‭ ues‬‭→‬‭Describe the architecture of a Big Data system.‬‭Highlight‬


Q
‭the role of each component and how they interact in a data‬
‭pipeline.‬

‭Ans:-)‬
‭LePic‬
‭LePic‬
‭Component Role and Function‬

‭1. Data Sources‬‭→‬‭Origin of data such as web servers,sensors,‬‭applications,‬


‭logs,social media.‬

‭→‬‭Sends both real-time and historical data for‬


‭processing.‬

‭2. Data Storage‬‭→‬‭Storeslarge volumes of raw input‬‭data‬


‭(structured,semi-structured,‬
‭unstructured).‬

‭→‬‭Supports batch processing later. Common systems:‬


‭HDFS, cloud storage, NoSQL DBs.‬

‭3. Batch Processing‬‭→‬‭Processeslarge volumes of data‬‭at once (non-real time).‬

‭→‬‭Used for historical trend analysis, data cleaning,‬


‭and aggregation.‬

‭→‬‭Outputs are written to the Analytical Data Store.‬

‭4. Real-time Message Ingestion‬‭→‬‭Captures and ingestsstreaming‬‭data from live‬


‭sources.‬

‭→‬‭Toolslike Kafka, Flume, or NiFi are used here.‬

‭5. Stream Processing‬‭→‬‭Processes real-time data for‬‭instant insights and decisions.‬


‭→‬‭Detects events or anomalies as data flowsin (e.g.,‬
‭fraud detection).‬

‭→‬‭Feeds processed data into the Analytical Data‬


‭Store.‬

‭6. Analytical Data Store‬‭→‬‭Central repository for‬‭storing processed data from both batch‬
‭and stream paths.‬

‭→‬‭Optimized for querying, analytics, and business‬


‭intelligence tools.‬

‭7. Analytics and Reporting‬‭→‬‭Performs data analysis,‬‭visualizations,‬


‭LePic‬
‭ ‬‭Tools: Tableau, Power BI, Apache Superset,‬

‭dashboards, and reporting.‬ ‭etc.‬

‭8. Orchestration‬‭→‬‭Manages workflows,scheduling, and‬‭coordination across all‬


‭components.‬

‭→‬‭Ensuressmooth and timely execution of tasks.‬


‭Tools: Apache Airflow, Oozie.‬

‭Interaction in the Pipeline‬

‭ ‬‭Data Sources‬‭send input to two parallel paths:‬



‭→‬‭One path stores data for‬‭Batch Processing‬
‭→‬‭The other ingests data in real time via‬‭Message‬‭Ingestion‬

‭→‬‭Batch Processing‬‭and‬‭Stream Processing‬‭both send‬‭processed data to a shared‬

‭Analytical Data Store‬‭→‬‭Final data is retrieved from‬‭the‬‭Analytical Store‬‭by‬

‭Analytics and Reporting‬‭toolsfor visualization‬‭→‬‭Orchestration‬‭ensures every step in‬

‭thisflow isscheduled, monitored, and managed properly‬

‭ ues‬‭→‬‭Discuss the ethical challenges and privacy‬‭issues related to Big‬


Q
‭Data. How can compliance and auditing features be integrated into Big‬
‭Data frameworks?‬

‭Ans:-)‬‭Ethical Challenges and Privacy Issues in Big‬‭Data‬


‭ . Data Privacy Violation‬
1
‭→‬‭Big Data often collects personal and sensitive information‬‭(e.g., health, financial, location‬
‭data) without explicit consent.‬
‭→‬‭There is a risk of misuse, re-identification, and‬‭exposure of individuals' private data.‬

‭ . Informed Consent‬
2
‭→‬‭Users are often unaware of how their data is being‬‭collected,stored, and analyzed.‬
‭→‬‭Consent forms are either vague or buried in long‬‭privacy policies, violating ethical‬
‭transparency.‬

‭ . Data Ownership and Control‬


3
‭→‬‭Unclear policies on who ownsthe data—users, data‬‭collectors, or‬
‭third-party processors.‬‭→‬‭Lack of control by users‬‭over their own data can‬
‭lead to exploitation.‬

‭ . Data Discrimination and Bias‬


4
‭→‬‭Algorithmstrained on biased data may lead to unfair‬‭outcomes(e.g., in‬
‭hiring, creditscoring).‬‭→‬‭Ethical concerns arise when‬‭Big Data reinforcessocial‬
‭inequalities.‬
‭LePic‬
‭ . Surveillance and Monitoring‬
5
‭→‬‭Governments or corporations can use Big Data for‬‭continuoustracking, leading to‬
‭masssurveillance.‬‭→‬‭Raises concerns around freedom,‬‭autonomy, and civil rights.‬

‭ . Data Security Risks‬


6
‭→‬‭Large datasets are prime targetsfor cyberattacks,‬‭breaches, and leaks.‬
‭→‬‭Poor security practices can compromise the privacy‬‭of millions.‬

‭Integrating Compliance and Auditing in Big Data Frameworks‬

‭ . Data Governance Policies‬


1
‭→‬‭Implementstrict data governance models defining‬‭access, retention,‬
‭and usage policies.‬‭→‬‭Define roles and responsibilitiesfor‬‭data‬
‭custodianship.‬

‭ . Regulatory Compliance‬
2
‭→‬‭Align data processing practices with lawslike GDPR,‬‭HIPAA, CCPA, etc.‬
‭→‬‭Include consent tracking, right-to-be-forgotten,‬‭and data minimization in system design.‬

‭3. Data Auditing Mechanisms‬


‭→‬‭Maintain logs of data access, processing, and sharing.‬
‭→‬‭Use automated auditing toolsto trace who accessed‬‭what data, when, and why.‬

‭ . Role-Based Access Control (RBAC)‬


4
‭→‬‭Enforce fine-grained access policies, allowing only‬‭authorized usersto‬
‭accessspecific datasets.‬‭→‬‭Combine with encryption‬‭and anonymization to‬
‭protectsensitive information.‬
‭ . Data Anonymization and Masking‬
5
‭→‬‭Apply techniquesto obscure identifiable information‬‭while preserving‬
‭analytical value.‬‭→‬‭Reducesthe risk of data re-identification.‬

‭ . Encryption and Secure Storage‬


6
‭→‬‭Ensure all data at rest and in transit is encrypted.‬
‭→‬‭Use secure storage systems and key management practices.‬

‭ . Transparency and Reporting‬


7
‭→‬‭Provide clear disclosures on data usage and privacy‬‭practices.‬
‭→‬‭Offer users accessto their own data and summaries‬‭of how it's being used.‬

‭ ues‬‭→‬‭Big Data is often referred to as a disruptive‬‭innovation. Trace‬


Q
‭the history of Big Data evolution and explain the technological and‬
‭business drivers behind its rise.‬

‭Ans:-)‬‭Evolution of Big Data‬

‭ . Pre-Big Data Era (1960s–1990s)‬


1
‭→‬‭Data was processed using traditional RDBMS (Relational‬‭Database‬
‭Management Systems).‬‭→‬‭Systems were built for structured‬‭data with‬
‭limited volume and slower generation rates.‬‭→‬‭Technologies:‬‭COBOL, IBM‬
‭mainframes, and early SQL databases.‬
‭LePic‬
‭ . Internet and Web Era (1990s–2005)‬
2
‭→‬‭The rise of the internet led to rapid data generation‬‭(web pages, emails,‬
‭e-commerce logs).‬‭→‬‭Organizations began collecting‬‭user data, leading to data‬
‭warehouses and businessintelligence tools.‬‭→‬‭Technologies:‬‭Oracle, MySQL,‬
‭Teradata, ETL tools.‬

‭ . Emergence of Big Data (2005–2010)‬


3
‭→‬‭Explosion of unstructured data from social media,sensors,‬‭videos, and mobile apps.‬
‭→‬‭Google introduced‬‭MapReduce‬‭and‬‭GFS (Google File‬‭System)‬‭to handle‬
‭web-scale data.‬‭→‬‭Apache Hadoop (2006) adopted these‬‭ideas, making Big Data‬
‭processing open-source and scalable.‬

‭ . Modern Big Data Era (2010–Present)‬


4
‭→‬‭Real-time data analytics became important (IoT,‬‭fintech, e-health).‬
‭→‬‭Technologieslike‬‭Apache Spark‬‭,‬‭Kafka‬‭,‬‭NoSQL‬‭, and‬‭cloud platforms‬
‭gained popularity.‬‭→‬‭Machine learning, AI, and predictive‬‭analytics became‬
‭central to Big Data ecosystems.‬

‭Technological Drivers Behind Big Data Rise‬

‭ . Storage Advancements‬
1
‭→‬‭Cheap and scalable storage (e.g., HDFS, Amazon S3)‬‭made it feasible to store massive‬
‭datasets.‬

‭ . Distributed Computing‬
2
‭→‬‭Technologieslike Hadoop and Spark enabled parallel‬‭processing across clusters of‬
‭machines.‬

‭ . Cloud Computing‬
3
‭→‬‭Platformslike AWS, Google Cloud, and Azure made‬‭Big Data infrastructure more accessible‬
‭and cost-effective.‬

‭ . NoSQL Databases‬
4
‭→‬‭Support for semi-structured and unstructured data‬‭(e.g., MongoDB, Cassandra).‬

‭ . Open-source Ecosystem‬
5
‭→‬‭Rapid development due to community-driven projectslike‬‭Hadoop, Hive, Pig, and Flink.‬

‭ . Real-Time Processing‬
6
‭→‬‭Toolslike Apache Kafka and Storm allowed handling‬‭streaming data for‬

‭instant decision-making.‬‭Business Drivers Behind Big‬‭Data Rise‬

‭ . Customer-Centric Strategies‬
1
‭→‬‭Companies use Big Data to understand consumer behavior,‬‭personalize‬
‭marketing, and improve user experience.‬

‭ . Competitive Advantage‬
2
‭→‬‭Data-driven insights help firmsinnovate faster and‬‭outperform competitors.‬

‭ . Operational Efficiency‬
3
‭→‬‭Optimization ofsupply chains, manufacturing, and‬‭resource management using data‬
‭analytics.‬

‭ . Risk Management‬
4
‭→‬‭Fraud detection, predictive maintenance, and financial‬‭risk modeling using Big Data tools.‬
‭LePic‬
‭ . Monetization of Data‬
5
‭→‬‭Businessesincreasingly see data as a valuable asset‬‭that can be sold, traded, or used to‬
‭generate revenue.‬

‭ ues‬‭→‬‭Differentiate between analysis and reporting‬‭in the context of‬


Q
‭Big Data. Why is analysis considered more critical in intelligent systems?‬

‭Ans:-)‬‭Difference Between Analysis and Reporting in‬‭Big Data‬

‭Category‬‭Analysis‬‭Reporting‬
‭discover patterns, trends.‬
‭Definition‬‭→‬‭In-depth examination of data to‬
‭Purpose‬‭→‬‭To extract insights, make‬
‭predictions, and support‬ ‭Systems‬
‭decisions.‬
‭ . Decision-Making Support‬
1
‭→‬‭Summarizing data to present factsin a‬
‭Nature‬‭→‬‭Exploratory, diagnostic, predictive,‬
‭readable format.‬
‭and prescriptive.‬

‭→‬‭To communicate historical data and‬


‭ ools Used‬‭→‬‭Machine learning, data mining,‬
T
‭performance.‬
‭statistical models.‬

‭ utput‬‭→‬‭Actionable insights,‬
O
‭→‬‭Descriptive and historical in nature.‬
‭recommendations, forecasts.‬

‭→‬‭Dashboards, BI tools(Tableau, Power BI).‬


‭ ser‬‭→‬‭Data scientists, analysts,‬
U
‭decision-makers.‬
‭→‬‭Tables, charts, graphs, summary reports.‬

‭ utomation‬‭→‬‭Involves advanced algorithms‬


A
‭→‬‭Managers, business users, stakeholders.‬
‭and dynamic processing.‬

‭→‬‭Often automated, butstatic in nature.‬


‭ omplexity‬‭→‬‭High; requires deep‬
C
‭understanding of data and tools.‬
‭ ‬‭Low to moderate; focuses on presentation‬

‭of data.‬

‭Why Analysis Is More Critical in Intelligent‬

‭ ‬‭Intelligentsystems rely on analysisto make data-driven‬‭decisionsin real-time (e.g.,‬



‭autonomous vehicles, fraud detection).‬
‭LePic‬
‭ . Predictive and Adaptive Behavior‬
2
‭→‬‭Analysis helpssystemslearn from past data and adapt‬‭to new situations using predictive‬
‭models.‬

‭3. Enables Automation‬


‭→‬‭Intelligent automation (like recommendation engines‬‭or chatbots) depends on continuous‬
‭analysis of user data.‬

‭ . Drives Personalization‬
4
‭→‬‭Systems analyze user preferences and behaviorsto‬‭provide tailored experiences.‬

‭ . Core to AI and Machine Learning‬


5
‭→‬‭Analysisisthe foundation of training models, detecting‬‭patterns, and improving over time.‬
‭Ques‬‭→‬‭List any five Big Data platforms.‬

‭Ans:-)‬‭List of Five Big Data Platforms‬


‭ ‬‭Apache Hadoop‬
‭➤‬‭Apache Spark‬

‭ ‬‭Apache Flink‬
‭➤‬‭Google BigQuery‬
‭➤‬‭Amazon EMR (Elastic MapReduce)‬

‭Ques‬‭→‬‭Write any two industry examples for Big Data‬

‭Ans:-)‬‭Industry Examples of Big Data‬


‭ ‬‭Healthcare Industry‬
‭→‬‭Big Data is used to analyze patient records, treatment‬‭histories, and diagnostic reportsto‬
‭improve healthcare outcomes and personalize treatment plans.‬


‭ ‬‭Retail Industry‬
‭→‬‭Big Data helpsin customer behavior analysis, inventory‬‭management, and targeted‬
‭marketing by analyzing large volumes ofsales and customer interaction data.‬

‭Unit-02‬
‭h i d l i i hi d h f h d‬
‭LePic‬
‭Ques‬‭→‬‭What is Hadoop? Explain its history and the‬‭components of the‬

‭Hadoop ecosystem.‬‭Ans:-)‬

‭Hadoop:-‬

‭ ‬‭Hadoop is an‬‭open-source framework‬‭developed by‬‭Apache‬‭for storing and‬



‭processing large datasetsin a‬‭distributed computing‬‭environment.‬
‭→‬‭It enables reliable,scalable, and efficient data‬‭storage and processing using‬
c‭ ommodity hardware.‬‭→‬‭Hadoop uses a‬‭distributed file system‬‭(HDFS) and a‬
‭processing model called MapReduce.‬

‭History of Hadoop‬

‭ ‬‭Hadoop was created by‬‭Doug Cutting‬‭and‬‭Mike Cafarella‬‭in 2005.‬



‭→‬‭It wasinspired by Google's File System (GFS) and‬‭the MapReduce‬
‭programming model.‬‭→‬‭Initially part of the Apache‬‭Nutch project (web‬
‭crawler).‬
‭ ‬‭Yahoo! contributed Hadoop to the Apache Software‬‭Foundation in 2008.‬

‭→‬‭Firststable release: Hadoop 1.0.0 in December 2011.‬

‭Components of the Hadoop Ecosystem‬

‭Core Components:‬

‭ ‬‭HDFS (Hadoop Distributed File System):‬



‭Storeslarge files across multiple machines with replication and fault tolerance.‬

‭ ‬‭MapReduce:‬

‭A programming model for processing large datasetsin parallel across a Hadoop cluster.‬

‭ ‬‭YARN (Yet Another Resource Negotiator):‬



‭Manages and schedules resources acrossthe Hadoop cluster.‬

‭→‬‭Hadoop Common:‬
‭Providesshared libraries and utilities used by other Hadoop components.‬
‭LePic‬
‭Supporting Ecosystem Tools:‬

‭ ‬‭Hive:‬

‭Data warehousing tool that allows SQL-like queries on large datasets.‬

‭ ‬‭Pig:‬

‭A high-levelscripting language (Pig Latin) for analyzing large data sets.‬

‭→‬‭HBase:‬
‭A column-oriented NoSQL database built on top of HDFS.‬

‭ ‬‭Sqoop:‬

‭Used for transferring data between Hadoop and relational databases.‬

‭ ‬‭Flume:‬

‭Collects and transportslarge amounts of log or streaming data into HDFS.‬

‭ ‬‭Oozie:‬

‭Workflow scheduler to manage Hadoop jobs.‬

‭→‬‭Zookeeper:‬
‭A coordination service for managing distributed applications.‬

‭ ‬‭Mahout:‬

‭A library for scalable machine learning algorithms on Hadoop.‬

‭ ues‬‭→‬‭Describe the Hadoop Distributed File System‬‭(HDFS). What are its‬


Q
‭core features and how does it store data across nodes?‬

‭Ans:-)‬

‭Hadoop Distributed File System (HDFS)‬

‭ ‬‭HDFS standsfor‬‭Hadoop Distributed File System‬‭, the‬‭primary storage system‬



‭of Hadoop.‬‭→‬‭It is designed to‬‭store very large files‬‭reliably across‬‭multiple‬
‭machines‬‭in a distributed manner.‬‭→‬‭It follows a‬‭master-slave‬‭architecture‬‭and‬
‭is highly‬‭fault-tolerant‬‭.‬

‭Core Features of HDFS‬

‭→‬‭Distributed Storage:‬
‭Stores data across multiple machinesto ensure scalability and parallel access.‬

‭→‬‭Fault Tolerance:‬
‭ ata is automatically replicated across multiple nodes(default: 3 copies) to prevent data‬
D
‭loss during hardware failure.‬

‭ ‬‭High Throughput:‬

‭Optimized for large data sets with batch processing, providing high data accessspeed.‬
‭LePic‬
‭ ‬‭Scalability:‬

‭Can scale up by adding more machines without changing the data or applications.‬

‭ ‬‭Write-Once, Read-Many Model:‬



‭Files are written once and read multiple times, which simplifies data consistency and‬
‭replication.‬

‭→‬‭Data Locality Optimization:‬


‭Moves computation close to the data to reduce network congestion and increase speed.‬

‭ ‬‭Support for Large Files:‬



‭Efficiently handlesfiles ofsize in GBs or TBs.‬

‭How HDFS Stores Data Across Nodes‬

‭1.Architecture:‬

‭ ‬‭NameNode (Master):‬

‭Stores metadata (file names, block locations, permissions) but not the actual data.‬
‭ ‬‭DataNodes (Slaves):‬

‭Store the actual data blocks. They report to the NameNode periodically.‬

‭2.Data Storage Process:‬

‭ ‬‭When a file is uploaded to HDFS, it is‬‭split into‬‭blocks‬‭(default block size: 128‬



‭MB or 64 MB).‬‭→‬‭Each block is‬‭replicated‬‭(default:‬‭3 copies) and distributed to‬
‭different DataNodesfor‬‭reliability‬‭.‬‭→‬‭The‬‭NameNode‬‭maintainsthe record of‬
‭which block is stored on which DataNode‬‭.‬‭→‬‭If a DataNode‬‭fails, the‬
‭NameNode ensuresthe lost replica is recreated on another DataNode.‬

‭3.Accessing Data:‬

‭ ‬‭When a user wantsto read a file, the client contactsthe‬‭NameNode to get‬



‭the block locations.‬‭→‬‭Then the client directly readsthe‬‭data from the‬
‭respective DataNodes.‬

‭ ues‬‭→‬‭Explain the basic working of the MapReduce framework‬‭with the‬


Q
‭help of a suitable example.‬
‭Ans:-)‬

‭MapReduce‬

‭ ‬‭MapReduce is a‬‭programming model‬‭and‬‭processing‬‭engine‬‭in Hadoop for processing‬



‭large-scale data in a‬‭parallel and distributed‬‭manner.‬
‭→‬‭It consists of two main phases:‬‭Map‬‭and‬‭Reduce‬‭.‬

‭Basic Working of MapReduce‬

‭1.‬
I‭ nput Splitting:‬
‭→‬‭Input data issplit into smaller chunks(blocks),‬‭which are processed in parallel.‬
‭LePic‬
‭2.‬
‭ apping Phase:‬
M
‭→‬‭The Mapper function processes each split and producesintermediate‬
‭ ey-value pairs. 3.‬
k
‭ huffling and Sorting:‬
S
‭→‬‭The framework‬‭groups all values‬‭based on their keys‬‭and‬‭sends them‬
‭to the Reducers‬‭.‬‭4.‬
‭Reducing Phase:‬
‭→‬‭The Reducer processes each key and itslist of values‬‭and generatesthe final output.‬

‭Example: Word Count Program‬

‭Problem:‬‭Count the frequency of each word in a‬

‭large text file.‬‭Step 1: Input File‬

‭Input.txt:‬

‭Hello world‬

‭Hello Hadoop‬

‭Step 2: Map Phase‬

‭→‬‭The Mapper readsline by line and breaksit‬

‭into words: Map Output (Key, Value):‬

‭("Hello", 1)‬

‭("world", 1)‬

‭("Hello", 1)‬

‭("Hadoop", 1)‬

‭Step 3: Shuffle and Sort‬

‭→‬‭Group values by keys:‬

‭Grouped:‬

‭("Hadoop", [1])‬

‭("Hello", [1, 1])‬

‭("world", [1])‬

‭Step 4: Reduce Phase‬

‭→‬‭Sum the valuesfor each key:‬

‭Reduce Output:‬

‭("Hadoop", 1)‬
‭LePic‬
‭("Hello", 2)‬
‭("world", 1)‬

‭Final Output (Word Count Result):‬

‭Hadoop 1‬

‭Hello 2‬

‭world 1‬

‭ ues‬‭→‬‭What is shuffle and sort in MapReduce? Why is‬‭it a critical‬


Q
‭phase in the job lifecycle?‬

‭Ans:-)‬
‭What is Shuffle and Sort in MapReduce?‬

‭ ‬‭Shuffle and Sort‬‭is an‬‭intermediate phase‬‭in the‬‭MapReduce framework that occurs‬



‭between the Map and Reduce phases‬‭.‬
‭ ‬‭It is automatically handled by the Hadoop framework‬‭and is responsible for‬

‭transferring‬‭,‬‭sorting‬‭,‬‭and‬‭grouping‬‭intermediate data‬‭output by Mappers before it‬
‭reachesthe Reducers.‬

‭Working of Shuffle and Sort‬

‭ ‬‭After the‬‭Map phase‬‭, the output is a series of unsorted‬‭key-value pairs.‬



‭→‬‭The‬‭Shuffle‬‭processsendsthese pairsfrom the Mappersto‬‭the appropriate Reducers‬
‭based on the key.‬‭→‬‭During thistransfer,‬‭all values‬‭for the same key are grouped‬
‭together‬‭.‬
‭ ‬‭Simultaneously, the data is‬‭sorted by key‬‭,so the‬‭Reducer receives‬

‭keysin sorted order.‬‭→‬‭Thissorted and grouped data‬‭isthen passed to the‬
‭Reduce‬‭function.‬

‭Why Shuffle and Sort is a Critical Phase‬

‭ ‬‭Ensures Correctness‬‭:‬

‭The Reducer expects all valuesfor a given key to arrive together;shuffle and sort‬
‭guaranteesthis.‬

‭ ‬‭Handles Large-Scale Data Movement‬‭:‬



‭It efficiently managesthe transfer of large volumes of intermediate data acrossthe network.‬

‭ ‬‭Optimizes Performance‬‭:‬

‭Sorting reducesthe complexity of the Reducer’sjob by providing already ordered data.‬

‭ ‬‭Enables Scalability‬‭:‬

‭It allows Hadoop to process massive data sets across distributed nodes effectively.‬

‭→‬‭Supports Fault Tolerance‬‭:‬


‭The framework handles retries and failures during shuffle without affecting the final result.‬
‭LePic‬
‭ ues‬‭→‬‭Compare and contrast Hadoop Streaming and Hadoop‬‭Pipes. In‬
Q
‭what scenarios is each preferred?‬

‭Ans:-)‬

‭Feature‬‭Hadoop Streaming‬‭Hadoop Pipes‬


‭ ver stdin/stdout‬
o
‭ efinition‬‭Uses‬‭standard input/output‬‭to‬
D ‭A‬‭C++ interface‬‭to write‬
‭allow programs written in‬‭any‬ ‭MapReduce programsfor Hadoop‬
‭language‬‭(e.g., Python, Perl,‬
‭Ruby) to be used in MapReduce‬
‭jobs‬

‭C++ only‬
‭ anguage Support‬‭Any language‬
L
‭thatsupportsstdin and stdout‬

‭Binary protocol over sockets‬


‭Interface Type‬‭Text-based communication‬

‭Ease of Use‬‭Easy to write and debug More complex to‬‭implement and debug‬

‭Performance‬‭May be‬‭slower‬‭due to text parsing‬‭Faster‬‭due to binary protocol‬

I‭ ntegration with Hadoop‬ ‭When to Prefer Each‬


‭Externalscript/program interacts with‬
‭Hadoop‬ ‭ ‬‭Hadoop Streaming‬‭is preferred when:‬

‭Native integration with Hadoop core‬

‭ se Case Simplicity‬‭Ideal for quick‬


U
‭prototyping or using existing scripts‬ ‭Best for performance-critical C++ applications‬

‭ ‬‭You want to‬‭use existing scripts‬‭written in languageslike‬‭Python, Ruby, or Perl.‬



‭→‬‭You need to‬‭prototype or test ideas quickly‬‭.‬
‭→‬‭You are working on‬‭text-based data‬‭and ease of development‬‭is more important than‬
‭speed.‬

‭ ‬‭Hadoop Pipes‬‭is preferred when:‬



‭→‬‭You want to write‬‭high-performance‬‭MapReduce code‬‭in‬‭C++‬‭.‬
‭→‬‭Your application is‬‭compute-intensive‬‭and requires‬‭faster‬‭data handling‬‭.‬
‭→‬‭You need‬‭tighter integration‬‭with the Hadoop ecosystem.‬

‭Ques‬‭→‬‭Differentiate “Scale up and Scale out” Explain‬‭with an example‬


‭How Hadoop uses Scale out feature to improve the Performance.‬
‭LePic‬
‭Ans:-)‬

‭Difference Between Scale Up and Scale Out‬

‭Aspect Scale Up Scale Out‬


‭that takes 10 hours on one server.‬
‭ efinition Increasing the capacity of a‬
D
‭ ‬‭Scale Up:‬

‭single machine‬
‭→‬‭You upgrade the server’s RAM and CPU.‬

‭ ardware Dependency Requires powerful‬ ‭Now the job finishesin 6 hours.‬‭→‬‭Scale Out:‬
H
‭(high-end) hardware‬
‭Adding more machines/nodes to the‬
‭system‬
‭Cost Expensive‬‭due to high-end systems‬

‭Uses commodity (low-cost) hardware‬


‭ erformance Limited by the capability of a‬
P
‭single machine‬
‭Cost-effective‬‭as more‬
‭inexpensive nodes are added‬
‭Failure Impact Single point of failure can‬
‭affect the system‬
‭Improved performance through parallel‬
‭processing‬
‭Example Upgrading RAM or CPU of a‬
‭server‬
‭Distributed systems reduce failure impact‬

‭Adding more nodes to a Hadoop cluster‬


‭Example to Understand the Concept‬

‭→‬‭Suppose you have a data processing job‬

‭ ‬‭You add 3 more similar machines and distribute the‬‭task. Now the job finishesin 2.5‬

‭hours due to‬‭parallel processing‬‭.‬

‭How Hadoop Uses Scale Out to Improve Performance‬

‭→‬‭Hadoop is‬‭designed based on the Scale Out model‬‭.‬


‭→‬‭It‬‭splits data into blocks‬‭and storesthem across‬‭multiple nodes‬‭in the‬

‭cluster using HDFS.‬‭→‬‭Each node processes a part of‬‭the data‬

‭simultaneously‬‭using the‬‭MapReduce‬‭model‬‭.‬

‭ ‬‭As more data is added, instead of upgrading machines,‬‭Hadoop can‬‭add more nodes‬‭to‬

‭the cluster to maintain performance.‬
‭LePic‬
‭ ‬‭This parallel and distributed architecture enables‬‭Hadoop to‬‭handle petabytes of data‬

‭efficiently‬‭, ensuring both‬‭scalability and fault tolerance‬‭.‬

‭Unit-03‬
‭Ques-Explain the design and architecture of HDFS.‬

‭Ans:-)‬

‭1. Master Node (Resource Manager)‬



‭ ‬‭The‬‭Master Node‬‭is responsible for managing resources‬‭and scheduling tasks‬
‭acrossthe cluster.‬‭➤‬‭The‬‭Resource Manager‬‭component‬‭inside it handles:‬
‭→‬‭Allocation ofsystem resourcesto various applications.‬
‭→‬‭Monitoring cluster utilization and load balancing.‬

‭2. Slave Nodes (Node Managers)‬



‭ ‬‭Each‬‭Slave Node‬‭runs a‬‭Node Manager‬‭,‬‭which is responsible for:‬
‭→‬‭Monitoring resource usage (CPU, memory) on that‬‭node.‬
‭→‬‭Reporting node health and availability to the Resource‬‭Manager.‬
‭→‬‭Managing containers where actual task execution‬‭happens.‬


‭ ‬‭Slave Nodes are the‬‭execution layer‬‭,‬‭where MapReduce‬‭jobs or other computation tasks‬
‭run.‬
‭LePic‬
‭3. Communication‬


‭ ‬‭The‬‭Resource Manager‬‭communicates with all‬‭Node‬‭Managers‬‭to:‬
‭→‬‭Assign jobs‬
‭→‬‭Track task progress‬
‭→‬‭Handle failures and reassign resources as needed‬

‭Relation with HDFS‬

‭Although the diagram emphasizes‬‭YARN resource management‬‭,‬‭in HDFS:‬


‭ ‬‭NameNode‬‭(typically part of the Master) manages‬‭metadata about file storage.‬
‭➤‬‭DataNodes‬‭(typically part of Slave Nodes)store the‬‭actual data blocks.‬

I‭ n practice,‬‭Node Managers and DataNodes‬‭often run‬‭on the same machinesto support‬


‭both‬‭computation and storage‬‭.‬

‭Ques-What are the challenges and benefits of using HDFS in big data‬

‭environments? Ans:-)‬‭Benefits of Using HDFS‬


‭ ‬‭Scalability‬
‭→‬‭HDFS can store and process petabytes of data by‬‭scaling across multiple machines.‬


‭ ‬‭Fault Tolerance‬
‭→‬‭Data is replicated across different nodesto ensure‬‭availability in case of failure.‬


‭ ‬‭Cost-Effective Storage‬
‭→‬‭Uses commodity hardware, reducing infrastructure‬‭costs.‬

‭➤‬‭High Throughput‬
‭→‬‭Optimized for batch processing and large datasets‬‭with sequential read/write operations.‬


‭ ‬‭Data Locality Optimization‬
‭→‬‭Moves computation close to where the data resides,‬‭reducing network traffic.‬

‭Challenges of Using HDFS‬


‭ ‬‭Not Suitable for Small Files‬
‭→‬‭HDFS is optimized for large files; managing a large‬‭number ofsmall filesisinefficient.‬

‭➤‬‭High Latency for Random Access‬


‭→‬‭Not ideal for real-time or low-latency accesssince it is designed for batch processing.‬


‭ ‬‭Complexity in Management‬
‭→‬‭Requiresskilled administratorsfor setup, configuration,‬‭and monitoring.‬


‭ ‬‭Security Concerns‬
‭→‬‭By default, lacksstrong authentication and encryption‬‭without additional configuration‬
‭(like Kerberos).‬
‭LePic‬

‭ ‬‭Single Point of Failure (NameNode)‬
‭→‬‭If the NameNode fails(without HA configuration),‬‭the entire system is disrupted.‬

‭Ques-Discuss the role of Flume and Sqoop in data ingestion. How do they‬

‭work with HDFS? Ans:-)‬‭Apache Flume‬


‭ ‬‭Purpose‬
‭→‬‭Flume is designed for collecting, aggregating, and‬‭moving large volumes of‬‭log or‬
‭event data‬‭from various sourcesto HDFS.‬


‭ ‬‭How It Works‬
‭→‬‭Flume uses a data flow structure with‬‭Source‬‭→‬‭Channel‬‭→‬‭Sink‬‭.‬
‭→‬‭The‬‭Source‬‭collects data (e.g., from web servers),‬
‭→‬‭The‬‭Channel‬‭temporarily storesit (like a buffer),‬
‭→‬‭The‬‭Sink‬‭deliversthe data to‬‭HDFS‬‭or other destinations.‬


‭ ‬‭Integration with HDFS‬
‭→‬‭The Flume sink writes data directly into HDFS using‬‭a specified directory path and‬
‭format (e.g., text,sequence file).‬

‭Apache Sqoop‬


‭ ‬‭Purpose‬
‭→‬‭Sqoop is used for efficiently transferring‬‭structured‬‭data‬‭between‬‭relational databases‬
‭(MySQL, Oracle, etc.) and‬‭HDFS‬‭.‬


‭ ‬‭How It Works‬
‭→‬‭Sqoop generates MapReduce jobsthat parallelly import/export‬‭data.‬
‭→‬‭For‬‭import‬‭, it extracts data from RDBMS and writesinto‬‭HDFS in formatslike‬
‭Avro, Parquet, or text.‬‭→‬‭For‬‭export‬‭, it takes HDFS‬‭data and pushesit back to a‬
‭database table.‬


‭ ‬‭Integration with HDFS‬
‭→‬‭Sqoop directly writesthe imported data into HDFS‬‭directories.‬
‭→‬‭It also supportsimporting into Hive tables and HBase.‬
‭ ues-What are the various file-based data structures and serialization‬
Q
‭formats supported in Hadoop?‬

‭Ans:-)‬‭File-Based Data Structures‬


‭ ‬‭Text Files‬
‭→‬‭Simple flat files containing plain text data; used‬‭for small or basic datasets.‬


‭ ‬‭Sequence Files‬
‭→‬‭Binary filesstoring key-value pairs; useful for‬‭intermediate MapReduce data.‬
‭LePic‬

‭ ‬‭MapFiles‬
‭→‬‭Sorted SequenceFiles with an index for fast lookups.‬


‭ ‬‭Avro Files‬
‭→‬‭Row-based storage format;supportsschema evolution‬‭and compactserialization.‬


‭ ‬‭Parquet Files‬
‭→‬‭Columnar storage format; optimized for complex data‬‭and‬

‭read-heavy workloads.‬‭Serialization Formats‬


‭ ‬‭Writable‬
‭→‬‭Default Hadoop serialization format for key-value‬‭pairsin MapReduce.‬


‭ ‬‭Avro‬
‭→‬‭Supports rich data structures, compact format, and‬‭schema definition in JSON.‬


‭ ‬‭Protocol Buffers‬
‭→‬‭Language-neutral format developed by Google; efficient‬‭and extensible.‬


‭ ‬‭Thrift‬
‭→‬‭Cross-language serialization and RPC framework used‬‭for data serialization.‬


‭ ‬‭JSON and XML‬
‭→‬‭Human-readable formats; used for configuration,‬‭logs, and lightweight data exchange.‬

‭Ques-Explain the steps involved in setting up and configuring a secure‬

‭Hadoop cluster. Ans:-)‬‭Steps Involved in Setting Up‬‭and Configuring a Secure‬

‭Hadoop Cluster‬

‭1. Install and Configure Hadoop‬

‭➤‬‭Download and install Hadoop on all nodes‬



‭ ‬‭Configure environment variables(e.g., HADOOP_HOME, JAVA_HOME)‬
‭➤‬‭Edit core-site.xml, hdfs-site.xml, yarn-site.xml‬‭for cluster setup‬

‭2. Set Up SSH and Networking‬


‭ ‬‭Enable passwordless SSH between all nodes‬
‭➤‬‭Configure hostnames and /etc/hostsfor name resolution‬
‭➤‬‭Test connectivity between nodes‬

‭3. Enable HDFS Permissions and User Authentication‬


‭ ‬‭Enable file-level permissionsin hdfs-site.xml‬
‭➤‬‭Create Hadoop users and groups‬
‭➤‬‭Set up directory ownership and access control‬

‭4. Configure Kerberos Authentication‬


‭LePic‬

‭ ‬‭Install and configure Kerberosserver (KDC)‬
‭➤‬‭Generate keytabsfor Hadoop services and users‬
‭➤‬‭Modify Hadoop configuration filesto use Kerberos‬

‭5. Enable Data Encryption‬


‭ ‬‭Enable data encryption at rest using Transparent‬‭Data Encryption (TDE)‬
‭➤‬‭Enable data encryption in transit using HTTPS or‬‭SASL‬

‭6. Set Up Service-Level Authorization‬


‭ ‬‭Enable service-level authorization in core-site.xml‬
‭➤‬‭Define policiesin hadoop-policy.xml‬

‭7. Configure Audit Logging and Monitoring‬


‭ ‬‭Enable audit logsfor HDFS and other components‬
‭➤‬‭Integrate with monitoring toolslike Ambari or Prometheus‬

‭Ques-Examine how a client read and write data in HDFS.‬

‭Ans:-)‬‭1. Write Operation in HDFS‬


‭ ‬‭Step 1: Client Request‬
‭→‬‭Client contactsthe‬‭NameNode‬‭to request permission‬‭to write a file.‬


‭ ‬‭Step 2: Block Allocation‬
‭→‬‭NameNode checks namespace and responds with the‬‭addresses of DataNodes‬‭for each‬
‭block of the file.‬


‭ ‬‭Step 3: Data Streaming‬
‭→‬‭Clientsplitsthe file into blocks and startssending‬‭the data to the‬‭first DataNode‬‭in the‬
‭pipeline.‬
‭➤‬‭Step 4: Data Pipelining‬
‭ ‬‭Each DataNode forwardsthe block to the‬‭next DataNode‬‭(replicas are created in the‬

‭process).‬


‭ ‬‭Step 5: Acknowledgment‬
‭→‬‭Once all DataNodes have written the block, acknowledgments‬‭are sent‬

‭back to the client.‬‭2. Read Operation in HDFS‬


‭ ‬‭Step 1: Client Request‬
‭→‬‭Clientsends a request to the‬‭NameNode‬‭for reading‬‭a file.‬


‭ ‬‭Step 2: Metadata Response‬
‭→‬‭NameNode responds with the‬‭locations of DataNodes‬‭containing the blocks of the‬
‭requested file.‬


‭ ‬‭Step 3: Data Fetching‬
‭→‬‭Client directly contactsthe‬‭nearest DataNode‬‭to‬‭read the data blocks.‬
‭LePic‬

‭ ‬‭Step 4: Data Assembly‬
‭→‬‭Blocks are fetched and‬‭reassembled‬‭by the client‬‭to form the complete file.‬

‭Unit-04‬
‭Ques‬‭→‬‭Explain the architecture of YARN and its role‬‭in Hadoop .‬

‭Ans:-)‬

‭YARN‬


‭ ‬‭YARN standsfor‬‭Yet Another Resource Negotiator‬‭.‬
‭➤‬‭It isthe‬‭resource management layer‬‭of the Hadoop‬‭ecosystem, introduced in Hadoop 2.x‬
‭to improve scalability and performance.‬
‭➤‬‭It decouples‬‭resource management‬‭and‬‭job scheduling‬‭from the MapReduce‬

‭programming model.‬‭Core Components of YARN Architecture‬


‭ . ResourceManager (RM)‬
1
‭➤‬‭Central authority that manages cluster resources‬‭and schedules applications.‬
‭➤‬‭Consists of two main components:‬
‭➤‬‭Scheduler‬‭–‬‭Allocates resources based on constraintslike‬‭memory and CPU.‬
‭➤‬‭ApplicationManager‬‭–‬‭Managesjob submissions and‬‭negotiates containers with‬
‭NodeManagers.‬

‭ . NodeManager (NM)‬
2
‭➤‬‭Runs on each node in the cluster.‬
‭➤‬‭Responsible for monitoring resources and container‬‭lifecycle‬
‭management on the node.‬‭➤‬‭Reports node and container‬‭statusto‬
‭ResourceManager.‬

‭ . ApplicationMaster (AM)‬
3
‭➤‬‭Created for each application/job.‬
‭➤‬‭Handles execution, task scheduling, and coordination‬‭of itsspecific job.‬
‭➤‬‭Requests resourcesfrom ResourceManager and interacts‬‭with NodeManagers.‬
‭LePic‬
‭ . Containers‬
4
‭➤‬‭Logical units of resource allocation (e.g., memory‬‭+ CPU).‬
‭➤‬‭Actual tasks of an application run inside containers‬‭on different nodes.‬

‭Role of YARN in Hadoop‬


‭ ‬‭Resource Management‬
‭➤‬‭Efficient allocation of resources(CPU, memory) acrossthe‬‭cluster.‬


‭ ‬‭Multi-Framework Support‬
‭➤‬‭Allows Hadoop to run multiple processing engines(e.g.,‬‭MapReduce, Apache Spark,‬
‭Tez, Flink) on the same cluster.‬


‭ ‬‭Improved Scalability‬
‭➤‬‭Decentralized application managementsupportsthousands‬‭of concurrent applications.‬


‭ ‬‭Better Utilization‬
‭➤‬‭Fine-grained and dynamic allocation improves cluster‬‭utilization.‬
‭➤‬‭Fault Tolerance‬
‭➤‬‭Isolated application execution enables recovery‬‭without affecting others.‬

‭ ues‬‭→‬‭What are NoSQL databases? Discuss the key differences‬


Q
‭between NoSQL and traditional RDBMS.‬

‭Ans:-)‬

‭NoSQL Databases‬


‭ ‬‭NoSQL (Not Only SQL)‬‭databases are non-relational‬‭databases designed to‬
‭handle large volumes of unstructured,semi-structured, or structured data.‬


‭ ‬‭They provide‬‭high scalability‬‭,‬‭flexibility in data‬‭modeling‬‭, and are well-suited for‬
‭distributed architectures‬‭and‬‭Big Data‬‭applications.‬

‭➤‬‭NoSQL databases can store data in variousformatslike‬‭key-value pairs‬‭,‬‭documents‬‭,‬

‭wide-columns‬‭,‬‭or‬‭graphs‬‭.‬‭➤‬‭Examples:‬‭MongoDB, Cassandra,‬‭Redis, CouchDB, Neo4j‬

‭Key Differences Between NoSQL and RDBMS‬


‭LePic‬
‭Aspect RDBMS (Relational DB)‬‭NoSQL (Non-Relational‬‭DB)‬
‭Consistency, Isolation,‬
‭ ata Model Structured data with‬
D ‭ urability)‬
D
‭Flexible schemas (document, key-value,‬
‭predefined schemas (tables,‬
‭graph, column family)‬
‭rows)‬

‭ ynamic schema; can change structure‬


D
‭ chema Fixed schema; changes require‬
S
‭without affecting others‬
‭altering the database structure‬

‭ orizontal (scale-out by adding more‬


H
‭servers)‬
‭ calability Vertical (scale-up by adding‬
S
‭power to a single server)‬
‭ aries (MongoDB uses‬
V
‭BSON/JSON-like queries, Cassandra uses‬
‭ uery Language SQL (Structured Query‬
Q
‭CQL)‬
‭Language)‬

‭ enerally avoids joins to maintain‬


G
‭performance‬
J‭ oins Support Supports joins between‬
‭multiple tables‬
‭ ay relax ACID for better performance‬
M
‭and availability‬
‭ACID Compliance Strong ACID (Atomicity,‬
‭Performance with Big Data Can struggle with huge datasets Designed for high‬
‭performance at web-scale‬
‭MongoDB, Cassandra, Redis, CouchDB,‬
‭Examples MySQL, PostgreSQL, Oracle,‬ ‭HBase‬
‭SQL Server‬

‭Ques‬‭→‬‭Write a detailed note on MongoDB document operations‬

‭Ans:-)‬

‭MongoDB Document Operations‬

‭ ongoDB stores data in the form of‬‭documents‬‭, which‬‭are similar to JSON objects but use‬
M
‭BSON (Binary JSON) format internally. Document operations refer to actionsthat can be‬
‭performed on these documents within a MongoDB collection.‬

‭1. Insert Operations‬


‭LePic‬

‭ ‬‭insertOne()‬
‭➤‬‭Inserts a single document into a collection.‬
‭Example:‬

‭db.students.insertOne({ name: "Alice", age: 22, course: "B.Tech" })‬


‭ ‬‭insertMany()‬
‭➤‬‭Inserts multiple documents at once.‬
‭Example:‬

‭db.students.insertMany([‬

‭{ name: "Bob", age: 23 },‬

‭{ name: "Carol", age: 21 }‬

‭])‬

‭2. Read Operations (Queries)‬

‭➤‬‭find()‬
‭➤‬‭Returns all documentsthat match a query.‬
‭Example:‬

‭db.students.find({ age: 22 })‬


‭ ‬‭findOne()‬
‭➤‬‭Returnsthe first document that matches a query.‬
‭Example:‬

‭db.students.findOne({ name: "Alice" })‬


‭ ‬‭Query Operators‬
‭➤‬‭Use operatorslike $gt, $lt, $eq, etc.‬
‭Example:‬

‭db.students.find({ age:{ $gt: 21 }})‬

‭3. Update Operations‬


‭ ‬‭updateOne()‬
‭➤‬‭Updatesthe first document that matchesthe query.‬
‭Example:‬

‭db.students.updateOne(‬

‭{ name: "Alice" },‬

‭{ $set:{ course: "M.Tech" }}‬

‭)‬
‭LePic‬

‭ ‬‭updateMany()‬
‭➤‬‭Updates all matching documents.‬
‭Example:‬

‭db.students.updateMany(‬

‭{ course: "B.Tech" },‬

‭{ $set:{ status: "active" }}‬

‭)‬


‭ ‬‭replaceOne()‬
‭➤‬‭Replacesthe entire document with a‬
‭new one.‬‭Example:‬

‭db.students.replaceOne(‬

‭{ name: "Bob" },‬

‭{ name: "Bob", age: 25, course: "MBA" }‬

‭)‬

‭4. Delete Operations‬

‭➤‬‭deleteOne()‬

‭ ‬‭Deletesthe first document that‬
‭matchesthe query.‬‭Example:‬

‭db.students.deleteOne({ name: "Carol" })‬


‭ ‬‭deleteMany()‬
‭➤‬‭Deletes all documents matching‬
‭the query.‬‭Example:‬

‭db.students.deleteMany({ status:‬

‭"inactive" })‬‭5. Additional Useful‬

‭Operations‬


‭ ‬‭countDocuments()‬
‭➤‬‭Returnsthe count of documentsthat‬
‭match a query.‬‭Example:‬

‭db.students.countDocuments({ course: "B.Tech" })‬


‭ ‬‭sort()‬
‭➤‬‭Sorts documents based on one or‬
‭more fields.‬‭Example:‬

‭db.students.find().sort({ age: -1 })‬‭// Descending‬‭order‬


‭LePic‬

‭ ‬‭limit()‬
‭➤‬‭Limitsthe number of documents returned.‬
‭Example:‬

‭db.students.find().limit(2)‬

‭Ques‬‭→‬‭Explain the anatomy of a Spark job run.‬

‭Ans:-)‬‭Anatomy of a Spark Job Run‬

‭ . Driver Program Initialization‬


1
‭➤‬‭The application starts with the‬‭Driver Program‬‭,‬‭which containsthe main()‬
‭function and defines transformations and actions on RDDs or DataFrames.‬

‭ . SparkContext Creation‬
2
‭➤‬‭The Driver creates a‬‭SparkContext‬‭, which establishes‬‭a connection with the‬‭Cluster‬
‭Manager‬‭(like YARN, Mesos, or Standalone).‬

‭ . Resource Request from Cluster Manager‬


3
‭➤‬‭The SparkContext requests resourcesfrom the cluster‬‭manager to launch‬‭Executor processes‬
‭on worker nodes.‬
‭4. Executors Launch‬

‭ ‬‭Executors are launched on worker nodes. They are‬‭responsible for executing tasks and‬
‭storing data (caching if needed).‬

‭ . Task Scheduling by DAG Scheduler‬


5
‭➤‬‭The‬‭DAG Scheduler‬‭builds a‬‭Directed Acyclic Graph‬‭(DAG)‬‭ofstages based on the lineage of‬
‭transformations.‬

‭6. Stage Division by Task Scheduler‬


‭➤‬‭The DAG is broken into‬‭stages‬‭, where each stage‬‭contains‬‭tasks‬‭based on data partitions.‬

‭ . Task Execution by Executors‬


7
‭➤‬‭Tasks are sent to the executors.‬
‭➤‬‭Transformations‬‭are lazily evaluated and executed‬‭only when an‬‭action‬‭is called.‬

‭ . Data Shuffling‬
8
‭➤‬‭Between stages, data may need to be‬‭shuffled‬‭(redistributed‬‭across nodes) for‬
‭operationslike groupByKey or reduceByKey.‬

‭ . Job Completion‬
9
‭➤‬‭After all tasksfinish, the‬‭results are returned‬‭to the Driver or written to externalstorage (like‬
‭HDFS, S3, DB).‬

‭ 0. Clean-Up‬
1
‭➤‬‭Spark releases resources(executors) and the job‬‭ends.‬

‭ ues‬‭→‬‭Compare and contrast Hadoop MapReduce v1 and‬‭YARN (MRv2).‬


Q
‭How does YARN improve over MRv1?‬
‭LePic‬
‭Ans:-)‬

‭Aspect MapReduce v1 (MRv1) YARN (MRv2)‬

‭ rchitecture Monolithic: JobTracker and‬


A
‭TaskTracker‬ ‭Fault Tolerance Failure of JobTracker‬
‭affects the whole system‬
‭Modular: ResourceManager,‬
‭Job Management Single JobTracker‬ ‭NodeManager, and‬
‭manages resources and job scheduling‬ ‭ApplicationMaster‬

‭ calability Limited due to single‬


S ‭ eparate ResourceManager and‬
S
‭JobTracker‬ ‭ApplicationMaster‬

‭ esource Utilization Inefficient; tied to‬


R ‭ ighly scalable due to distributed‬
H
‭MapReduce tasks only‬ ‭architecture‬
‭ fficient; supports multiple types of‬
E ‭ ailures are isolated;‬
F
‭applications (e.g., Spark)‬ ‭components can recover independently‬

‭Support for Other Models Only supports MapReduce Supports MapReduce, Spark,‬
‭Tez, Flink, etc.‬

‭ esource Scheduling Coarse-grained,‬


R ➤
‭ ‬‭Separation of Concerns‬
‭Fine-grained, dynamic resource‬
‭static allocation‬
‭allocation‬

‭How YARN Improves Over MRv1‬

‭→‬‭JobTracker responsibilities are split into‬‭ResourceManager‬‭(cluster resource‬


‭ anagement) and‬‭ApplicationMaster‬‭(application execution),‬‭improving‬
m
‭manageability.‬

‭➤‬‭Better Scalability‬
‭→‬‭MRv1’sJobTracker became a bottleneck in large clusters.‬‭YARN decentralizesjob‬
‭scheduling, allowing thousands of jobsto run concurrently.‬

‭➤‬‭Multi-Framework Support‬
‭→‬‭YARN supports‬‭non-MapReduce‬‭processing modelslike‬‭Apache Spark, Tez, and Flink,‬
‭enabling more flexibility and modern workloads.‬
‭LePic‬
‭➤‬‭Improved Resource Utilization‬
‭→‬‭YARN provides‬‭fine-grained control‬‭over resources(CPU,‬‭memory), allowing better‬
‭utilization and reduced waste.‬

‭➤‬‭Fault Isolation and Recovery‬


‭→‬‭Failuresin one job’s ApplicationMaster do not affect‬‭others, unlike MRv1 where‬
‭JobTracker failure could crash the whole system.‬

‭Ques‬‭→‬‭What are the key components of the Hadoop ecosystem?‬

‭Ans:-)‬

‭Key Components of the Hadoop Ecosystem‬

‭➤‬‭HDFS (Hadoop Distributed File System)‬


‭Storeslarge volumes of data across multiple machines with replication for fault tolerance.‬

‭ ‬‭MapReduce‬
‭A programming model for processing large data setsin parallel using map and reduce‬
‭functions.‬

‭➤‬‭YARN (Yet Another Resource Negotiator)‬


‭Manages resources and schedulestasks acrossthe Hadoop cluster.‬


‭ ‬‭Hive‬
‭A data warehouse system built on Hadoop that allows querying and managing large‬
‭datasets using a SQL-like language called HiveQL.‬


‭ ‬‭Pig‬
‭A platform for analyzing large datasets using a high-levelscripting language called Pig Latin.‬


‭ ‬‭HBase‬
‭A NoSQL database built on HDFS that provides real-time read/write accessto large datasets.‬


‭ ‬‭Sqoop‬
‭Used to transfer bulk data between Hadoop and structured data storeslike RDBMS.‬


‭ ‬‭Flume‬
‭Used for collecting, aggregating, and moving large amounts of log data into HDFS.‬


‭ ‬‭Oozie‬
‭A workflow scheduler system to manage Hadoop jobs.‬


‭ ‬‭Zookeeper‬
‭A coordination service for distributed applications, ensuring synchronization across nodes.‬


‭ ‬‭Mahout‬
‭A machine learning library for building scalable ML applications on top of Hadoop.‬


‭ ‬‭Ambari‬
‭A web-based tool for provisioning, managing, and monitoring Hadoop clusters.‬
‭LePic‬

‭Ques‬‭→‬‭Discuss Scala’s functional programming features‬

‭with examples. Ans:-)‬

‭Scala’s Functional Programming Features‬


‭ ‬‭First-Class and Higher-Order Functions‬
‭Functions are treated as values and can be assigned to variables, passed as parameters, or‬
‭returned from other functions.‬
‭Example:‬

‭val add = (x: Int, y: Int) => x + y‬


‭def applyFunc(f: (Int, Int) => Int, a: Int, b: Int): Int = f(a, b)‬

‭println(applyFunc(add, 5, 3)) // Output: 8‬

‭➤‬‭Immutable Data‬
‭ cala encourages using immutable variables(using val) and immutable collectionsto‬
S
‭avoid side effects.‬‭Example:‬

‭val list = List(1, 2, 3)‬

‭// list(0) = 10 // Not allowed, List isimmutable‬

‭val newList = list.map(_ * 2)‬

‭println(newList) // Output: List(2, 4, 6)‬


‭ ‬‭Pure Functions‬
‭Functionsthat always produce the same output for the same input and‬
‭have no side effects.‬‭Example:‬

‭defsquare(x: Int): Int = x * x‬

‭println(square(4)) // Output: 16‬


‭ ‬‭Pattern Matching‬
‭A powerful feature to match data structures and decompose them. Similar to switch-case‬
‭but more expressive.‬‭Example:‬

‭def describe(x: Any): String = x match {‬

‭case 0 => "zero"‬

‭case s: String => s"String of length ${s.length}"‬

‭case _ => "something else"‬


‭LePic‬
‭}‬

‭println(describe("hello")) // Output: String of length 5‬

‭➤‬‭Anonymous Functions (Lambdas)‬


‭Functions without a name, often used as argumentsto higher-order functions.‬

‭Example:‬

‭val nums = List(1, 2, 3, 4)‬

‭val doubled = nums.map(x => x * 2)‬

‭println(doubled) // Output: List(2, 4, 6, 8)‬



‭ ‬‭Closures‬
‭Functionsthat capture variablesfrom their environment.‬
‭Example:‬

‭var factor = 3‬

‭val multiplier = (x: Int) => x * factor‬

‭println(multiplier(4)) // Output: 12‬


‭ ‬‭Lazy Evaluation‬
‭Expressions are evaluated only when needed. Useful to improve performance and handle‬
‭infinite data structures.‬‭Example:‬

‭lazy val x = { println("Evaluated"); 42 }‬

‭println("Before accessing x")‬

‭println(x) //"Evaluated"is printed here, then 42‬

‭Unit-05‬
‭Ques‬‭→‬‭Differentiate between Map-Reduce, PIG and HIVE‬

‭Ans:-)‬
‭LePic‬
‭feature‬‭MapReduce‬‭Pig‬‭Hive‬
‭processing‬ ‭ sing Pig Latin‬
u
‭ efinition‬‭Low-level‬
D ‭large data‬ ‭Data warehousing tool with‬
‭High-levelscripting platform‬ ‭SQL-like interface‬
‭programming model for‬

‭ anguage Used‬‭Java, Python, C++ Pig Latin HiveQL (SQL-like)‬


L
‭Ease of Use‬‭Complex and verbose Easier‬ ‭Easiest; user-friendly for SQL users‬
‭than MapReduce; scripting based‬
‭Manualschema management‬ ‭Converts HiveQL to‬
‭Execution Engine‬‭Native‬ ‭Converts Pig Latin to‬ ‭MapReduce / Tez / Spark‬
‭MapReduce engine‬ ‭MapReduce‬
‭ ata querying,‬
D
‭ se Cases‬‭Custom and‬
U ‭ TL (Extract, Transform,‬
E ‭summarization, and reporting‬
‭complex data processing‬ ‭Load)‬
‭ trictly schema-based like‬
S
‭RDBMS‬
‭Schema Support‬ ‭ emi-structured;‬
S
‭optionalschema‬
‭Ques‬‭→‬‭Explore various execution‬

‭models of PIG. Ans:-)‬‭1. Local Mode‬


‭ ‬‭Execution Environment‬
‭→‬‭Pig scripts run on a‬‭single local machine‬‭.‬
‭→‬‭Data is read from and written to the‬‭local file‬‭system‬‭.‬


‭ ‬‭Use Case‬
‭→‬‭Ideal for‬‭testing‬‭,‬‭development‬‭, and small datasets.‬

‭➤‬‭Command to Run‬‭→‬‭pig -x local‬

‭2. MapReduce Mode (Default Mode)‬


‭ ‬‭Execution Environment‬
‭→‬‭Pig scripts are translated into‬‭MapReduce jobs‬‭and‬‭run on a‬
‭Hadoop cluster‬‭.‬‭→‬‭Data is processed using‬‭HDFS‬‭.‬


‭ ‬‭Use Case‬
‭→‬‭Best for‬‭large-scale production environments‬‭.‬

‭➤‬‭Command to Run‬‭→‬‭pig -x mapreduce‬


‭LePic‬
‭3. Tez Mode‬


‭ ‬‭Execution Environment‬
‭→‬‭Pig scripts are converted to tasksthat run using‬‭the‬‭Apache Tez‬
‭execution engine.‬‭→‬‭Tez isfaster and more efficient‬‭than traditional‬
‭MapReduce.‬


‭ ‬‭Use Case‬
‭→‬‭Used for‬‭improved performance‬‭in iterative and interactive‬‭Pig queries.‬

‭➤‬‭Command to Run‬‭→‬‭pig -x tez‬

‭4. Spark Mode (Experimental)‬


‭ ‬‭Execution Environment‬
‭→‬‭Allows Pig scriptsto execute on the‬‭Apache Spark‬‭engine.‬

‭➤‬‭Use Case‬
‭→‬‭Useful for users already using Spark and exploring‬‭faster execution.‬

‭➤‬‭Status‬
‭→‬‭Still‬‭experimental and less stable‬‭than MapReduce‬‭and Tez.‬
‭5. Grunt Shell (Interactive Mode)‬


‭ ‬‭Execution Environment‬
‭→‬‭Interactive command-line shell where users can write‬‭and execute Pig Latin statements‬
‭step-by-step‬‭.‬


‭ ‬‭Use Case‬
‭→‬‭Helpful for‬‭debugging‬‭,‬‭testing‬‭,‬‭and‬‭learning‬‭Pig‬‭syntax.‬

‭➤‬‭Command to Start‬‭→‬‭pig‬

‭Ques‬‭→‬‭Design and explain the detailed architecture‬‭of HIVE‬

‭Ans:-)‬
‭LePic‬

‭Architecture of Hive‬

‭ pache Hive is a‬‭data warehouse infrastructure‬‭built‬‭on top of Hadoop. It allows‬


A
‭usersto query and analyze large datasetsstored in HDFS using a‬‭SQL-like language called‬
‭HiveQL‬‭.‬

‭1. Hive Client‬


‭ ‬‭Thrift, JDBC, and ODBC Applications‬
‭→‬‭Usersinteract with Hive using different interfaces.‬
‭→‬‭These clientssend queriesto‬‭HiveServer2‬‭.‬


‭ ‬‭JDBC/ODBC/Beeline Clients‬
‭→‬‭Allow connectionsfrom Java/ODBC-based applications.‬
‭→‬‭Beeline isthe command-line interface for Hive.‬

‭2. Hive Services‬


‭ ‬‭HiveServer2‬
‭→‬‭Manages connections and handles requestsfrom multiple‬‭clients.‬
‭→‬‭Provides a secure and multi-user environment.‬
‭LePic‬

‭ ‬‭Driver‬
‭→‬‭Actslike a controller, managing the execution flow‬‭of HiveQL statements.‬


‭ ‬‭Compiler‬
‭→‬‭Parsesthe query and convertsit into a logical plan.‬


‭ ‬‭Optimizer‬
‭→‬‭Optimizesthe query plan (e.g., rearranging joins,‬‭filtering early).‬


‭ ‬‭Metastore‬
‭→‬‭Stores metadata about tables, columns, data types,‬
‭partitions, etc.‬‭→‬‭Uses a relational database like‬
‭MySQL or Derby.‬

‭3. Processing & Resource Management‬


‭ ‬‭MapReduce/YARN‬
‭→‬‭Hive converts queriesinto MapReduce jobs(or Tez/Spark‬‭in‬
‭newer setups).‬‭→‬‭YARN‬‭handlesjob scheduling and resource‬
‭management.‬

‭4. Distributed Storage‬


‭ ‬‭HDFS (Hadoop Distributed File System)‬
‭→‬‭Actual data isstored in HDFS.‬
‭→‬‭Hive only stores metadata; it queriesthe data in‬‭HDFS.‬

‭Working Flow‬

‭➤‬‭Step 1: Query Submission‬


‭→‬‭A HiveQL query issubmitted via a client (e.g., Beeline, JDBC, Thrift).‬


‭ ‬‭Step 2: HiveServer2 Receives the Query‬
‭→‬‭Passesit to the‬‭Driver‬‭component for processing.‬


‭ ‬‭Step 3: Compilation‬
‭→‬‭The‬‭Compiler‬‭checkssyntax, performssemantic analysis,‬‭and creates a query plan.‬


‭ ‬‭Step 4: Optimization‬
‭→‬‭The‬‭Optimizer‬‭improvesthe query plan (e.g., filters‬‭early, reducesshuffle).‬


‭ ‬‭Step 5: Execution Plan Generation‬
‭→‬‭The final physical plan is generated (as MapReduce,‬‭Tez, or Spark jobs).‬


‭ ‬‭Step 6: Metadata Access‬
‭→‬‭The‬‭Driver‬‭consultsthe‬‭Metastore‬‭for schema and‬‭partition information.‬


‭ ‬‭Step 7: Job Execution via YARN‬
‭→‬‭Jobs are submitted to the‬‭YARN‬‭resource manager‬‭for execution on the cluster.‬
‭LePic‬

‭ ‬‭Step 8: Data Retrieval from HDFS‬
‭→‬‭Input data isfetched from‬‭HDFS‬‭, processed, and results‬‭are written back to HDFS or sent to‬
‭the client.‬


‭ ‬‭Step 9: Result Return‬
‭→‬‭The Driver collectsthe final output and sendsit‬‭to the client.‬

‭Ques‬‭→‬‭What are the key features of HBase and how does‬‭it differ‬

‭from RDBMS? Ans:-)‬

‭Key Features of HBase‬


‭ ‬‭Column-Oriented Storage‬
‭➤‬‭Stores data in column families rather than rows,‬‭which is efficient for read/write operations‬
‭on large datasets.‬

‭➤‬‭Built on HDFS‬
‭➤‬‭Uses Hadoop Distributed File System (HDFS) asitsstorage‬‭layer, allowing it to handle‬
‭massive volumes of data.‬

‭➤‬‭Scalability‬

‭ ‬‭Horizontally scalable across commodity hardware;‬‭new nodes can be added easily without‬
‭downtime.‬

‭➤‬‭Real-Time Read/Write Access‬



‭ ‬‭Unlike HDFS, which is optimized for batch processing,‬‭HBase allows real-time random‬
‭read/write access.‬

‭➤‬‭No Fixed Schema‬


‭➤‬‭Schema-less design where each row can have a differentset‬‭of columns, offering high‬
‭flexibility.‬


‭ ‬‭Automatic Sharding‬
‭➤‬‭Data is automatically split into‬‭regions‬‭, and each‬‭region isserved by a‬‭RegionServer‬‭for‬
‭load balancing.‬


‭ ‬‭Strong Consistency‬
‭➤‬‭Providesstrong consistency on reads and writes per‬‭row.‬

‭Differences Between HBase and RDBMS‬


‭LePic‬
‭Aspect HBase RDBMS‬
t‭ ransactions‬
‭ ata Model Column-family based‬
D ‭Table-based with rows and columns‬
‭(NoSQL)‬ ‭(Relational)‬

‭ chema Flexible schema; dynamic‬


S ‭Fixed schema; predefined columns‬
‭columns‬
‭SQL (Structured Query Language)‬
‭ uery Language No SQL; uses Java‬
Q
‭APIs or REST/Thrift‬ ‭ ully ACID compliant with‬
F
‭transaction support‬
‭Transactions No multi-row ACID‬

‭Scalability Horizontally scalable Typically scales vertically‬


‭Structured data with‬
‭ se Case Big data, random‬
U ‭relationships, complex queries‬
‭read/write, sparse data‬

‭Joins Support Not supported Supported‬

‭Indexing Limited (manual creation) Automatic and optimized indexing‬


‭examples.‬
‭Performance Goal High‬ ‭Data integrity and complex‬
‭throughput, low latency for large‬ ‭queries‬
‭datasets‬

‭Ques‬‭→‬‭Explain HiveQL with‬


‭Ans:-)‬‭➤‬‭HiveQL (HQL)‬‭isthe‬‭query language of Apache Hive‬‭, designed to manage and‬
‭query structured data stored in Hadoop Distributed File System (HDFS).‬

‭➤‬‭It issimilar to‬‭SQL‬‭,‬‭making it easy for database‬‭professionalsto analyze big data‬

‭using familiar syntax.‬‭Basic HiveQL Syntax and Examples‬

‭LePic‬

‭6.Group By‬

‭1.Creating a Table‬

‭7.Join Example‬

‭CREATE TABLE students (‬

‭id INT,‬

‭name STRING,‬
‭2.Loading Data into a Table‬
‭marks INT‬

‭)‬

‭ROW FORMAT DELIMITED‬


‭3. Selecting Data‬

‭FIELDS TERMINATED BY ',';‬

‭4.Using WHERE Clause‬


‭LOAD DATA LOCAL INPATH‬
‭'/home/user/students.csv' INTO‬
‭TABLE students;‬

‭5.Grouping and Aggregation‬


‭students; (Aggregation)‬

‭SELECT * FROM students;‬


‭SELECT marks, COUNT(*) FROM‬
‭students GROUP BY marks;‬

‭SELECT name, marks FROM students‬


‭WHERE marks > 70;‬
‭SELECT s.id, s.name, m.subject‬

‭FROM students s‬
‭SELECT COUNT(*) FROM students;‬
‭(Grouping)‬ ‭JOIN marks_table m ON (s.id =‬
‭m.student_id);‬
‭SELECT AVG(marks) FROM‬
‭LePic‬

‭HiveQL vs Traditional SQL‬

‭feature‬‭HiveQL‬‭Traditional SQL‬
‭ rocessing‬
p
‭ xecution Engine Converts to‬
E ‭Direct execution on RDBMS‬
‭MapReduce/Tez/Spark‬
‭ ast due to indexing and‬
F
‭Speed Slower due to batch‬ ‭transaction‬

‭Schema Handling Schema-on-Read Schema-on-Write‬

‭Ques‬‭→‬‭Discuss Zookeeper in detail.‬

‭Ans:-)‬


‭ ‬‭Apache Zookeeper‬‭is a‬‭centralized service‬‭used for‬‭maintaining‬‭configuration‬
‭information‬‭,‬‭naming‬‭,‬‭synchronization‬‭, and‬‭group services‬‭in distributed systems.‬


‭ ‬‭It is a‬‭coordination service‬‭that allows distributed‬‭applicationsto work together by‬
‭providing reliable and consistentservices.‬

‭Key Features‬

‭➤‬‭High Availability‬
‭→‬‭Ensuresthat configuration and coordination data are always available across nodes.‬


‭ ‬‭Consistency‬
‭→‬‭All clientssee the same view of the system, even‬‭during updates.‬


‭ ‬‭Reliability‬
‭→‬‭Uses replication and logging to ensure data durability‬‭and recovery.‬


‭ ‬‭Fast Reads‬
‭→‬‭Read operations are very fast asthey are served‬‭from memory.‬

‭Architecture‬


‭ ‬‭Client-Server Model‬
‭→‬‭Clients(applications) interact with the Zookeeper‬‭service to read or update data.‬
‭LePic‬

‭ ‬‭Zookeeper Ensemble‬
‭→‬‭A group ofservers(usually odd in number, like 3,‬‭5, or 7) form the‬
‭Zookeeper cluster.‬‭→‬‭One acts asthe‬‭Leader‬‭, and the‬‭others are‬
‭Followers‬‭.‬


‭ ‬‭ZNodes‬
‭→‬‭The data in Zookeeper isstored in a‬‭hierarchical‬‭tree-like‬
‭structure‬‭,similar to a file system.‬‭→‬‭Each node is‬‭called a‬‭ZNode‬‭, and it can‬
‭hold data and children.‬

‭Core Services‬


‭ ‬‭Naming Service‬
‭→‬‭Helpsidentify nodesin a cluster using unique names(ZNode‬‭paths).‬

‭➤‬‭Configuration Management‬
‭→‬‭Stores configuration filesthat can be accessed or‬‭updated by clients.‬

‭➤‬‭Synchronization‬
‭ ‬‭Helps manage locks, barriers, and other synchronization‬‭primitivesin distributed‬

‭applications.‬

‭➤‬‭Leader Election‬
‭→‬‭Helps elect a master node among a group of nodes,‬‭which is critical for‬

‭systemslike Hadoop.‬‭Use Cases in Hadoop Ecosystem‬


‭ ‬‭HBase‬
‭→‬‭Uses Zookeeper for region server coordination, failover,‬‭and leader election.‬


‭ ‬‭Kafka‬
‭→‬‭Uses Zookeeper for broker coordination, topic configuration,‬‭and consumer offsets.‬

‭➤‬‭YARN (Hadoop)‬
‭→‬‭Can use Zookeeper for High Availability of ResourceManager.‬

‭Limitations‬


‭ ‬‭Notsuitable for‬‭large-scale data storage‬‭.‬
‭➤‬‭Designed for‬‭coordination and small configuration‬‭data‬‭,‬‭not heavy workloads.‬

‭Ques‬‭→‬‭Discuss the different types of data that can‬‭be‬

‭handled with HIVE. Ans:-)‬‭1. Structured Data‬


‭ ‬‭Data that follows a strictschema (rows and columns).‬
‭→‬‭Example: Tablesfrom‬‭RDBMS‬‭like MySQL, Oracle.‬
‭→‬‭Handled efficiently using‬‭HiveQL‬‭for querying and‬‭analysis.‬

‭2. Semi-Structured Data‬


‭LePic‬

‭ ‬‭Data that does not follow a strictschema butstill‬‭hassome structure.‬
‭→‬‭Example:‬‭JSON, XML, Avro‬‭files.‬
‭→‬‭Hive supports parsing and querying semi-structured‬‭formats using‬‭SerDe‬

‭(Serializer/Deserializer)‬‭.‬‭3. Unstructured Data‬


‭ ‬‭Data without a predefined model or organization.‬
‭→‬‭Example:‬‭Text files, logs, images, videos‬‭.‬
‭→‬‭Hive can process and query‬‭log/text files‬‭using‬‭custom input formats‬

‭or externalscripts.‬‭4. Time-Series or Event Data‬


‭ ‬‭Logs or records with time-stamped events(e.g., clickstreams,sensor‬‭data).‬
‭→‬‭Suitable for partitioning by‬‭date/time fields‬‭to‬‭improve performance.‬
‭→‬‭Common in web analytics and monitoring applications.‬

‭5. Flat Files (Delimited Files)‬


‭ ‬‭CSV, TSV, and other‬‭delimited‬‭files.‬
‭→‬‭Hive can directly load and query such files using‬‭simple table definitions.‬

‭Ques‬‭→‬‭Describe schema.‬

‭Ans:-)‬‭➤‬‭A‬‭schema‬‭isthe‬‭structure or blueprint‬‭that‬‭definesthe‬‭organization of data‬‭in a‬


‭dataset or database.‬‭→‬‭It describesthe‬‭fields, their‬‭names, data types‬‭,‬‭and relationships.‬

‭Types of Schema‬

‭ ‬‭1. Structured Schema‬
‭→‬‭Predefined schema with a fixed format (e.g., RDBMS).‬
‭→‬‭Example: Tables with specific columnslike ID (int),‬‭Name (string), Age (int).‬

‭➤‬‭2. Semi-Structured Schema‬


‭ ‬‭Partially defined schema that allowsflexibility.‬

‭→‬‭Example: JSON, XML.‬


‭ ‬‭3. Schema-on-Write‬
‭→‬‭Schema is applied‬‭when data is written‬‭to storage‬‭(e.g., RDBMS, Hive).‬
‭→‬‭Data must match the schema upfront.‬


‭ ‬‭4. Schema-on-Read‬
‭→‬‭Schema is applied‬‭when data is read or queried‬‭(common‬‭in Hadoop).‬
‭→‬‭Example: Hive or Pig accessing raw log files.‬

‭Schema in Hadoop Ecosystem‬

‭➤‬‭In‬‭Hive‬‭,schema is defined using HiveQL during table‬‭creation.‬


‭LePic‬
‭➤‬‭In‬‭Pig‬‭,schema is optional but can be defined using‬‭AS clause in LOAD statements.‬

‭➤‬‭In‬‭HBase‬‭,schema defines column families but allows‬‭dynamic columns within them.‬

‭Importance of Schema‬


‭ ‬‭Helpsin‬‭data validation‬‭and consistency.‬
‭➤‬‭Enables efficient‬‭querying and data processing‬‭.‬
‭➤‬‭Assistsin‬‭data interpretation‬‭and metadata management.‬

‭Thankyou for Watching!!!!‬


‭For Notes visit->‬
‭http://lepic.mzelo.com‬‭J‭o
‬ in Our‬
‭Telegram Channel !!!‬
‭LePic‬

You might also like