0% found this document useful (0 votes)
774 views109 pages

Big Data Hadoop MCQ Question

This document provides a collection of questions and answers related to the big data tools Sqoop and Hive. Sqoop is used to transfer data between relational databases and Hadoop. Key functions include full and incremental loading of data from relational sources into HDFS and exporting data from HDFS to relational databases. Hive provides a mechanism to project structure onto the data stored in HDFS and conduct ad-hoc queries using a SQL-like language called HiveQL. Important Hive concepts covered include metastores, the Beeline interface, data distribution techniques like DISTRIBUTE BY and CLUSTER BY, SERDE properties, and ACID transactions.

Uploaded by

VIDHYA HK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
774 views109 pages

Big Data Hadoop MCQ Question

This document provides a collection of questions and answers related to the big data tools Sqoop and Hive. Sqoop is used to transfer data between relational databases and Hadoop. Key functions include full and incremental loading of data from relational sources into HDFS and exporting data from HDFS to relational databases. Hive provides a mechanism to project structure onto the data stored in HDFS and conduct ad-hoc queries using a SQL-like language called HiveQL. Important Hive concepts covered include metastores, the Beeline interface, data distribution techniques like DISTRIBUTE BY and CLUSTER BY, SERDE properties, and ACID transactions.

Uploaded by

VIDHYA HK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

==================================================================================

======

Big data Hadoop MCQ Question & Answers - Sqoop 1.4.6

==================================================================================
======

1) What is Sqoop ?

a) Sqoop means SQL to Hadoop

b) Sqoop is an Export & Import Tool

c) Sqoop extracts RDBMS data to Hadoop

d) All the above

2) Explain JDBC driver role in Sqoop

a) JDBC driver , a Java api, created by different database vendor

b) JDBC driver is used as an interface to interact with RDBMS

c) JDBC driver is different from Sqoop connector

d) All the above

3) What is the role of Sqoop connector ?

a) Sqoop connectors are optimized to export & import from RDBMS

b) Sqoop connectors are used to execute SQL dialect supported by various databases

c) Sqoop connectors are pluggable piece that is used to fetch metadata

d) All the above

4) When to use --target-dir while importing data ?

a) To store the data in --hive-table

b) To store the data in a specific directory in HDFS

c) To store the data in a specific directory in local

d) None of the above

5) When to use --warehouse-dir while importing data?

a) To store all the date in a parent directory , sqoop jobs use --warehouse-dir

b) To store all the data under the parent directory with the same name as the table
c) To store all the data in HDFS location

d) All the above

6) How can you import only a subset of rows from a table?

a) By using CONDITIONS parameter

b) By using WHERE clause in the sqoop import statement

c) By using -m1 parameter

d) None of the above

7) Is it possible to import data from two or more tables using Sqoop ?

a) JDBC driver directly pulls multiple tables

b) Not possible to combine multiple tables

c) SQL Join can be used under --query parameter

d) None of the above

8) How can we slice the data to be imported to multiple parallel tasks?

a) By using --split-by parameter , data will be imported into multiple chunks to be run in
parallel

b) Parallel tasks are not possible in Sqoop

c) Possible while importing into local directory (non-HDFS)

d) All the above

9) What is meant by full load ?

a) Full Load is not possible

b) Entire data can be imported by using --table parameter

c) Sqoop can use only increamental import

d) None of the above

10) What are the different incrementatl load or import option available in Sqooo ?

a) Incremental append
b) Incremental last modified

c) Both a & b

d) None of the above

11) In which scenario --direct option should be avoided ?

a) While importing to binary data formats like SequenceFile

b) While importing to ORC file formats

c) While importing to Parquet file format with compression codec

d) All the above

12) What is an --options-file & When it is used ?

a) An options-file is a text file where each line identifies an option in the order that it
appears

b) An options-file is used for repeatitive execution for same script

c) An options-file is used to specify all the command line values in a file

d) All the above

13) How do you run a Sqooop Job ?

a) sqoop -create <job name>

b) sqoop -show <job name>

c) Sqoop –exec <job name>

d) None of the above

14) Explain the use of --boundary-query

a) To improve performance of the --split-by having Min & Max values

b) To fetch minimum and maximum values of the --split-by column using the --boundary-
query parameter

c) It return exactly one row with exactly two columns having first column with lower bound
and second column with upper bound

d) All the above

15) Name the file formats supported by Sqoop


a) Avro

b) Parquet

c) ORC

d) Sequential

e) All the above

16) What is the default compression-codec in Sqoop

a) Snappy

b) Gzip

c) Lzo

d) None of the above`

17) Sqoop Jobs option stores last-value in a file

a) True

b) False

18) Is it possible to view the list of saved Sqoop Jobs ?

a) By using sqoop job --list

b) Not possible

c) Only last & recently excuted jobs

d) None of the above``

19) Can Sqooop handle character large objects (CLOB) & binary large objects ?

a) Yes

b) No

20) Is it possible to import into Hcatalog tables directly?

a) Yes

b) No

21) Is it possible to import complete database ?


a) By using import-all-tables

b) Not possible

22) What option is used to check the data received in HDFS with RDBMS ?

a) By using --validate parameter

b) Not possible

23) What are the common delimiters and escape character in sqoop?

a) Field delimiter comma(,)

b) Line delimiter ,a newline(\n)

c) All the above

24) Is it possible to transfer HDFS data into RDBMS table ?

a) No , Not possible

b) Yes, By using export parameter

25) What is the advantage of --batch option ?

a) Improves performance

b) Used mostly in export to RDBMS for grouping data

c) To work with multiple JDBC

d) All the above

26) Identify the wrong statement

a) Sqoop generates input-split

b) Sqoop can use only Map task

c) Sqoop output is generated by Reducer task

d) All the above

27) Indentify the wrong statement

a) Sqoop uses multiple mapper

b) Sqoop uses default HDFS directory store data


c) Sqoop maintains metastore

c) None of the above

28) Indentify the correct statement

a) Sqoop can import CSV file

b) Sqoop can import JSON file

c) Sqoop can import RDBMS data

d) All the above

29) Identify the correct statement

a) Sqoop can update the data using last-modified option in incremental append

b) Sqoop can use incremental append using last-change option

c) Sqoop can use incremental append only on date column basis

d) None of the above

30) Indentify wrong statement

a) Sqoop can directly import data into hive-table

b) Sqoop can directly import date into hbase-table

c) Sqoop can directly import data into hdfs path

d) Sqoop can directly import data into compressed format

==================================================================================
===========================================================================

Answer to HDFS MCQ : -

----------------------

1) d 2) d 3) d 4) b 5) d 6) b 7) c 8) a 9) b 10) c

11) a 12) d 13) c 14) d 15) e 16) b 17) a 18) a 19) a 20) a
21) a 22) a 23) d 24) b 25) d 26) c 27) b 28) c 29) a 30) d

===========================================================================<--- END
of Sqoop MCQ --->========================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
===========================================================================

==================================================================================
======

Big data Hadoop MCQ Question & Answers - Hive v 1.2.1

==================================================================================
======

1) What is Hive ?

a) Hive is Data Warehouse software built on top of Hadoop

c) Hive is an ETL tool

d) Hive stores structured data in table format in RDBMS

d) None of the above

2) Name the various metastores available in Hive ?

a) Embedded

b) Local

c) Remote

d) All the above

3) What is Beeline ?

a) Beeline in an ETL tool

b) Beeline a command-line interface of Hive Server2

c) Beeline is database
d) All the above

4) When do you use DISTRIBUTE BY clause ?

a) To summmarize all data collectively

b) To store data outside HDFS

c) To distribute data on specific keys (columns) basis to same reducer

d) None of the above

5) When do use CLUSTER BY clause ?

a) Cluster by is combination of Sort By & Distribute by

b) Cluster by first sorts the rows first on multiple reducers

c) Cluster by then distribute the data on Specific keys basis to the same reducer

d) All the above

6) What is a SERDE property in Hive ?

a) SERDE means Serialization and DeSerialiation

b) Hive uses SerDe interface for Input & Output

c) SerDe reads Hive table and writes back data to HDFS in any of the custom format

d) All the above

7) What is ACID property and usage in Hive ?

a) ACID stands for Automicity, Consistency, Isolation and Durability

b) ACID provides various row-level transactions - insert,update,delete

c) To enable transaction feature in hive table, TBLPROPERTIES as ‘transactional’=’true' to be


given

d) All the above

8) Is it possible to execute UPDATE & DELETE Command in Hive ?

a) Yes , only by using ORC format

b) No
9) What is the use of Explain command ?

a) Explain command provides execution plan for a HQL query

b) Query will be converted intosequence of Map & Reduce stages

c) Plan displays dependencies between stages with metadata associated

d) All the above

10) What is the differene between Partitioning & Bucketing in Hive ?

a) Both are indented for performance enhancement

b) Partition creates folder where as Bucket creates Hash functions and stores data in a file

c) Bucket is a [Link] will be slow on high cardinality

d) All the above

11) Is it possible to drop database and also tables under it automatically ? Why ?

a) No. Hive database cannot be dropped when it is not empty - when tables are available

b) Hive Database is by default in RESTRICT mode

c) To forcefully delete the database along with tables,,use CASCADE option, DROP DATABASE
<database name> CASCADE

d) All the above

12) Where Hive schema is stored ?

a) Hive stores the schema (table names, column names,data types) in Hive Metastore

b) In case of Embedded Metastore,Schema is stored in default DerBy Database

c) In case of Local/Remote Metastore, Schema is stored in external RDBMS

d) All the above

13) What is meant by Remote metatore in Hive?

a) Remote metastore runs independently in a seperate JVM process

b) Remote metastore integrates with Hive server running in seperate JVM

c) Remote metastore uses JDBC driver & Thrift Network service to connect with
metastore_DB in an external RDMBS

d) All the above


14) What is the use of vectorization ?

a) Vectorization improves performamce and also reduces the CPU utlization drastically

b) Instead of processing one row at a time, process the batch of rows (1024 rows)

c) Converts the column data into vectorized Array

d) All the above

15) Name the complex data types in Hive ?

a) Struct

b) Map

c) Array

d) All the above

16) What is the use of MapJoin in Hive & how it is implemented?

a) Improves performance

b) Smaller table will be loaded into memory to join with the data in each DataNode

c) Reducer will not be running as it is set to zero. Map output will be generated

d) All the above

17) Identify the correct statement of Metastore

a) Metastore is stored in HDFS as file

b) Metastore is stored in Hive as table

c) Metastore is a repository where hive metadata/schema is stored in a database

d) None of the above

18) Identify the wrong statement of hive is reading the data from

a) Hive data is stored in RDBMS by default

b) Hive data is stored in HDFS location by defaut

c) Hive data is stored in Local location by default

d) All the above

19) Multiline query is possible in Hive ?


a) True

b) False

20) Is it possible to create VIEW in Hive ?

a) True

b) False

21) Identify the correct statement of Hive

a) Default field delimiter is '\001'

b) Defulat line delimiter is '\n'

c) Default file format is TextInputFormat

d) All the above

22) Identify the wrong statement of Hive

a) Hive is write on-schema

b) Hive data storage location is '/hive' by default

c) Hive data is stored in local location by default

d) All the above

23) The Name 'Hive' is derived from the term ...

a) BeeHive

b) HiveBee

c) HoneyBee

d) None of the above

25) Hive was initialy developed by

a) Google

b) Apache

c) Facebook

d) None of the above


26) Hive supported file formats are ...

a) Parquet

b) ORC

c) Avro

d) All the above

27) Identify the correct statement of ORDER BY command

a) Order by sorts all data togehter

b) Order by uses only one reducer

c) Order by is time consuming

d) All the above

28) Identify the wrong statement of SORT BY command

a) Sort by uses more than one reducer

b) Sort by fast compared to order by

c) Sort by indexes the rows first then the columns

d) None of the above

29) Alter table command is used to change table properties ?

a) True

b) False

30) Alter partition command is used to change partition table properties

a) True

b) False

31) Is it possible to index hive tables ?

a) No

b) Yes

32) Enabling 'non-strict' mode helps to utlize dynamic partition


a) True

b) False

33) Dynamic partition helps improving the query performance

a) True

b) False

34) Hive table displays NULL values when there is an incorrect data types

a) True

b) False

35) Indentify the wrong statement of MANAGE table

a) Default table in Hive is Manage table

b) External application cannot access Manage table

c) When Manage table is dropped, data & schema also gets deleted

d) Manage table data location cannot be changed

36) Identify the correct statement of External table

a) When External table is dropped, schema will be dropped, data remains intact

b) External application can access External table

c) External table data location can be customized

d) All the above

37) Hive index is stored in file-format ?

a) True

b) False

38) When will you use 'explode' option in Hive ?

a) Used to extract individual element data type Array in complex data

b) Used with LATERAL VIEW option to extra dat

c) Used as Built-in Table-Generating Functions (UDTF)to seperate as a table


d) All the above

39) Indentify the correct statement of the 'insert' option

a) Insert into table ... appends data into an existing table

b) Insert overwrite table ... loads current data in the table

c0 insert overwrite directory '/PATH/' writes the data into a file

d) All the above

40) Identiy the correct statement of Hive Architecture ..

a) Metastore acts an interface to integrate Hive table, Schema & HDFS data

b) JDBC driver receives HQL queries

c) Compliler parses the HQL Query and Optimized provides workflow

d) Executor runs the task as Map & Reduce

e) All the above

41) Does Hive Support Online-Transcation-Processing (OLTP) system ?

a) True

b) No

42) Is it possible to run Hive MapReduce in Non-HDFS mode ?

a) No

b) yes, by enabling set [Link]=TRUE

43) Is there any alternative to MapReduce process in Hive ?

a) No

b) Yes, by enabling set [Link]=tez;

44) What are various ways Hive process data ?

a) Table

b) Partition

c) Bucketing
d) All the above

45) Name the various performance tuning option in Hive ?

a) Paritioning & Bucketing

b) Distribute by & Cluster by

c) Vectorization & Cost Based optimization

d) File Formats ORC,Parquet & Avro & Compression Snappy , Gzip

e) All the above

46) What is the advantage of HCatalog ?

a) HCatalog is used to share HIVE data structure to external applications

b) HCatalog is not compatiable to Hive

c) HCatalog is improves performance

d) None of the above

47) Is it possible to load data into VIEW ?

a) No

b) Yes

48) What is the use of VIEW in Hive ?

a) View is logical entity , does not store data physically

b) View helps to implement complex queries into reuable

c) View allows query to be saved & treated like a table

d) All the above

49) How security & authorization is enforced in Hive?

a) Grant & Revoke option to users

b) Sentry in Cloudera

c) Ranger in Hortonworks

d) All the above


50) What is the default SerDe format in Hive ?

a) AvroSerDe

b) LazySimpleSerDe

c) CSVSerDe

d) ParquetSerDE

==================================================================================
===========================================================================

Answer to Hive MCQ : -

----------------------

1) a 2) d 3) b 4) c 5) d 6) d 7) d 8) a 9) d 10) d

11) d 12) d 13) d 14) d 15) d 16) d 17) c 18) b 19) a 20) a

21) d 22) d 23) a 24) a 25) c 26) d 27) d 28) d 29) a 30) a

31) b 32) a 33) a 34) a 35) d 36) d 37) b 38) d 39) d 40)
e

41) b 42) b 43) b 44) d 45) e 46) a 47) a 48) d 49) d 50) b

===========================================================================<--- END
of HDFS MCQ --->========================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
===========================================================================
==================================================================================
======

Apache PIG MCQ Question & Answers

==================================================================================
======

Pig Script - MCQ Question - 1 on Pig Introduction

1. Pig operates in mainly how many nodes?

a) Two

b) Three

c) Four

d) Five

2. Point out the correct statement.

a) You can run Pig in either mode using the “pig” command

b) You can run Pig in batch mode using the Grunt shell

c) You can run Pig in interactive mode using the FS shell

d) None of the mentioned

3. You can run Pig in batch mode using __________

a) Pig shell command

b) Pig scripts

c) Pig options

d) All of the mentioned

4. Pig Latin statements are generally organized in one of the following ways?
a) A LOAD statement to read data from the file system

b) A series of “transformation” statements to process the data

c) A DUMP statement to view results or a STORE statement to save the results

d) All of the mentioned

5. Point out the wrong statement.

a) To run Pig in local mode, you need access to a single machine

b) The DISPLAY operator will display the results to your terminal screen

c) To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation

d) All of the mentioned

6. Which of the following function is used to read data in PIG?

a) WRITE

b) READ

c) LOAD

d) None of the mentioned

7. You can run Pig in interactive mode using the ______ shell.

a) Grunt

b) FS

c) HDFS

d) None of the mentioned

8. Which of the following is the default mode?

a) Mapreduce

b) Tez

c) Local
d) All of the mentioned

9. Which of the following will run pig in local mode?

a) $ pig -x local …

b) $ pig -x tez_local …

c) $ pig …

d) None of the mentioned

10.$ pig -x tez_local … will enable ________ mode in Pig.

a) Mapreduce

b) Tez

c) Local

d) None of the mentioned

==================================================================================
===========================================================================

Answer to PIG Introduction 1 : -

---------------------------------

1) a 2) a 3) b 4) d 5) b 6) c 7) a 8) a 9) a 10) d

==================================================================================
=================================================

Pig Script MCQ Question -2 on Pig Latin

1._________ operator is used to review the schema of a relation.


a) DUMP

b) DESCRIBE

c) STORE

d) EXPLAIN

2. Point out the correct statement.

a) During the testing phase of your implementation, you can use LOAD to display results to your
terminal screen

b) You can view outer relations as well as relations defined in a nested FOREACH statement

c) Hadoop properties are interpreted by Pig

d) None of the mentioned

3. Which of the following operator is used to view the map reduce execution plans?

a) DUMP

b) DESCRIBE

c) STORE

d) EXPLAIN

4. ___________ operator is used to view the step-by-step execution of a series of statements.

a) ILLUSTRATE

b) DESCRIBE

c) STORE

d) EXPLAIN

5. Point out the wrong statement.

a) ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin
statements

b) ILLUSTRATE is based on an example generator


c) Several new private classes make it harder for external tools such as Oozie to integrate with Pig
statistics

d) None of the mentioned

6. __________ is a framework for collecting and storing script-level statistics for Pig Latin.

a) Pig Stats

b) PStatistics

c) Pig Statistics

d) None of the mentioned

7. The ________ class mimics the behavior of the Main class but gives users a statistics object back.

a) PigRun

b) PigRunner

c) RunnerPig

d) None of the mentioned

8. ___________ is a simple xUnit framework that enables you to easily test your Pig scripts.

a) PigUnit

b) PigXUnit

c) PigUnitX

d) All of the mentioned

9. Which of the following will compile the Pigunit?

a) $pig_trunk ant pigunit-jar

b) $pig_tr ant pigunit-jar

c) $pig_ ant pigunit-jar

d) None of the mentioned


10. PigUnit runs in Pig’s _______ mode by default.

a) local

b) tez

c) mapreduce

d) none of the mentioned

==================================================================================
===========================================================================

Answer to PIG Latin 2 : -

---------------------------------

1) b 2) b 3) d 4) a 5) c 6) c 7) b 8) b 9) a 10) a

==================================================================================
=================================================

Pig Script - MCQ Question -3 on Pig Latin

1._________ operator is used to review the schema of a relation.

a) DUMP

b) DESCRIBE

c) STORE

d) EXPLAIN

2. Point out the correct statement.


a) During the testing phase of your implementation, you can use LOAD to display results to your
terminal screen

b) You can view outer relations as well as relations defined in a nested FOREACH statement

c) Hadoop properties are interpreted by Pig

d) None of the mentioned

3. Which of the following operator is used to view the map reduce execution plans?

a) DUMP

b) DESCRIBE

c) STORE

d) EXPLAIN

4. ___________ operator is used to view the step-by-step execution of a series of statements.

a) ILLUSTRATE

b) DESCRIBE

c) STORE

d) EXPLAIN

5. Point out the wrong statement.

a) ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin
statements

b) ILLUSTRATE is based on an example generator

c) Several new private classes make it harder for external tools such as Oozie to integrate with Pig
statistics

d) None of the mentioned

6. __________ is a framework for collecting and storing script-level statistics for Pig Latin.

a) Pig Stats
b) PStatistics

c) Pig Statistics

d) None of the mentioned

7. The ________ class mimics the behavior of the Main class but gives users a statistics object back.

a) PigRun

b) PigRunner

c) RunnerPig

d) None of the mentioned

8. ___________ is a simple xUnit framework that enables you to easily test your Pig scripts.

a) PigUnit

b) PigXUnit

c) PigUnitX

d) All of the mentioned

9. Which of the following will compile the Pigunit?

a) $pig_trunk ant pigunit-jar

b) $pig_tr ant pigunit-jar

c) $pig_ ant pigunit-jar

d) None of the mentioned

10. PigUnit runs in Pig’s _______ mode by default.

a) local

b) tez

c) mapreduce

d) none of the mentioned


==================================================================================
===========================================================================

Answer to PIG Latin 3: -

---------------------------------

1) b 2) b 3) d 4) a 5) c 6) c 7) b 8) b 9) a 10) a

==================================================================================
============================================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
==========================================================================

==================================================================================
======

Big data Hadoop MCQ Question & Answers - HBASE v 0.98

==================================================================================
======

1) What is HBase ?

a) HBase is Hadoop Database built on top of HDFS

b) HBase is a NoSQL DB horizontally scalable

c) HBase is Column-Oriented , Key-Value store DB

d) All the above


2) What are the Key components of HBase ?

a) Region Server

b) HMaster

c) Zookeeper

d) All the above

3) What are the advantages of HBase ?

a) HBase is consistent provides fast row-lookup to access data from large Hash tables on top
of HDFS

b) HBase provides low-latency random access to real-time read and write

c) HBase handles high volume data, Schema less, tables are replicated for failover

d) HBase is flexible which stores structured, unstructured or semi-structured data

e) All the above

4) How HBase statisfies CAP theorem ?

a) Avilability & Consistency

b) Consistency & Partition Tolerance

c) Availability & Partition Tolerance

d) None of the above

5) What is meant by NoSQL DB ?

a) NoSQL means Not only SQL

b) NoSQL databases store data in a format like RDBMS tables

c) NoSQL databases can store relationship data

d) All the above

6) How Columnar Database is differenct from RDBMS ?

a) RDBMS stores data by row which is suited for OLTP

b) Columnar DB stores data by column based which is suited for OLAP

c) Columnar DB is fast in accessing & processing data


d) All the above

7) What is the use of Column-Family in HBase ?

a) In Hbase, Column-family is used to group columns

b) Column-family wise data is stored in HFile permanently in disk

c) In case of data not going to be retrieved frequently, seperate Column-Family can be


created

d) All the above

8) What is the role of Zookeeper in HBASE ?

a) HBase uses ZooKeeper as a distributed coordinated service to maintain server state

b) ZooKeeper shares the information of Region servers and the active state of HMaster

c) HBase client gets the META table information from Region Server with the help of
ZooKeeper

d) All the above

9) What is the role of WAL ?

a) HBase first write the data to WAL - write-ahead log

b) In case of any failure to any Region, HMaster is used to recover data from WAL and
restore the same in Region

c) WAL is a log on disk sequencially appends data

d) All the above

10) What is the role of memstore in HBase ?

a) HBase uses Memstore as a in-memory Key-Value store to update data

b) Each Column-family will create a seperate memstore

c) Once the threshold exceeds, data will be flushed and stored as HFile or StoreFile in disk

d) All the above

11) What is Bloom filter in HBase and explain its advantages ?

a) Bloom filters are a probabilistic dataset to find whether an element is "there (false
positive) or not (false negative)"
b) Bloom Filter maintains an in-memory reference structure by which it seeks an element
(Row) in a specific StoreFile(HFile)

c) Bloom filter is a memory efficient by reducing the disk read which improves the
performance

d) All the above

12) What are Regions & Region Server ?

a) HBase table is divided into one or more Regions, Default size is 1 GB

b) A Region contains a contiguous, sorted range of rows between a start key and an end key

c) Region server is hosts and collection of Regions (tables) - appoximately upto 1,000 regions
per Region Server

d) All the above

13) Identify the wrong statement of HMaster ?

a) Whenever a client sends a read/write request (DDL/DML), HMaster receives the request
and forwards it to the corresponding Region server

b) HBase has One Active HMaster and multiple backup HMasters

c) Incase of Region Server failure, HMaster organize the Data recovery from WAL

d) All the above

14) Name few filters available in HBase ?

a) ValueFilter

b) PageFilter

c) RowFilter

d) All the above`

15) Identify the correct statment of Verision

a) Versions is nothing but maintaining the history of a coloum with max of 5

b) HBase sorts the versions of a cell from newest to oldest (descending order)

c) Version contains a value and a timestamp

d) All the above


16) What are various Encoding used in HBase ?

a) Diff

b) Prefix

c) Fast_Diff

d) All the above

17) What are the data types available in HBase ?

a) HBase data types are String, Int, Long

b) HBase data type is byte array

c) Both a & b

d) All the above

18) Name the Scripting language used in HBase ?

a) HBase shell uses Java scripting language to manipulate data

b) HBase shell uses Ruby Script to manipulate data

c) HBase shell uses SQL script to manipulate data

d) None of the above

19) Is it possible to use SQL queries in HBase ?

a) No

b) Yes, by installing Phoenix software tool

20) Identify the corrent statement of HBase origin

a) HBase is implemented from Google's Big Table

b) HBase was intially developed by Poweret

c) Apache Software releases HBase in 2010

d) All the above

21) What is Tombstone in HBase & How it is used ?

a) When Tombstone marker is set, deleted cells will become invisible

b) HBase deleted cells are actually removed during compactions.


c) Three type of Tombstones, namely, Family Delete Marker, Version Delete Marker,Column
Delete Marker

d) All the above

22) Identify the correct statement of HBase storage

a) HBase uses the Key-Value store model.

b) In Key-Value, Row is Key, Value is collection of Column families

c) HBase scans the rows by using start row key and end row key

d) All the above

23) Identify the wrong statement of HBase Schema

a) HBase is having structured schema like RDBMS

b) HBase is Schema-less and has no fixed schema

c) HBase Schema is pre-defined

d) All the above

24) What is correct about Wide Column Store ?

a) HBase does not support Wide Column store

b) HBase table provides Column family which groups collection of columns for data storage
based on row values on different cluster nodes

c) HBase is good for high voluminious data

d) All the above

25) What is Sharding and how it implemented in HBase ?

a) Sharding is a mechanism of distributing or partitioning data across multiple systems

b) When HBase table grows high, Sharding option dynamically distributes it across cluster

c) HBase splitting and serving Regions from one Region Server to other Region Servers is
considered as auto sharding

d) All the above

26) What is the role of META-Table in HBase ?

a) META table provides mapping list of Regions belong to Region Servers


b) META table is used to find the Region for a given Table based on Row Key

c) META table is maintained as B-Tree with Key id and Value

d) All the above

27) What is the role of Row-Key in HBase

a) In HBase table, each row has a unique identifier known as Row-Key.

b) Row-Key is used for grouping cells logically and it ensures that all cells that have the same
Row-Keys are co-located on the same server

c) Row-Key is internally stored as a byte array.

d) All the above

28) What is the role of HFile ?

a) HFile contains a multi-layered index which allows HBase to seek to the data without
having to read the whole file

b) Key value pairs are stored in increasing order

c) HFile is loaded in Block cache amd indexes point by row key to the key value data in 64KB
“blocks”

d) All the above

29) What is meant by Column Qualifier ?

a) HBase always uses qualified columns

b) HBase Columns with pre-defined data types

c) Columns defined under Column family is called as Column Qualifier

d) None of the above

30) How replication is maintained in HBase ?

a) HDFS replicates the WAL and HFile blocks

b) HMaster replicates WAL & HFile

c) Region server replicates WAL & HFile

d) None of the above

31) What is the role of load balancer in HBase ?


a) In HBase,load balancer decides about the placement and movement of Regions across
RegionServers in a distributed manner

b) Load Balancer ensures that the region replicas are not co-hosted in the same Region
Servers and also in the same rack

c) Cluster wide load balancing will occur only when there are no Regions in transition and
according to a fixed period of a time using balanceCluster(Map).

d) All the above

32) What is the difference between HBase & HDFS

a) HDFS doesn’t provide fast lookup of records in a file,

b) HBase provides fast row-level lookup for a large table

c) HBase is random read & write whereas HDFS is sequential

d) All the above`

33) What is the difference between HBase & Hive ?

a) Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key-
value store that runs on top of Hadoop.

b) Hive helps SQL savvy people to run MapReduce jobs where as HBase supports 4 primary
operations-put, get, scan and delete.

c) HBase is ideal for real-time querying of big data whereas Hive is an ideal choice for
analytical querying of data collected over point-in-time

d) All the above

34) Explain about the different catalog tables in HBase?

a) The two important catalog tables in HBase, are ROOT and META.

b) ROOT table tracks where the META table is available

c) META table stores all the regions in the Region Server

d) All the above

35) What is MSLAB?

a) MSLAB stands for Memstore-Local Allocation Buffer.

b) MSLABs are buffers of fixed sizes containing Key-Value instances

c) MSLABS is introduced to overcome heap issue in HBase


d) All the above

36) What is REST?

a) Rest stands for Representational State Transfer

b) REST defines the semantics so that the protocol can be used in a generic way to address
remote resources.

c) REST also provides support for different message formats, offering many choices for a
client application to communicate with the server.

d) All the above

37) What is TTL (Time to live) in Hbase?

a) TTL is a data retention technique

b) TTL uses the version of a cell which can be preserved till a specific time period.

c) TTL uses timestamp which reached the specific version will be removed.

d) All the above

38) Is it possible to store HFile in difference file formats other than HBase ?

a) No

b) Yes

39) What are the various compresion formats HBase Support ?

a) Snappy

b) LZO

c) Gzip

d) All the above

40) What are the secuity & authentications available in HBase

a) HBase implements User Authorization to grant users permissions for particular actions on
a specified set of data

b) HBase authorization uses RPC level which is based on the Simple Authentication and
Security Layer (SASL)

c) Kerberos is used as a Network authentication protocol whihc provides secret-key


cryptography
d) All the above

41) How HBase handles fail-over ?

a) When system is crashed, data in MemStore will be lost. HFile will be missing the lost data

b) WAL records all the changes and stores it as File system in disk

c) HMaster uses WAL for fail-over and retrieves data and restore it in Region

d) All the above

42 ) Is it possible to create HBase table without column family.

a) No

B) Yes

43) What is the BlcokCache ?

a) The “Block” in BlockCache is the smallest unit of data that HBase reads from HFile disk in a
single pass to maintain frequently used data

b) Block cache is used to keep most recently used data in JVM Heap apart from MemStore

c) HBase requires only looking up that block’s location in the index and retrieving it to avoid
disk read

d) All the above

44) Narrate the process of HBase reading data from different places & validation process before
returning the value ?

a) First HBase, checks the MemStore for any pending modifications. HFile contains Snapshot
of MemStore at the point it exceeds thresh hold

b) Secondly,checks the BlockCache to see if the block containing this read/write has been
recently accessed.

c) Finally, the relevant HFiles on disk are accessed. HBase access all HFiles that is required for
the row in order to complete record

d) All the above

45) What is the Impact of "Turn off WAL on Puts" or What is the Deferred Log Flush in HBase ?

a) Puts stores the data in Write-Ahead-Log (WriteToWal=False) ie. HLog edits


b) Incase of Deferred Log Flush WAL edits are maintained only in the MemStore, but
improves performance due to I/O

c) But incase of Region Server failure, data will be lost

d) All the above

46) What is Auto Flush in HBase ?

a) When MemStore exceed the thresh hold, data will be flushed to Hfile -disk. By default
AutoFlush is True

b) To improve performance, AutoFlush is set to False to enable Puts to store all the data in
HTable one at a time to the Region Server

c) To explicity Flush the data, FlushCommits is to store data in HTable

d) All the above

47) Where HBase permanently stores data ?

a) HBase directly stores in HDFS

b) HBase stores the data permanently in Store File or HFile on HDFS

c) HBase stores data in mem-store

d) Non of the above

48) What is compaction and list its advantages ?

a) HBase will automatically combines smaller HFiles and rewrite them into one HFiles

b) Compaction reduces the number of storage files and I/O seek

c) Compaction improves performance

d) All the above

==================================================================================
===========================================================================

Answer to HBase MCQ : -

----------------------

1) d 2) d 3) e 4) b 5) a 6) d 7) d 8) d 9) d 10) d
11) d 12) d 13) d 14) d 15) d 16) d 17) b 18) b 19) b 20) d

21) d 22) d 23) a 24) b 25) d 26) d 27) d 28) d 29) c 30) b

31) d 32) d 33) d 34) d 35) d 36) d 37) d 38) d 39) d 40) c

41) d 42) a 43) d 44) d 45) d 46) d 47) b 48) d

===========================================================================<--- END
of HDFS MCQ ---> =======================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
===========================================================================

==================================================================================
==================================================================================
==========

Big data Hadoop MCQ Question & Answers - HBASE v 0.98

==================================================================================
==================================================================================
==========

HBase - 1

1. HBase is a distributed ________ database built on top of the Hadoop file system.

a) Column-oriented

b) Row-oriented

c) Tuple-oriented

d) None of the mentioned


2. Point out the correct statement.

a) HDFS provides low latency access to single rows from billions of records (Random access)

b) HBase sits on top of the Hadoop File System and provides read and write access

c) HBase is a distributed file system suitable for storing large files

d) None of the mentioned

3. HBase is ________ defines only column families.

a) Row Oriented

b) Schema-less

c) Fixed Schema

d) All of the mentioned

4. Apache HBase is a non-relational database modeled after Google’s _________

a) BigTop

b) Bigtable

c) Scanner

d) FoundationDB

5. Point out the wrong statement.

a) HBase provides only sequential access to data

b) HBase provides high latency batch processing

c) HBase internally provides serialized access

d) All of the mentioned


6. The _________ Server assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.

a) Region

b) Master

c) Zookeeper

d) All of the mentioned

7. Which of the following command provides information about the user?

a) status

b) version

c) whoami

d) user

8. Which of the following command does not operate on tables?

a) enabled

b) disabled

c) drop

d) all of the mentioned

9. _________ command fetches the contents of a row or a cell.

a) select

b) get

c) put

d) none of the mentioned

10. HBaseAdmin and ____________ are the two important classes in this package that provide DDL
functionalities.

a) HTableDescriptor
b) HDescriptor

c) HTable

d) HTabDescriptor

==================================================================================
==================================================================================
=========

Answer to HBase MCQ : - 1

----------------------

1) a 2) b 3) b 4) b 5) c 6) b 7) c 8) b 9) b 10) a

==================================================================================
==================================================================================
==============

HBase - 2

1. The minimum number of row versions to keep is configured per column family via _____________

a) HBaseDecriptor

b) HTabDescriptor

c) HColumnDescriptor

d) All of the mentioned

2. Point out the correct statement.

a) The default for max versions is 1

b) It is recommended setting the number of max versions to an exceedingly high level

c) HBase does overwrite row values


d) All of the mentioned

3. HBase supports a ____________ interface via Put and Result.

a) “bytes-in/bytes-out”

b) “bytes-in”

c) “bytes-out”

d) none of the mentioned

4. One supported data type that deserves special mention are ____________

a) money

b) counters

c) smallint

d) tinyint

5. Point out the wrong statement.

a) Where time-ranges are very wide (e.g., year-long report) and where the data is
voluminous, summary tables are a common approach

b) Coprocessors act like RDBMS triggers

c) HBase does not currently support ‘constraints’ in traditional (SQL) database parlance

d) None of the mentioned

6. The _________ suffers from the monotonically increasing rowkey problem.

a) rowkey

b) columnkey

c) counterkey

d) all of the mentioned


7. __________ does re-write data and pack rows into columns for certain time-periods.

a) OpenTS

b) OpenTSDB

c) OpenTSD

d) OpenDB

8. Which command is used to disable all the tables matching the given regex?

a) remove all

b) drop all

c) disable_all

d) all of the mentioned

9. __________ command disables drops and recreates a table.

a) drop

b) truncate

c) delete

d) none of the mentioned

10. Correct and valid syntax for count command is ____________

a) count ‘<row number>’

b) count ‘<table name>’

c) count ‘<column name>’

d) none of the mentioned


==================================================================================
==================================================================================
=========

Answer to HBase MCQ : - 2

----------------------

1) c 2) a 3) a 4) b 5) c 6) a 7) b 8) c 9) b 10) b

==================================================================================
==================================================================================
==============

HBase - 3

1. _______ can change the maximum number of cells of a column family.

a) set

b) reset

c) alter

d) select

2. Point out the correct statement.

a) You can add a column family to a table using the method addColumn()

b) Using alter, you can also create a column family

c) Using disable-all, you can truncate a column family

d) None of the mentioned

3. Which of the following is not a table scope operator?

a) MEMSTORE_FLUSH
b) MEMSTORE_FLUSHSIZE

c) MAX_FILESIZE

d) All of the mentioned

4. You can delete a column family from a table using the method _________ of HBAseAdmin class.

a) delColumn()

b) removeColumn()

c) deleteColumn()

d) all of the mentioned

5. Point out the wrong statement.

a) To read data from an HBase table, use the get() method of the HTable class

b) You can retrieve data from the HBase table using the get() method of the HTable class

c) While retrieving data, you can get a single row by id, or get a set of rows by a set of row
ids, or scan an entire table or a subset of rows

d) None of the mentioned

6. __________ class adds HBase configuration files to its object.

a) Configuration

b) Collector

c) Component

d) None of the mentioned

7. The ________ class provides the getValue() method to read the values from its instance.

a) Get

b) Result

c) Put
d) Value

8. ________ communicate with the client and handle data-related operations.

a) Master Server

b) Region Server

c) Htable

d) All of the mentioned

9. _________ is the main configuration file of HBase.

a) [Link]

b) [Link]

c) [Link]

d) none of the mentioned

10. HBase uses the _______ File System to store its data.

a) Hive

b) Imphala

c) Hadoop

d) Scala

==================================================================================
==================================================================================
===

Answer to HBase MCQ : - 3

----------------------

1) c 2) a 3) a 4) c 5) d 6) a 7) b 8) b 9) b 10) c
==================================================================================
==================================================================================
===

========================================================================<--- END of
HDFS MCQ --->=====================================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
===========================================================================

==================================================================================
======

Big data Hadoop MCQ Question & Answers - MapReduce v 2.0

==================================================================================
======

1. How many tasks are in MapReduce Jobs

a) 3

b) 1

c) 2

d) 4

2. What is the process layer(model) of Hadoop ?

a) HDFS

b) MapReduce

c) MapReduce & HDFS

d) All the above

3) What is role of Custom-input format ?

a) Provides a set of rules to run Map & Reduce tasks

b) Provides logical chunk to run input-split

c) Validates configuration file and availability of data

d) All the above


4) What is the role of Record reader ?

a) Reads each line of the input file as the new value and associated key is byte offset

b) Converts the source data as Key-Value pair

c) Provides input(key-value pair) to Mapper

d) All the above

5) What is the role of input-split ?

a) The data generated by Mapper is input-split

b) No of Map task is proportional to no of input-splits

c) It is the logical representation of data in blocks (128 mb)

d) All the above

6) What is the purpose of byte-offset value ?

a) The byte offset is the count of bytes starting at zero

b) Byte offset value is treated as Key while reading each line

c) Byte offset is calculated for each line address

d) All the above

7) What is the role of Job Tracker ?

a) Job Tracker is a service assigns MapReduce Jobs to specific nodes in Cluster

b) Job Tracker is a sub-task of Task Tracker

c) All the above

Note : In Hadoop version 1.0 Job Tracker and version 2.0 Resource Manager

8) What is the role of Resource Manager ?

a) Resource Manager responsible for resources management

b) Resouce Manager uses Scheduler to assign container for the appication

c) Resouce Manager is core-conponent & single point of failure in YARN


d) All the above

9) What are the various Schedulers available in YARN ?

a) Fair

b) Capacity

c) None

d) Both a & b

10) What is the default file format in Mapreduce ?

a) Key-Value Inputformat

b) TextInputFormat

c) Parquet Format

d) None of the above

11) Name the methods available in Map Class

a) Setup

b) Run

c) Map

d) All the above

12) What is Combiner ?

a) Combiner is part of Reduce class

b) Combiner does a mini Reducer activity

c) Combiner groups Map's intermediate results

d) All the above

13) What is role of NodeManager ?

a) Responsible for launching & maintaining containers on Node

b) In version in called as Task Tracker

c) Regularly communicates with Resource Manager for Node status update


d) All the above`

14) Name the methods available in Reducer Class

a) Sort

b) Shuffle

c) Reduce

d) All the above

15) What is distcache or distributed cache ?

a) Maintains all necessary read-only data/text files or archives, jars, etc

b) Copies the files from HDFS to Local disk (non-hdfs) to improve performance

c) Default size for distcache is 10 GB

d) All the above

16) What is the role of AppMaster ?

a) Resides in NodeManager & monitors container

b) Maintains the collection of submitted applications or MR Jobs

c) Responsible for negotiating resources with Resource Manager

d) All the above

17) Where Map process output is being stored ?

a) Stores in HDFS location of the output path specified

b) Intermediate result of map is stored in local disk to avoid replication

c) Stores in HDFS location of the input path specified

d) None of the above

18) What is the difference between block split and input-split ?

a) Block split is physical division of source file created by HDFS

b) Input split is logic chunk of records created by Mapreduce

c) Input split uses information about block location

d) All the above


19) What is container ?

a) Container is a logical entity in NodeManager

b) MR Jobs run in container with specific resources like CPU, RAM

b) AppMaster controls the capacity of the container

d) All the above

20) Name MapReduce Data types

a) IntWrittable

b) FloatWritable

c) LongWritable

d) All the above

21) Reducer process starts only after 100% completion of Map process

a) True

b) False

22) How do you monitor MapReduce Job Status ?

a) hadoop job -list

b) hadoop job -status <job id>

c) use WebUI

d) All the above

23) Mapper process reads each row through _________ method.

a) map

b) reduce

c) mapper

d) reducer

24) Which option provides you to control which keys (and hence records) go to which Reducer
by implementing a custom?
a) Partitioner

b) OutputSplit

c) Reporter

d) All the above

25) In which option the MR Job is use to ____________ the progress and also set status
messages.

a) Partitioner

b) OutputSplit

c) Reporter

d) All the above

26) The MR Job uses ____________ option to set the number of reducers required

a) [Link](int)

b) [Link](int);

c) [Link](int)

d) All of the mentioned

Note:

-----

JobConf uses : - [Link] api (old version 1.0 , deprecated)

Job uses : - [Link] api (new,version 2.0)

27) The MR Job used to groups Reducer inputs by key in _________ stage.

a) sort

b) shuffle

c) reduce

d) none of the mentioned

28) The output of the reduce task is typically written to the FileSystem via _____________

a) [Link]
b) [Link]

c) [Link]

d) [Link]

29) Name the default Partitioner in MR Job ?

a) MergePartitioner

b) HashedPartitioner

c) HashPartitioner

d) None of the above

30) Is it possible to set the Reducer to Zero when reduce output is not required ?

a) True

b) False

31) The MR Job uses __________ to launch the application in the Hadoop framework .

a) JobCon

b) Job

c) JobConfiguration

d) All of the above

32) In MR Job ________________ is option is used as a threshold for the serialization buffers?

a) [Link]

b) [Link]

c) [Link]

d) None of the above

33) Which option _________________ in MR job enable the reuse of JVM

a) [Link]

b) [Link]

c) [Link]

d) None of the above


34) The MR Job creates local directory and localized cache to store inter-mediate results

a) True

b) False

35) Which option of MR Job uses _____________ to distribute both jars and native libraries

a) DistributedLog

b) DistributedCache

c) DistributedJars

d) None of the mentioned

36) What is the secure mode ___________ used in Hadoop environment ?

a) SSL

b) Kerberos

c) SSH

d) None of the mentioned

37) ________ is the architectural center of Hadoop that allows multiple data processing engines.

a) YARN

b) Hive

c) Incubator

d) Chuckwa

38) The __________ is a framework-specific entity that negotiates resources with


ResourceManager.

a) NodeManager

b) ResourceManager

c) ApplicationMaster

d) All the above

39) Apache Hadoop YARN stands for _________


a) Yet Another Reserve Negotiator

b) Yet Another Resource Network

c) Yet Another Resource Negotiator

d) All the above

40) The ____________ is the ultimate authority that arbitrates resources among all the
applications in the system.

a) NodeManager

b) ResourceManager

c) ApplicationMaster

d) All of the mentioned

41) The __________ is responsible for allocating resources to the various running applications
subject to familiar constraints of capacities, queues etc.

a) Manager

b) Master

c) Scheduler

d) None of the mentioned

42) What is speculative execution ?

a) only one job

b) parallel job

c) not running a job

d) All the above

43) How do you view MapReduce Log ?

a) YARN Client log (/tmp/yarn_client.out)

b) Job Log (/var/log/hadoop-yarn)

c) Application Master log (application_1437631989065_0009.stdout)

d) All the above

44) How to eliminate a running MR Job


a) hadoop job -kill <job id>

b) Map -kill <job id>

c) hdfs -kill <job id>

d) None of the above

==================================================================================
===========================================================================

Answer to MapReduce MCQ : -

----------------------

1) c 2) b 3) d 4) d 5) d 6) a 7) a 8) d 9) d 10) b

11) d 12) d 13) d 14) d 15) d 16) d 17) b 18) d 19) d 20) d

21) a 22) d 23) a 24) a 25) c 26) b 27) a 28) a 29) c 30) a

31) b 32) a 33) c 34) a 35) b 36) b 37) a 38) c 39) c 40) c

41) c 42) b 43) d 44) a

===========================================================================<--- END
of HDFS MCQ --->========================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
===========================================================================

==================================================================================
======

Big data Spark MCQ Question & Answers - SPARK V 2.2.0


==================================================================================
======

Spark - 1

1. Spark was initially started by ____________ at UC Berkeley AMPLab in 2009.

a) Mahek Zaharia

b) Matei Zaharia

c) Doug Cutting

d) Stonebraker

2. Point out the correct statement.

a) RSS abstraction provides distributed task dispatching, scheduling, and basic I/O
functionalities

b) For cluster manager, Spark supports standalone Hadoop YARN

c) Hive SQL is a component on top of Spark Core

d) None of the mentioned

3. ____________ is a component on top of Spark Core.

a) Spark Streaming

b) Spark SQL

c) RDDs

d) All of the mentioned

4. Spark SQL provides a domain-specific language to manipulate ___________ in Scala, Java, or


Python.

a) Spark Streaming

b) Spark SQL
c) RDDs

d) All of the mentioned

5. Point out the wrong statement.

a) For distributed storage, Spark can interface with a wide variety, including Hadoop
Distributed File System (HDFS)

b) Spark also supports a pseudo-distributed mode, usually used only for development or
testing purposes

c) Spark has over 465 contributors in 2014

d) All of the mentioned

6. ______________ leverages Spark Core fast scheduling capability to perform streaming analytics.

a) MLlib

b) Spark Streaming

c) GraphX

d) RDDs

7. ____________ is a distributed machine learning framework on top of Spark.

a) MLlib

b) Spark Streaming

c) GraphX

d) RDDs

8. ________ is a distributed graph processing framework on top of Spark.

a) MLlib

b) Spark Streaming

c) GraphX

d) All of the mentioned


9. GraphX provides an API for expressing graph computation that can model the __________
abstraction.

a) GaAdt

b) Spark Core

c) Pregel

d) None of the mentioned

10. Spark architecture is ___________ times as fast as Hadoop disk-based Apache Mahout and even
scales better than Vowpal Wabbit.

a) 10

b) 20

c) 50

d) 100

==================================================================================
==================================================================================
=========

Answer to Spark MCQ : - 1

----------------------

1) b 2) b 3) b 4) c 5) d 6) b 7) a 8) c 9) c 10) a

==================================================================================
==================================================================================
==============

This set of Hadoop Questions for campus interviews focuses on “Spark with Hadoop – 2”.
1. Users can easily run Spark on top of Amazon’s __________

a) Infosphere

b) EC2

c) EMR

d) None of the mentioned

2. Point out the correct statement.

a) Spark enables Apache Hive users to run their unmodified queries much faster

b) Spark interoperates only with Hadoop

c) Spark is a popular data warehouse solution running on top of Hadoop

d) None of the mentioned

3. Spark runs on top of ___________ a cluster manager system which provides efficient resource
isolation across distributed applications.

a) Mesjs

b) Mesos

c) Mesus

d) All of the mentioned

4. Which of the following can be used to launch Spark jobs inside MapReduce?

a) SIM

b) SIMR

c) SIR

d) RIS

5. Point out the wrong statement.


a) Spark is intended to replace, the Hadoop stack

b) Spark was designed to read and write data from and to HDFS, as well as other storage
systems

c) Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can
simply run Spark on YARN

d) None of the mentioned

6. Which of the following language is not supported by Spark?

a) Java

b) Pascal

c) Scala

d) Python

7. Spark is packaged with higher level libraries, including support for _________ queries.

a) SQL

b) C

c) C++

d) None of the mentioned

8. Spark includes a collection over ________ operators for transforming data and familiar data frame
APIs for manipulating semi-structured data.

a) 50

b) 60

c) 70

d) 80

9. Spark is engineered from the bottom-up for performance, running ___________ faster than
Hadoop by exploiting in memory computing and other optimizations.
a) 100x

b) 150x

c) 200x

d) None of the mentioned

10. Spark powers a stack of high-level tools including Spark SQL, MLlib for _________

a) regression models

b) statistics

c) machine learning

d) reproductive research

==================================================================================
==================================================================================
=

Answer to Spark MCQ : - 2

----------------------

1) b 2) a 3) b 4) b 5) a 6) b 7) a 8) d 9) a 10) c

==================================================================================
==================================================================================
===

=====================================================================<--- END of
HDFS MCQ ---
>=========================================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]
==================================================================================
===========================================================================

==================================================================================
======

Apache Kafka MCQ Question & Answers

==================================================================================
======

Kafka with Hadoop

1. Kafka is comparable to traditional messaging systems such as _____________

a) Impala

b) ActiveMQ

c) BigTop

d) Zookeeper

2. Point out the correct statement.

a) The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of
real-time publish-subscribe feeds

b) Activity tracking is often very high volume as many activity messages are generated for each user
page view

c) Kafka is often used for operational monitoring data

d) All of the mentioned

3. Many people use Kafka as a replacement for a ___________ solution.

a) log aggregation

b) compaction

c) collection

d) all of the mentioned


4. _______________ is a style of application design where state changes are logged as a time-
ordered sequence of records.

a) Event sourcing

b) Commit Log

c) Stream Processing

d) None of the mentioned

5. Point out the wrong statement.

a) Kafka can serve as a kind of external commit-log for a distributed system

b) The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes
to restore their data

c) Kafka comes with a command line client that will take input from a file or from standard input and
send it out as messages to the Kafka cluster

d) All of the mentioned

6. Kafka uses __________ so you need to first start a ZooKeeper server if you don’t already have
one.

a) Impala

b) ActiveMQ

c) BigTop

d) Zookeeper

7. __________ is the node responsible for all reads and writes for the given partition.

a) replicas

b) leader

c) follower

d) isr
8. __________ is the subset of the replicas list that is currently alive and caught-up to the leader.

a) replicas

b) leader

c) follower

d) isr

9. Kafka uses key-value pairs in the ____________ file format for configuration.

a) RFC

b) Avro

c) Property

d) None of the mentioned

10. __________ is the amount of time to keep a log segment before it is deleted.

a) [Link]

b) [Link]

c) [Link]

d) [Link]

Answer to Kafka with Hadoop - 1 : -

-----------------------------------

1) b 2) d 3) a 4) a 5) d 6) d 7) b 8) d 9) c 10) b

==================================================================================
===========================================================================

Kafka with Hadoop – 2


1. __________ provides the functionality of a messaging system.

a) Oozie

b) Kafka

c) Lucene

d) BigTop

2. Point out the correct statement.

a) With kafka, more users, whether using SQL queries or BI applications, can interact with more data

b) A topic is a category or feed name to which messages are published

c) For each topic, the Kafka cluster maintains a partitioned log

d) None of the mentioned

3. Kafka maintains feeds of messages in categories called __________

a) topics

b) chunks

c) domains

d) messages

4. Kafka is run as a cluster comprised of one or more servers each of which is called __________

a) cTakes

b) broker

c) test

d) none of the mentioned

5. Point out the wrong statement.

a) The Kafka cluster does not retain all published messages


b) A single Kafka broker can handle hundreds of megabytes of reads and writes per second from
thousands of clients

c) Kafka is designed to allow a single cluster to serve as the central data backbone for a large
organization

d) Messages are persisted on disk and replicated within the cluster to prevent data loss

6. Communication between the clients and the servers is done with a simple, high-performance,
language agnostic _________ protocol.

a) IP

b) TCP

c) SMTP

d) ICMP

7. The only metadata retained on a per-consumer basis is the position of the consumer in the log,
called __________

a) offset

b) partition

c) chunks

d) all of the mentioned

8. Each kafka partition has one server which acts as the _________

a) leaders

b) followers

c) staters

d) all of the mentioned

9. _________ has stronger ordering guarantees than a traditional messaging system.

a) kafka
b) Slider

c) Suz

d) None of the mentioned

10. Kafka only provides a _________ order over messages within a partition.

a) partial

b) total

c) 30%

d) none of the mentioned

Answer to Kafka with Hadoop - 2 : -

-----------------------------------

1) b 2) b 3) a 4) b 5) a 6) b 7) a 8) a 9) a 10) b

==================================================================================
===========================================================================

1. Temporary znodes are called

a) temp nodes

b) fleeting nodes

c) ephemeral nodes

d) terminating nodes

2. The znodes that continue to exist even after the creator of the znode dies are called:

a) ephemeral nodes
b) persistent nodes

c) sequential nodes

d) pure nodes

3. What property of znodes can be used to order znodes?

a) counter property

b) sequential property

c) ascending property

d) hierarchical property

4. Which feature of znodes are used for leader election?

a) Ephmeral feature

b) Persistent feature

c) watch feature

d) sequential feature

5) What does ZooKeeper provide to programmatically handle znodes?

a server hookups

b) client APIs

c) property files

d) Classes

6. Which is the configuration file for setting up ZooKeeper properties in Kafka?

a) [Link]

b) [Link]

c) [Link]

d) [Link]

7. Which command is used to start Kafka broker?


a) [Link]

b) [Link]

c) [Link]

d) [Link]

8. Which server should be started before starting Kafka server?

a) Kafka producer

b) Kafka consumer

c) Kafka topic

d) ZooKeeper server

9. A Kafka topic is setup with a replication factor of 5. Out of these, 2 nodes in the cluster have
failed. Business users are concerned that they may lose messages. What do you tell them?

a) They need to stop sending messages till you bring up the 2 servers

b) They need to stop sending messages till you bring up at least one server

c) They can continue to send messages as there is fault tolerance of 4 server failures.

d) They can continue to send messages as you are keeping a tape back up of all the messages

10. A kafka cluster has 20 nodes. There are 5 topics created, each with 6 partitions. How many total
number of broker processes will be running?

a) 20 processes, one on each node

b) 100 processes, one process for each topic on each node.

c) 30 processes, one process for each topic and partition.

d) 120 processes, one process for each partition on each node

11. You have tested that a Kafka cluster with five nodes is able to handle ten million messages per
minute. Your input is likely to increase to twenty five million messages per minute. How many more
nodes should be added to the cluster?
a) 15

b) 13

c) 8

d) 5

12. How many brokers will be marked as leaders for a partition?

a) Zero

b) Any number of brokers

c) One

d) All the brokers

13. Which of the following is guaranteed by Kafka?

a) A consumer instance gets the messages in the same order as they are produced.

b) A consumer instance is guaranteed to get all the messages produced.

c) No two consumer instances will get the same message

d) All consumer instances will get all the messages

14. The replication model in Kafka is

a) Quorum based

b) primary-backup method

c) multi-primary method

d) Journal based

15. When messages passes from producer to broker to consumer, the data modification is
minimized by using:

a) Message compression
b) Message sets

c) Binary message format

d) Partitions

16. What happens in Kafka, if a machine to be written data is down?

a) The write fails

b) The machine is bypassed till it comes back up

c) The write waits till the machine comes back up

d) Data is lost as a machine has failed

17. A common problem in a distributed system is

a) Complete shutdown

b) Partial failures

c) Network latency

d) Power failures

18. Which of the following best describes the relationship between ZooKeeper and partial failures?

a) ZooKeeper eliminates partial failures

b) ZooKeeper causes partial failures

c) ZooKeeper detects partial failures

d) ZooKeeper provides a mechanism for handling partial failures

19. Replication of data can result in improved fault tolerance. Which of the following is a
disadvantage of replication?

a) Inconsistent states

b) Loss of data

c) Deadlocks
d) Partial failures

20) ZooKeeper data model consists of:

a) A graph of znodes

b) A tree of znodes

c) A directed acyclic graph of znodes

d) A list of znodes

21. A common problem in a distributed system is

a) Complete shutdown

b) Partial failures

c) Network latency

d) Power failures

22. Which of the following best describes the relationship between ZooKeeper and partial failures?

a) ZooKeeper eliminates partial failures

b) ZooKeeper causes partial failures

c) ZooKeeper detects partial failures

d) ZooKeeper provides a mechanism for handling partial failures

23. Replication of data can result in improved fault tolerance. Which of the following is a
disadvantage of replication?

a) Inconsistent states

b) Loss of data

c) Deadlocks

d) Partial failures

24. ZooKeeper data model consists of:

a) A graph of znodes
b) A tree of znodes

c) A directed acyclic graph of znodes

d) A list of znodes

25) In Kafka , commit-log serves as an

a) internal log

b) external log

c) temporary log

d) No log

26) Is it possible to delete a topic when the broker is down

a) true

b) false

Note :

Topic Node, Broker Node , Zookeeper ZNode,Kafka- servers are same

Answer to Kafka 3: -

-----------------------------------rue

1) c 2) b 3) b 4) d 5) b 6) b 7) c 8) d 9) b 10) c 11) b 12) b

13) c 14) a 15) c 16) c 17) b 18) b 19) c 20) b 21) b 22) d 23) a 24) b

25) b 26) b
===========================================================================<--- END
of Kafka MCQ --->========================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
==========================================================================

==================================================================================
======

Big data Spark MCQ Question & Answers - SPARK V 2.2.0

==================================================================================
======

1) What is Apache Spark ?

a) Spark is an open source Frame Work

b) Spark is an in-memory data processing anlytical engine

c) Spark is a fast , distributed, cluster computation frame work

d) All the above

2) What are the various core components of Spark ?

a) Spark SQL

b) Spark Streaming

c) Spak MLIB

d) Spark GraphX

e) All the above

3) Which application program is used for development of Spark ?

a) Python
b) Java

c) Scala

d) None of the above

4) What is the role of Spark Driver program ?

a) The driver program runs the main () function of the application in the Master Node. One
driver per application

b) Spark Driver contains various components – DAGScheduler, TaskScheduler,


BackendScheduler and BlockManager

c) Driver stores the metadata of RDD and its partition details

d) Driver program splits user application into smaller execution units known as tasks

e) All the above

5) What is the role of Spark Context ?

a) Spark Context is an entry point of Spark functionality before version 2.0

c) Spark driver program generates Spark Context.

c) Spark Context connects to Spark Cluster Manager,YARN, to launch applications in Woker


Node as tasks

d) All the above

6) What is the use of Spark Conf ?

a) SparkConf is used to set configuration parameters as key-value pair for Spark Applications

b) SparkConf is required to pass Spark Context objects

c) SparkConf stores parameters like Application name, application exectuion location

d) All the above

7) What is the use of Spark Session ?

a) Spark Session is an entry point to Spark SQL

b) In Spark 2.0, Spark Session is built as a new entry point to launch application to process
structured datasets viz Dataframe & Dataset

c) Spark Session is a combination of SQLContext and HiveContext

d) All the above


8) Identify the correct statement of RDD

a) Resilient distributed dataset is a basic datastructure of Spark

b) RDD is immutable, distributed as partition in multiple nodes

c) RDD provides fault-tolerance, lazy evaluation and re-usability

d) All the above

9) Identify the correct statement of Transformation of Spark RDD

a) Transformation is a function that generates RDD to Dataframe

b) Transformation is an operation that generates New RDD from current RDD

c) Transformation is a function that creates Dataset to RDD

d) None of the above

10) What is Lazy evaluation and how it is used in Spark ?

a) Lazy Evaluation means that the execution will not take place until an action is triggered

b) Lazy Evaluation is associated with Spark Transformation

c) Transformations are lazy in nature meaning when we call some operation in RDD

d) All the above

11) What is Narrow Transformation ?

a) Computation takes place in a single partition of parent RDD

b) No Shuffle process across partitions. It is a pipeline process , grouped together as a single


stage, without data movement

c) Narrow Transformation is a result of map(),filter()

d) All the above

12) What is Wide Transformation ?

a) Computation takes place in a many partitions of parent RDD

b) Data is Shuffled across multiple partitions in the Cluster

c) Wide transformation is a result of groupbyKey(), reducebyKey()

d) All the above


13) What is the difference between map() & flatmap() ?

a) map() does one-to-one transformation

b) flatmap() does one-to-many transformation

c) All the above

14) What is DAG and how it is used in Spark ?

a) Directed Acyclic Graph converts logical exeuction plan to physical execution plan. DAG
provides a flow of execution

b) DAG in Spark is a set of Vertices(RDD) and Edges (Operations applied. DAG takes care of
processing in Spark

c) DAG achieves fault-tolerance and replaces the lost RDD partition in the ccluster with the
help of lineage

d) All the above

15) Name the various storage levels available in Spark

a) MEMORY_ONLY,MEMORY_ONLY_2

b) MEMORY_AND_DISK, MEMORY_AND_DISK_2

c) MEMORY_AND_DISK_SER

d) All the above

16) Identify the wrong statement of Checkpoint ?

a) Checkpointing is method to achieve fault tolerance

b) Check Pointing RDD gets stored in HDFS . Unlike cache, checkpointing file is not deleted

c) Spark does the process both at cache & check pointing directroy also

d) None of the above`

17) What is the difference between dataframe & dataset ?

a) Both are named columns like table

b) Dataframe schema is checked at run-time whereas Dataset schema is checked at compile-


time

c) Dataset is compatiable to execute Catalyst optimizer


d) All the above

18) What is the role of YARN and how it is used in Spark ?

a) YARN is a generic resource management framework for distributed environment

b) Spark application submitted to YARN which allocates the resources (executors) to Spark
tasks

c) Yarn provides two types of deployment mode, Yarn Cluster & Yarn Client to launch the
driver program

d) All the above

19) What is an Action in Spark ?

a) Action is an operation that performs on RDD

b) An action is one of the ways of sending data from Executor to the Driver

c) Examples of Spark actions are , sum(),count(), collect(), foreach() etc

d) All of the above

20) What is a Stage in Spark ?

a) Stage is a physical execution plan

b) Stage is set of parallel tasks runs on partitions

c) Stage is a Spark Job which is splitted into no of tasks . DAG maintains the execution of
tasks (ID s) one-by-one

d) All the above

21) What is Lineage and how it is used in Spark ?

a) RDD Lineage actually is a graphical representation of all the parent RDD and its
dependencies as RDD

b) Lineage is used to achieve fault-tolerance. If any RDD partition is lost, DAG replaces the
lost RDD from Parent RDD,using lineaage

c) All the above

22) What is shared variables and how it is used in Spark ?

a) Shared variable are required to be used by many functions & methods in parallel across
clusters. Shared variables responsible for performance enhancement
b) Broadcast - used to cache same set of same values in memory on all worker nodes ie.
outside executor

c) Accumulator - used to implement counters and summs

d) All the above

23) What are the various file formats supported by Spark ?

a) CSV, JSON

b) Parquet, ORC

c) Avro, Sequenctial

d) All the above

24) Identify the correct statement of Block Manager ?

a) A BlockManager is a BlockDataManager, i.e. manages the storage for blocks (block id s)


that can represent cached RDD partitions

b) A Block manager is reponsible for cluster management

c) None of the above

25) What is Catalyst optimizer role in Dataframe & Dataset ?

a) Catalyst is a tree like representation which creates catalogue to track the Dataframe &
Dataset

b) Catalyst is a function which applies set of rules based and cost based optimization
technique to manipulate data

c) Catalyst optimization takes care of logical plan, physical plan and Java code generation

d) All the above

26) What is Lamda Architure and how it is used in Spark ?

a) Lambda architecture is a data processing technique that is capable of handling fault-


tolerance, scalability on high volume data

b) Lambda architecture comprises of Batch Layer, Speed Layer (also known as Stream layer)
and Serving Layer.

c) Spark uses Streaming layer to process data as and when it arrives(in memory) , a new
data(micro batch) with an incremental view whereas Batch layer

provides Master data (HDFS) processed once on all data


d) All the above`

27) Identify the correct statment of Batch Layer

a) Batch Layer is intend to process current dataset

b) Batch Layer manages Master dataset and pre-computes batch view. Proccessed once
read many

c) Batch Layer is stores data in in an incremental view

d) None of the above

28) Identify the correct statement of Speed Layer or Streaming Layer ?

a) Speed layer deals with historical data

b) Speed Layer deals with recently received data only. ie. micro batches . Random read &
write . Incremental computation

c) Speed Layer deals with multiple read and write

d) All the above

29) Identify the correct statement of Service Layer in Spark?

a) Service layer provides not part of Lamda architecture

b) Service layers helpful in processing speed layer

c) Service layer indexes the batch views so that they can be queried in low-latency on an ad-
hoc basis

d) None of the above

30) Which of the following organized data will have named column?

a) RDD

b) DataFrame

c) Both a & b

d) None of the above

31) What Serialization feature RDD allows by default ?

a) Java

b) Kryo
c) Avro

d) None of the above

32) Identifty the correct statemetn of SparkSQL

a) SparkSQL is one of the core-component in Spark to process structure data

b) SparkSQL provides programming abstraction called Dataframe and also runs SQL query in
a distributed dataset

c) SparkSQL provides a cost-based optimizer, columnar storage, and code generation to


make queries run faster

d) All the above

33) Why Spark is prefered over Hadoop ?

a) Hadoop MapReduce uses disk I/O to process HDFS files to read and write - a time
consuming

b) Spark uses in-memory processing. Supports near-real time streaming process

c) Spark applies DAG model for executing multiple stages whereas Hadoop MapReduce
handles only two tasks(Map & Reduce) in application

d) All the above

34) Name the various partitioner availabe in Spark and its usage ?

a) Hash Partitioner - provides partitioning based on unique Key and split data uniformaly
across various partitions

b) Range Partitioner - provides partitioning method based keys with same range which will
appear on the same machine

c) Custom Partitioner - provides a mechanism to adjust the size and number of partitions or
the partitioning scheme based on application need

d) All the above

35) What is the difference between caching & peristence in Spark ?

a) In Spark, cache() is the default storage level which use MEMORY_ONLY

b) In Spark, persist() option can use various storage levels

c) Persist storage options include memory_only,disk_only,memory_and_disk,


memory_disk_and_ser

d) All the above


36) What is the advantage of using Cache ?

a) Cache optimization technique is Cost effective.

b) Cache uses Iterative computations mean to reuse the results over multiple computations
in multistage applications

c) Cache uses Interactive mechanisms which saves results for upcoming stages so, that we
can reuse them

d) All the above

37) How fault-tolerance is achived in Spark ?

a) DAG maintains all the transformation details of parent RDD and its dependencies in a
logged Graph - called lineage

b) In case of any lost partition in the cluster, DAG re-creates partition using lineage graph

c) All of the above

38) Identify the role of Master Node of Spark ?

a) The machine on which the Spark Standalone Cluster Manager runs is called the Master
Node

b) In Master, the driver program runs the main() function where the spark context is
created. The Master , with the help of cluster manager,

requests resources and makes those resources available to the Driver

c) There c1an be only one Master per cluster. Usually it is happend to be Resource Manager

d) All the above

39) Identify the role of Worker Node in Spark ?

a) Worker node is a Slave in Spark Master-Slave architecture

b) Worker node consists of processes that can run in parallel to perform the tasks scheduled
by the driver program.

c) All the above

40) Identify the role of Cluster Manager ?

a) Cluster Manager is the process responsible for monitoring the Worker nodes and
providing resources to those nodes upon request by the Master
b) Cluster Manager function is the YARN Resource Manager process for Spark applications
running on Hadoop clusters

c) Alternatively Spark Mesos also provides Cluster Manager support

d) All the above

41) Identify the correct statement for Executor in Spark

a) Spark Executors are the processes or CPU and memory resources allocated in slave nodes,
or Worker Node

b) Spark DAG tasks run on Executors as Stages

c) Each executors is dedicated to run a specific application. Application uses JVM heap
memory as --executor-memory in worker node

d) All the above

42) Name the various compression format used in Spark ?

a) Gzip

b) Lzo

c) Snappy

d) All the above

43) What is the use of CreateOrReplaceTempView ?

a) createorReplaceTempView is used when you want to store a specific table for a particular
spark session

b) createorReplaceTempView is using a lazily evaluated view ,like a table, and can execute
queries in SparkSQL

c) createorReplaceTempView is used to temporary view with the same name of DataFrames

d) All the above

44) Identify the correct statement of Spark Streaming ?

a) Spark Streaming is one of the core component of Spark which provieds an abstraction
called discretized stream or DStream

b) Spark Streaming is fault-tolerant and scalable & better load balancing & resource usage

c) Spark Streaming supports both batch layer & speed layer

d) All the above


45) Identify the correct statement of Spark-submit

a) Spark-Submit script is used to launch applications on a cluster

b) Spark-Submit provides various options of setting up dependencies and also different


cluster managers and deploy modes

c) All the above

46) What are the various deployment-mode in Spark ?

a) Yarn Cluster

b) Yarn Client

c) Master

d) All the above

47) Identify the correct statement of DStream ?

a) A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous


sequence of RDDs (of the same type) representing a continuous

stream of data (see [Link] for more details on RDDs).

b) DStreams can either be created from live data (such as, data from HDFS, Kafka or Flume)
or it can be generated by transformation existing DStreams

using operations such as map, window and reduceByKeyAndWindow

c) DStream periodically generates a RDD, micro batch, either from live data or by
transforming the RDD generated by a parent DStream.

d) All the above

48) Name the Various Schedulers used in Spark ?

a) DAG Scheduler - stage-oriented scheduling.

b) Task Scheduler - specific taks scheduling

c) All the above`

49) Identify the correct statement of Shuffling in Spark ?

a) Shuffling is a a data movement process in Spark

b) Shuffling is a process of repartitioning or redistributing data across partition


c) Shuffling reduces the no of partitions in the cluster

d) None of the above

50) What is speculative execution in and how it is used in Spark ?

a) Speculative exeution is a process of creating new copy of task running in parallel

b) Specutlative execution is a health check procedure to identify the tasks running slower in
worker node

c) Speculative execution does not stop slow running task

51) Name the differene between groupByKey & reduceByKey

a) groupbykey shuffles all the data, a slow process

b) reducebykey shuffles the result of sub-aggregated in each partiton data

c) All the above`

52) Is it possible to have multiple SparkContext in a single JVM ?

a) No

b) Yes, by enabling [Link]=true

53) Is it possible the share the RDD between SparkContexts ?

a) No

b) Yes

54) Is it possible to stop the SparkContext after initiated ?

a) No

b) yes, by using [Link]()

55) What is Repartitioning?

a) Repartition method can be used to either increase or decrease the number of partitions in
a RDD or DataFrame.

b) Repartition is a full Shuffle operation, whole data is taken out from existing partitions
and equally distributed into newly formed partition

c) Repartition creates equal-size partition


d) All the above

56) What is Coalsec transformation ?

a) Coalesce method reduces the number of partitions in a RDD or DataFrame .

b) Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using
Hash Partitioner (Default), and adjusts into existing

partitions, this means it can only decrease the number of partitions

c) Coalesec creates non-size partiton

d) All the above

57) What is data locality and how it is used in Spark ?

a) Data locality in simple terms means doing computation on the node where actual data
resides.

b) Spark driver, in Hadoop, contacts NameNode about the DataNodes for various blocks

c) Spark uses YARN Cluster manager to place the tasks alongside HDFS blocks

d) All the above

58) What are the various data locality and placement policy in Spark ?

a) PROCESS_LOCAL

b) NODE_LOCAL

C) RACK_LOCAL

D) NO_PREF , ANY

e) All the above

59) What are the various Security features in Spark ?

a) RPC Authentication

b) ACL Authentication

c) Kerberos Authentication

d) SSL Authentication

e) All the above

60) Is it possible to allocate dynamic resources ?


No , default is false

Yes, by setting [Link] = true & [Link] =


true

61) What is on-heap & off-heap memory and how it is used in Spark ?

a) On-heap memory is dynamically allocted memory used by JVM with GC enabled, fast

b) Off-heap memory is than on-heap memory managed to store serialised as byte array as
bobjects, no GC enabled , slow

c) Spark JVM by default uses on-heap memory to run application. Off-heap is used when on-
heap memory is exceeded

d) All the above

==================================================================================
===========================================================================

Answer to Spark MCQ : -

----------------------

1) d 2) d 3) d 4) e 5) a 6) d 7) d 8) d 9) b 10) a

11) d 12) d 13) c 14) d 15) d 16) d 17) d 18) d 19) d 20) d

21) c 22) d 23) d 24) a 25) d 26) d 27) b 28) c 29) c 30) b

31) a 32) d 33) d 34) d 35) d 36) d 37) c 38) d 39) c 40) d

41) d 42) d 43) d 44) d 45) c 46) d 47) d 48) c 49) b 50) c

51) c 52) b 53) a 54) b 55) d 56) d 57) d 58) d 59) e 60) b

61) d
===========================================================================<--- END
of HDFS MCQ --->========================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
===========================================================================

==================================================================================
======

Big data Hadoop MCQ Question & Answers - HDFS v 2.0

==================================================================================
======

1) What is the default block size in Hadoop Version 2.0 ?

a) 100 mb

b) 256 mb

c) 64 mb

D) 128 mb

2) What is the default replication factor in HDFS ?

a) 4

b) 2

c) 3

d) 1

3) NameNode & DataNode exhanges the communication signal through ...

a) Heart beat

b) HTTP

c) Active pulse

d) Hdfs signal
4) HDFS Storage uses actual file size as storage

a) False

b) True

c) Sometimes

d) All true

5) What is the role of NameNode ?

a) Maintains data blocks storage

b) Maintains file system Namespace

c) Manages the entire cluster

d) Manages processing of data

6) What is Single point of failure ?

a) When DataNode is unavailable

b) When SecondaryNode is unavailable

c) When NameNode is unavailable

d) All the above

7) Information about NameNode is available in Configuration file

a) [Link]

b) [Link]

c) [Link]

d) [Link]

8) Which HDFS commands is used to copy a hdfs-file to another cluster ?

a) copyFromLocal

b) distcp

c) put

d) copyToLocal

9) What is Journal Node


a) Maintains FSimage

b) Maintains [Link]

c) Maintains data blocks

d) Maintains blocpool management

10) What is blockpool management ?

a) DataNode shares its datablocks to NameNode in memory

b) NameNode shares its information to DataNode

c) SecondaryNode shares its file system to DataNode

d) All the above

11) Hadoop framework is developed using ...

a) Python

b) Java

c) C ++

d) Scala

12) Where active Fsimage is stored ?

a) NameNode

b) SecondaryNode

c) DataNode

d) All the above

13) Is it possible to change the size of block size ?

a) No

b) Yes

14) Is it possible to edit the HDFS File?

a) No

b) Yes
15) What is Checkpointing proces ?

a) Merging process of Edits log with FSimage

b) Merging process of block pool with Namespace ?

c) Merging process of FSimage temporarily

d) All the above

16) What is horizontal scalability ?

a) Adding the resources for Data Storage during runtime

b) Addition another NameNode

c) Additing another SecondaryNode

d) All the above

17) What is data locality or computation at data locality ?

a) Moving the data blocks to application locality

b) Moving the application to the data location

c) Moving DataNode to NameNode Location

d) All the above

18) What is fault-tolerance in HDFS ?

a) Copy of Data blocks are stored in another DataNode to prevent data loss

b) Copy of Fsimage is stored in Datanode

c) Copy of Edits log stored in SecondaryNode

d) All the above

19) What is Fsimage ?

a) Information about HDFS file system metadata are stored in disk called the FsImage

b) Fsimage stores information about Filename, Blockid, DataNode location, etc

c) The in-memory information of Namespace & Block pool in NameNode are stored
permmanently into disk as Fsimage file as image format

d) All the above


20) Name the Apache Software Foundation Distributors of Hadoop

a) Cloudera

b) HortonWorks

c) MapR

d) All the above

21) What are the 'v' s of Big data characteristics ?

a) Volume

b) Velocity

c) Veraity

d) Veracity

e) All the above

22) What is safemode in NameNode ?

a) When Namenode is shutdown safely

b) When NameNode is in maintenance mode

c) When DataNode communicates with NameNode

d) All the above

23) What is the role of Zookeeper during NameNode High availability ?

a) Provides coordinating service & maintains configuration information

b) Starts Stand by NameNode when Active NamaNode is not responding

c) Uses Quorum Journal Manager to update Edits log

d) All the above

24) Explain replication parallel pipeline process

a) A simultaneous data blocks storage parallely across multiple DataNodes

b) DataNode blockpool information shared to Namespace

c) Stores data blocks individually in DataBode

d) None of the above


25) What is the role of Rack awareness program ?

a) Data high availability and reliability

b) Performance of the HDFS cluster

c) Network bandwidth

d) All of the above

26) What is meant by Dead Node ?

a) When NameNode is unavailable

b) When SecondaryNode is unavailable

c) When DataNode is not communicating to NameNode

d) All the above

27) Which is called as Master Node in Hadoop ?

a) DataNode

b) SecondaryNode

c) NameNode

d) All the above

28) In HDFS Shell command, What command is used to copy entire folder ?

a) CopyFromLocal

b) CopyToLocal

c) put

d) distcp

29) In HDFS Shell command, Which command is used to combine multiple hdfs files into one file in
local path ?

a) getMerge

b) put

c) distcp

d) copyToLocal
30) HDFS is ____________ processing ?

a) Parallel

b) Batch

c) Sequence

d) All the above

31) In which configuration file you can alter replication factor & set block-size ?

a) [Link]

b) [Link]

c) hdfs-site,xml

d) [Link]

32) What is Namespace in NameNode ?

a) Namespace is a tree like metadata (directories,files,blockstored in Namenode memory

b) Namespace is a container which maintains all the filenames in a hierarchy structure

c) Namespace is in-memory file system

d) All the above

33) What is split-brain scenario ?

a) In Hadoop High availabity, when both NameNodes are in Active state

b) Standby NameNode is unable to reach

c) Active NameNode is unable reach

d) None of the above

34) What are the different data types that hadoop can handle ?

a) structured

b) un-strctured

c) semi-strctured

d) All the above


35) What is meant by commodity cheap hardware in hadoop ?

a) Very cheap hardware

b) Industry standard hardware

c) Discarded hardware

d) Low specifications Industry grade hardware

36) Hadooop architecture is ________________________

a) Client-Server architecture

b) Peer-to-Peer architecture

c) Master-Slave architecture

d) All the above

37) Name few Hadoop distribution modes ?

a) Local mode

b) Pseudo distributed node

c) Fully distributed mode

d) All the above

38) Name the components of Hadoop frame work

a) HDFS

b) MapReduce

c) All the above

d) None of the above

39) SecondaryNode can be used as Active NameNode, incase of any failure to NameNode

a) True

b) False

40) Salient Features of Hadooop is ...

a) Horizontal Scalability

b) Distributed Computation
c) Fault tolerance

d) Moving the application to the data locality

e) All the above

41) What is block-pool ?

a) Datanodes stores block information (block id & IP address) to NameNode

b) Set of blocks are maintained in a single block pool id

c) DataNode sends periodic heartbeat signal and shares block pool report

d) All the above

42) What is metadata in Hadoop ?

a) HDFS metadata represents the HDFS directories and files in a tree struture

b) Metadata also maintains, ownership, permissions, quotas, and replication factor.

c) At the time of start,NameNode load both FsImage & Edits log from disk to into memory

d) All the above

43) Which HDFS commands is used to remove a hdfs folder ?

a) delete

b) rm

c) rm -r

d) expunge

44) What is Edits Log ?

a) [Link] is transaction file system of NameNode

b) Maintains recent addition/modification/deletion information of FsImage

c) Edits log is stored in Journal Node in Hadoop version 2.0

d) All the above

45) Which HDFS commands is used to copy a file from one hdfs path to another hdfs path ?

a) cp

b) distcp
c) put

d) copyFromLocal

46) Which HDFS commands is used to create a empty file in hdfs ?

a) cp

b) touchz

c) copyFromLocal

d) copyToLocal

47) Which HDFS commands is used to clear all the files (empty) from trash ?

a) delete

b) rm

c) expunge

d) None of the above

48) Which HDFS commands is used to read the content of the files in hdfs ?

a) cat

b) read

c) view

d) None of the above

49) Is it possible to customize or change the hdfs block-size from default size ?

a) Yes

b) No

50) What is the importance of fsck command in HDFS ?

a) fsck command helps to check the health of hdfs file system

b) To view the missing block files, over-replicated files, under-replicated files

c) To view the corrput block files

d) All the above


==================================================================================
===========================================================================

Answer to HDFS MCQ : -

----------------------

1) d 2) c 3) a 4) b 5) b 6) c 7) a 8) b 9) b 10) a

11) b 12) a 13) b 14) a 15) a 16) a 17) b 18) a 19) a 20) d

21) e 22) b 23) d 24) a 25) d 26) c 27) c 28) c 29) a 30) b

31) c 32) d 33) a 34) d 35) d 36) c 37) d 38) c 39) b 40) e

41) d 42) d 43) c 44) d 45) a 46) b 47) c 48) a 49) a 50) d

===========================================================================<--- END
of HDFS MCQ --->========================================================

==================================================================================
===========================================================================

Compiled by : V. Vasu 91-9940156760


[Link]@[Link]

==================================================================================
===========================================================================

==================================================================================
======

Big data MCQ Question & Answers - Hive

==================================================================================
======
Hive - 1

1. Hive uses _________ for logging.

a) logj4

b) log4l

c) log4i

d) log4j

2. Point out the correct statement.

a) list FILE[S] <filepath>* executes a Hive query and prints results to standard output

b) <query string> executes a Hive query and prints results to standard output

c) <query> executes a Hive query and prints results to standard output

d) All of the mentioned

3. What does the [Link] specified in the following statement?

$HIVE_HOME/bin/hive --hiveconf [Link]=INFO,console

a) Log level

b) Log modes

c) Log source

d) All of the mentioned

4. HiveServer2 introduced in Hive 0.11 has a new CLI called __________

a) BeeLine

b) SqlLine

c) HiveLine

d) CLilLine
5. Point out the wrong statement.

a) There are four namespaces for variables in Hive (

b) Custom variables can be created in a separate namespace with the define

c) Custom variables can also be created in a separate namespace with hivevar

d) None of the mentioned

6. HCatalog is installed with Hive, starting with Hive release is ___________

a) 0.10.0

b) 0.9.0

c) 0.11.0

d) 0.12.0

7. hiveconf variables are set as normal by using the following statement?

a) set -v x=myvalue

b) set x=myvalue

c) reset x=myvalue

d) none of the mentioned

8. Variable Substitution is disabled by using ___________

a) set [Link]=false;

b) set [Link]=false;

c) set [Link]=true;

d) all of the mentioned

9. _______ supports a new command shell Beeline that works with HiveServer2.

a) HiveServer2
b) HiveServer3

c) HiveServer4

d) None of the mentioned

10. In ______ mode HiveServer2 only accepts valid Thrift calls.

a) Remote

b) HTTP

c) Embedded

d) Interactive

==================================================================================
==================================================================================
=

Answer to Hive MCQ : -

----------------------

1) d 2) b 3) a 4) a 5) a 6) c 7) d 8) a 9) a 10) a

==================================================================================
==================================================================================
===

Hive - 2

1. Hive specific commands can be run from Beeline, when the Hive _______ driver is used.

a) ODBC

b) JDBC

c) ODBC-JDBC

d) All of the Mentioned


2. Point out the correct statement.

a) –helpusage display a usage message

b) The JDBC connection URL format has the prefix jdbc:hive:

c) Starting with Hive 0.14, there are improved SV output formats

d) None of the mentioned

3. _________ reduce the amount of informational messages displayed (true) or not (false).

a) –silent=[true/false]

b) –autosave=[true/false]

c) –force=[true/false]

d) All of the mentioned

4. Which of the following is used to set transaction isolation level?

a) –incremental=[true/false]

b) –isolation=LEVEL

c) –force=[true/false]

d) –truncateTable=[true/false]

5. Point out the wrong statement.

a) HiveServer2 has a new JDBC driver

b) CSV and TSV output formats are maintained for forward compatibility

c) HiveServer2 supports both embedded and remote access to HiveServer2

d) None of the mentioned

6. The ________ allows users to read or write Avro data as Hive tables.
a) AvroSerde

b) HiveSerde

c) SqlSerde

d) None of the mentioned

7. Starting in Hive _______ the Avro schema can be inferred from the Hive table schema.

a) 0.14

b) 0.12

c) 0.13

d) 0.11

8. The AvroSerde has been built and tested against Hive 0.9.1 and later, and uses Avro _______ as of
Hive 0.13 and 0.14.

a) 1.7.4

b) 1.7.2

c) 1.7.3

d) None of the mentioned

9. Which of the following data type is supported by Hive?

a) map

b) record

c) string

d) enum

10. Which of the following data type is converted to Array prior to Hive 0.12.0?

a) map

b) long
c) float

d) bytes

==================================================================================
==================================================================================
=

Answer to Hive MCQ : - 2

----------------------

1) b 2) c 3) a 4) b 5) b 6) a 7) a 8) d 9) d 10) d

==================================================================================
==================================================================================
===

Hive - 3

1. Avro-backed tables can simply be created by using _________ in a DDL statement.

a) “STORED AS AVRO”

b) “STORED AS HIVE”

c) “STORED AS AVROHIVE”

d) “STORED AS SERDE”

2. Point out the correct statement.

a) Avro Fixed type should be defined in Hive as lists of tiny ints

b) Avro Bytes type should be defined in Hive as lists of tiny ints

c) Avro Enum type should be defined in Hive as strings

d) All of the mentioned


3. Types that may be null must be defined as a ______ of that type and Null within Avro.

a) Union

b) Intersection

c) Set

d) All of the mentioned

4. The files that are written by the _______ job are valid Avro files.

a) Avro

b) Map Reduce

c) Hive

d) All of the mentioned

5. Point out the wrong statement.

a) To create an Avro-backed table, specify the serde as


[Link]

b) Avro-backed tables can be created in Hive using AvroSerDe

c) The AvroSerde cannot serialize any Hive table to Avro files

d) None of the mentioned

6. Use ________ and embed the schema in the create statement.

a) [Link]

b) [Link]

c) [Link]

d) all of the mentioned

7. _______ is interpolated into the quotes to correctly handle spaces within the schema.

a) $SCHEMA
b) $ROW

c) $SCHEMASPACES

d) $NAMESPACES

8. To force Hive to be more verbose, it can be started with ___________

a) *hive –hiveconf [Link]=INFO,console*

b) *hive –hiveconf [Link]=INFO,console*

c) *hive –hiveconf [Link]=INFOVALUE,console*

d) All of the mentioned

9. ________ was designed to overcome limitations of the other Hive file formats.

a) ORC

b) OPC

c) ODC

d) None of the mentioned

10. An ORC file contains groups of row data called __________

a) postscript

b) stripes

c) script

d) none of the mentioned

==================================================================================
==================================================================================
=

Answer to Hive MCQ : - 3

----------------------
1) a 2) b 3) a 4) c 5) c 6) a 7) a 8) a 9) a 10) b

==================================================================================
==================================================================================
===

Hive – 4

1. Serialization of string columns uses a ________ to form unique column values.

a) Footer

b) STRIPES

c) Dictionary

d) Index

2. Point out the correct statement.

a) The Avro file dump utility analyzes ORC files

b) Streams are compressed using a codec, which is specified as a table property for all
streams in that table

c) The ODC file dump utility analyzes ORC files

d) All of the mentioned

3. _______ is a lossless data compression library that favors speed over compression ratio.

a) LOZ

b) LZO

c) OLZ

d) All of the mentioned

4. Which of the following will prefix the query string with parameters?

a) SET [Link]=false
b) SET [Link]=false

c) SET [Link]=true

d) All of the mentioned

5. Point out the wrong statement.

a) TIMESTAMP is Only available starting with Hive 0.10.0

b) DECIMAL introduced in Hive 0.11.0 with a precision of 38 digits

c) Hive 0.13.0 introduced user definable precision and scale

d) All of the mentioned

6. Integral literals are assumed to be _________ by default.

a) SMALL INT

b) INT

c) BIG INT

d) TINY INT

7. Hive uses _____ style escaping within the strings.

a) C

b) Java

c) Python

d) Scala

8. Which of the following statement will create a column with varchar datatype?

a) CREATE TABLE foo (bar CHAR(10))

b) CREATE TABLE foo (bar VARCHAR(10))

c) CREATE TABLE foo (bar CHARVARYING(10))

d) All of the mentioned


9. _________ will overwrite any existing data in the table or partition.

a) INSERT WRITE

b) INSERT OVERWRITE

c) INSERT INTO

d) None of the mentioned

10. Hive does not support literals for ______ types.

a) Scalar

b) Complex

c) INT

d) CHAR

==================================================================================
==================================================================================
=

Answer to Hive MCQ : - 4

----------------------

1) c 2) b 3) a 4) a 5) b 6) b 7) a 8) b 9) c 10) b

==================================================================================
==================================================================================
=

===========================================================================<--- END
of HDFS MCQ --->========================================================

==================================================================================
===========================================================================
Compiled by : V. Vasu 91-9940156760
[Link]@[Link]

==================================================================================
===========================================================================

You might also like