Big Data Hadoop MCQ Question
Big Data Hadoop MCQ Question
======
==================================================================================
======
1) What is Sqoop ?
b) Sqoop connectors are used to execute SQL dialect supported by various databases
a) To store all the date in a parent directory , sqoop jobs use --warehouse-dir
b) To store all the data under the parent directory with the same name as the table
c) To store all the data in HDFS location
a) By using --split-by parameter , data will be imported into multiple chunks to be run in
parallel
10) What are the different incrementatl load or import option available in Sqooo ?
a) Incremental append
b) Incremental last modified
c) Both a & b
a) An options-file is a text file where each line identifies an option in the order that it
appears
b) To fetch minimum and maximum values of the --split-by column using the --boundary-
query parameter
c) It return exactly one row with exactly two columns having first column with lower bound
and second column with upper bound
b) Parquet
c) ORC
d) Sequential
a) Snappy
b) Gzip
c) Lzo
a) True
b) False
b) Not possible
19) Can Sqooop handle character large objects (CLOB) & binary large objects ?
a) Yes
b) No
a) Yes
b) No
b) Not possible
22) What option is used to check the data received in HDFS with RDBMS ?
b) Not possible
23) What are the common delimiters and escape character in sqoop?
a) No , Not possible
a) Improves performance
a) Sqoop can update the data using last-modified option in incremental append
==================================================================================
===========================================================================
----------------------
1) d 2) d 3) d 4) b 5) d 6) b 7) c 8) a 9) b 10) c
11) a 12) d 13) c 14) d 15) e 16) b 17) a 18) a 19) a 20) a
21) a 22) a 23) d 24) b 25) d 26) c 27) b 28) c 29) a 30) d
===========================================================================<--- END
of Sqoop MCQ --->========================================================
==================================================================================
===========================================================================
==================================================================================
===========================================================================
==================================================================================
======
==================================================================================
======
1) What is Hive ?
a) Embedded
b) Local
c) Remote
3) What is Beeline ?
c) Beeline is database
d) All the above
c) Cluster by then distribute the data on Specific keys basis to the same reducer
c) SerDe reads Hive table and writes back data to HDFS in any of the custom format
b) No
9) What is the use of Explain command ?
b) Partition creates folder where as Bucket creates Hash functions and stores data in a file
11) Is it possible to drop database and also tables under it automatically ? Why ?
a) No. Hive database cannot be dropped when it is not empty - when tables are available
c) To forcefully delete the database along with tables,,use CASCADE option, DROP DATABASE
<database name> CASCADE
a) Hive stores the schema (table names, column names,data types) in Hive Metastore
c) Remote metastore uses JDBC driver & Thrift Network service to connect with
metastore_DB in an external RDMBS
a) Vectorization improves performamce and also reduces the CPU utlization drastically
b) Instead of processing one row at a time, process the batch of rows (1024 rows)
a) Struct
b) Map
c) Array
a) Improves performance
b) Smaller table will be loaded into memory to join with the data in each DataNode
c) Reducer will not be running as it is set to zero. Map output will be generated
18) Identify the wrong statement of hive is reading the data from
b) False
a) True
b) False
a) BeeHive
b) HiveBee
c) HoneyBee
a) Google
b) Apache
c) Facebook
a) Parquet
b) ORC
c) Avro
a) True
b) False
a) True
b) False
a) No
b) Yes
b) False
a) True
b) False
34) Hive table displays NULL values when there is an incorrect data types
a) True
b) False
c) When Manage table is dropped, data & schema also gets deleted
a) When External table is dropped, schema will be dropped, data remains intact
a) True
b) False
a) Metastore acts an interface to integrate Hive table, Schema & HDFS data
a) True
b) No
a) No
a) No
a) Table
b) Partition
c) Bucketing
d) All the above
a) No
b) Yes
b) Sentry in Cloudera
c) Ranger in Hortonworks
a) AvroSerDe
b) LazySimpleSerDe
c) CSVSerDe
d) ParquetSerDE
==================================================================================
===========================================================================
----------------------
1) a 2) d 3) b 4) c 5) d 6) d 7) d 8) a 9) d 10) d
11) d 12) d 13) d 14) d 15) d 16) d 17) c 18) b 19) a 20) a
21) d 22) d 23) a 24) a 25) c 26) d 27) d 28) d 29) a 30) a
31) b 32) a 33) a 34) a 35) d 36) d 37) b 38) d 39) d 40)
e
41) b 42) b 43) b 44) d 45) e 46) a 47) a 48) d 49) d 50) b
===========================================================================<--- END
of HDFS MCQ --->========================================================
==================================================================================
===========================================================================
==================================================================================
===========================================================================
==================================================================================
======
==================================================================================
======
a) Two
b) Three
c) Four
d) Five
a) You can run Pig in either mode using the “pig” command
b) You can run Pig in batch mode using the Grunt shell
b) Pig scripts
c) Pig options
4. Pig Latin statements are generally organized in one of the following ways?
a) A LOAD statement to read data from the file system
b) The DISPLAY operator will display the results to your terminal screen
c) To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation
a) WRITE
b) READ
c) LOAD
7. You can run Pig in interactive mode using the ______ shell.
a) Grunt
b) FS
c) HDFS
a) Mapreduce
b) Tez
c) Local
d) All of the mentioned
a) $ pig -x local …
b) $ pig -x tez_local …
c) $ pig …
a) Mapreduce
b) Tez
c) Local
==================================================================================
===========================================================================
---------------------------------
1) a 2) a 3) b 4) d 5) b 6) c 7) a 8) a 9) a 10) d
==================================================================================
=================================================
b) DESCRIBE
c) STORE
d) EXPLAIN
a) During the testing phase of your implementation, you can use LOAD to display results to your
terminal screen
b) You can view outer relations as well as relations defined in a nested FOREACH statement
3. Which of the following operator is used to view the map reduce execution plans?
a) DUMP
b) DESCRIBE
c) STORE
d) EXPLAIN
a) ILLUSTRATE
b) DESCRIBE
c) STORE
d) EXPLAIN
a) ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin
statements
6. __________ is a framework for collecting and storing script-level statistics for Pig Latin.
a) Pig Stats
b) PStatistics
c) Pig Statistics
7. The ________ class mimics the behavior of the Main class but gives users a statistics object back.
a) PigRun
b) PigRunner
c) RunnerPig
8. ___________ is a simple xUnit framework that enables you to easily test your Pig scripts.
a) PigUnit
b) PigXUnit
c) PigUnitX
a) local
b) tez
c) mapreduce
==================================================================================
===========================================================================
---------------------------------
1) b 2) b 3) d 4) a 5) c 6) c 7) b 8) b 9) a 10) a
==================================================================================
=================================================
a) DUMP
b) DESCRIBE
c) STORE
d) EXPLAIN
b) You can view outer relations as well as relations defined in a nested FOREACH statement
3. Which of the following operator is used to view the map reduce execution plans?
a) DUMP
b) DESCRIBE
c) STORE
d) EXPLAIN
a) ILLUSTRATE
b) DESCRIBE
c) STORE
d) EXPLAIN
a) ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin
statements
c) Several new private classes make it harder for external tools such as Oozie to integrate with Pig
statistics
6. __________ is a framework for collecting and storing script-level statistics for Pig Latin.
a) Pig Stats
b) PStatistics
c) Pig Statistics
7. The ________ class mimics the behavior of the Main class but gives users a statistics object back.
a) PigRun
b) PigRunner
c) RunnerPig
8. ___________ is a simple xUnit framework that enables you to easily test your Pig scripts.
a) PigUnit
b) PigXUnit
c) PigUnitX
a) local
b) tez
c) mapreduce
---------------------------------
1) b 2) b 3) d 4) a 5) c 6) c 7) b 8) b 9) a 10) a
==================================================================================
============================================================================
==================================================================================
===========================================================================
==================================================================================
==========================================================================
==================================================================================
======
==================================================================================
======
1) What is HBase ?
a) Region Server
b) HMaster
c) Zookeeper
a) HBase is consistent provides fast row-lookup to access data from large Hash tables on top
of HDFS
c) HBase handles high volume data, Schema less, tables are replicated for failover
b) ZooKeeper shares the information of Region servers and the active state of HMaster
c) HBase client gets the META table information from Region Server with the help of
ZooKeeper
b) In case of any failure to any Region, HMaster is used to recover data from WAL and
restore the same in Region
c) Once the threshold exceeds, data will be flushed and stored as HFile or StoreFile in disk
a) Bloom filters are a probabilistic dataset to find whether an element is "there (false
positive) or not (false negative)"
b) Bloom Filter maintains an in-memory reference structure by which it seeks an element
(Row) in a specific StoreFile(HFile)
c) Bloom filter is a memory efficient by reducing the disk read which improves the
performance
b) A Region contains a contiguous, sorted range of rows between a start key and an end key
c) Region server is hosts and collection of Regions (tables) - appoximately upto 1,000 regions
per Region Server
a) Whenever a client sends a read/write request (DDL/DML), HMaster receives the request
and forwards it to the corresponding Region server
c) Incase of Region Server failure, HMaster organize the Data recovery from WAL
a) ValueFilter
b) PageFilter
c) RowFilter
b) HBase sorts the versions of a cell from newest to oldest (descending order)
a) Diff
b) Prefix
c) Fast_Diff
c) Both a & b
a) No
c) HBase scans the rows by using start row key and end row key
b) HBase table provides Column family which groups collection of columns for data storage
based on row values on different cluster nodes
b) When HBase table grows high, Sharding option dynamically distributes it across cluster
c) HBase splitting and serving Regions from one Region Server to other Region Servers is
considered as auto sharding
b) Row-Key is used for grouping cells logically and it ensures that all cells that have the same
Row-Keys are co-located on the same server
a) HFile contains a multi-layered index which allows HBase to seek to the data without
having to read the whole file
c) HFile is loaded in Block cache amd indexes point by row key to the key value data in 64KB
“blocks”
b) Load Balancer ensures that the region replicas are not co-hosted in the same Region
Servers and also in the same rack
c) Cluster wide load balancing will occur only when there are no Regions in transition and
according to a fixed period of a time using balanceCluster(Map).
a) Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key-
value store that runs on top of Hadoop.
b) Hive helps SQL savvy people to run MapReduce jobs where as HBase supports 4 primary
operations-put, get, scan and delete.
c) HBase is ideal for real-time querying of big data whereas Hive is an ideal choice for
analytical querying of data collected over point-in-time
a) The two important catalog tables in HBase, are ROOT and META.
b) REST defines the semantics so that the protocol can be used in a generic way to address
remote resources.
c) REST also provides support for different message formats, offering many choices for a
client application to communicate with the server.
b) TTL uses the version of a cell which can be preserved till a specific time period.
c) TTL uses timestamp which reached the specific version will be removed.
38) Is it possible to store HFile in difference file formats other than HBase ?
a) No
b) Yes
a) Snappy
b) LZO
c) Gzip
a) HBase implements User Authorization to grant users permissions for particular actions on
a specified set of data
b) HBase authorization uses RPC level which is based on the Simple Authentication and
Security Layer (SASL)
a) When system is crashed, data in MemStore will be lost. HFile will be missing the lost data
b) WAL records all the changes and stores it as File system in disk
c) HMaster uses WAL for fail-over and retrieves data and restore it in Region
a) No
B) Yes
a) The “Block” in BlockCache is the smallest unit of data that HBase reads from HFile disk in a
single pass to maintain frequently used data
b) Block cache is used to keep most recently used data in JVM Heap apart from MemStore
c) HBase requires only looking up that block’s location in the index and retrieving it to avoid
disk read
44) Narrate the process of HBase reading data from different places & validation process before
returning the value ?
a) First HBase, checks the MemStore for any pending modifications. HFile contains Snapshot
of MemStore at the point it exceeds thresh hold
b) Secondly,checks the BlockCache to see if the block containing this read/write has been
recently accessed.
c) Finally, the relevant HFiles on disk are accessed. HBase access all HFiles that is required for
the row in order to complete record
45) What is the Impact of "Turn off WAL on Puts" or What is the Deferred Log Flush in HBase ?
a) When MemStore exceed the thresh hold, data will be flushed to Hfile -disk. By default
AutoFlush is True
b) To improve performance, AutoFlush is set to False to enable Puts to store all the data in
HTable one at a time to the Region Server
a) HBase will automatically combines smaller HFiles and rewrite them into one HFiles
==================================================================================
===========================================================================
----------------------
1) d 2) d 3) e 4) b 5) a 6) d 7) d 8) d 9) d 10) d
11) d 12) d 13) d 14) d 15) d 16) d 17) b 18) b 19) b 20) d
21) d 22) d 23) a 24) b 25) d 26) d 27) d 28) d 29) c 30) b
31) d 32) d 33) d 34) d 35) d 36) d 37) d 38) d 39) d 40) c
===========================================================================<--- END
of HDFS MCQ ---> =======================================================
==================================================================================
===========================================================================
==================================================================================
===========================================================================
==================================================================================
==================================================================================
==========
==================================================================================
==================================================================================
==========
HBase - 1
1. HBase is a distributed ________ database built on top of the Hadoop file system.
a) Column-oriented
b) Row-oriented
c) Tuple-oriented
a) HDFS provides low latency access to single rows from billions of records (Random access)
b) HBase sits on top of the Hadoop File System and provides read and write access
a) Row Oriented
b) Schema-less
c) Fixed Schema
a) BigTop
b) Bigtable
c) Scanner
d) FoundationDB
a) Region
b) Master
c) Zookeeper
a) status
b) version
c) whoami
d) user
a) enabled
b) disabled
c) drop
a) select
b) get
c) put
10. HBaseAdmin and ____________ are the two important classes in this package that provide DDL
functionalities.
a) HTableDescriptor
b) HDescriptor
c) HTable
d) HTabDescriptor
==================================================================================
==================================================================================
=========
----------------------
1) a 2) b 3) b 4) b 5) c 6) b 7) c 8) b 9) b 10) a
==================================================================================
==================================================================================
==============
HBase - 2
1. The minimum number of row versions to keep is configured per column family via _____________
a) HBaseDecriptor
b) HTabDescriptor
c) HColumnDescriptor
a) “bytes-in/bytes-out”
b) “bytes-in”
c) “bytes-out”
4. One supported data type that deserves special mention are ____________
a) money
b) counters
c) smallint
d) tinyint
a) Where time-ranges are very wide (e.g., year-long report) and where the data is
voluminous, summary tables are a common approach
c) HBase does not currently support ‘constraints’ in traditional (SQL) database parlance
a) rowkey
b) columnkey
c) counterkey
a) OpenTS
b) OpenTSDB
c) OpenTSD
d) OpenDB
8. Which command is used to disable all the tables matching the given regex?
a) remove all
b) drop all
c) disable_all
a) drop
b) truncate
c) delete
----------------------
1) c 2) a 3) a 4) b 5) c 6) a 7) b 8) c 9) b 10) b
==================================================================================
==================================================================================
==============
HBase - 3
a) set
b) reset
c) alter
d) select
a) You can add a column family to a table using the method addColumn()
a) MEMSTORE_FLUSH
b) MEMSTORE_FLUSHSIZE
c) MAX_FILESIZE
4. You can delete a column family from a table using the method _________ of HBAseAdmin class.
a) delColumn()
b) removeColumn()
c) deleteColumn()
a) To read data from an HBase table, use the get() method of the HTable class
b) You can retrieve data from the HBase table using the get() method of the HTable class
c) While retrieving data, you can get a single row by id, or get a set of rows by a set of row
ids, or scan an entire table or a subset of rows
a) Configuration
b) Collector
c) Component
7. The ________ class provides the getValue() method to read the values from its instance.
a) Get
b) Result
c) Put
d) Value
a) Master Server
b) Region Server
c) Htable
a) [Link]
b) [Link]
c) [Link]
10. HBase uses the _______ File System to store its data.
a) Hive
b) Imphala
c) Hadoop
d) Scala
==================================================================================
==================================================================================
===
----------------------
1) c 2) a 3) a 4) c 5) d 6) a 7) b 8) b 9) b 10) c
==================================================================================
==================================================================================
===
========================================================================<--- END of
HDFS MCQ --->=====================================================================
==================================================================================
===========================================================================
==================================================================================
===========================================================================
==================================================================================
======
==================================================================================
======
a) 3
b) 1
c) 2
d) 4
a) HDFS
b) MapReduce
a) Reads each line of the input file as the new value and associated key is byte offset
Note : In Hadoop version 1.0 Job Tracker and version 2.0 Resource Manager
a) Fair
b) Capacity
c) None
d) Both a & b
a) Key-Value Inputformat
b) TextInputFormat
c) Parquet Format
a) Setup
b) Run
c) Map
a) Sort
b) Shuffle
c) Reduce
b) Copies the files from HDFS to Local disk (non-hdfs) to improve performance
a) IntWrittable
b) FloatWritable
c) LongWritable
21) Reducer process starts only after 100% completion of Map process
a) True
b) False
c) use WebUI
a) map
b) reduce
c) mapper
d) reducer
24) Which option provides you to control which keys (and hence records) go to which Reducer
by implementing a custom?
a) Partitioner
b) OutputSplit
c) Reporter
25) In which option the MR Job is use to ____________ the progress and also set status
messages.
a) Partitioner
b) OutputSplit
c) Reporter
26) The MR Job uses ____________ option to set the number of reducers required
a) [Link](int)
b) [Link](int);
c) [Link](int)
Note:
-----
27) The MR Job used to groups Reducer inputs by key in _________ stage.
a) sort
b) shuffle
c) reduce
28) The output of the reduce task is typically written to the FileSystem via _____________
a) [Link]
b) [Link]
c) [Link]
d) [Link]
a) MergePartitioner
b) HashedPartitioner
c) HashPartitioner
30) Is it possible to set the Reducer to Zero when reduce output is not required ?
a) True
b) False
31) The MR Job uses __________ to launch the application in the Hadoop framework .
a) JobCon
b) Job
c) JobConfiguration
32) In MR Job ________________ is option is used as a threshold for the serialization buffers?
a) [Link]
b) [Link]
c) [Link]
a) [Link]
b) [Link]
c) [Link]
a) True
b) False
35) Which option of MR Job uses _____________ to distribute both jars and native libraries
a) DistributedLog
b) DistributedCache
c) DistributedJars
a) SSL
b) Kerberos
c) SSH
37) ________ is the architectural center of Hadoop that allows multiple data processing engines.
a) YARN
b) Hive
c) Incubator
d) Chuckwa
a) NodeManager
b) ResourceManager
c) ApplicationMaster
40) The ____________ is the ultimate authority that arbitrates resources among all the
applications in the system.
a) NodeManager
b) ResourceManager
c) ApplicationMaster
41) The __________ is responsible for allocating resources to the various running applications
subject to familiar constraints of capacities, queues etc.
a) Manager
b) Master
c) Scheduler
b) parallel job
==================================================================================
===========================================================================
----------------------
1) c 2) b 3) d 4) d 5) d 6) a 7) a 8) d 9) d 10) b
11) d 12) d 13) d 14) d 15) d 16) d 17) b 18) d 19) d 20) d
21) a 22) d 23) a 24) a 25) c 26) b 27) a 28) a 29) c 30) a
31) b 32) a 33) c 34) a 35) b 36) b 37) a 38) c 39) c 40) c
===========================================================================<--- END
of HDFS MCQ --->========================================================
==================================================================================
===========================================================================
==================================================================================
===========================================================================
==================================================================================
======
Spark - 1
a) Mahek Zaharia
b) Matei Zaharia
c) Doug Cutting
d) Stonebraker
a) RSS abstraction provides distributed task dispatching, scheduling, and basic I/O
functionalities
a) Spark Streaming
b) Spark SQL
c) RDDs
a) Spark Streaming
b) Spark SQL
c) RDDs
a) For distributed storage, Spark can interface with a wide variety, including Hadoop
Distributed File System (HDFS)
b) Spark also supports a pseudo-distributed mode, usually used only for development or
testing purposes
6. ______________ leverages Spark Core fast scheduling capability to perform streaming analytics.
a) MLlib
b) Spark Streaming
c) GraphX
d) RDDs
a) MLlib
b) Spark Streaming
c) GraphX
d) RDDs
a) MLlib
b) Spark Streaming
c) GraphX
a) GaAdt
b) Spark Core
c) Pregel
10. Spark architecture is ___________ times as fast as Hadoop disk-based Apache Mahout and even
scales better than Vowpal Wabbit.
a) 10
b) 20
c) 50
d) 100
==================================================================================
==================================================================================
=========
----------------------
1) b 2) b 3) b 4) c 5) d 6) b 7) a 8) c 9) c 10) a
==================================================================================
==================================================================================
==============
This set of Hadoop Questions for campus interviews focuses on “Spark with Hadoop – 2”.
1. Users can easily run Spark on top of Amazon’s __________
a) Infosphere
b) EC2
c) EMR
a) Spark enables Apache Hive users to run their unmodified queries much faster
3. Spark runs on top of ___________ a cluster manager system which provides efficient resource
isolation across distributed applications.
a) Mesjs
b) Mesos
c) Mesus
4. Which of the following can be used to launch Spark jobs inside MapReduce?
a) SIM
b) SIMR
c) SIR
d) RIS
b) Spark was designed to read and write data from and to HDFS, as well as other storage
systems
c) Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can
simply run Spark on YARN
a) Java
b) Pascal
c) Scala
d) Python
7. Spark is packaged with higher level libraries, including support for _________ queries.
a) SQL
b) C
c) C++
8. Spark includes a collection over ________ operators for transforming data and familiar data frame
APIs for manipulating semi-structured data.
a) 50
b) 60
c) 70
d) 80
9. Spark is engineered from the bottom-up for performance, running ___________ faster than
Hadoop by exploiting in memory computing and other optimizations.
a) 100x
b) 150x
c) 200x
10. Spark powers a stack of high-level tools including Spark SQL, MLlib for _________
a) regression models
b) statistics
c) machine learning
d) reproductive research
==================================================================================
==================================================================================
=
----------------------
1) b 2) a 3) b 4) b 5) a 6) b 7) a 8) d 9) a 10) c
==================================================================================
==================================================================================
===
=====================================================================<--- END of
HDFS MCQ ---
>=========================================================================
==================================================================================
===========================================================================
==================================================================================
======
==================================================================================
======
a) Impala
b) ActiveMQ
c) BigTop
d) Zookeeper
a) The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of
real-time publish-subscribe feeds
b) Activity tracking is often very high volume as many activity messages are generated for each user
page view
a) log aggregation
b) compaction
c) collection
a) Event sourcing
b) Commit Log
c) Stream Processing
b) The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes
to restore their data
c) Kafka comes with a command line client that will take input from a file or from standard input and
send it out as messages to the Kafka cluster
6. Kafka uses __________ so you need to first start a ZooKeeper server if you don’t already have
one.
a) Impala
b) ActiveMQ
c) BigTop
d) Zookeeper
7. __________ is the node responsible for all reads and writes for the given partition.
a) replicas
b) leader
c) follower
d) isr
8. __________ is the subset of the replicas list that is currently alive and caught-up to the leader.
a) replicas
b) leader
c) follower
d) isr
9. Kafka uses key-value pairs in the ____________ file format for configuration.
a) RFC
b) Avro
c) Property
10. __________ is the amount of time to keep a log segment before it is deleted.
a) [Link]
b) [Link]
c) [Link]
d) [Link]
-----------------------------------
1) b 2) d 3) a 4) a 5) d 6) d 7) b 8) d 9) c 10) b
==================================================================================
===========================================================================
a) Oozie
b) Kafka
c) Lucene
d) BigTop
a) With kafka, more users, whether using SQL queries or BI applications, can interact with more data
a) topics
b) chunks
c) domains
d) messages
4. Kafka is run as a cluster comprised of one or more servers each of which is called __________
a) cTakes
b) broker
c) test
c) Kafka is designed to allow a single cluster to serve as the central data backbone for a large
organization
d) Messages are persisted on disk and replicated within the cluster to prevent data loss
6. Communication between the clients and the servers is done with a simple, high-performance,
language agnostic _________ protocol.
a) IP
b) TCP
c) SMTP
d) ICMP
7. The only metadata retained on a per-consumer basis is the position of the consumer in the log,
called __________
a) offset
b) partition
c) chunks
8. Each kafka partition has one server which acts as the _________
a) leaders
b) followers
c) staters
a) kafka
b) Slider
c) Suz
10. Kafka only provides a _________ order over messages within a partition.
a) partial
b) total
c) 30%
-----------------------------------
1) b 2) b 3) a 4) b 5) a 6) b 7) a 8) a 9) a 10) b
==================================================================================
===========================================================================
a) temp nodes
b) fleeting nodes
c) ephemeral nodes
d) terminating nodes
2. The znodes that continue to exist even after the creator of the znode dies are called:
a) ephemeral nodes
b) persistent nodes
c) sequential nodes
d) pure nodes
a) counter property
b) sequential property
c) ascending property
d) hierarchical property
a) Ephmeral feature
b) Persistent feature
c) watch feature
d) sequential feature
a server hookups
b) client APIs
c) property files
d) Classes
a) [Link]
b) [Link]
c) [Link]
d) [Link]
b) [Link]
c) [Link]
d) [Link]
a) Kafka producer
b) Kafka consumer
c) Kafka topic
d) ZooKeeper server
9. A Kafka topic is setup with a replication factor of 5. Out of these, 2 nodes in the cluster have
failed. Business users are concerned that they may lose messages. What do you tell them?
a) They need to stop sending messages till you bring up the 2 servers
b) They need to stop sending messages till you bring up at least one server
c) They can continue to send messages as there is fault tolerance of 4 server failures.
d) They can continue to send messages as you are keeping a tape back up of all the messages
10. A kafka cluster has 20 nodes. There are 5 topics created, each with 6 partitions. How many total
number of broker processes will be running?
11. You have tested that a Kafka cluster with five nodes is able to handle ten million messages per
minute. Your input is likely to increase to twenty five million messages per minute. How many more
nodes should be added to the cluster?
a) 15
b) 13
c) 8
d) 5
a) Zero
c) One
a) A consumer instance gets the messages in the same order as they are produced.
a) Quorum based
b) primary-backup method
c) multi-primary method
d) Journal based
15. When messages passes from producer to broker to consumer, the data modification is
minimized by using:
a) Message compression
b) Message sets
d) Partitions
a) Complete shutdown
b) Partial failures
c) Network latency
d) Power failures
18. Which of the following best describes the relationship between ZooKeeper and partial failures?
19. Replication of data can result in improved fault tolerance. Which of the following is a
disadvantage of replication?
a) Inconsistent states
b) Loss of data
c) Deadlocks
d) Partial failures
a) A graph of znodes
b) A tree of znodes
d) A list of znodes
a) Complete shutdown
b) Partial failures
c) Network latency
d) Power failures
22. Which of the following best describes the relationship between ZooKeeper and partial failures?
23. Replication of data can result in improved fault tolerance. Which of the following is a
disadvantage of replication?
a) Inconsistent states
b) Loss of data
c) Deadlocks
d) Partial failures
a) A graph of znodes
b) A tree of znodes
d) A list of znodes
a) internal log
b) external log
c) temporary log
d) No log
a) true
b) false
Note :
Answer to Kafka 3: -
-----------------------------------rue
13) c 14) a 15) c 16) c 17) b 18) b 19) c 20) b 21) b 22) d 23) a 24) b
25) b 26) b
===========================================================================<--- END
of Kafka MCQ --->========================================================
==================================================================================
===========================================================================
==================================================================================
==========================================================================
==================================================================================
======
==================================================================================
======
a) Spark SQL
b) Spark Streaming
c) Spak MLIB
d) Spark GraphX
a) Python
b) Java
c) Scala
a) The driver program runs the main () function of the application in the Master Node. One
driver per application
d) Driver program splits user application into smaller execution units known as tasks
a) SparkConf is used to set configuration parameters as key-value pair for Spark Applications
b) In Spark 2.0, Spark Session is built as a new entry point to launch application to process
structured datasets viz Dataframe & Dataset
a) Lazy Evaluation means that the execution will not take place until an action is triggered
c) Transformations are lazy in nature meaning when we call some operation in RDD
a) Directed Acyclic Graph converts logical exeuction plan to physical execution plan. DAG
provides a flow of execution
b) DAG in Spark is a set of Vertices(RDD) and Edges (Operations applied. DAG takes care of
processing in Spark
c) DAG achieves fault-tolerance and replaces the lost RDD partition in the ccluster with the
help of lineage
a) MEMORY_ONLY,MEMORY_ONLY_2
b) MEMORY_AND_DISK, MEMORY_AND_DISK_2
c) MEMORY_AND_DISK_SER
b) Check Pointing RDD gets stored in HDFS . Unlike cache, checkpointing file is not deleted
c) Spark does the process both at cache & check pointing directroy also
b) Spark application submitted to YARN which allocates the resources (executors) to Spark
tasks
c) Yarn provides two types of deployment mode, Yarn Cluster & Yarn Client to launch the
driver program
b) An action is one of the ways of sending data from Executor to the Driver
c) Stage is a Spark Job which is splitted into no of tasks . DAG maintains the execution of
tasks (ID s) one-by-one
a) RDD Lineage actually is a graphical representation of all the parent RDD and its
dependencies as RDD
b) Lineage is used to achieve fault-tolerance. If any RDD partition is lost, DAG replaces the
lost RDD from Parent RDD,using lineaage
a) Shared variable are required to be used by many functions & methods in parallel across
clusters. Shared variables responsible for performance enhancement
b) Broadcast - used to cache same set of same values in memory on all worker nodes ie.
outside executor
a) CSV, JSON
b) Parquet, ORC
c) Avro, Sequenctial
a) Catalyst is a tree like representation which creates catalogue to track the Dataframe &
Dataset
b) Catalyst is a function which applies set of rules based and cost based optimization
technique to manipulate data
c) Catalyst optimization takes care of logical plan, physical plan and Java code generation
b) Lambda architecture comprises of Batch Layer, Speed Layer (also known as Stream layer)
and Serving Layer.
c) Spark uses Streaming layer to process data as and when it arrives(in memory) , a new
data(micro batch) with an incremental view whereas Batch layer
b) Batch Layer manages Master dataset and pre-computes batch view. Proccessed once
read many
b) Speed Layer deals with recently received data only. ie. micro batches . Random read &
write . Incremental computation
c) Service layer indexes the batch views so that they can be queried in low-latency on an ad-
hoc basis
30) Which of the following organized data will have named column?
a) RDD
b) DataFrame
c) Both a & b
a) Java
b) Kryo
c) Avro
b) SparkSQL provides programming abstraction called Dataframe and also runs SQL query in
a distributed dataset
a) Hadoop MapReduce uses disk I/O to process HDFS files to read and write - a time
consuming
c) Spark applies DAG model for executing multiple stages whereas Hadoop MapReduce
handles only two tasks(Map & Reduce) in application
34) Name the various partitioner availabe in Spark and its usage ?
a) Hash Partitioner - provides partitioning based on unique Key and split data uniformaly
across various partitions
b) Range Partitioner - provides partitioning method based keys with same range which will
appear on the same machine
c) Custom Partitioner - provides a mechanism to adjust the size and number of partitions or
the partitioning scheme based on application need
b) Cache uses Iterative computations mean to reuse the results over multiple computations
in multistage applications
c) Cache uses Interactive mechanisms which saves results for upcoming stages so, that we
can reuse them
a) DAG maintains all the transformation details of parent RDD and its dependencies in a
logged Graph - called lineage
b) In case of any lost partition in the cluster, DAG re-creates partition using lineage graph
a) The machine on which the Spark Standalone Cluster Manager runs is called the Master
Node
b) In Master, the driver program runs the main() function where the spark context is
created. The Master , with the help of cluster manager,
c) There c1an be only one Master per cluster. Usually it is happend to be Resource Manager
b) Worker node consists of processes that can run in parallel to perform the tasks scheduled
by the driver program.
a) Cluster Manager is the process responsible for monitoring the Worker nodes and
providing resources to those nodes upon request by the Master
b) Cluster Manager function is the YARN Resource Manager process for Spark applications
running on Hadoop clusters
a) Spark Executors are the processes or CPU and memory resources allocated in slave nodes,
or Worker Node
c) Each executors is dedicated to run a specific application. Application uses JVM heap
memory as --executor-memory in worker node
a) Gzip
b) Lzo
c) Snappy
a) createorReplaceTempView is used when you want to store a specific table for a particular
spark session
b) createorReplaceTempView is using a lazily evaluated view ,like a table, and can execute
queries in SparkSQL
a) Spark Streaming is one of the core component of Spark which provieds an abstraction
called discretized stream or DStream
b) Spark Streaming is fault-tolerant and scalable & better load balancing & resource usage
a) Yarn Cluster
b) Yarn Client
c) Master
b) DStreams can either be created from live data (such as, data from HDFS, Kafka or Flume)
or it can be generated by transformation existing DStreams
c) DStream periodically generates a RDD, micro batch, either from live data or by
transforming the RDD generated by a parent DStream.
b) Specutlative execution is a health check procedure to identify the tasks running slower in
worker node
a) No
a) No
b) Yes
a) No
a) Repartition method can be used to either increase or decrease the number of partitions in
a RDD or DataFrame.
b) Repartition is a full Shuffle operation, whole data is taken out from existing partitions
and equally distributed into newly formed partition
b) Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using
Hash Partitioner (Default), and adjusts into existing
a) Data locality in simple terms means doing computation on the node where actual data
resides.
b) Spark driver, in Hadoop, contacts NameNode about the DataNodes for various blocks
c) Spark uses YARN Cluster manager to place the tasks alongside HDFS blocks
58) What are the various data locality and placement policy in Spark ?
a) PROCESS_LOCAL
b) NODE_LOCAL
C) RACK_LOCAL
D) NO_PREF , ANY
a) RPC Authentication
b) ACL Authentication
c) Kerberos Authentication
d) SSL Authentication
61) What is on-heap & off-heap memory and how it is used in Spark ?
a) On-heap memory is dynamically allocted memory used by JVM with GC enabled, fast
b) Off-heap memory is than on-heap memory managed to store serialised as byte array as
bobjects, no GC enabled , slow
c) Spark JVM by default uses on-heap memory to run application. Off-heap is used when on-
heap memory is exceeded
==================================================================================
===========================================================================
----------------------
1) d 2) d 3) d 4) e 5) a 6) d 7) d 8) d 9) b 10) a
11) d 12) d 13) c 14) d 15) d 16) d 17) d 18) d 19) d 20) d
21) c 22) d 23) d 24) a 25) d 26) d 27) b 28) c 29) c 30) b
31) a 32) d 33) d 34) d 35) d 36) d 37) c 38) d 39) c 40) d
41) d 42) d 43) d 44) d 45) c 46) d 47) d 48) c 49) b 50) c
51) c 52) b 53) a 54) b 55) d 56) d 57) d 58) d 59) e 60) b
61) d
===========================================================================<--- END
of HDFS MCQ --->========================================================
==================================================================================
===========================================================================
==================================================================================
===========================================================================
==================================================================================
======
==================================================================================
======
a) 100 mb
b) 256 mb
c) 64 mb
D) 128 mb
a) 4
b) 2
c) 3
d) 1
a) Heart beat
b) HTTP
c) Active pulse
d) Hdfs signal
4) HDFS Storage uses actual file size as storage
a) False
b) True
c) Sometimes
d) All true
a) [Link]
b) [Link]
c) [Link]
d) [Link]
a) copyFromLocal
b) distcp
c) put
d) copyToLocal
b) Maintains [Link]
a) Python
b) Java
c) C ++
d) Scala
a) NameNode
b) SecondaryNode
c) DataNode
a) No
b) Yes
a) No
b) Yes
15) What is Checkpointing proces ?
a) Copy of Data blocks are stored in another DataNode to prevent data loss
a) Information about HDFS file system metadata are stored in disk called the FsImage
c) The in-memory information of Namespace & Block pool in NameNode are stored
permmanently into disk as Fsimage file as image format
a) Cloudera
b) HortonWorks
c) MapR
a) Volume
b) Velocity
c) Veraity
d) Veracity
c) Network bandwidth
a) DataNode
b) SecondaryNode
c) NameNode
28) In HDFS Shell command, What command is used to copy entire folder ?
a) CopyFromLocal
b) CopyToLocal
c) put
d) distcp
29) In HDFS Shell command, Which command is used to combine multiple hdfs files into one file in
local path ?
a) getMerge
b) put
c) distcp
d) copyToLocal
30) HDFS is ____________ processing ?
a) Parallel
b) Batch
c) Sequence
31) In which configuration file you can alter replication factor & set block-size ?
a) [Link]
b) [Link]
c) hdfs-site,xml
d) [Link]
34) What are the different data types that hadoop can handle ?
a) structured
b) un-strctured
c) semi-strctured
c) Discarded hardware
a) Client-Server architecture
b) Peer-to-Peer architecture
c) Master-Slave architecture
a) Local mode
a) HDFS
b) MapReduce
39) SecondaryNode can be used as Active NameNode, incase of any failure to NameNode
a) True
b) False
a) Horizontal Scalability
b) Distributed Computation
c) Fault tolerance
c) DataNode sends periodic heartbeat signal and shares block pool report
a) HDFS metadata represents the HDFS directories and files in a tree struture
c) At the time of start,NameNode load both FsImage & Edits log from disk to into memory
a) delete
b) rm
c) rm -r
d) expunge
45) Which HDFS commands is used to copy a file from one hdfs path to another hdfs path ?
a) cp
b) distcp
c) put
d) copyFromLocal
a) cp
b) touchz
c) copyFromLocal
d) copyToLocal
47) Which HDFS commands is used to clear all the files (empty) from trash ?
a) delete
b) rm
c) expunge
48) Which HDFS commands is used to read the content of the files in hdfs ?
a) cat
b) read
c) view
49) Is it possible to customize or change the hdfs block-size from default size ?
a) Yes
b) No
----------------------
1) d 2) c 3) a 4) b 5) b 6) c 7) a 8) b 9) b 10) a
11) b 12) a 13) b 14) a 15) a 16) a 17) b 18) a 19) a 20) d
21) e 22) b 23) d 24) a 25) d 26) c 27) c 28) c 29) a 30) b
31) c 32) d 33) a 34) d 35) d 36) c 37) d 38) c 39) b 40) e
41) d 42) d 43) c 44) d 45) a 46) b 47) c 48) a 49) a 50) d
===========================================================================<--- END
of HDFS MCQ --->========================================================
==================================================================================
===========================================================================
==================================================================================
===========================================================================
==================================================================================
======
==================================================================================
======
Hive - 1
a) logj4
b) log4l
c) log4i
d) log4j
a) list FILE[S] <filepath>* executes a Hive query and prints results to standard output
b) <query string> executes a Hive query and prints results to standard output
a) Log level
b) Log modes
c) Log source
a) BeeLine
b) SqlLine
c) HiveLine
d) CLilLine
5. Point out the wrong statement.
a) 0.10.0
b) 0.9.0
c) 0.11.0
d) 0.12.0
a) set -v x=myvalue
b) set x=myvalue
c) reset x=myvalue
a) set [Link]=false;
b) set [Link]=false;
c) set [Link]=true;
9. _______ supports a new command shell Beeline that works with HiveServer2.
a) HiveServer2
b) HiveServer3
c) HiveServer4
a) Remote
b) HTTP
c) Embedded
d) Interactive
==================================================================================
==================================================================================
=
----------------------
1) d 2) b 3) a 4) a 5) a 6) c 7) d 8) a 9) a 10) a
==================================================================================
==================================================================================
===
Hive - 2
1. Hive specific commands can be run from Beeline, when the Hive _______ driver is used.
a) ODBC
b) JDBC
c) ODBC-JDBC
3. _________ reduce the amount of informational messages displayed (true) or not (false).
a) –silent=[true/false]
b) –autosave=[true/false]
c) –force=[true/false]
a) –incremental=[true/false]
b) –isolation=LEVEL
c) –force=[true/false]
d) –truncateTable=[true/false]
b) CSV and TSV output formats are maintained for forward compatibility
6. The ________ allows users to read or write Avro data as Hive tables.
a) AvroSerde
b) HiveSerde
c) SqlSerde
7. Starting in Hive _______ the Avro schema can be inferred from the Hive table schema.
a) 0.14
b) 0.12
c) 0.13
d) 0.11
8. The AvroSerde has been built and tested against Hive 0.9.1 and later, and uses Avro _______ as of
Hive 0.13 and 0.14.
a) 1.7.4
b) 1.7.2
c) 1.7.3
a) map
b) record
c) string
d) enum
10. Which of the following data type is converted to Array prior to Hive 0.12.0?
a) map
b) long
c) float
d) bytes
==================================================================================
==================================================================================
=
----------------------
1) b 2) c 3) a 4) b 5) b 6) a 7) a 8) d 9) d 10) d
==================================================================================
==================================================================================
===
Hive - 3
a) “STORED AS AVRO”
b) “STORED AS HIVE”
c) “STORED AS AVROHIVE”
d) “STORED AS SERDE”
a) Union
b) Intersection
c) Set
4. The files that are written by the _______ job are valid Avro files.
a) Avro
b) Map Reduce
c) Hive
a) [Link]
b) [Link]
c) [Link]
7. _______ is interpolated into the quotes to correctly handle spaces within the schema.
a) $SCHEMA
b) $ROW
c) $SCHEMASPACES
d) $NAMESPACES
9. ________ was designed to overcome limitations of the other Hive file formats.
a) ORC
b) OPC
c) ODC
a) postscript
b) stripes
c) script
==================================================================================
==================================================================================
=
----------------------
1) a 2) b 3) a 4) c 5) c 6) a 7) a 8) a 9) a 10) b
==================================================================================
==================================================================================
===
Hive – 4
a) Footer
b) STRIPES
c) Dictionary
d) Index
b) Streams are compressed using a codec, which is specified as a table property for all
streams in that table
3. _______ is a lossless data compression library that favors speed over compression ratio.
a) LOZ
b) LZO
c) OLZ
4. Which of the following will prefix the query string with parameters?
a) SET [Link]=false
b) SET [Link]=false
c) SET [Link]=true
a) SMALL INT
b) INT
c) BIG INT
d) TINY INT
a) C
b) Java
c) Python
d) Scala
8. Which of the following statement will create a column with varchar datatype?
a) INSERT WRITE
b) INSERT OVERWRITE
c) INSERT INTO
a) Scalar
b) Complex
c) INT
d) CHAR
==================================================================================
==================================================================================
=
----------------------
1) c 2) b 3) a 4) a 5) b 6) b 7) a 8) b 9) c 10) b
==================================================================================
==================================================================================
=
===========================================================================<--- END
of HDFS MCQ --->========================================================
==================================================================================
===========================================================================
Compiled by : V. Vasu 91-9940156760
[Link]@[Link]
==================================================================================
===========================================================================