100% found this document useful (4 votes)
2K views31 pages

CS8091-BIG DATA ANALYTICS UNIT V Notes

The document discusses different NoSQL data management models for big data including key-value stores, document stores, tabular stores, and object data stores. Key-value stores store data as a mapping of unique keys to values, document stores store semi-structured data like JSON or XML documents, tabular stores like Bigtable store data in a sparse multidimensional sorted map, and object data stores maintain object structures from object-oriented languages. NoSQL databases provide flexibility for schema-less data and scale to large datasets for applications like log data, profiles, and web analytics.

Uploaded by

anu xerox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
2K views31 pages

CS8091-BIG DATA ANALYTICS UNIT V Notes

The document discusses different NoSQL data management models for big data including key-value stores, document stores, tabular stores, and object data stores. Key-value stores store data as a mapping of unique keys to values, document stores store semi-structured data like JSON or XML documents, tabular stores like Bigtable store data in a sparse multidimensional sorted map, and object data stores maintain object structures from object-oriented languages. NoSQL databases provide flexibility for schema-less data and scale to large datasets for applications like log data, profiles, and web analytics.

Uploaded by

anu xerox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

CS8091-BIG DATA ANALYTICS– QUESTION BANK / NOTES


UNIT V NOSQL DATA MANAGEMENT FOR BIG DATA AND VISUALIZATION
NoSQL Databases : Schema-less Models: Increasing Flexibility for Data Manipulation-Key Value
Stores- Document Stores - Tabular Stores - Object Data Stores - Graph Databases- Hive - Sharding
- Hbase – Analyzing big data with twitter - Big data for E-Commerce-Big data for blogs - Review
of Basic Data Analytic Methods using R.
1. NOSQL:
• most hardware and software appliances support standard approaches to standard, SQL-based relational
database management systems (RDBMSs). Software appliances often bundle their execution engines with
the RDBMS and utilities for creating the database structures and for bulk data loading.
• some algorithms will not be able to consume data in traditional RDBMS systems and will be acutely
dependent on alternative means for data management.
• The term “NoSQL” may convey two different connotations—one implying that the data management
system is not an SQL-compliant one, while the more accepted implication is that the term means “Not
only SQL,” suggesting environments that combine traditional SQL (or SQL-like query languages) with
alternative means of querying and access.
2. “SCHEMA-LESS MODELS”: INCREASING FLEXIBILITY FOR DATA MANIPULATION:
• NoSQL reduce the dependence on more formal database administration. NoSQL databases may benefit
both the application developer and the end-user analysts when their interactive analyses are not throttled
by the need to cast each query in terms of a relational table-based environment.
• Different NoSQL frameworks are optimized for different types of analyses.
• For example, some are implemented as keyvalue stores, which nicely align to certain big data
programming models, while another emerging model is a graph database, in which a graph abstraction is
implemented to embed both semantics and connectivity within its structure.
• the general concepts for NoSQL include schema less modeling in which the semantics of the data are
embedded within a flexible connectivity and storage model; this provides for automatic distribution of
data and elasticity with respect to the use of computing, storage, and network bandwidth.
• This doesn’t force specific binding of data to be persistently stored in particular physical locations.
• NoSQL databases also provide for integrated data caching that helps reduce data access latency and speed
performance.
• The loosening of the relational structure allow different models to be adapted to specific types of
analyses.
• The models themselves do not necessarily impose any validity rules that introduces risks associated with
ungoverned data management activities such as inadvertent inconsistent data replication, reinterpretation
of semantics, and currency and timeliness issues.
3. KEYVALUE STORES:
• A key value store is a schema-less model in which values (or sets of values, or even more complex entity
objects) are associated with distinct character strings called keys.
• Programmers may see similarity with the data structure known as a hash table. Other alternative NoSQL
data stores are variations on the keyvalue theme, which lends a degree of credibility to the model.
• consider the data subset represented in Table 1. The key is the name of the automobile make, while the
value is a list of names of models associated with that automobile make.
2

Table 1 Example Data Represented in a Key-Value Store:

• the key value store does not impose any constraints about data typing or data structure—the value
associated with the key is the value, and it is up to the consuming business applications to assert
expectations about the data values and their semantics and interpretation. This demonstrates the schema-
less property of the model.
• The core operations performed on a keyvalue store include:
o Get(key), which returns the value associated with the provided key.
o Put(key, value), which associates the value with the key.
o Multi-get(key1, key2,.., keyN), which returns the list of values associated with the
list of keys.
o Delete(key), which removes the entry for the key from the data store.
• One critical characteristic of a keyvalue store is uniqueness of the key.
• if you want to associate multiple values with a single key, you need to consider the representations of the
objects and how they are associated with the key. For example, you may want to associate a list of
attributes with a single key, which may suggest that the value stored with the key is yet another keyvalue
store object itself.
• Key value stores are essentially very long, and presumably thin tables.
• The table’s rows can be sorted by the key value to simplify finding the key during a query. Alternatively,
the keys can be hashed using a hash function that maps the key to a particular location (sometimes called
a “bucket”) in the table.
• Additional supporting data structures and algorithms (such as bit vectors and bloom filters) can be used to
even determine whether the key exists in the data set at all.
• The representation can grow indefinitely, which makes it good for storing large amounts of data that can
be accessed relatively quickly, as well as environments requiring incremental appends of data.
• Examples include capturing system transaction logs, managing profile data about individuals, or
maintaining access counts for millions of unique web page URLs.
• The simplicity of the representation allows massive amounts of indexed data values to be appended to the
same keyvalue table, which can then be sharded, or distributed across the storage nodes.
• Under the right conditions, the table is distributed in a way that is aligned with the way the keys are
organized, so that the hashing function that is used to determine where any specific key exists in the table
can also be used to determine which node holds that key’s bucket.
Drawbacks of key value pairs:
o the model will not inherently provide any kind of traditional database capabilities
(such as atomicity of transactions, or consistency when multiple transactions are
executed simultaneously).
o as the model grows, maintaining unique values as keys may become more difficult,
requiring the introduction of some complexity in generating character strings that
will remain unique among a myriad of keys.
4. DOCUMENT STORES:
• A document store is similar to a keyvalue store in that stored objects are associated character string keys.
• The difference is that the values being stored, which are referred to as “documents,” provide some
structure and encoding of the managed data.
3

• There are different common encodings, including XML (Extensible Markup Language), JSON (Java
Script Object Notation), BSON (which is a binary encoding of JSON objects), or other means of
serializing data.

Fig.1 Example of document store.


• in Figure.1 we have some examples of documents stored in association with the names of specific retail
locations. Note that while the three examples all represent locations, yet the representative models differ.
• The document representation embeds the model so that the meanings of the document values can be
inferred by the application.
• One of the differences between a key value store and a document store is that while the former requires
the use of a key to retrieve data, the latter often provides a means (either through a programming API or
using a query language) for querying the data based on the contents.
5. TABULAR STORES:
• Tabular, or table-based stores are largely descended from Google’s original Bigtable design to manage
structured data.
• The HBase model is an example of a Hadoop-related NoSQL data management system that evolved from
bigtable.
• The bigtable NoSQL model allows sparse data to be stored in a three-dimensional table that is indexed by
a row key (that is used in a fashion that is similar to the key value and document stores), a column key
that indicates the specific attribute for which a data value is stored, and a timestamp that may refer to the
time at which the row’s column value was stored.
• As an example, various attributes of a web page can be associated with the web page’s URL: the HTML
content of the page, URLs of other web pages that link to this web page, and the author of the content.
Columns in a Bigtable model are grouped together as “families,” and the timestamps enable management
of multiple versions of an object. The timestamp can be used to maintain history—each time the content
changes, new column affiliations can be created with the timestamp of when the content was downloaded.
6. OBJECT DATA STORES:
• object data stores and object databases seem to bridge the worlds of schema-less data management and
the traditional relational models.
• approaches to object databases can be similar to document stores except that the document stores
explicitly serializes the object so the data values are stored as strings, while object databases maintain the
object structures as they are bound to object-oriented programming languages such as C11, Objective-C,
Java, and Smalltalk.
• object database management systems are more likely to provide traditional ACID (atomicity, consistency,
isolation, and durability) compliance—characteristics that are bound to database reliability. Object
databases are not relational databases and are not queried using SQL.
7. GRAPH DATABASES:
• Graph databases provide a model of representing individual entities and numerous kinds of relationships
that connect those entities.
4

• it employs the graph abstraction for representing connectivity, consisting of a collection of vertices
(which are also referred to as nodes or points) that represent the modeled entities, connected by edges
(which are also referred to as links, connections, or relationships) that capture the way that two entities
are related. Graph analytics performed on graph data stores are somewhat different than more frequently
used querying and reporting. (Refer Unit 4).
8. HIVE:
• One of the often-noted issues with MapReduce is that although it provides a methodology for developing
and executing applications that use massive amounts of data, it is not more than that. And while the data
can be managed within files using HDFS, many business applications expect representations of data in
structured database tables.
• That was the motivation for the development of Hive, which is a “data warehouse system for Hadoop that
facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop
compatible file systems.”
• ” Hive is specifically engineered for data warehouse querying and reporting and is not intended for use as
within transaction processing systems that require real-time query execution or transaction semantics for
consistency at the row level.
• Hive is layered on top of the file system and execution framework for Hadoop and enables applications
and users to organize data in a structured data warehouse and therefore query the data using a query
language called HiveQL that is similar to SQL (the standard Structured Query Language used for most
modern relational database management systems).
• The Hive system provides tools for extracting/ transforming/loading data (ETL) into a variety of different
data formats. And because the data warehouse system is built on top of Hadoop, it enables native access
to the MapReduce model, allowing programmers to develop custom Map and Reduce functions that can
be directly integrated into HiveQL queries. Hive provides scalability and extensibility for batch-style
queries for reporting over large datasets that are typically being expanded while relying on the
faulttolerant aspects of the underlying Hadoop execution model.
9. SHARDING:
• Sharding is a method of splitting and storing a single logical dataset in multiple databases. By distributing
the data among multiple machines, a cluster of database systems can store larger dataset and handle
additional requests. Sharding is necessary if a dataset is too large to be stored in a single database.
Moreover, many sharding strategies allow additional machines to be added. Sharding allows a database
cluster to scale along with its data and traffic growth.
• Sharding is also referred as horizontal partitioning. The distinction of horizontal vs vertical comes
from the traditional tabular view of a database. A database can be split vertically — storing different
tables & columns in a separate database, or horizontally — storing rows of a same table in multiple
database nodes.(Fig.2)

Fig.2 An illustrated example of vertical and horizontal partitioning


5

• Vertical partitioning is very domain specific. You draw a logical split within your application data,
storing them in different databases. It is almost always implemented at the application level — a piece of
code routing reads and writes to a designated database.
• In contrast, sharding splits a homogeneous type of data into multiple databases. You can see that such an
algorithm is easily generalizable. That’s why sharding can be implemented at either the application
or database level. In many databases, sharding is a first-class concept, and the database knows how to
store and retrieve data within a cluster. Almost all modern databases are natively sharded. Cassandra,
HBase, HDFS, and MongoDB are popular distributed databases. Notable examples of non-sharded
modern databases are Sqlite, Redis (spec in progress), Memcached, and Zookeeper.
• Shard or Partition Key is a portion of primary key which determines how data should be distributed. A
partition key allows you to retrieve and modify data efficiently by routing operations to the correct
database. Entries with the same partition key are stored in the same node. A logical shard is a collection
of data sharing the same partition key. A database node, sometimes referred as a physical shard, contains
multiple logical shards.
Categories of Sharding:
Case 1: Algorithmic Sharding:
• Algorithmically sharded databases use a sharding function (partition_key) -> database_id to locate data.
• Reads are performed within a single database as long as a partition key is given.
• Algorithmic sharding distributes data by its sharding function only.(Fig.3)

Fig. 3 An algorithmically sharded database, with a simple sharding function


Case 2— Dynamic Sharding:
• In dynamic sharding, an external locator service determines the location of entries. It can be
implemented in multiple ways. If the cardinality of partition keys is relatively low, the locator can be
assigned per individual key. Otherwise, a single locator can address a range of partition keys.(Fig.4)
• To read and write data, clients need to consult the locator service first. Operation by primary key becomes
fairly trivial. Other queries also become efficient depending on the structure of locators. In the example of
range-based partition keys, range queries are efficient because the locator service reduces the number of
candidate databases. Queries without a partition key will need to search all databases.
6

Fig.4 A dynamic sharding scheme using range based partitioning.


Case 3 – Entity Groups:
• Store related entities in the same partition to provide additional capabilities within a single partition.
Specifically:
o Queries within a single physical shard are efficient.
o Stronger consistency semantics can be achieved within a shard.
• In this case, data needs to be stored in multiple partitions to support efficient reads. For example, chat
messages between two users may be stored twice — partitioned by both senders and recipients. All
messages sent or received by a given user are stored in a single partition. In general, many-to-many
relationships between partitions may need to be duplicated. (Fig.5)

Fig.5 Entity Groups partitions all related tables together


Case 4 –Hierarchical keys & Column-oriented databases:
• Column-oriented databases are an extension of key-value stores. They add expressiveness of entity
groups with a hierarchical primary key. A primary key is composed of a pair (row key, column key).
Entries with the same partition key are stored together. Range queries on columns limited to a single
partition are efficient. That’s why a column key is referred as a range key in DynamoDB. (Fig.6)

Fig.6 Column-oriented databases partition its data by row keys.


7

Disadvantages:
• Increased complexity of SQL - Increased bugs because the developers have to write more complicated
SQL to handle sharding logic.
• Sharding introduces complexity - The sharding software that partitions, balances, coordinates, and
ensures integrity can fail.
• Single point of failure - Corruption of one shard due to network/hardware/systems problems causes
failure of the entire table.
• Failover servers more complex - Failover servers must themselves have copies of the fleets of database
shards.
• Backups more complex - Database backups of the individual shards must be coordinated with the
backups of the other shards.
• Operational complexity added - Ad Adding/removing
ding/removing indexes, adding/deleting columns, modifying the
schema becomes much more difficult.
10. HBASE:
• HBase is another example of a nonrelational data management environment that distributes massive
datasets over the underlying Hadoop framework. HBase iiss derived from Google’s BigTable and is a
column-oriented
oriented data layout that, when layered on top of Hadoop, provides a fault
fault--tolerant method for
storing and manipulating large data tables.
• data stored in a columnar layout is amenable to compression, which iincreases
ncreases the amount of data that can
be represented while decreasing the actual storage footprint. In addition, HBase supports in-memory
in
execution.
• HBase is not a relational database, and it does not support SQL queries. There are some basic operations
for HBase: Get (which access a specific row in the table), Put (which stores or updates a row in the table),
Scan (which iterates over a collection of rows in the table), and Delete (which removes a row from the
table). Because it can be used to organize data
datasets,
sets, coupled with the performance provided by the aspects
of the columnar orientation, HBase is a reasonable alternative as a persistent storage paradigm when
running MapReduce applications.
• HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.(Fig.7) (Fig.7)

Fig.7 – Architecture of HBase


• HMaster:
The implementation of Master Server in HBase is HMaster. It is a process in which regions are assigned
to region server as well as DDL (create, delete table) operations. It monitor all Region Server instances
present in the cluster. In a distributed environment, Master runs several background threads. HMaster has
many features like controlling load balancing, failover etc.
8

• Region Server:
HBase Tables are divided horizontally by row key range into Regions. Regions are the basic building
elements of HBase cluster that consists of the distribution of tables and are comprised of Column
families. Region Server runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region
Server are responsible for several things, like handling, managing, executing as well as reads and writes
HBase operations on that set of regions. The default size of a region is 256 MB.
• Zookeeper: –
It is like a coordinator in HBase. It provides services like maintaining configuration information, naming,
providing distributed synchronization, server failure notification etc. Clients communicate with region
servers via zookeeper.
Advantages of HBase –
• Can store large data sets
• Database can be shared
• Cost-effective from gigabytes to petabytes
• High availability through failover and replication
Disadvantages of HBase –
• No support SQL structure
• No transaction support
• Sorted only on key
• Memory issues on the cluster
Camparison between HBase and HDFS:
• HBase provides low latency access while HDFS provide high latency operations.
• HBase supports random read and write while HDFS supports Write once Read Many times.
• HBase is accessed through shell commands, Java API, REST, Avro or Thrift API while HDFS is
accessed through MapReduce jobs.
Note – HBase is extensively used for online analytical operations, like in banking applications such as
real-time data updates in ATM machines, HBase can be used.
11. ANALYZING BIG DATA WITH TWITTER :
• This section describes the overall framework for capturing and analyzing tweets streamed in real time. As
a first part, real-time tweets are collected from Twitter. These tweets are retrieved from Twitter using
Twitter streaming API as shown in Fig.8. This semi-structured twitter data is given as input to the PIG
module as well as the HIVE module which will convert nested JSON data into a structured form that is
suitable for analysis.
9

Fig.8. System model for capturing and analyzing the tweets.


a. Finding recent trends:
• Trend is a subject of many posts on social media for a short duration of time. Finding recent trends means
to process the huge amount of data collected over the needed period of time. Algorithm 1 briefs on
finding popular hash tags.
i. Finding popular hashtags using Apache Pig:
• To find popular hashtags of given tweets, the tweets are loaded into Apache Pig module, wherein these
tweets are passed through a series of Pig scripts for finding popular hashtags. Following are the steps to
determine the popular hashtags in tweets:
a. Loading the Twitter data on Pig:
• This streamed twitter data is in JSON format and consists of map data types that is data with key and
value pair. For analysis, the tweets stored in HDFS are loaded into PIG module. To load the Twitter data,
we used elephant bird JsonLoader jar files which supports to load tweets of JSON format.
Algorithm 1: Finding popular hashtag
Data: dataset: = Corpus of tweets
Result: popular hashtag
Load tweets from HDFS to Hadoop ecosystem module
for each tweet in module
feature = extract(extract id, hashtag text)
end
for each feature
count_id = Count(id€hashtag text)
endpopular_hashtag = max(count_id)
(b) Feature extraction:
• This step is called preprocessing where Twitter messages containing many fields such as id, text, entities,
language, time zone, etc. are looked at. To find famous hashtags, we have extracted tweet id and entities
fields where the entity field has a member hashtag. This member is used for further analysis along with
tweet id.
10

(c) Extract hashtags:


• Each hashtag object contains two fields: they are text and indices where text field contains the hashtag. So
to find famous hashtags, we have extracted text field. The output of this phase is hashtag followed by the
tweet id.
For example, GST; 910449715870818304
(d) Counting hashtags:
• After performing all the above steps, we get hashtag and tweet ids. To find popular hashtags, we first
group the relation with respect to hashtag, next we count the number of times the hashtag appeared.
Hashtags, which have appeared highest number of times, are categorized as famous hashtags or recent
trends.
ii. Finding recent trends using Apache Hive:
• Recent trends from real-time tweets can also be found using Hive queries. Since tweets collected from
twitter are in JSON format, we have to use JSON input format to load the tweets into Hive. We have used
Cloudera Hive JsonSerDe for this purpose.
• This jar file has to be present in Hive to process the data. This jar file can be added using following
command.
add jar <path to jar file>;
Following steps are performed to find the recent trend:
(a) Loading and Feature extraction:
• The tweets collected from twitter are stored in HDFS. In order to work with Data stored in HDFSusing
HiveQL, first an external table is created which creates the table definition in the Hive metastore.
Fig.9 shows the query used to create the Twitter table. This query not only creates a schema to store the
tweets, but also extracts required fields like id and entities.

Fig.9 Query to create a table in Hive.


(b) Extracting Hashtags:
• In order to extract actual hashtags from entities, we created another table which contains id and the list of
hashtags. Since multiple hashtags are present in one tweet, we used UDTF (User Defined Table
generation function) to extract each hashtag on the new [Link] outcome of this phase is id and hashtag.
(c) Counting hashtag:
• After performing all the above steps, we have id and hashtag text.A hive query is written to count the
hashtags.
iii. Sentiment analysis using Apache Pig:
• The “group by” operation is performed on id to group all the words belonging to one tweet after which
average operation is performed on the ratings given to each word in a tweet. Based on the average ratings,
tweets are classified into positive and negative.
12. BIG DATA FOR E-COMMERCE:
• Big data is an extensive collection of both, offline and online data. It focuses on being a productive source
of analysis and sustained discovery for evaluating past trends and performance development with greater
insights to gain higher customer satisfaction. Incorporating big data in the e-commerce industry will
11

allow businesses to gain access to significantly larger amounts of data in order to convert the growth into
revenue, streamline operation processes, and gain more customers.
• Here’s how big data solutions can help e-commerce industry to flourish:-
[Link] Customer Shopping Experience
• With the help of big data, e-commerce businesses can identify meaningful customer behaviour patterns,
purchase histories, browsing interests, etc. Therefore, online businesses have the opportunity to re-target
the buyers by displaying or recommending products that they are interested in. Big Data can help drive
more customers to online stores thereby positively impacting the overall ROI .
[Link] Customer Satisfaction
• Poor quality shopping services impact the shopping experience of the customers as well as the reputation
of the e-commerce service providers. Owing to the advent of big data, the businesses have all the vital
information that help in providing better solutions to shoppers needs. This further allows the stores to
understand their customers better and built a lasting relationship with them.
[Link] Analytics
• Big data allows real-time analytics, also referred to as “real-time business intelligence” to gain valuable
customer insights. Data on buyer’s demographics and their journey to a particular e-commerce store
provides a lot of insights that will help in personalizing the shopper’s experience and generate more
revenue. Additionally, by analysing the traffic amount and online transaction, these insights also help in
making an effective business strategy.
Big Data Solutions:
i. Recommendation Engine:
• It is an absolute must-have for an e-commerce company. Online store owners can hardly find a better tool
for cross-selling and up-selling. For this, it is necessary to tune up the analytical system so that it could
analyze all the actions of a particular customer: product pages they visited with the time spent there,
products they liked, added into their carts and finally bought/abandoned etc.
• The system can also compare the behavior pattern of a certain visitor to those of the other visitors.
• The result is splendid- the analytical system works autonomously, analyzes, recommends the products
that a visitor may like.
• More than that, the system constantly learns on the analyzed patterns and becomes even more precise
over time.
ii. Personalized Shopping Experience:
• Creating personalized shopping experience is a key to successful e-commerce marketing. To do it, a
company should be able to react to their customer’s actions properly and in real time.
• This becomes possible with big data tools that analyze all customer activities in an e-shop and create a
picture of customer behavior patterns.
iii. Everything in the cart is tracked:
• A customer has put a gown, a pair of shoes and a clutch to her shopping cart, but decided to abandon it
for some reason.
• The analytical system knows that this customer is valuable – she shops frequently and buys a lot.
• Reacting immediately and offering a coupon for a 5% discount for the shoes, the company may
encourage the customer to finish the shopping.
• A customer bought a winter coat two weeks ago and visited some product pages with winter gloves,
scarfs and hats at that time.
• It is likely that the customer will be happy to receive a personal email that advertises a new collection of
winter accessories and/or announces a 10% discount for them, may encourage him to choose your offer
among multiple similar options.
12

iv. Voice of the customer:


• Big data can help optimize the product portfolio of an e-commerce retailer. To do this, add sentiment
analysis to the standard approach of analyzing products and brands by their sales value, volume,
revenues, number of orders etc.
• Sentiment analysis is the evaluation of comments that the customers left about different products and
brands.
• The analytical system automatically identifies whether each comment is positive or negative.
v. Dynamic Pricing:
• Big data can help e-commerce retailers keep in line with their pricing strategy. The concept of dynamic
pricing implies setting price rules, monitoring competitors and adjusting prices in real time.
vi. Demand Forecasting:
• E-commerce retailers can significantly improve demand forecasting by creating customer profiles and
looking at their customer’s behavior – when they prefer to shop, how many items they usually purchase,
which products they buy etc.
• It has a positive effect on retailer’s internal processes such as avoiding out-of-stocks, optimizing supply
chain and warehouse.
13. BIG DATA FOR BLOGS:
Introduction:
• Social media is defined as web-based and mobile-based Internet applications that allow the creation,
access and exchange of user-generated content that is ubiquitously accessible.
• Besides social networking media (e.g., Twitter and Facebook), for convenience, we will also use the term
‘social media’ to encompass really simple syndication (RSS) feeds, blogs, wikis and news, all typically
yielding unstructured text and accessible through the web.
• Social media is especially important for research into computational social science that investigates
questions using quantitative techniques (e.g., computational statistics, machine learning and complexity)
and so-called big data for data mining and simulation modeling.
• This has led to numerous data services, tools and analytics platforms. However, this easy availability of
social media data for academic research may change significantly due to commercial pressures.
Terminology:
• We start with definitions of some of the key techniques related to analyzing unstructured textual data:
• Natural language processing—(NLP) is a field of computer science, artificial intelligence and
linguistics concerned with the interactions between computers and human (natural) languages.
Specifically, it is the process of a computer extracting meaningful information from natural language
input and/or producing natural language output.
• News analytics—the measurement of the various qualitative and quantitative attributes of textual
(unstructured data) news stories. Some of these attributes are: sentiment, relevance and novelty.
• Opinion mining—opinion mining (sentiment mining, opinion/sentiment extraction) is the area of
research that attempts to make automatic systems to determine human opinion from text written in natural
language.
• Scraping—collecting online data from social media and other Web sites in the form of unstructured text
and also known as site scraping, web harvesting and web data extraction.
• Sentiment analysis—sentiment analysis refers to the application of natural language processing,
computational linguistics and text analytics to identify and extract subjective information in source
materials.
• Text analytics—involves information retrieval (IR), lexical analysis to study word frequency
distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques
including link and association analysis, visualization and predictive analytics.
Research challenges:
13

• Scraping—although social media data is accessible through APIs, due to the commercial value of the
data, most of the major sources such as Facebook and Google are making it increasingly difficult for
academics to obtain comprehensive access to their ‘raw’ data;
• Data cleansing—cleaning unstructured textual data (e.g., normalizing text), especially high-frequency
streamed real-time data, still presents numerous problems and research challenges.
• Holistic data sources—researchers are increasingly bringing together and combining novel data sources:
social media data, real-time market & customer data and geospatial data for analysis.
• Data protection—once you have created a ‘big data’ resource, the data needs to be secured, ownership
and IP issues resolved, and users provided with different levels of access;
• Data analytics—sophisticated analysis of social media data for opinion mining (e.g., sentiment analysis)
still raises a myriad of challenges due to foreign languages, foreign words, slang, spelling errors and the
natural evolving of language.
• Analytics dashboards—many social media platforms require users to write APIs to access feeds or
program analytics models in a programming language, such as Java. While reasonable for computer
scientists, these skills are typically beyond most (social science) researchers. Non-programming
interfaces are required for giving what might be referred to as ‘deep’ access to ‘raw’ data, for example,
configuring APIs, merging social media feeds, combining holistic sources and developing analytical
models.
• Data visualization—visual representation of data whereby information that has been abstracted in some
schematic form with the goal of communicating information clearly and effectively through graphical
means. Given the magnitude of the data involved, visualization is becoming increasingly important.
14. REVIEW OF BASIC DATA ANALYTIC METHODS USING R:
a. Introduction to R:
• R is a programming language and software framework for statistical analysis and graphics.
• The following R code illustrates a typical analytical situation in which a dataset is imported, the contents
of the dataset are examined, and some modeling building tasks are executed.

• In the scenario, the annual sales in U.S. dollars for 10,000 retail customers have been provided in the
form of a comma-separated-value (CSV) file. The [Link]() function is used to import the CSV file. This
dataset is stored to the R variable sales using the assignment operator <-.
• Once the file has been imported, it is useful to examine the contents to ensure that the data was loaded
properly as well as to become familiar with the data. In the example, the head() function, by default,
displays the first six records of sales.
14

• The summary() function provides some descriptive statistics, such as the mean and median, for each data
column. Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are
provided. Because the gender column contains two possible characters, an “F” (female) or “M” (male),
the summary() function provides the count of each character’s occurrence.

• the plot() function generates a scatterplot of the number of orders (sales$num_of_orders) against the
annual sales (sales$sales_total). The $ is used to reference a specific column in the dataset sales. The
resulting plot is shown in Fig.10.

Fig.10 Graphically examining the data


• Each point corresponds to the number of orders and the total sales for each customer. The plot indicates
that the annual sales are proportional to the number of orders placed. Although the observed relationship
between these two variables is not purely linear, the analyst decided to apply linear regression using the
lm() function as a first step in the modeling process.
15

• The resulting intercept and slope values are –154.1 and 166.2, respectively, for the fitted linear equation.
However, results stores considerably more information that can be examined with the summary()
function. Details on the contents of results are examined by applying the attributes() function.

• The summary() function is an example of a generic function. A generic function is a group of functions
sharing the same name but behaving differently depending on the number and the type of arguments they
receive.
• In the final portion of the example, the following R code uses the generic function hist() to generate a
histogram (Fig.11) of the residuals stored in results.

Fig.11 Evidence of large residuals


i. R Graphical User Interfaces:
• R software uses a command-line interface (CLI) that is similar to the BASH shell in Linux or the
interactive versions of scripting languages such as Python. Popular GUIs include the R commander,
Rattle, and RStudio.
• R allows one to save the workspace environment, including variables and loaded libraries, into an .Rdata
file using the [Link]() function. An existing .Rdata file can be loaded using the [Link]()
16

function. Tools such as RStudio prompt the user for whether the developer wants to save the workspace
connects prior to exiting the GUI.
ii. Data Import and Export:
• the dataset was imported into R using the [Link]() function as in the following code.
sales <- [Link](“c:/data/yearly_sales.csv”)
• To simplify the import of multiple files with long path names, the setwd() function can be used to set the
working directory for the subsequent import and export operations, as shown in the following R code.
setwd(“c:/data/”)
sales <- [Link](“yearly_sales.csv”)
• Other import functions include [Link]() and [Link](), which are intended to import other common
file types such as TXT. These functions can also be used to import the yearly_sales .csv file, as the
following code illustrates.
sales_table <- [Link](“yearly_sales.csv”, header=TRUE, sep=”,”)
sales_delim <- [Link](“yearly_sales.csv”, sep=”,”)
• Sometimes it is necessary to read data from a database management system (DBMS). R packages such as
DBI and RODBC are available for this purpose. These packages provide database interfaces for
communication between R and DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal
Greenplum. The following R code demonstrates how to install the RODBC package with the install
.packages() function. The library() function loads the package into the R workspace. Finally, a connector
(conn) is initialized for connecting to a Pivotal Greenplum database training2 via open database
connectivity (ODBC) with user user. The training2 database must be defined either in the /etc/[Link]
configuration file or using the Administrative Tools under the Windows Control Panel.
[Link](“RODBC”)
library(RODBC)
conn <- odbcConnect(“training2”, uid=“user”, pwd=“password”)
• The connector needs to be present to submit a SQL query to an ODBC database by using the sqlQuery()
function from the RODBC package. The following R code retrieves specific columns from the housing
table in which household income (hinc) is greater than $1,000,000.

• Although plots can be saved using the RStudio GUI, plots can also be saved using R code by specifying
the appropriate graphic devices. Using the jpeg() function, the following R code creates a new JPEG file,
adds a histogram plot to the file, and then closes the file. Such techniques are useful when automating
standard reports. Other functions, such as png(), bmp(), pdf(), and postscript(), are available in R to save
plots in the desired format.
17

iii. Attribute and Data Types:


• Attributes can be categorized into four types: nominal, ordinal, interval, and ratio (NOIR). Nominal and
ordinal attributes are considered categorical attributes, whereas interval and ratio attributes are considered
numeric attributes. Table 2 distinguishes these four attribute types and shows the operations they support.
Table 2 : NOIR Attribute Types:

• Data of
one attribute type may be converted to another. For example, the quality of diamonds {Fair, Good, Very
Good, Premium, Ideal} is considered ordinal but can be converted to nominal {Good, Excellent} with a
defined mapping. Similarly, a ratio attribute like Age can be converted into an ordinal attribute such as
{Infant, Adolescent, Adult, Senior}.
Numeric, Character, and Logical Data Types:
• R supports the use of numeric, character, and logical (Boolean) values. Examples of such variables are
given in the following R code.

• R provides several functions, such as class() and typeof(), to examine the characteristics of a given
variable. The class() function represents the abstract class of an object. The typeof() function determines
the way an object is stored in memory.

Vectors:
18

• Vectors are a basic building block for data in R. simple R variables are actually vectors. A vector can
only consist of values in the same class. The tests for vectors can be conducted using the [Link]()
function.

Arrays and Matrices:


• The array() function can be used to restructure a vector as an array. A two-dimensional array is known as
a matrix. R provides the standard matrix operations such as addition, subtraction, and multiplication, as
well as the transpose function t() and the inverse matrix function [Link]() included in the
matrixcalc package.
Data Frames:
• data frames provide a structure for storing and accessing several variables of possibly different data types.
Because of the flexibility to handle many data types, data frames are the preferred input format for many
of the modeling functions available in R.
• data frames are lists of variables of the same length. A subset of the data frame can be retrieved through
subsetting operators.
Lists:
• A list is a collection of objects that can be of various types, including other lists.
Contingency Tables:
• In R, table refers to a class of objects used to store the observed counts across the factors for a given
dataset. Such a table is commonly referred to as a contingency table and is the basis for performing a
statistical test on the independence of the factors used to build the table.
iv. Descriptive Statistics:
• The following code provides some common R functions that include descriptive statistics. In parentheses,
the comments describe the functions.

• The IQR() function provides the difference between the third and the first quartiles. The function apply()
is useful when the same function is to be applied to several variables in a data frame.
• R code defines a function, my_range(), to compute the difference between the maximum and minimum
values returned by the range() function.
b. Exploratory Data Analysis:
• summary() can help analysts easily get an idea of the magnitude and range of the data, but other aspects
such as linear relationships and distributions are more difficult to see from descriptive statistics.
• A useful way to detect patterns and anomalies in the data is through the exploratory data analysis with
visualization. Visualization gives a succinct, holistic view of the data that may be difficult to grasp from
the numbers and summaries alone. Variables x and y of the data frame data can instead be visualized in a
scatterplot (Fig.12), which easily depicts the relationship between two variables.
19

Fig.12 A scatterplot can easily show if x and y share a relation


i. Visualizing a Single Variable:
• R has many functions available to examine a single variable. Some of these functions are listed in Table
3.
Table 3 : Example Functions for Visualizing a Single Variable:

Dotchart and Barplot:


• Dotchart and barplot portray continuous values with labels from a discrete variable. A dotchart can be
created in R with the function dotchart(x, label=…), where x is a numeric vector and label is a vector of
categorical labels for x. A barplot can be created with the barplot(height) function, where height
represents a vector or matrix. Fig.13 shows (a) a dotchart and (b) a barplot based on the mtcars dataset,
which includes the fuel consumption and 10 aspects of automobile design and performance of 32
automobiles. This dataset comes with the standard R distribution.
20

Fig.13 (a) Dotchart on the miles per gallon of cars and (b) Barplot on the distribution of car
cylinder counts
• The plots in Fig.13 can be produced with the following R code.

Histogram and Density Plot:


• Fig.14 (a) includes a histogram of household income. The histogram shows a clear concentration of low
household incomes on the left and the long tail of the higher incomes on the right.

Fig.14 (a) Histogram and (b) Density plot of household income


21

• Fig.14(b) shows a density plot of the logarithm of household income values, which emphasizes the
distribution.
• The code to generate the two plots in Fig.14 is provided next. The rug() function creates a one-
dimensional density plot on the bottom of the graph to emphasize the distribution of the observation.

ii. Examining Multiple Variables:


• The scatterplot in Fig.15 portrays the relationship of two variables: x and y. The red line shown on the
graph is the fitted line from the linear regression.

Fig.15 Examining two variables with regression

• The R code to produce Fig.15 is as follows. The runif(75,0,10) generates 75 numbers between 0 to 10
with random deviates, and the numbers conform to the uniform distribution. The rnorm(75,0,20)
generates 75 numbers that conform to the normal distribution, with the mean equal to 0 and the standard
deviation equal to 20. The points() function is a generic function that draws a sequence of points at the
specified coordinates. Parameter type=“l” tells the function to draw a solid line. The col parameter sets
the color of the line, where 2 represents the red color and 4 represents the blue color.
22

Box-and-Whisker Plot:
• Box-and-whisker plots show the distribution of a continuous variable for each value of a discrete variable.
The box-and-whisker plot in Fig.16 visualizes mean household incomes as a function of region in the
United States.

Fig.16 A box-and-whisker plot of mean household income and geographical region


• The “box” of the box-and-whisker shows the range that contains the central 50% of the data, and the line
inside the box is the location of the median value. The upper and lower hinges of the boxes correspond to
the first and third quartiles of the data.
• The upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge. The
lower whisker extends from the hinge to the lowest value within 1.5 * IQR of the hinge. IQR is the inter-
quartile range.
• The points outside the whiskers can be considered possible outliers.
Scatterplot Matrix:
• A scatterplot matrix shows many scatterplots in a compact, side-by-side fashion. The scatterplot matrix,
therefore, can visually represent multiple attributes of a dataset to explore their relationships, magnify
differences, and disclose hidden patterns.
23

c. Statistical Methods for Evaluation:


• Visualization is useful for data exploration and presentation, but statistics is crucial because it may exist
throughout the entire Data Analytics Lifecycle. Statistical techniques are used during the initial data
exploration and data preparation, model building, evaluation of the final models, and assessment of how
the new models improve the situation when deployed in the field.
i. Hypothesis Testing:
• When comparing populations, such as testing or evaluating the difference of the means from two samples
of data (Fig.17), a common technique to assess the difference or the significance of the difference is
hypothesis testing.

Fig.17 Distributions of two samples of data

Table 4 includes some examples of null and alternative hypotheses that should be answered during the
analytic lifecycle.
24

Table 4 Example Null Hypotheses and Alternative Hypotheses:

• Once a model is built over the training data, it needs to be evaluated over the testing data to see if the
proposed model predicts better than the existing model currently being used. The null hypothesis is that
the proposed model does not predict better than the existing model.
• The alternative hypothesis is that the proposed model indeed predicts better than the existing model. In
accuracy forecast, the null model could be that the sales of the next month are the same as the prior
month. The hypothesis test needs to evaluate if the proposed model provides a better prediction.
• Take a recommendation engine as an example. The null hypothesis could be that the new algorithm does
not produce better recommendations than the current algorithm being deployed. The alternative
hypothesis is that the new algorithm produces better recommendations than the old algorithm.
• A common hypothesis test is to compare the means of two populations. Two such hypothesis tests are
discussed in the next section.
ii. Difference of Means:

Fig.18 Overlap of the two distributions


25

Student’s t-test:

Welch’s t-test:
When the equal population variance assumption is not justified in performing Student’s t-test for the
difference of means, Welch’s t-test can be used based on T expressed in Equation.

The degrees of freedom for Welch’s t-test is defined in the below Equation.

• A confidence interval is an interval estimate of a population parameter or characteristic based on sample


data. A confidence interval is used to indicate the uncertainty of a point estimate.
iii. Wilcoxon Rank-Sum Test:
• A t-test represents a parametric test in that it makes assumptions about the population distributions from
which the samples are drawn. If the populations cannot be assumed or transformed to follow a normal
distribution, a nonparametric test can be used. The Wilcoxon rank-sum test is a nonparametric hypothesis
test that checks whether two populations are identically distributed.
26

iv. Type I and Type II Errors:

Table 5 lists the four possible states of a hypothesis test, including the two types of errors.
Table 5 Type I and Type II Error:

v. ANOVA:
ANOVA is a generalization of the hypothesis testing of the difference of two population means. ANOVA
tests if any of the population means differ from the other population means. The null hypothesis of
ANOVA is that all the population means are equal. The alternative hypothesis is that at least one pair of
the population means is not equal. In other words,
27

• The F-test statistic in ANOVA can be thought of as a measure of how different the means are relative to
the variability within each group. The larger the observed F-test statistic, the greater the likelihood that
the differences between the means are due to something other than chance alone. The F-test statistic is
used to test the hypothesis that the observed effects are not due to chance—that is, if the means are
significantly different from one another.

PART –A (2 Marks)
1. What is NoSQL?
NoSQL is an approach to database design that can accommodate a wide variety of data models,
including key-value, document, columnar and graph formats. NoSQL, which stand for "not only SQL,"
is an alternative to traditional relational databases in which data is placed in tables and data schema is
carefully designed before the database is built. NoSQL databases are especially useful for working with
large sets of distributed data.
2. What are the advantages of NoSQL data systems?
NoSQL data systems hold out the promise of greater flexibility in database management while reducing
the dependence on more formal database administration. NoSQL databases have more relaxed
modeling constraints, which may benefit both the application developer and the end-user analysts when
their interactive analyses are not throttled by the need to cast each query in terms of a relational table-
based environment. NoSQL databases also provide for integrated data caching that helps reduce data
access latency and speed performance.
3. What is schema-less model and what are its advantages?
Schema less modeling is a modeling scheme in which the semantics of the data are embedded within a
flexible connectivity and storage model which provides for automatic distribution of data and elasticity
with respect to the use of computing, storage, and network bandwidth in ways that don’t force specific
binding of data to be persistently stored in particular physical locations.
28

4. What is a key value store?


Key value store is a relatively simple type of NoSQL data store. It is a schema-less model in which
values (or sets of values, or even more complex entity objects) are associated with distinct character
strings called keys.
5. What are the various core operations performed on a key value store?
a. Get(key), which returns the value associated with the provided key.
b. Put(key, value), which associates the value with the key.
c. Multi-get(key1, key2,.., keyN), which returns the list of values associated with the list of keys.
d. Delete(key), which removes the entry for the key from the data store.
6. What are the drawbacks of key value pair?
[Link] model will not inherently provide any kind of traditional database capabilities (such as atomicity
of transactions, or consistency when multiple transactions are executed simultaneously)—those
capabilities must be provided by the application itself.
[Link] the model grows, maintaining unique values as keys may become more difficult, requiring the
introduction of some complexity in generating character strings that will remain unique among a
myriad of keys.
7. What is a document store?
A document store is similar to a key value store in that stored objects are associated (and therefore
accessed via) character string keys. The difference is that the values being stored, which are referred to
as “documents,” provide some structure and encoding of the managed data.
8. Name the various encodings utilized in document store for managing data.
XML (Extensible Markup Language), JSON (Java Script Object Notation), BSON (which is a binary
encoding of JSON objects), or other means of serializing data (i.e., packaging up the potentially
linearizing data values associated with a data record or object).
9. Distinguish between document store and key value store.
While the key value store requires the use of a key to retrieve data, the document store often provides a
means (either through a programming API or using a query language) for querying the data based on
the contents. Because the approaches used for encoding the documents embed the object metadata, one
can use methods for querying by example.
10. What are tabular stores?
Give an example. Tabular, or table-based stores are largely descended from Google’s original Bigtable
design to manage structured data. The HBase model is an example of a Hadoop-related NoSQL data
management system that evolved from Bigtable.
11. Write short notes on Bigtable
NoSQL model. The Bigtable NoSQL model allows sparse data to be stored in a three-dimensional table
that is indexed by a row key (that is used in a fashion that is similar to the key value and document
stores), a column key that indicates the specific attribute for which a data value is stored, and a
timestamp that may refer to the time at which the row’s column value was stored.
12. What are ACID properties?
Atomicity, Consistency, Isolation, and Durability.
13. Write short notes on object data stores.
Object data stores and object databases seem to bridge the worlds of schema-less data management and
the traditional relational models. Approaches to object databases can be similar to document stores
except that the document stores explicitly serializes the object so the data values are stored as strings,
while object databases maintain the object structures as they are bound to object-oriented programming
languages such as C11, Objective-C, Java, and Smalltalk.
14. Write short notes on graph databases.
Graph databases provide a model of representing individual entities and numerous kinds of
relationships that connect those entities. More precisely, it employs the graph abstraction for
representing connectivity, consisting of a collection of vertices (which are also referred to as nodes or
29

points) that represent the modeled entities, connected by edges (which are also referred to as links,
connections, or relationships) that capture the way that two entities are related. Graph analytics
performed on graph data stores are somewhat different than more frequently used querying and
reporting.
15. Mention the two key criteria for which NoSQL data management environment is engineered for.
a. fast accessibility b. scalability for volume.
16. What is Hive?
Hive, is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries,
and the analysis of large datasets stored in Hadoop compatible file systems.
17. What is the purpose of Hive?
Hive is specifically engineered for data warehouse querying and reporting and is not intended for use as
within transaction processing systems that require real-time query execution or transaction semantics
for consistency at the row level.
18. What is HBase?
HBase is a nonrelational data management environment that distributes massive datasets over the
underlying Hadoop framework. HBase is derived from Google’s BigTable and is a column-oriented
data layout that, when layered on top of Hadoop, provides a fault-tolerant method for storing and
manipulating large data tables.
19. Mention the basic operations of HBase.
a. Get (which access a specific row in the table),
b. Put (which stores or updates a row in the table),
c. Scan (which iterates over a collection of rows in the table), and
d. Delete (which removes a row from the table)
20. What is R?
R is a programming language and software framework for statistical analysis and graphics. Available
for use under the GNU General Public License, R software and installation instructions can be obtained
via the Comprehensive R Archive and Network.
21. Mention the purpose of [Link], head, summary and lm functions?
The [Link] () function is used to import the CSV file. The head ( ) function, is used for displaying the
first six records. The summary () function provides some descriptive statistics, such as the mean and
median, for each data column. The lm () function is used for linear regression.
22. What is a generic function? Give an example.
A generic function is a group of functions sharing the same name but behaving differently depending
on the number and the type of arguments they receive. Ex. summary () function.
23. Mention some R graphical user interfaces.
a. [Link], b. R commander c. Rattle d. RStudio
24. Name the four highlighted window panes of RStudio graphical user interface.
a. Scripts: Serves as an area to write and save R code
b. Workspace: Lists the datasets and variables in the R environment
c. Plots: Displays the plots generated by the R code and provides a straightforward mechanism to
export the plots
d. Console: Provides a history of the executed R code and the output.
25. What is the purpose of [Link] () and [Link] () function?
R allows one to save the workspace environment, including variables and loaded libraries, into an
.Rdata file using the [Link] () function. An existing .Rdata file can be loaded using the [Link]
() function.
26. What are the various functions used for importing the dataset in R?
a. [Link]() function
b. [Link]()function
c. [Link]() function
30

d. read.csv2() function
e. read.delim2() function
27. What are the various functions used for exporting the R dataset to an external file?
a. [Link](), b. [Link]() ,and c. write.csv2()
28. State the purpose of the R packages DBI and RODBC.
Sometimes it is necessary to read data from a database management system (DBMS). R packages such
as DBI and RODBC are used for this purpose. These packages provide database interfaces for
communication between R and DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal
Greenplum.
29. What are the four major categories of attributes in R?
a. Nominal b. Ordinal c. Interval d. Ratio
30. What are the various data types supported by R?
a. Numeric b. Character c. Logical data types.
31. What is the purpose of class and typeof functions?
The class () function represents the abstract class of an object. The typeof () function determines the
way an object is stored in memory.
32. What is a vector in R?
Vectors are a basic building block for data in R. Simple R variables are actually vectors. A vector can
only consist of values in the same class. The tests for vectors can be conducted using the [Link]()
function.
33. What is the purpose of data frame?
Data frames provide a structure for storing and accessing several variables of possibly different data
types.
34. What are contingency tables in R?
In R, table refers to a class of objects used to store the observed counts across the factors for a given
dataset. Such a table is commonly referred to as a contingency table and is the basis for performing a
statistical test on the independence of the factors used to build the table.
35. Mention some functions for visualizing a single variable in R.
a. plot(data) b. barplot(data ) c. dotchart(data ) d. hist(data) e. plot(density(data)) f. stem(data) g.
rug(data)
36. What is hypothesis testing?
It is a common technique to assess the difference or the significance of the difference of the means
from two samples of data. The basic concept of hypothesis testing is to form an assertion and test it
with data. When performing hypothesis tests, the common assumption is that there is no difference
between two samples. This assumption is used as the default position for building the test or conducting
a scientific experiment. Statisticians refer to this as the null hypothesis (H0). The alternative hypothesis
(HA) is that there is a difference between two samples.
37. What is confidence interval?
A confidence interval is an interval estimate of a population parameter or characteristic based on
sample data. A confidence interval is used to indicate the uncertainty of a point estimate. If x is the
estimate of some unknown population mean p, the confidence interval provides an idea of how close x
is to the unknown p.
38. What is Wilcoxon rank sum test?
The Wilcoxon rank-sum test is a nonparametric hypothesis test that checks whether two populations are
identically distributed.
39. What is ANOVA?
ANOVA is a generalization of the hypothesis testing of the difference of two population means.
ANOVA tests if any of the population means differ from the other population means. The null
hypothesis of ANOVA is that all the population means are equal. The alternative hypothesis is that at
least one pair of the population means is not equal.
31

40. Mention the types of ANOVA.


a. One-way ANOVA b. Two-way ANOVA c. Multivariate ANOVA

CS8091-BIG DATA ANALYTICS – QUESTION BANK


1. Discuss in detail about key-value pair.([Link].1-2)
2. Explain in detail about NoSQL Databases & Schema-less Models. ([Link].1)
3. Elaborate document stores in detail.([Link].2-3)
4. What do you know about tabular stores, Object Data Stores & Graph Databases? Explain in
detail.([Link].3-4)
5. Elaborate in detail about the architecture of HBase.([Link].7-8)
6. What is Hive? Explain in detail.([Link].4)
7. Illustrate in detail about the graphical user interface, data import and export operations in
R.([Link].15-16)
8. Examine the attributes and data types in R.([Link].17-18)
9. Compare and contrast about visualizing a single variable and multi variable in R. ([Link].19-22)
10. Explain in detail about Box-and-Whisker plot.([Link].22)
11. Elaborate in detail about statistical methods for evaluation.([Link].23-27)
12. Discuss in detail about Sharding ([Link].4-7)
13. Explain in detail about Analyzing big data with twitter([Link].8-10)
14. Explain in detail about Big data for E-Commerce([Link].10-12)
15. Explain in detail about Big data for blogs ([Link].12-13)
16. Make a Review of Basic Data Analytic Methods using R([Link].13-27)

You might also like