CS8091 / Big Data Analytics
III Year / VI Semester
UNIT V NOSQL DATA MANAGEMENT
FOR BIG DATA AND VISUALIZATION
NoSQL Databases : Schema-less Models:
Increasing Flexibility for Data Manipulation-Key
Value Stores- Document Stores - Tabular Stores -
Object Data Stores - Graph Databases Hive -
Sharding –- Hbase – Analyzing big data with twitter
- Big data for E-Commerce Big data for blogs -
Review of Basic Data Analytic Methods using R.
NoSQL
Most hardware and software appliances support
standard approaches to standard, SQL-based
relational database management systems
(RDBMSs).
Software appliances often bundle their execution
engines with the RDBMS and utilities for creating
the database structures and for bulk data loading.
NoSQL
The availability of a high-performance, elastic
distributed data environment enables creative
algorithms to exploit variant modes of data
management in different ways.
Data management frameworks are bundled
under the term “NoSQL databases”.
NoSQL
Not only SQL
Combine traditional SQL (or SQL-like query
languages) with alternative means of querying
and access.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
NoSQL data systems hold out the promise of
greater flexibility in database management
while reducing the dependence on more
formal database administration.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types:
Key Value stores: align to certain big data
programming models
Graph Database: a graph abstraction is
implemented to embed both semantics and
connectivity within its structure.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
NoSQL databases also provide for integrated
data caching that helps reduce data access
latency and speed performance.
The loosening of the relational structure is
intended to allow different models to be
adapted to specific types of analyses.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores:
Values (or sets of values, or even more complex
entity objects) are associated with distinct
character strings called keys.
Programmers may see similarity with the data
structure known as a hash table.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores:
Key Value
BMW {“1-Series”, “3-Series”, “5-Series”, “5-Series GT”,
“7-Series”, “X3”, “X5”, “X6”, “Z4”}
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores:
The key is the name of the automobile make,
while the value is a list of names of models
associated with that automobile make.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores - Operations:
Get(key), which returns the value associated with the
provided key.
Put(key, value), which associates the value with the key.
Multi-get(key1, key2,.., keyN), which returns the list of
values associated with the list of keys.
Delete(key), which removes the entry for the key from the
data store
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores – Characteristics:
Uniqueness of the key - to find the values you are
looking for, you must use the exact key.
In this data management approach, if you want to
associate multiple values with a single key, you need
to consider the representations of the objects and how
they are associated with the key.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores – Characteristics:
Key-value stores are essentially very long, and likely thin
tables.
The table’s rows can be sorted by the key value to simplify
finding the key during a query.
The keys can be hashed using a hash function that maps
the key to a particular location (sometimes called a
“bucket”) in the table.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores – Characteristics:
The representation can grow indefinitely, which
makes it good for storing large amounts of data that
can be accessed relatively quickly, as well as
environments requiring incremental appends of data.
Examples include capturing system transaction logs,
managing profile data about individuals.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores – Characteristics:
The simplicity of the representation allows massive
amounts of indexed data values to be appended to the
same key value table, which can then be sharded, or
distributed across the storage nodes.
Under the right conditions, the table is distributed in a
way that is aligned with the way the keys are
organized.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores:
While key value pairs are very useful for both
storing the results of analytical algorithms (such as
phrase counts among massive numbers of
documents) and for producing those results for
reports, the model does pose some potential
drawbacks.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Key –Value Stores – Drawbacks:
the model will not inherently provide any kind of
traditional database capabilities (such as atomicity of
transactions, or consistency when multiple transactions
are executed simultaneously)—those capabilities must
be provided by the application itself.
Another is that as the model grows, maintaining
unique values as keys may become more difficult.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Document Stores:
A document store is similar to a key value store in that
stored objects are associated (and therefore accessed
via) character string keys.
The difference is that the values being stored, which
are referred to as “documents,” provide some structure
and encoding of the managed data.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Document Stores:
A document store is similar to a key value store in that
stored objects are associated (and therefore accessed via)
character string keys.
The difference is that the values being stored, which are
referred to as “documents,” provide some structure and
encoding of the managed data.
Common encodings - XML
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Document Stores – Example:
{StoreName:“Retail Store #34”, {Street:“1203 O ST”,
City:“Lincoln”, State:“NE”, ZIP:“68508”} }
{StoreName:”Retail Store #65”, {MallLocation:”Westfield
Wheaton”, City:”Wheaton”, State:”IL”} }
{StoreName:”Retail Store $102”, {Latitude:” 40.748328”,
Longitude:” -73.985560”} }
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Document Stores:
The document representation embeds the model
so that the meanings of the document values can
be inferred by the application.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Document Stores:
One of the differences between a keyvalue store
and a document store is that while the former
requires the use of a key to retrieve data, the latter
often provides a means (either through a
programming API or using a query language) for
querying the data based on the contents.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Tabular Stores:
Tabular, or table-based stores are largely derived
from Google’s original Bigtable design to manage
structured data.
The HBase model, a Hadoop-related NoSQL data
management system that evolved from bigtable.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Tabular Stores:
The bigtable NoSQL model allows sparse data to
be stored in a three-dimensional table that is
indexed by a row key, a column key that indicates
the specific attribute for which a data value is
stored, and a timestamp that may refer to the time
at which the row’s column value was stored.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Tabular Stores:
As an example, various attributes of a web page
can be associated with the web page’s URL:
the HTML content of the page,
URLs of other web pages that link to this web page, and
the author of the content.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Tabular Stores:
Columns in a Bigtable model are grouped together as
“families,” and the timestamps enable management of
multiple versions of an object.
The timestamp can be used to maintain history—each
time the content changes, new column attachments can
be created with the timestamp of when the content was
downloaded.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Object Data Stores:
Object databases can be similar to document
stores except that the document stores explicitly
serializes the object so the data values are stored as
strings, while object databases maintain the object
structures as they are bound to object-oriented
programming languages.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Object Data Stores:
Object database management systems are more likely
to provide traditional ACID (atomicity, consistency,
isolation, and durability) compliance—characteristics
that are bound to database reliability. Object databases
are not relational databases and are not queried using
SQL.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
Types: Graph Databases:
Graph databases provide a model of representing
individual entities and numerous kinds of relationships
that connect those entities.
It is consisting of a collection of vertices that represent
the modeled entities, connected by edges that capture
the way that two entities are related.
Hive
Apache Hive enables users to process data
without explicitly writing MapReduce code.
One key difference to Pig is that the Hive
language, HiveQL (Hive Query Language),
resembles Structured Query Language (SQL)
rather than a scripting language.
Hive
A Hive table structure consists of rows and columns.
The rows typically correspond to some record, transaction,
or particular entity (for example, customer) detail.
The values of the corresponding columns represent the
various attributes or characteristics for each row.
Hadoop and its ecosystem are used to apply some structure
to unstructured data.
Hive
Therefore, if a table structure is an appropriate
way to view the restructured data, Hive may
be a good tool to use.
Additionally, a user may consider using Hive
if the user has experience with SQL and the
data is already in HDFS.
Hive
Another consideration in using Hive may be how
data will be updated or added to the Hive tables.
If data will simply be added to a table
periodically, Hive works well, but if there is a
need to update data in place, it may be beneficial
to consider another tool, such as Hbase.
Hive
A Hive query is first translated into a MapReduce
job, which is then submitted to the Hadoop
cluster.
Thus, the execution of the query has to compete
for resources with any other submitted job.
Hive is intended for batch processing.
Hive
Data easily fits into a table structure.
Data is already in HDFS.
Developers are comfortable with SQL programming
and queries.
There is a desire to partition datasets based on time.
Batch processing is acceptable.
Hive
HiveQL Basics:
From the command prompt, a user enters the
interactive Hive environment by simply entering
hive:
$ hive
hive>
Hive
HiveQL Basics:
From this environment, a user can define new tables,
query them, or summarize their contents.
hive> create table customer ( cust_id bigint,
first_name string, last_name string, email_address
string) row format delimited fields terminated by ‘\t’;
Hive
HiveQL Basics:
HiveQL query is executed to count the number of records
in the newly created table, customer.
The table is currently empty, the query returns a result of
zero, the last line of the provided output.
The query is converted and run as a MapReduce job,
which results in one map task and one reduce task being
executed.
Hive
HiveQL Basics:
hive> select count(*) from customer;
When querying large tables, Hive outperforms
and scales better than most conventional database
queries.
Hive
HiveQL Basics:
To load the customer table with the contents of HDFS
file, customer.txt, it is only necessary to provide the
HDFS directory path to the file.
hive> load data inpath ‘/user/customer.txt’ into table
customer;
hive> select * from customer limit 3;
Hive – Use Cases
Exploratory or ad-hoc analysis of HDFS data:
Data can be queried, transformed, and exported to
analytical tools, such as R.
Extracts or data feeds to reporting systems,
dashboards, or data repositories such as HBase:
Hive queries can be scheduled to provide such
periodic feeds.
Hive – Use Cases
Combining external structured data to data
already residing in HDFS
The data from an RDBMS can be periodically
added to Hive tables for querying with existing
data in HDFS.
Sharding
Sharding is a database architecture pattern related to
horizontal partitioning.
The practice of separating one table’s rows into multiple
different tables, known as partitions.
Each partition has the same schema and columns, but also
entirely different rows.
the data held in each is unique and independent of the data
held in other partitions.
Sharding
Sharding
In a vertically-partitioned table, entire
columns are separated out and put into new,
distinct tables.
The data held within one vertical partition is
independent from the data in all the others, and
each holds both distinct rows and columns.
Sharding
Horizontal or Range Based Sharding:
In this case, the data is split based on the value
ranges that are inherent in each entity.
For example, the if you store the contact info for your
online customers, you might choose to store the info
for customers whose last name starts with A-H on one
shard, while storing the rest on another shard.
Sharding
Horizontal or Range Based Sharding:
ID Name Mail ID
1 A
[email protected]ID Name Mail ID 2 B
[email protected]1 A
[email protected]2 B
[email protected] Horizontal
3 C
[email protected] ID Name Mail ID
4 D
[email protected] 3 C
[email protected] 4 D
[email protected] Sharding
Horizontal or Range Based Sharding:
Advantages:
Each shard also has the same schema as the original
database.
It works well for relative non static data -- for
example to store the contact info for students in a
college because the data is unlikely to see huge churn.
Sharding
Horizontal or Range Based Sharding:
Disadvantages:
The disadvantage of this scheme is that the last
names of the customers may not be evenly distributed.
In that case, your first shard will be experiencing a
much heavier load than the second shard and can
become a system bottleneck.
Sharding
Vertical Sharding:
In this case, different features of an entity will be
placed in different shards on different machines.
Vertical Sharding:
Sharding
ID Name
1 A
2 B
ID Name Mail ID 3 C
1 A [email protected] 4 D
2 B [email protected] Vertical
3 C [email protected] ID Mail ID
4 D [email protected] 1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
Sharding
Vertical Sharding – Benefits:
handle the critical part of your data differently
from the not so critical part of your data and build
different replication and consistency models
around it.
Sharding
Vertical Sharding – Disadvantages:
It increases the development and operational
complexity of the system.
If your Site/system experiences additional growth
then it may be necessary to further shard a feature
specific database across multiple server.
Sharding
Key or hash based sharding:
In this case, an entity has a value which can be
used as an input to a hash function and a resultant
hash value generated. This hash value determines
which database server(shard) to use.
Sharding
Key or hash based sharding:
The main drawback of this method is that elastic load
balancing (dynamically adding/removing database
servers) becomes very difficult and expensive.
A large number of the requests cannot be serviced and
you'll incur a downtime till the migration completes
Sharding
Directory based sharding:
Directory based shard partitioning involves placing a
lookup service in front of the sharded databases.
The lookup service knows the current partitioning scheme
and keeps a map of each entity and which database shard it
is stored on.
The lookup service is usually implemented as a web
service.
Sharding
Directory based sharding:
The client application first queries the lookup
service to figure out the shard (database partition)
on which the entity resides/should be placed.
Then it queries / updates the shard returned by the
lookup service.
Sharding
Directory based sharding:
Sharding
Directory based sharding – Steps:
Keep the modulo 4 hash function in the lookup service.
Determine the data placement based on the new hash
function - modulo 10.
Write a script to copy all the data based on #2 into the six
new shards and possibly on the 4 existing shards. Note that
it does not delete any existing data on the 4 existing shards.
Sharding
Directory based sharding – Steps:
Once the copy is complete, change the hash function
to modulo 10 in the lookup service
Run a cleanup script to purge unnecessary data from 4
existing shards based on step#2. The reason being that
the purged data is now existing on other shards.x
Sharding
Directory based sharding – Steps:
There are two practical considerations which needs to be solved on a
per system basis:
While the migration is happening, the users might still be updating
their data. Options include putting the system in read-only mode or
placing new data in a separate server that is placed into correct shards
once migration is done.
The copy and cleanup scripts might have an effect on system
performance during the migration. It can be circumvented by using
system cloning and elastic load balancing - but both are expensive.
HBase
HBase is a distributed column-oriented database
built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
Apache Hbase is capable of providing real time
read and write access to datasets with billions of
rows and millions of columns.
HBase
HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts
of structured data.
It leverages the fault tolerance provided by the Hadoop File
System.
It is a part of the Hadoop ecosystem that provides random
real-time read/write access to data in the Hadoop File
System.
HBase
One can store the data in HDFS either directly
or through HBase.
Data consumer reads/accesses the data in
HDFS randomly using HBase.
HBase sits on top of the Hadoop File System
and provides read and write access.
HBase
Storage Mechanism in Hbase:
HBase is a column-oriented database and the tables in it are
sorted by row.
The table schema defines only column families, which are the
key value pairs.
A table have multiple column families and each column family
can have any number of columns.
Subsequent column values are stored contiguously on the disk.
Each cell value of the table has a timestamp.
HBase
Storage Mechanism in Hbase - In short, in an
HBase:
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
HBase
Architecture:
Tables are split into regions and are served by the
region servers.
Regions are vertically divided by column families
into “Stores”.
Stores are saved as files in HDFS.
HBase
Architecture:
HBase
Architecture – Master Server:
Assigns regions to the region servers
Handles load balancing of the regions across region
servers. It unloads the busy servers and shifts the
regions to less occupied servers.
Maintains the state of the cluster by negotiating the
load balancing.
HBase
Architecture – Regions:
Tables are split up and spread across the region
servers.
HBase
Architecture – Regions Server:
Communicate with the client and handle data related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size
thresholds.
Memstore – Cache Memory
The data is transferred and saved in Hfiles as blocks and the
memstore is flused.
HBase
Architecture – Zoo Keeper:
It provides services like maintaining configuration
information, naming, providing distributed
synchronization, etc.
It keeps track of all the region servers in the HBase
cluster and tracking information like, how many
region servers are there and which region servers are
holding which DataNode.
HBase
Architecture – Zoo Keeper:
Services:
Establishing client communication with region
servers.
Tracking server failure and network partitions.
Maintain Configuration Information.
Analyzing Big Data With Twitter
Twitter is an effective tool for a company to
get people excited about its product.
To engage users whose updates tend to
generate lots of retweets.
Twitter tracks retweet counts for all tweets
Analyzing Big Data With Twitter
A simple Search
To search for tweets containing the word
“Happy”.
index = twitter text = * Happy *
It searches only Twitter index and finds all the
places where the word Happy is mentioned.
Analyzing Big Data With Twitter
Examining the Twitter Event:
_time: Splunk assigns a timestamp for every event. This is
done in UTC time format.
Contributors: The value for this field is null.
Retweeted_status: Other fields associated with a tweet.
The 140 character text filed that most people consider to
be the tweet is actually a small part of the actual data
collected.
Analyzing Big Data With Twitter
Implied AND:
To search for all tweets that include both the text
“Happy” and the text “morning”,
index = twitter text = *Happy* text = *Morning*
Analyzing Big Data With Twitter
Need to specify OR
To obtain all events that mention either Happy or
morning
index = twitter text = *Happy* OR text =
*Morning*
Analyzing Big Data With Twitter
Finding Other words used:
To find out what other words are used in tweets about
Happy.
index = twitter text = *Happy* | makemv text | mvexpand
text | top 30 text
First searches for the word “Happy” in a text field, then
creates a multivalued field from the tweet, and then
expands it so that each word is treated as a separate piece.
Analyzing Big Data With Twitter
Finding Other words used:
Then it takes the top 30 words that it finds. To limit
the top words:
index=twitter text=*Happy* | makemv text |
mvexpand text | search NOT text = “RT” AND NOT
text=“a” AND NOT text=“to” AND NOT text=“the” |
top 30 text
Analyzing Big Data For E-Commerce
It allows businesses to gain access to
significantly larger amount of data in order to
convert
growth into revenue
streamline operation processes
Gain more customers
Analyzing Big Data For E-Commerce
Big Data Solutions:
Optimize Customer Shopping Experience
Customer behaviour pattern, purchase histories,
browsing interests
To re-target the buyers by displaying or recommending
products that they are interested in.
Analyzing Big Data For E-Commerce
Big Data Solutions:
Higher Customer Satisfaction
The stores to understand their customers better and
built a lasting relationship with them.
Analyzing Big Data For E-Commerce
Big Data Solutions:
Streaming Analytics
to gain valuable customer insights.
Store provides a lot of insights that will help in
personalizing the shopper’s experience and generate
more revenue.
Analyzing Big Data For E-Commerce
Recommendation Engine:
Analyze all the actions of a particular customer
Product pages visited
products they liked
added into their carts
Finally bought / abandoned
Analyzing Big Data For E-Commerce
Recommendation Engine:
The system can also compare the behaviour
pattern of a certain visitor to those of the other
visitors.
It analyzes, recommends the products that a
visitor may like.
Analyzing Big Data For E-Commerce
Personalized Shopping Experience:
A key to successful e-commerce marketing
To react to their customer’s actions properly and
in real time.
Analyze all customer activities in an e-shop and a
create a picture of customer behavior pattern.
Analyzing Big Data For E-Commerce
Everything in the cart is tracked:
System encourage customer to finish the purchase
with discounts.
Example: A customer bought a winter coat two
weeks ago and visited some product pages with
winter gloves, scarfs and hats at that time.
Analyzing Big Data For E-Commerce
Voice of the customer:
To add sentiment analysis to the standard approach of
analyzing products and brands by their sales value,
volume, revenues, number of orders, etc.
Sentiment analysis is the evaluation of comments that
customers left about different products and brands.
Analyzing Big Data For E-Commerce
Voice of the customer:
The system identifies whether each comment is
positive or negative.
Positive Comments: Happy, Great, Recommend
or Satisfied
Negative Comments: Bad, Terrible
Analyzing Big Data For E-Commerce
Dynamic Pricing:
Setting price rules, monitoring competitors and
adjusting prices in real time.
Analyzing Big Data For E-Commerce
Demand Forecasting:
Creating customer profiles
looking at their customer’s behavior
how many items they usually purchase and which
products they buy.
To collect, analyze and visualizing the analysis result.
Retailer will analyze external big data.
Analyzing Big Data For Blogs
Wiki is nothing but a collection of web pages
interconnected with each other through
internal links.
In Wikipedia, there are more than a million
pages like this English version and nowadays
in other languages too.
Analyzing Big Data For Blogs
End user working Mechanism:
Roles:
Reader
Writer
Admin
Analyzing Big Data For Blogs
End user working Mechanism:
To understand a wiki community works:
Goto Wikipedia and find a topic that know something about.
Search for and read the page about that topic on Wikipedia.
Edit the page and add/change a sentence or two in the article.
Simply click on “Edit this page.”
Submit your change.
Analyzing Big Data For Blogs
Wikistats – Open Source UI – Community
Environment:
Public statistics website
To add context and motivate our editor community by
providing a set of metrics through which users can see
the impact of their contributions in the projects they
are a part of.
Analyzing Big Data For Blogs
Wikistats – Open Source UI – Community
Environment:
In Wikistats2, not only updating the website interface but
are also providing new access to all our edit data in
analytics-friendly form.
Wikistats2 computes metrics by extracting data from
Media Wiki databases, processing it and re-storing it on an
analytics-friendly form so metrics can be extracted easily.
Analyzing Big Data For Blogs
Wikistats – Open Source UI – Community
Environment:
Wikistats2 is a client side only single page
application, this means that it does hot have a
server component and can be served from
anywhere.
Analyzing Big Data For Blogs
Wikistats – Open Source UI – Community
Environment:
The dashboard: Metrics belong to one of three
main areas: reading, contributing and content.
The detail page: Breakdown selectors, time range
and granularity selectors.
Analyzing Big Data For Blogs
MediaWiki:
Collect and organize knowledge and make it available
to people.
MediaWiki action API is a web service that allows
access to some wiki-features like authentication, page
operations and search. It can provide meta information
about the wiki and logged-in user.
Analyzing Big Data For Blogs
MediaWiki – Uses:
Monitor a MediaWiki installation
Log into a wiki, access data and post changes by
making HTTP request to the web service.
Review of Basic Data Analytics
Methods using R
The study of the data in terms of basic
statistical measures and creation of graphs and
plots to visualize and identify relationships and
patterns.
Review of Basic Data Analytics
Methods using R
Introduction to R
R is a programming language and software
framework for statistical analysis and graphics.
read.csv() – to import CSV file.
head() – to display first six records of file.
summary() – provides some descriptive statistics, such
as the mean and median, for each data column.
Review of Basic Data Analytics
Methods using R
R Graphical User Interface:
R software uses a command line interface.
To improve the ease of writing, executing and
debugging R code, several additional GUIs have
been written for R.
Example: RStudio, R Commander
Review of Basic Data Analytics
Methods using R
R Graphical User Interface – Panes:
Scripts
Workspace
Plots
Console
Review of Basic Data Analytics
Methods using R
Data Import and Export:
setwd() – to set the working directory for the
subsequent import and export operations.
Example: setwd(“path of the dorctory”)
read.table and read.delim() – to import other
common file types such as TXT.
Review of Basic Data Analytics
Methods using R
Attribute and Data Type:
The characteristics or attributes provide the
qualitative and quantitative measures for each item
or subject of interest.
Review of Basic Data Analytics
Methods using R
Attribute and Data Type:
Categorical Numeric
Nominal Ordinal Interval Ratio
Definition The values represent Attributes imply The difference Both difference
labels that distinguish a sequence between two and ratio two
one from another. values is values are
meaningful meaningful.
Example: ZIP Code Academic Calendar Dates Age, length
Grades
Review of Basic Data Analytics
Methods using R
Attribute and Data Type:
class() – represents the abstract class of an object.
typeof() – determines the way an object is stored in
memory
is.data_type(object) – verify if object is of a certain
datatype
as.data_type(object) – convert data type of object to
another.
Review of Basic Data Analytics
Methods using R
Predefined constants
pi
letters
LETTERS
month.name. month.abb
Review of Basic Data Analytics
Methods using R
Data Types
Logical – True or False
Integer – Set of all integers
Numeric – Set of all real numbers
character – “a”, “b”
Review of Basic Data Analytics
Methods using R
Basic object:
Vector – Ordered collection of same data types
List – Ordered collection of objects
Data Frame – Generic tabular object
Review of Basic Data Analytics
Methods using R
Basic object – Vectors:
An ordered collection of basic data types of given
length
All the elements of a vector must be of same data
type.
Example: x=c(1, 2, 3)
Review of Basic Data Analytics
Methods using R
Basic object – List:
A generic object consisting of an ordered
collection of objects.
A list could consists of a numeric vector, a logical
value, a matrix, a complex vector.
Review of Basic Data Analytics
Methods using R
Basic object – List:
To access top level components, use double
slicing operator “[[]]” or [] and for lower / inner
level components use “[]” along with “[[]]”.
Review of Basic Data Analytics
Methods using R
Basic object – Data Frame:
Used to store tabular data.
df[val,val2] – row “val1”, column “val2”
val1, val2 can also be array of values like “1:2” or
“c(1:2)”
df[val2] – refers to column “val2” only.
Review of Basic Data Analytics
Methods using R
Basic object – Data Frame:
Subset():
Extracts subset of data based on conditions
runif(75,0,10) – generated 75 numbers
between 0 to 10 with random nunmbers.
Review of Basic Data Analytics
Methods using R
Visualizing a single variable:
plot(data) – suitable for low volume data
barplot(data) – Vertical or horizontal bars
dotchart(data) – dot plot
hist(data) - histogram
Review of Basic Data Analytics
Methods using R
Statistical Methods for Evaluation:
It is used during the initial data exploration and
data preparation, model building, evaluation of the
final models.
Review of Basic Data Analytics
Methods using R
Statistical Methods for Evaluation:
Hypothesis Testing:
To form an statement and test it with data.
When performing hypothesis tests, the common
assumption is that there is no difference between
two samples – Null Hypothesis(H0)
Review of Basic Data Analytics
Methods using R
Statistical Methods for Evaluation:
Hypothesis Testing – Applications:
Application Null Hypothesis Alternative Hypothesis
Accuracy Forecast Model X does not predict better than Model X predicts better than
the existing model the existing model
Recommendation Algorithm Y does not produce better Algorithm Y produces better
engine recommendation than the current recommendation than the
algorithm being used. current algorithm being
used.
Review of Basic Data Analytics
Methods using R
Statistical Methods for Evaluation:
Wilcoxon Rank-Sum Test:
It is a nonparametric hypothesis test that checks whether two
populations are identically distributed.
wilcox.test() – ranks the observation, determines the respective
rank-sums corresponding to each population’s sample and then
determines the probability of such rank-sums of such magnitude
being observed assuming that the population distributions are
identical.
Review of Basic Data Analytics
Methods using R
Statistical Methods for Evaluation:
Type I and Type II Errors:
Type I Error: The rejection of the null hypothesis when the
null hypothesis is TRUE. Probability is denoted by the
Greek letter α.
Type II Error: The acceptance of a null hypothesis when
the null hypothesis is FALSE. Probability is denoted by the
Greek letter β.
Review of Basic Data Analytics
Methods using R
Statistical Methods for Evaluation:
ANOVA:
Analysis of Variance.
ANOVA is a generalization of the hypothesis testing of the
difference of two population means.
The null hypothesis of ANOVA is that all the population means
are equal.
The alternative hypothesis is that at least one pair of the
population means is not equal.