0% found this document useful (0 votes)
434 views125 pages

CS8091 BDA Unit 5

Uploaded by

dhurgadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
434 views125 pages

CS8091 BDA Unit 5

Uploaded by

dhurgadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

CS8091 / Big Data Analytics

III Year / VI Semester


UNIT V NOSQL DATA MANAGEMENT
FOR BIG DATA AND VISUALIZATION
NoSQL Databases : Schema-less Models:
Increasing Flexibility for Data Manipulation-Key
Value Stores- Document Stores - Tabular Stores -
Object Data Stores - Graph Databases Hive -
Sharding –- Hbase – Analyzing big data with twitter
- Big data for E-Commerce Big data for blogs -
Review of Basic Data Analytic Methods using R.
NoSQL

 Most hardware and software appliances support


standard approaches to standard, SQL-based
relational database management systems
(RDBMSs).
 Software appliances often bundle their execution
engines with the RDBMS and utilities for creating
the database structures and for bulk data loading.
NoSQL

 The availability of a high-performance, elastic


distributed data environment enables creative
algorithms to exploit variant modes of data
management in different ways.
 Data management frameworks are bundled
under the term “NoSQL databases”.
NoSQL

 Not only SQL

 Combine traditional SQL (or SQL-like query


languages) with alternative means of querying
and access.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 NoSQL data systems hold out the promise of
greater flexibility in database management
while reducing the dependence on more
formal database administration.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types:
 Key Value stores: align to certain big data
programming models
 Graph Database: a graph abstraction is
implemented to embed both semantics and
connectivity within its structure.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 NoSQL databases also provide for integrated
data caching that helps reduce data access
latency and speed performance.
 The loosening of the relational structure is
intended to allow different models to be
adapted to specific types of analyses.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores:
 Values (or sets of values, or even more complex
entity objects) are associated with distinct
character strings called keys.
 Programmers may see similarity with the data
structure known as a hash table.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores:

Key Value

BMW {“1-Series”, “3-Series”, “5-Series”, “5-Series GT”,


“7-Series”, “X3”, “X5”, “X6”, “Z4”}
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores:
 The key is the name of the automobile make,
while the value is a list of names of models
associated with that automobile make.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores - Operations:
 Get(key), which returns the value associated with the
provided key.
 Put(key, value), which associates the value with the key.
 Multi-get(key1, key2,.., keyN), which returns the list of
values associated with the list of keys.
 Delete(key), which removes the entry for the key from the
data store
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores – Characteristics:
 Uniqueness of the key - to find the values you are
looking for, you must use the exact key.
 In this data management approach, if you want to
associate multiple values with a single key, you need
to consider the representations of the objects and how
they are associated with the key.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores – Characteristics:
 Key-value stores are essentially very long, and likely thin
tables.
 The table’s rows can be sorted by the key value to simplify
finding the key during a query.
 The keys can be hashed using a hash function that maps
the key to a particular location (sometimes called a
“bucket”) in the table.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores – Characteristics:
 The representation can grow indefinitely, which
makes it good for storing large amounts of data that
can be accessed relatively quickly, as well as
environments requiring incremental appends of data.
 Examples include capturing system transaction logs,
managing profile data about individuals.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores – Characteristics:
 The simplicity of the representation allows massive
amounts of indexed data values to be appended to the
same key value table, which can then be sharded, or
distributed across the storage nodes.
 Under the right conditions, the table is distributed in a
way that is aligned with the way the keys are
organized.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores:
 While key value pairs are very useful for both
storing the results of analytical algorithms (such as
phrase counts among massive numbers of
documents) and for producing those results for
reports, the model does pose some potential
drawbacks.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Key –Value Stores – Drawbacks:
 the model will not inherently provide any kind of
traditional database capabilities (such as atomicity of
transactions, or consistency when multiple transactions
are executed simultaneously)—those capabilities must
be provided by the application itself.
 Another is that as the model grows, maintaining
unique values as keys may become more difficult.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Document Stores:
 A document store is similar to a key value store in that
stored objects are associated (and therefore accessed
via) character string keys.
 The difference is that the values being stored, which
are referred to as “documents,” provide some structure
and encoding of the managed data.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Document Stores:
 A document store is similar to a key value store in that
stored objects are associated (and therefore accessed via)
character string keys.
 The difference is that the values being stored, which are
referred to as “documents,” provide some structure and
encoding of the managed data.
 Common encodings - XML
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Document Stores – Example:
{StoreName:“Retail Store #34”, {Street:“1203 O ST”,
City:“Lincoln”, State:“NE”, ZIP:“68508”} }
{StoreName:”Retail Store #65”, {MallLocation:”Westfield
Wheaton”, City:”Wheaton”, State:”IL”} }
{StoreName:”Retail Store $102”, {Latitude:” 40.748328”,
Longitude:” -73.985560”} }
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Document Stores:
 The document representation embeds the model
so that the meanings of the document values can
be inferred by the application.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Document Stores:
 One of the differences between a keyvalue store
and a document store is that while the former
requires the use of a key to retrieve data, the latter
often provides a means (either through a
programming API or using a query language) for
querying the data based on the contents.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Tabular Stores:
 Tabular, or table-based stores are largely derived
from Google’s original Bigtable design to manage
structured data.
 The HBase model, a Hadoop-related NoSQL data
management system that evolved from bigtable.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Tabular Stores:
 The bigtable NoSQL model allows sparse data to
be stored in a three-dimensional table that is
indexed by a row key, a column key that indicates
the specific attribute for which a data value is
stored, and a timestamp that may refer to the time
at which the row’s column value was stored.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Tabular Stores:
 As an example, various attributes of a web page
can be associated with the web page’s URL:
the HTML content of the page,

URLs of other web pages that link to this web page, and

the author of the content.


Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Tabular Stores:
 Columns in a Bigtable model are grouped together as
“families,” and the timestamps enable management of
multiple versions of an object.
 The timestamp can be used to maintain history—each
time the content changes, new column attachments can
be created with the timestamp of when the content was
downloaded.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Object Data Stores:
 Object databases can be similar to document
stores except that the document stores explicitly
serializes the object so the data values are stored as
strings, while object databases maintain the object
structures as they are bound to object-oriented
programming languages.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Object Data Stores:
 Object database management systems are more likely
to provide traditional ACID (atomicity, consistency,
isolation, and durability) compliance—characteristics
that are bound to database reliability. Object databases
are not relational databases and are not queried using
SQL.
Schema-Less Models: Increasing
Flexibility for Data Manipulation
 Types: Graph Databases:
 Graph databases provide a model of representing
individual entities and numerous kinds of relationships
that connect those entities.
 It is consisting of a collection of vertices that represent
the modeled entities, connected by edges that capture
the way that two entities are related.
Hive

 Apache Hive enables users to process data


without explicitly writing MapReduce code.
 One key difference to Pig is that the Hive
language, HiveQL (Hive Query Language),
resembles Structured Query Language (SQL)
rather than a scripting language.
Hive
 A Hive table structure consists of rows and columns.
 The rows typically correspond to some record, transaction,
or particular entity (for example, customer) detail.
 The values of the corresponding columns represent the
various attributes or characteristics for each row.
 Hadoop and its ecosystem are used to apply some structure
to unstructured data.
Hive

 Therefore, if a table structure is an appropriate


way to view the restructured data, Hive may
be a good tool to use.
 Additionally, a user may consider using Hive
if the user has experience with SQL and the
data is already in HDFS.
Hive

 Another consideration in using Hive may be how


data will be updated or added to the Hive tables.
 If data will simply be added to a table
periodically, Hive works well, but if there is a
need to update data in place, it may be beneficial
to consider another tool, such as Hbase.
Hive

 A Hive query is first translated into a MapReduce


job, which is then submitted to the Hadoop
cluster.
 Thus, the execution of the query has to compete
for resources with any other submitted job.
 Hive is intended for batch processing.
Hive

 Data easily fits into a table structure.


 Data is already in HDFS.
 Developers are comfortable with SQL programming
and queries.
 There is a desire to partition datasets based on time.
 Batch processing is acceptable.
Hive

 HiveQL Basics:
 From the command prompt, a user enters the
interactive Hive environment by simply entering
hive:
 $ hive

 hive>
Hive

 HiveQL Basics:
 From this environment, a user can define new tables,
query them, or summarize their contents.
 hive> create table customer ( cust_id bigint,
first_name string, last_name string, email_address
string) row format delimited fields terminated by ‘\t’;
Hive

 HiveQL Basics:
 HiveQL query is executed to count the number of records
in the newly created table, customer.
 The table is currently empty, the query returns a result of
zero, the last line of the provided output.
 The query is converted and run as a MapReduce job,
which results in one map task and one reduce task being
executed.
Hive

 HiveQL Basics:
 hive> select count(*) from customer;

 When querying large tables, Hive outperforms


and scales better than most conventional database
queries.
Hive

 HiveQL Basics:
 To load the customer table with the contents of HDFS
file, customer.txt, it is only necessary to provide the
HDFS directory path to the file.
 hive> load data inpath ‘/user/customer.txt’ into table
customer;
 hive> select * from customer limit 3;
Hive – Use Cases

 Exploratory or ad-hoc analysis of HDFS data:


Data can be queried, transformed, and exported to
analytical tools, such as R.
 Extracts or data feeds to reporting systems,
dashboards, or data repositories such as HBase:
Hive queries can be scheduled to provide such
periodic feeds.
Hive – Use Cases

 Combining external structured data to data


already residing in HDFS
 The data from an RDBMS can be periodically
added to Hive tables for querying with existing
data in HDFS.
Sharding
 Sharding is a database architecture pattern related to
horizontal partitioning.
 The practice of separating one table’s rows into multiple
different tables, known as partitions.
 Each partition has the same schema and columns, but also
entirely different rows.
 the data held in each is unique and independent of the data
held in other partitions.
Sharding
Sharding

 In a vertically-partitioned table, entire


columns are separated out and put into new,
distinct tables.
 The data held within one vertical partition is
independent from the data in all the others, and
each holds both distinct rows and columns.
Sharding

 Horizontal or Range Based Sharding:


 In this case, the data is split based on the value
ranges that are inherent in each entity.
 For example, the if you store the contact info for your
online customers, you might choose to store the info
for customers whose last name starts with A-H on one
shard, while storing the rest on another shard.
Sharding

 Horizontal or Range Based Sharding:


ID Name Mail ID
1 A [email protected]
ID Name Mail ID 2 B [email protected]
1 A [email protected]
2 B [email protected] Horizontal
3 C [email protected]
ID Name Mail ID
4 D [email protected]
3 C [email protected]
4 D [email protected]
Sharding

 Horizontal or Range Based Sharding:


 Advantages:
 Each shard also has the same schema as the original
database.
 It works well for relative non static data -- for
example to store the contact info for students in a
college because the data is unlikely to see huge churn.
Sharding

 Horizontal or Range Based Sharding:


 Disadvantages:
 The disadvantage of this scheme is that the last
names of the customers may not be evenly distributed.
 In that case, your first shard will be experiencing a
much heavier load than the second shard and can
become a system bottleneck.
Sharding

 Vertical Sharding:
In this case, different features of an entity will be
placed in different shards on different machines.
 Vertical Sharding:
Sharding

ID Name
1 A
2 B

ID Name Mail ID 3 C

1 A [email protected] 4 D

2 B [email protected] Vertical
3 C [email protected] ID Mail ID
4 D [email protected] 1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
Sharding

 Vertical Sharding – Benefits:


 handle the critical part of your data differently
from the not so critical part of your data and build
different replication and consistency models
around it.
Sharding

 Vertical Sharding – Disadvantages:


It increases the development and operational
complexity of the system.
 If your Site/system experiences additional growth
then it may be necessary to further shard a feature
specific database across multiple server.
Sharding

 Key or hash based sharding:


 In this case, an entity has a value which can be
used as an input to a hash function and a resultant
hash value generated. This hash value determines
which database server(shard) to use.
Sharding

 Key or hash based sharding:


 The main drawback of this method is that elastic load
balancing (dynamically adding/removing database
servers) becomes very difficult and expensive.
 A large number of the requests cannot be serviced and
you'll incur a downtime till the migration completes
Sharding

 Directory based sharding:


 Directory based shard partitioning involves placing a
lookup service in front of the sharded databases.
 The lookup service knows the current partitioning scheme
and keeps a map of each entity and which database shard it
is stored on.
 The lookup service is usually implemented as a web
service.
Sharding

 Directory based sharding:


 The client application first queries the lookup
service to figure out the shard (database partition)
on which the entity resides/should be placed.
 Then it queries / updates the shard returned by the
lookup service.
Sharding

 Directory based sharding:


Sharding

 Directory based sharding – Steps:


 Keep the modulo 4 hash function in the lookup service.
 Determine the data placement based on the new hash
function - modulo 10.
 Write a script to copy all the data based on #2 into the six
new shards and possibly on the 4 existing shards. Note that
it does not delete any existing data on the 4 existing shards.
Sharding

 Directory based sharding – Steps:


 Once the copy is complete, change the hash function
to modulo 10 in the lookup service
 Run a cleanup script to purge unnecessary data from 4
existing shards based on step#2. The reason being that
the purged data is now existing on other shards.x
Sharding
 Directory based sharding – Steps:
 There are two practical considerations which needs to be solved on a
per system basis:
 While the migration is happening, the users might still be updating
their data. Options include putting the system in read-only mode or
placing new data in a separate server that is placed into correct shards
once migration is done.
 The copy and cleanup scripts might have an effect on system
performance during the migration. It can be circumvented by using
system cloning and elastic load balancing - but both are expensive.
HBase

 HBase is a distributed column-oriented database


built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
 Apache Hbase is capable of providing real time
read and write access to datasets with billions of
rows and millions of columns.
HBase
 HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts
of structured data.
 It leverages the fault tolerance provided by the Hadoop File
System.
 It is a part of the Hadoop ecosystem that provides random
real-time read/write access to data in the Hadoop File
System.
HBase

 One can store the data in HDFS either directly


or through HBase.
 Data consumer reads/accesses the data in
HDFS randomly using HBase.
 HBase sits on top of the Hadoop File System
and provides read and write access.
HBase
 Storage Mechanism in Hbase:
 HBase is a column-oriented database and the tables in it are
sorted by row.
 The table schema defines only column families, which are the
key value pairs.
 A table have multiple column families and each column family
can have any number of columns.
 Subsequent column values are stored contiguously on the disk.
Each cell value of the table has a timestamp.
HBase

 Storage Mechanism in Hbase - In short, in an


HBase:
Table is a collection of rows.

Row is a collection of column families.

Column family is a collection of columns.

Column is a collection of key value pairs.


HBase

 Architecture:
 Tables are split into regions and are served by the
region servers.
 Regions are vertically divided by column families
into “Stores”.
 Stores are saved as files in HDFS.
HBase

 Architecture:
HBase

 Architecture – Master Server:


 Assigns regions to the region servers
 Handles load balancing of the regions across region
servers. It unloads the busy servers and shifts the
regions to less occupied servers.
 Maintains the state of the cluster by negotiating the
load balancing.
HBase

 Architecture – Regions:
 Tables are split up and spread across the region
servers.
HBase
 Architecture – Regions Server:
 Communicate with the client and handle data related operations.
 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size
thresholds.
 Memstore – Cache Memory
 The data is transferred and saved in Hfiles as blocks and the
memstore is flused.
HBase

 Architecture – Zoo Keeper:


 It provides services like maintaining configuration
information, naming, providing distributed
synchronization, etc.
 It keeps track of all the region servers in the HBase
cluster and tracking information like, how many
region servers are there and which region servers are
holding which DataNode.
HBase

 Architecture – Zoo Keeper:

 Services:
 Establishing client communication with region
servers.
Tracking server failure and network partitions.

Maintain Configuration Information.


Analyzing Big Data With Twitter

 Twitter is an effective tool for a company to


get people excited about its product.
 To engage users whose updates tend to
generate lots of retweets.
 Twitter tracks retweet counts for all tweets
Analyzing Big Data With Twitter

 A simple Search
 To search for tweets containing the word
“Happy”.
 index = twitter text = * Happy *

 It searches only Twitter index and finds all the


places where the word Happy is mentioned.
Analyzing Big Data With Twitter

 Examining the Twitter Event:


 _time: Splunk assigns a timestamp for every event. This is
done in UTC time format.
 Contributors: The value for this field is null.
 Retweeted_status: Other fields associated with a tweet.
 The 140 character text filed that most people consider to
be the tweet is actually a small part of the actual data
collected.
Analyzing Big Data With Twitter

 Implied AND:
 To search for all tweets that include both the text
“Happy” and the text “morning”,
 index = twitter text = *Happy* text = *Morning*
Analyzing Big Data With Twitter

 Need to specify OR
 To obtain all events that mention either Happy or
morning
 index = twitter text = *Happy* OR text =
*Morning*
Analyzing Big Data With Twitter

 Finding Other words used:


 To find out what other words are used in tweets about
Happy.
 index = twitter text = *Happy* | makemv text | mvexpand
text | top 30 text
 First searches for the word “Happy” in a text field, then
creates a multivalued field from the tweet, and then
expands it so that each word is treated as a separate piece.
Analyzing Big Data With Twitter

 Finding Other words used:


 Then it takes the top 30 words that it finds. To limit
the top words:
 index=twitter text=*Happy* | makemv text |
mvexpand text | search NOT text = “RT” AND NOT
text=“a” AND NOT text=“to” AND NOT text=“the” |
top 30 text
Analyzing Big Data For E-Commerce

 It allows businesses to gain access to


significantly larger amount of data in order to
convert
 growth into revenue

 streamline operation processes

 Gain more customers


Analyzing Big Data For E-Commerce

 Big Data Solutions:


 Optimize Customer Shopping Experience
 Customer behaviour pattern, purchase histories,
browsing interests
 To re-target the buyers by displaying or recommending
products that they are interested in.
Analyzing Big Data For E-Commerce

 Big Data Solutions:


 Higher Customer Satisfaction
 The stores to understand their customers better and
built a lasting relationship with them.
Analyzing Big Data For E-Commerce

 Big Data Solutions:


 Streaming Analytics
 to gain valuable customer insights.

 Store provides a lot of insights that will help in


personalizing the shopper’s experience and generate
more revenue.
Analyzing Big Data For E-Commerce

 Recommendation Engine:
 Analyze all the actions of a particular customer
 Product pages visited

 products they liked

 added into their carts

 Finally bought / abandoned


Analyzing Big Data For E-Commerce

 Recommendation Engine:
 The system can also compare the behaviour
pattern of a certain visitor to those of the other
visitors.
 It analyzes, recommends the products that a
visitor may like.
Analyzing Big Data For E-Commerce

 Personalized Shopping Experience:


 A key to successful e-commerce marketing

 To react to their customer’s actions properly and


in real time.
 Analyze all customer activities in an e-shop and a
create a picture of customer behavior pattern.
Analyzing Big Data For E-Commerce

 Everything in the cart is tracked:


 System encourage customer to finish the purchase
with discounts.
 Example: A customer bought a winter coat two
weeks ago and visited some product pages with
winter gloves, scarfs and hats at that time.
Analyzing Big Data For E-Commerce

 Voice of the customer:


 To add sentiment analysis to the standard approach of
analyzing products and brands by their sales value,
volume, revenues, number of orders, etc.
 Sentiment analysis is the evaluation of comments that
customers left about different products and brands.
Analyzing Big Data For E-Commerce

 Voice of the customer:


 The system identifies whether each comment is
positive or negative.
 Positive Comments: Happy, Great, Recommend
or Satisfied
 Negative Comments: Bad, Terrible
Analyzing Big Data For E-Commerce

 Dynamic Pricing:
 Setting price rules, monitoring competitors and
adjusting prices in real time.
Analyzing Big Data For E-Commerce

 Demand Forecasting:
 Creating customer profiles
 looking at their customer’s behavior
 how many items they usually purchase and which
products they buy.
 To collect, analyze and visualizing the analysis result.
 Retailer will analyze external big data.
Analyzing Big Data For Blogs

 Wiki is nothing but a collection of web pages


interconnected with each other through
internal links.
 In Wikipedia, there are more than a million
pages like this English version and nowadays
in other languages too.
Analyzing Big Data For Blogs

 End user working Mechanism:

 Roles:
 Reader

 Writer

 Admin
Analyzing Big Data For Blogs

 End user working Mechanism:


 To understand a wiki community works:
 Goto Wikipedia and find a topic that know something about.
 Search for and read the page about that topic on Wikipedia.
 Edit the page and add/change a sentence or two in the article.
Simply click on “Edit this page.”
 Submit your change.
Analyzing Big Data For Blogs

 Wikistats – Open Source UI – Community


Environment:
 Public statistics website
 To add context and motivate our editor community by
providing a set of metrics through which users can see
the impact of their contributions in the projects they
are a part of.
Analyzing Big Data For Blogs

 Wikistats – Open Source UI – Community


Environment:
 In Wikistats2, not only updating the website interface but
are also providing new access to all our edit data in
analytics-friendly form.
 Wikistats2 computes metrics by extracting data from
Media Wiki databases, processing it and re-storing it on an
analytics-friendly form so metrics can be extracted easily.
Analyzing Big Data For Blogs

 Wikistats – Open Source UI – Community


Environment:
 Wikistats2 is a client side only single page
application, this means that it does hot have a
server component and can be served from
anywhere.
Analyzing Big Data For Blogs

 Wikistats – Open Source UI – Community


Environment:
 The dashboard: Metrics belong to one of three
main areas: reading, contributing and content.
 The detail page: Breakdown selectors, time range
and granularity selectors.
Analyzing Big Data For Blogs

 MediaWiki:
 Collect and organize knowledge and make it available
to people.
 MediaWiki action API is a web service that allows
access to some wiki-features like authentication, page
operations and search. It can provide meta information
about the wiki and logged-in user.
Analyzing Big Data For Blogs

 MediaWiki – Uses:
 Monitor a MediaWiki installation

 Log into a wiki, access data and post changes by


making HTTP request to the web service.
Review of Basic Data Analytics
Methods using R
 The study of the data in terms of basic
statistical measures and creation of graphs and
plots to visualize and identify relationships and
patterns.
Review of Basic Data Analytics
Methods using R
 Introduction to R
 R is a programming language and software
framework for statistical analysis and graphics.
 read.csv() – to import CSV file.
 head() – to display first six records of file.
 summary() – provides some descriptive statistics, such
as the mean and median, for each data column.
Review of Basic Data Analytics
Methods using R
 R Graphical User Interface:
 R software uses a command line interface.

 To improve the ease of writing, executing and


debugging R code, several additional GUIs have
been written for R.
 Example: RStudio, R Commander
Review of Basic Data Analytics
Methods using R
 R Graphical User Interface – Panes:
 Scripts

 Workspace

 Plots

 Console
Review of Basic Data Analytics
Methods using R
 Data Import and Export:
 setwd() – to set the working directory for the
subsequent import and export operations.
 Example: setwd(“path of the dorctory”)

 read.table and read.delim() – to import other


common file types such as TXT.
Review of Basic Data Analytics
Methods using R
 Attribute and Data Type:
 The characteristics or attributes provide the
qualitative and quantitative measures for each item
or subject of interest.
Review of Basic Data Analytics
Methods using R
 Attribute and Data Type:
Categorical Numeric
Nominal Ordinal Interval Ratio
Definition The values represent Attributes imply The difference Both difference
labels that distinguish a sequence between two and ratio two
one from another. values is values are
meaningful meaningful.
Example: ZIP Code Academic Calendar Dates Age, length
Grades
Review of Basic Data Analytics
Methods using R
 Attribute and Data Type:
 class() – represents the abstract class of an object.
 typeof() – determines the way an object is stored in
memory
 is.data_type(object) – verify if object is of a certain
datatype
 as.data_type(object) – convert data type of object to
another.
Review of Basic Data Analytics
Methods using R
 Predefined constants
 pi

 letters

 LETTERS

 month.name. month.abb
Review of Basic Data Analytics
Methods using R
 Data Types
 Logical – True or False

 Integer – Set of all integers

 Numeric – Set of all real numbers

 character – “a”, “b”


Review of Basic Data Analytics
Methods using R
 Basic object:
 Vector – Ordered collection of same data types

 List – Ordered collection of objects

 Data Frame – Generic tabular object


Review of Basic Data Analytics
Methods using R
 Basic object – Vectors:
 An ordered collection of basic data types of given
length
 All the elements of a vector must be of same data
type.
 Example: x=c(1, 2, 3)
Review of Basic Data Analytics
Methods using R
 Basic object – List:
 A generic object consisting of an ordered
collection of objects.
 A list could consists of a numeric vector, a logical
value, a matrix, a complex vector.
Review of Basic Data Analytics
Methods using R
 Basic object – List:
 To access top level components, use double
slicing operator “[[]]” or [] and for lower / inner
level components use “[]” along with “[[]]”.
Review of Basic Data Analytics
Methods using R
 Basic object – Data Frame:
 Used to store tabular data.

 df[val,val2] – row “val1”, column “val2”

 val1, val2 can also be array of values like “1:2” or


“c(1:2)”
 df[val2] – refers to column “val2” only.
Review of Basic Data Analytics
Methods using R
 Basic object – Data Frame:

 Subset():
 Extracts subset of data based on conditions

 runif(75,0,10) – generated 75 numbers


between 0 to 10 with random nunmbers.
Review of Basic Data Analytics
Methods using R
 Visualizing a single variable:
 plot(data) – suitable for low volume data

 barplot(data) – Vertical or horizontal bars

 dotchart(data) – dot plot

 hist(data) - histogram
Review of Basic Data Analytics
Methods using R
 Statistical Methods for Evaluation:
 It is used during the initial data exploration and
data preparation, model building, evaluation of the
final models.
Review of Basic Data Analytics
Methods using R
 Statistical Methods for Evaluation:

 Hypothesis Testing:
 To form an statement and test it with data.

 When performing hypothesis tests, the common


assumption is that there is no difference between
two samples – Null Hypothesis(H0)
Review of Basic Data Analytics
Methods using R
 Statistical Methods for Evaluation:

 Hypothesis Testing – Applications:


Application Null Hypothesis Alternative Hypothesis

Accuracy Forecast Model X does not predict better than Model X predicts better than
the existing model the existing model

Recommendation Algorithm Y does not produce better Algorithm Y produces better


engine recommendation than the current recommendation than the
algorithm being used. current algorithm being
used.
Review of Basic Data Analytics
Methods using R
 Statistical Methods for Evaluation:
 Wilcoxon Rank-Sum Test:
 It is a nonparametric hypothesis test that checks whether two
populations are identically distributed.
 wilcox.test() – ranks the observation, determines the respective
rank-sums corresponding to each population’s sample and then
determines the probability of such rank-sums of such magnitude
being observed assuming that the population distributions are
identical.
Review of Basic Data Analytics
Methods using R
 Statistical Methods for Evaluation:
 Type I and Type II Errors:
 Type I Error: The rejection of the null hypothesis when the
null hypothesis is TRUE. Probability is denoted by the
Greek letter α.
 Type II Error: The acceptance of a null hypothesis when
the null hypothesis is FALSE. Probability is denoted by the
Greek letter β.
Review of Basic Data Analytics
Methods using R
 Statistical Methods for Evaluation:
 ANOVA:
 Analysis of Variance.
 ANOVA is a generalization of the hypothesis testing of the
difference of two population means.
 The null hypothesis of ANOVA is that all the population means
are equal.
 The alternative hypothesis is that at least one pair of the
population means is not equal.

You might also like