Data Handling and Analytics – Part I
Data is Precious
Introduction to Internet of Things 1
What is Data Handling
Data handling
Ensures that research data is stored, archived or disposed off in a safe and secure
manner during and after the conclusion of a research project
Includes the development of policies and procedures to manage data handled
electronically as well as through non‐electronic means.
In recent days, most data concern –
Big Data
Due to heavy traffic generated by IoT devices
Huge amount of data generated by the deployed sensors
Introduction to Internet of Things 2
What is Big Data
“Big data technologies describe a new generation of technologies and architectures,
designed to economically extract value from very large volumes of a wide variety of
data, by enabling the high-velocity capture, discovery, and/or analysis.”
[Report of International Data Corporation (IDC)]
“Big data shall mean the data of which the data volume, acquisition speed, or data
representation limits the capacity of using traditional relational methods to conduct
effective analysis or the data which may be effectively processed with important
horizontal zoom technologies.”
[National Institute of Standards and Technology (NIST)]
Introduction to Internet of Things 3
Types of Data
Structured data
Data that can be easily organized.
Usually stored in relational databases.
Structured Query Language (SQL) manages structured data in databases.
It accounts for only 20% of the total available data today in the world.
Unstructured data
Information that do not possess any pre‐defined model.
Traditional RDBMSs are unable to process unstructured data.
Enhances the ability to provide better insight to huge datasets.
It accounts for 80% of the total data available today in the world.
Introduction to Internet of Things 4
Characteristics of Big Data
Big Data is characterized by 7 Vs –
Volume
Velocity
Variety
Variability
Veracity
Visualization
Value
Introduction to Internet of Things 5
Characteristics of Big Data (Contd.)
Volume
Quantity of data that is generated
Sources of data are added continuously
Example of volume ‐
30TB of images will be generated every night from the Large Synoptic Survey Telescope
(LSST)
72 hours of video are uploaded to YouTube every minute
Introduction to Internet of Things 6
Characteristics of Big Data (Contd.)
Velocity
Refers to the speed of generation of data
Data processing time decreasing day‐by‐day in order to provide real‐time services
Older batch processing technology is unable to handle high velocity of data
Example of velocity –
140 million tweets per day on average (according to a survey conducted in 2011)
New York Stock Exchange captures 1TB of trade information during each trading
session
Introduction to Internet of Things 7
Characteristics of Big Data (Contd.)
Variety
Refers to the category to which the data belongs
No restriction over the input data formats
Data mostly unstructured or semi‐structured
Example of variety –
Pure text, images, audio, video, web, GPS data, sensor data, SMS, documents, PDFs, flash
etc.
Introduction to Internet of Things 8
Characteristics of Big Data (Contd.)
Variability
Refers to data whose meaning is constantly changing.
Meaning of the data depends on the context.
Data appear as an indecipherable mass without structure
Example:
Language processing, Hashtags, Geo‐spatial data, Multimedia, Sensor events
Veracity
Veracity refers to the biases, noise and abnormality in data.
It is important in programs that involve automated decision‐making, or feeding the data
into an unsupervised machine learning algorithm.
Veracity isn’t just about data quality, it’s about data understandability.
Introduction to Internet of Things 9
Characteristics of Big Data (Contd.)
Visualization
Presentation of data in a pictorial or graphical format
Enables decision makers to see analytics presented visually
Identify new patterns
Value
It means extracting useful business information from scattered data.
Includes a large volume and variety of data
Easy to access and delivers quality analytics that enables informed decisions
Introduction to Internet of Things 10
Data Handling Technologies
Cloud computing
Essential characteristics according to NIST
On‐demand self service
Broad network access
Resource pooling
Rapid elasticity
Measured service
Basic service models provided by cloud computing
Infrastructure‐as‐a‐Service (IaaS)
Platform‐as‐a‐Service (PaaS)
Software‐as‐a‐Service (SaaS)
Introduction to Internet of Things 11
Data Handling Technologies (Contd.)
Internet of Things (IoT)
According to Techopedia, IoT “describes a future where every day physical
objects will be connected to the internet and will be able to identify themselves
to other devices.”
Sensors embedded into various devices and machines and deployed into fields.
Sensors transmit sensed data to remote servers via Internet.
Continuous data acquisition from mobile equipment, transportation facilities,
public facilities, and home appliances
Introduction to Internet of Things 12
Data Handling Technologies (Contd.)
Internet of Things (IoT)
According to Techopedia, IoT “describes a future where every day physical
objects will be connected to the internet and will be able to identify themselves
to other devices.”
Sensors embedded into various devices and machines and deployed into fields.
Sensors transmit sensed data to remote servers via Internet.
Continuous data acquisition from mobile equipment, transportation facilities,
public facilities, and home appliances
Introduction to Internet of Things 13
Data Handling Technologies (Contd.)
Data handling at data centers
Storing, managing, and organizing data.
Estimates and provides necessary processing capacity.
Provides sufficient network infrastructure.
Effectively manages energy consumption.
Replicates data to keep backup.
Develop business oriented strategic solutions from big data.
Helps business personnel to analyze existing data.
Discovers problems in business operations.
Introduction to Internet of Things 14
Flow of Data
Generation Acquisition Storage Analysis
Enterprise data Data collection Hadoop Bloom filter
IoT data Data transportation MapReduce Parallel computing
Bio‐medical data Data pre‐processing NoSQL databases Hashing and
Other data indexing
Introduction to Internet of Things 15
Data Sources
Enterprise data Bio‐medical data
Online trading and analysis data. Masses of data generated by gene
Production and inventory data. sequencing.
Sales and other financial data. Data from medical clinics and medical
IoT data R&Ds.
Data from industry, agriculture, Other fields
traffic, transportation Fields such as – computational biology,
Medical‐care data, astronomy, nuclear research etc
Data from public departments, and
families.
Introduction to Internet of Things 16
Data Acquisition
Data collection
Log files or record files that are automatically generated by data sources to record
activities for further analysis.
Sensory data such as sound wave, voice, vibration, automobile, chemical, current,
weather, pressure, temperature etc.
Complex and variety of data collection through mobile devices. E.g. – geographical
location, 2D barcodes, pictures, videos etc.
Data transmission
After collecting data, it will be transferred to storage system for further processing and
analysis of the data.
Data transmission can be categorized as – Inter‐DCN transmission and Intra‐DCN
transmission.
Introduction to Internet of Things 17
Data Acquisition (Contd.)
Data pre‐processing
Collected datasets suffer from noise, redundancy, inconsistency etc., thus, pre‐
processing of data is necessary.
Pre‐processing of relational data mainly follows – integration, cleaning, and
redundancy mitigation
Integration is combining data from various sources and provides users with a uniform
view of data.
Cleaning is identifying inaccurate, incomplete, or unreasonable data, and then
modifying or deleting such data.
Redundancy mitigation is eliminating data repetition through detection, filtering and
compression of data to avoid unnecessary transmission.
Introduction to Internet of Things 18
Data Storage
File system
Distributed file systems that store massive data and ensure – consistency, availability,
and fault tolerance of data.
GFS is a notable example of distributed file system that supports large‐scale file
system, though it’s performance is limited in case of small files
Hadoop Distributed File System (HDFS) and Kosmosfs are other notable file systems,
derived from the open source codes of GFS.
Databases
Emergence of non‐traditional relational databases (NoSQL) in order to deal with the
characteristics that big data possess.
Three main NoSQL databases – Key‐value databases, column‐oriented databases, and
document‐oriented databases.
Introduction to Internet of Things 19
Data Handling Using Hadoop
Reliable, scalable, distributed data handling
Introduction to Internet of Things 20
What is Hadoop
Hadoop is a software framework for
distributed processing of large datasets
across large clusters of computers.
Hadoop is open-source implementation for
Google ‘s GFS and MapReduce.
Apache Hadoop's Map Reduce and Hadoop
Distributed File System (HDFS)
components originally derived respectively
from Google's MapReduce and Google File
System (GFS) .
Source: https://www.cloudnloud.com/hadoop-hdfs-operations/
Introduction to Internet of Things 21
Building Blocks of Hadoop
Hadoop Common
A module containing the utilities that support the other Hadoop components
Hadoop Distributed File System (HDFS)
Provides reliable data storage and access across the nodes
MapReduce
Framework for applications that process large amount of datasets in parallel.
Yet Another Resource Negotiator (YARN)
Next‐generation MapReduce, which assigns CPU, memory and storage to applications
running on a Hadoop cluster.
Introduction to Internet of Things 22
Hadoop Distributed File System (HDFS)
Centralized node
Namenode
Maintains metadata info about files
Distributed node
Datanode
Store the actual data
Files are divided into blocks
Each block is replicated
Source: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Introduction to Internet of Things 23
Name and Data Nodes
Namenode
Stores filesystem metadata.
Maintains two in‐memory tables, to map the datanodes to the blocks, and vice versa
Datanode
Stores actual data
Data nodes can talk to each other to rebalance and replicate data
Data nodes update the namenode with the block information periodically
Before updating datanodes verify the checksums.
Introduction to Internet of Things 24
Job and Task Trackers
Job Tracker –
Runs with the Namenode
Receives the user’s job
Decides on how many tasks will run (number
of mappers)
Decides on where to run each mapper
(concept of locality)
Task Tracker –
Runs on each datanode
Receives the task from Job Tracker
Always in communication with the Job
Source: http://developeriq.in/articles/2015/aug/11/an-introduction-to-
Tracker reporting progress apache-hadoop-for-big-data/
Introduction to Internet of Things 25
Hadoop Master/Slave Architecture
Master‐slave shared‐nothing architecture
Master
Executes operations like opening, closing,
and renaming files and directories.
Determines the mapping of blocks to
Datanodes.
Slave
Serves read and write requests from the
file system’s clients.
Performs block creation, deletion, and
replication as instructed by the Namenode.
Source: http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html
Introduction to Internet of Things 26
Introduction to Internet of Things 28
Data Analytics – Part II
Data is Precious
Introduction to Internet of Things 28
What is Data Analytics
“Data analytics (DA) is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of
specialized systems and software. Data analytics technologies and techniques are
widely used in commercial industries to enable organizations to make more‐
informed business decisions and by scientists and researchers to verify or disprove
scientific models, theories and hypotheses.”
[An admin's guide to AWS data management]
Introduction to Internet of Things 29
Types of Data Analysis
Two types of analysis
Qualitative Analysis
Deals with the analysis of data that is categorical in nature
Quantitative Analysis
Quantitative analysis refers to the process by which numerical data is analyzed
Introduction to Internet of Things 30
Qualitative Analysis
Data is not described through numerical values
Described by some sort of descriptive context such as text
Data can be gathered by many methods such as interviews, videos and audio
recordings, field notes
Data needs to be interpreted
The grouping of data into identifiable themes
Qualitative analysis can be summarized by three basic principles (Seidel, 1998):
Notice things
Collect things
Think about things
Introduction to Internet of Things 31
Quantitative Analysis
Quantitative analysis refers to the process by which numerical data is analyzed
Involves descriptive statistics such as mean, media, standard deviation
The following are often involved with quantitative analysis:
Statistical models Regression analysis
Analysis of variables Statistical significance
Data dispersion Precision
Analysis of relationships between variables Error limits
Contingence and correlation
Introduction to Internet of Things 32
Comparison
Qualitative Data Quantitative Data
Data is observed Data is measured
Involves descriptions Involves numbers
Emphasis is on quality Emphasis is on quantity
Examples are color, smell, taste, etc. Examples are volume, weight, etc.
Introduction to Internet of Things 33
Advantages
Allows for the identification of important (and often mission‐critical) trends
Helps businesses identify performance problems that require some sort of action
Can be viewed in a visual manner, which leads to faster and better decisions
Better awareness regarding the habits of potential customers
It can provide a company with an edge over their competitors
Introduction to Internet of Things 34
Statistical models
The statistical model is defined as the mathematical equation that are formulated
in the form of relationships between variables.
A statistical model illustrates how a set of random variables is related to another
set of random variables.
A statistical model is represented as the ordered pair (X , P)
X denotes the set of all possible observations
P refers to the set of probability distributions on X
Introduction to Internet of Things 35
Statistical models (Contd.)
Statistical models are broadly categorized as
Complete models
Incomplete models
Complete model does have the number of variables equal to the number of
equations
An incomplete model does not have the same number of variables as the number
of equations
Introduction to Internet of Things 36
Statistical models (Contd.)
In order to build a statistical model
Data Gathering
Descriptive Methods
Thinking about Predictors
Building of model
Interpreting the Results
Introduction to Internet of Things 37
Analysis of variance
Analysis of Variance (ANOVA) is a parametric statistical technique used to compare
datasets.
ANOVA is best applied where more than 2 populations or samples are meant to be
compared.
To perform an ANOVA, we must have a continuous response variable and at least one
categorical factor (e.g. age, gender) with two or more levels (e.g. Locations 1, 2)
ANOVAs require data from approximately normally distributed populations
Introduction to Internet of Things 38
Analysis of variance (Contd.)
Properties to perform ANOVA –
Independence of case
The sample should be selected randomly
There should not be any pattern in the selection of the sample
Normality
Distribution of each group should be normal
Homogeneity
Variance between the groups should be the same (e.g. should not compare data from
cities with those from slums)
Introduction to Internet of Things 39
Analysis of variance (Contd.)
Analysis of variance (ANOVA) has three types:
One way analysis
One fixed factor (levels set by investigator). Factors: age, gender, etc.
Two way analysis
Factor variables are more than two
K‐way analysis
Factor variables are k
Introduction to Internet of Things 40
Analysis of variance (Contd.)
Total Sum of square
In statistical data analysis, the total sum of squares (TSS or SST) is a quantity that
appears as part of a standard way of presenting results of such analyses. It is defined
as being the sum, over all observations, of the squared differences of each
observation from the overall mean.
F –ratio
Helps to understand the ratio of variance between two data sets
The F ratio is approximately 1.0 when the null hypothesis is true and is greater than
1.0 when the null hypothesis is false.
Degree of freedom
Factors which have no effect on the variance
The number of degrees of freedom is the number of values in the final calculation of a
statistic that are free to vary.
Introduction to Internet of Things 41
Data dispersion
A measure of statistical dispersion is a nonnegative real number that is zero if all
the data are the same and increases as the data becomes more diverse.
Examples of dispersion measures:
Range
Average absolute deviation
Variance and Standard deviation
Introduction to Internet of Things 42
Data dispersion (Contd.)
Range
The range is calculated by simply taking the difference between the maximum and
minimum values in the data set.
Average absolute deviation
The average absolute deviation (or mean absolute deviation) of a data set is the average of the
absolute deviations from the mean.
Variance
Variance is the expectation of the squared deviation of a random variable from its mean
Standard deviation
Standard deviation (SD) is a measure that is used to quantify the amount of variation
or dispersion of a set of data values
Introduction to Internet of Things 43
Contingence and correlation
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a
type of table in a matrix format that displays the (multivariate) frequency
distribution of the variables.
Provides a basic picture of the interrelation between two variables
A crucial problem of multivariate statistics is finding (direct‐)dependence structure
underlying the variables contained in high‐dimensional contingency tables
Introduction to Internet of Things 44
Contingence and correlation (Contd.)
Correlation is a technique for investigating the relationship between two
quantitative, continuous variables
Pearson's correlation coefficient (r) is a measure of the strength of the association
between the two variables.
Correlations are useful because they can indicate a predictive relationship that can
be exploited in practice
Introduction to Internet of Things 45
Regression analysis
In statistical modeling, regression analysis is a statistical process for estimating the
relationships among variables
Focuses on the relationship between a dependent variable and one or more
independent variables
Regression analysis estimates the conditional expectation of the dependent
variable given the independent variables
Introduction to Internet of Things 46
Regression analysis (Contd.)
The estimation target is a function of the independent variables called the
regression function
Characterize the variation of the dependent variable around the regression
function which can be described by a probability distribution
Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning
Regression analysis is also used to understand which among the independent
variables are related to the dependent variable
Introduction to Internet of Things 47
Statistical significance
Statistical significance is the likelihood that the difference in conversion rates
between a given variation and the baseline is not due to random chance
Statistical significance level reflects the risk tolerance and confidence level
There are two key variables that go into determining statistical significance:
Sample size
Effect size
Introduction to Internet of Things 48
Statistical significance (Contd.)
Sample size refers to the sample size of the experiment
The larger your sample size, the more confident you can be in the result of the
experiment (assuming that it is a randomized sample)
The effect size is just the standardized mean difference between the two groups
If a particular experiment replicated, the different effect size estimates from each
study can easily be combined to give an overall best estimate of the effect size
Introduction to Internet of Things 49
Precision and Error limits
Precision refers to how close estimates from different samples are to each other
The standard error is a measure of precision
When the standard error is small, estimates from different samples will be close in
value and vice versa
Precision is inversely related to standard error
Introduction to Internet of Things 50
Precision and Error limits (Contd.)
The limits of error are the maximum overestimate and the maximum
underestimate from the combination of the sampling and the non‐sampling errors
The margin of error is defined as –
Limit of error = Critical value x Standard deviation of the statistic
Critical value: Determines the tolerance level of error.
Introduction to Internet of Things 51
Introduction to Internet of Things 26