0% found this document useful (0 votes)
178 views52 pages

04 - IoT - Unit 4 - Data Handling & Analytics

The document discusses data handling and analytics, emphasizing the importance of securely managing research data and the rise of big data due to IoT devices. It outlines the characteristics of big data, including volume, velocity, variety, and the need for advanced technologies like Hadoop for data processing. Additionally, it covers types of data analysis, distinguishing between qualitative and quantitative methods, and highlights the advantages of data analytics for informed decision-making.

Uploaded by

gunigantibhanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
178 views52 pages

04 - IoT - Unit 4 - Data Handling & Analytics

The document discusses data handling and analytics, emphasizing the importance of securely managing research data and the rise of big data due to IoT devices. It outlines the characteristics of big data, including volume, velocity, variety, and the need for advanced technologies like Hadoop for data processing. Additionally, it covers types of data analysis, distinguishing between qualitative and quantitative methods, and highlights the advantages of data analytics for informed decision-making.

Uploaded by

gunigantibhanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Data Handling and Analytics – Part I

Data is Precious

Introduction to Internet of Things 1


What is Data Handling
 Data handling
 Ensures that research data is stored, archived or disposed off in a safe and secure
manner during and after the conclusion of a research project
 Includes the development of policies and procedures to manage data handled
electronically as well as through non‐electronic means.

 In recent days, most data concern –


 Big Data
 Due to heavy traffic generated by IoT devices
 Huge amount of data generated by the deployed sensors

Introduction to Internet of Things 2


What is Big Data
 “Big data technologies describe a new generation of technologies and architectures,
designed to economically extract value from very large volumes of a wide variety of
data, by enabling the high-velocity capture, discovery, and/or analysis.”
[Report of International Data Corporation (IDC)]
 “Big data shall mean the data of which the data volume, acquisition speed, or data
representation limits the capacity of using traditional relational methods to conduct
effective analysis or the data which may be effectively processed with important
horizontal zoom technologies.”
[National Institute of Standards and Technology (NIST)]

Introduction to Internet of Things 3


Types of Data
 Structured data
 Data that can be easily organized.
 Usually stored in relational databases.
 Structured Query Language (SQL) manages structured data in databases.
 It accounts for only 20% of the total available data today in the world.
 Unstructured data
 Information that do not possess any pre‐defined model.
 Traditional RDBMSs are unable to process unstructured data.
 Enhances the ability to provide better insight to huge datasets.
 It accounts for 80% of the total data available today in the world.

Introduction to Internet of Things 4


Characteristics of Big Data
 Big Data is characterized by 7 Vs –
 Volume
 Velocity
 Variety
 Variability
 Veracity
 Visualization
 Value

Introduction to Internet of Things 5


Characteristics of Big Data (Contd.)
 Volume
 Quantity of data that is generated
 Sources of data are added continuously
 Example of volume ‐
 30TB of images will be generated every night from the Large Synoptic Survey Telescope
(LSST)
 72 hours of video are uploaded to YouTube every minute

Introduction to Internet of Things 6


Characteristics of Big Data (Contd.)
 Velocity
 Refers to the speed of generation of data
 Data processing time decreasing day‐by‐day in order to provide real‐time services
 Older batch processing technology is unable to handle high velocity of data
 Example of velocity –
 140 million tweets per day on average (according to a survey conducted in 2011)
 New York Stock Exchange captures 1TB of trade information during each trading
session

Introduction to Internet of Things 7


Characteristics of Big Data (Contd.)
 Variety
 Refers to the category to which the data belongs
 No restriction over the input data formats
 Data mostly unstructured or semi‐structured
 Example of variety –
 Pure text, images, audio, video, web, GPS data, sensor data, SMS, documents, PDFs, flash
etc.

Introduction to Internet of Things 8


Characteristics of Big Data (Contd.)
 Variability
 Refers to data whose meaning is constantly changing.
 Meaning of the data depends on the context.
 Data appear as an indecipherable mass without structure
 Example:
 Language processing, Hashtags, Geo‐spatial data, Multimedia, Sensor events
 Veracity
 Veracity refers to the biases, noise and abnormality in data.
 It is important in programs that involve automated decision‐making, or feeding the data
into an unsupervised machine learning algorithm.
 Veracity isn’t just about data quality, it’s about data understandability.

Introduction to Internet of Things 9


Characteristics of Big Data (Contd.)
 Visualization
 Presentation of data in a pictorial or graphical format
 Enables decision makers to see analytics presented visually
 Identify new patterns

 Value
 It means extracting useful business information from scattered data.
 Includes a large volume and variety of data
 Easy to access and delivers quality analytics that enables informed decisions

Introduction to Internet of Things 10


Data Handling Technologies
 Cloud computing
 Essential characteristics according to NIST
 On‐demand self service
 Broad network access
 Resource pooling
 Rapid elasticity
 Measured service
 Basic service models provided by cloud computing
 Infrastructure‐as‐a‐Service (IaaS)
 Platform‐as‐a‐Service (PaaS)
 Software‐as‐a‐Service (SaaS)

Introduction to Internet of Things 11


Data Handling Technologies (Contd.)
 Internet of Things (IoT)
 According to Techopedia, IoT “describes a future where every day physical
objects will be connected to the internet and will be able to identify themselves
to other devices.”
 Sensors embedded into various devices and machines and deployed into fields.
 Sensors transmit sensed data to remote servers via Internet.
 Continuous data acquisition from mobile equipment, transportation facilities,
public facilities, and home appliances

Introduction to Internet of Things 12


Data Handling Technologies (Contd.)
 Internet of Things (IoT)
 According to Techopedia, IoT “describes a future where every day physical
objects will be connected to the internet and will be able to identify themselves
to other devices.”
 Sensors embedded into various devices and machines and deployed into fields.
 Sensors transmit sensed data to remote servers via Internet.
 Continuous data acquisition from mobile equipment, transportation facilities,
public facilities, and home appliances

Introduction to Internet of Things 13


Data Handling Technologies (Contd.)
 Data handling at data centers
 Storing, managing, and organizing data.
 Estimates and provides necessary processing capacity.
 Provides sufficient network infrastructure.
 Effectively manages energy consumption.
 Replicates data to keep backup.
 Develop business oriented strategic solutions from big data.
 Helps business personnel to analyze existing data.
 Discovers problems in business operations.

Introduction to Internet of Things 14


Flow of Data

Generation Acquisition Storage Analysis

 Enterprise data  Data collection  Hadoop  Bloom filter


 IoT data  Data transportation  MapReduce  Parallel computing
 Bio‐medical data  Data pre‐processing  NoSQL databases  Hashing and
 Other data indexing

Introduction to Internet of Things 15


Data Sources
 Enterprise data  Bio‐medical data
 Online trading and analysis data.  Masses of data generated by gene
 Production and inventory data. sequencing.
 Sales and other financial data.  Data from medical clinics and medical
 IoT data R&Ds.
 Data from industry, agriculture,  Other fields
traffic, transportation  Fields such as – computational biology,
 Medical‐care data, astronomy, nuclear research etc
 Data from public departments, and
families.

Introduction to Internet of Things 16


Data Acquisition
 Data collection
 Log files or record files that are automatically generated by data sources to record
activities for further analysis.
 Sensory data such as sound wave, voice, vibration, automobile, chemical, current,
weather, pressure, temperature etc.
 Complex and variety of data collection through mobile devices. E.g. – geographical
location, 2D barcodes, pictures, videos etc.
 Data transmission
 After collecting data, it will be transferred to storage system for further processing and
analysis of the data.
 Data transmission can be categorized as – Inter‐DCN transmission and Intra‐DCN
transmission.

Introduction to Internet of Things 17


Data Acquisition (Contd.)
 Data pre‐processing
 Collected datasets suffer from noise, redundancy, inconsistency etc., thus, pre‐
processing of data is necessary.
 Pre‐processing of relational data mainly follows – integration, cleaning, and
redundancy mitigation
 Integration is combining data from various sources and provides users with a uniform
view of data.
 Cleaning is identifying inaccurate, incomplete, or unreasonable data, and then
modifying or deleting such data.
 Redundancy mitigation is eliminating data repetition through detection, filtering and
compression of data to avoid unnecessary transmission.

Introduction to Internet of Things 18


Data Storage
 File system
 Distributed file systems that store massive data and ensure – consistency, availability,
and fault tolerance of data.
 GFS is a notable example of distributed file system that supports large‐scale file
system, though it’s performance is limited in case of small files
 Hadoop Distributed File System (HDFS) and Kosmosfs are other notable file systems,
derived from the open source codes of GFS.
 Databases
 Emergence of non‐traditional relational databases (NoSQL) in order to deal with the
characteristics that big data possess.
 Three main NoSQL databases – Key‐value databases, column‐oriented databases, and
document‐oriented databases.

Introduction to Internet of Things 19


Data Handling Using Hadoop
Reliable, scalable, distributed data handling

Introduction to Internet of Things 20


What is Hadoop

 Hadoop is a software framework for


distributed processing of large datasets
across large clusters of computers.
 Hadoop is open-source implementation for
Google ‘s GFS and MapReduce.
 Apache Hadoop's Map Reduce and Hadoop
Distributed File System (HDFS)
components originally derived respectively
from Google's MapReduce and Google File
System (GFS) .
Source: https://www.cloudnloud.com/hadoop-hdfs-operations/

Introduction to Internet of Things 21


Building Blocks of Hadoop
 Hadoop Common
 A module containing the utilities that support the other Hadoop components
 Hadoop Distributed File System (HDFS)
 Provides reliable data storage and access across the nodes
 MapReduce
 Framework for applications that process large amount of datasets in parallel.
 Yet Another Resource Negotiator (YARN)
 Next‐generation MapReduce, which assigns CPU, memory and storage to applications
running on a Hadoop cluster.

Introduction to Internet of Things 22


Hadoop Distributed File System (HDFS)
 Centralized node
 Namenode
 Maintains metadata info about files

 Distributed node
 Datanode
 Store the actual data
 Files are divided into blocks
 Each block is replicated
Source: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Introduction to Internet of Things 23


Name and Data Nodes
 Namenode
 Stores filesystem metadata.
 Maintains two in‐memory tables, to map the datanodes to the blocks, and vice versa

 Datanode
 Stores actual data
 Data nodes can talk to each other to rebalance and replicate data
 Data nodes update the namenode with the block information periodically
 Before updating datanodes verify the checksums.

Introduction to Internet of Things 24


Job and Task Trackers
 Job Tracker –
 Runs with the Namenode
 Receives the user’s job
 Decides on how many tasks will run (number
of mappers)
 Decides on where to run each mapper
(concept of locality)
 Task Tracker –
 Runs on each datanode
 Receives the task from Job Tracker
 Always in communication with the Job
Source: http://developeriq.in/articles/2015/aug/11/an-introduction-to-
Tracker reporting progress apache-hadoop-for-big-data/

Introduction to Internet of Things 25


Hadoop Master/Slave Architecture
 Master‐slave shared‐nothing architecture
 Master
 Executes operations like opening, closing,
and renaming files and directories.
 Determines the mapping of blocks to
Datanodes.
 Slave
 Serves read and write requests from the
file system’s clients.
 Performs block creation, deletion, and
replication as instructed by the Namenode.
Source: http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html

Introduction to Internet of Things 26


Introduction to Internet of Things 28
Data Analytics – Part II
Data is Precious

Introduction to Internet of Things 28


What is Data Analytics
 “Data analytics (DA) is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of
specialized systems and software. Data analytics technologies and techniques are
widely used in commercial industries to enable organizations to make more‐
informed business decisions and by scientists and researchers to verify or disprove
scientific models, theories and hypotheses.”
[An admin's guide to AWS data management]

Introduction to Internet of Things 29


Types of Data Analysis
 Two types of analysis
 Qualitative Analysis
 Deals with the analysis of data that is categorical in nature

 Quantitative Analysis
 Quantitative analysis refers to the process by which numerical data is analyzed

Introduction to Internet of Things 30


Qualitative Analysis
 Data is not described through numerical values
 Described by some sort of descriptive context such as text
 Data can be gathered by many methods such as interviews, videos and audio
recordings, field notes
 Data needs to be interpreted
 The grouping of data into identifiable themes
 Qualitative analysis can be summarized by three basic principles (Seidel, 1998):
 Notice things
 Collect things
 Think about things

Introduction to Internet of Things 31


Quantitative Analysis
 Quantitative analysis refers to the process by which numerical data is analyzed
 Involves descriptive statistics such as mean, media, standard deviation
 The following are often involved with quantitative analysis:

 Statistical models  Regression analysis


 Analysis of variables  Statistical significance
 Data dispersion  Precision
 Analysis of relationships between variables  Error limits
 Contingence and correlation

Introduction to Internet of Things 32


Comparison
Qualitative Data Quantitative Data
Data is observed Data is measured

Involves descriptions Involves numbers

Emphasis is on quality Emphasis is on quantity

Examples are color, smell, taste, etc. Examples are volume, weight, etc.

Introduction to Internet of Things 33


Advantages
 Allows for the identification of important (and often mission‐critical) trends
 Helps businesses identify performance problems that require some sort of action
 Can be viewed in a visual manner, which leads to faster and better decisions
 Better awareness regarding the habits of potential customers
 It can provide a company with an edge over their competitors

Introduction to Internet of Things 34



Statistical models
The statistical model is defined as the mathematical equation that are formulated
in the form of relationships between variables.
 A statistical model illustrates how a set of random variables is related to another
set of random variables.
 A statistical model is represented as the ordered pair (X , P)
 X denotes the set of all possible observations
 P refers to the set of probability distributions on X

Introduction to Internet of Things 35


Statistical models (Contd.)
 Statistical models are broadly categorized as
 Complete models
 Incomplete models

 Complete model does have the number of variables equal to the number of
equations
 An incomplete model does not have the same number of variables as the number
of equations

Introduction to Internet of Things 36


Statistical models (Contd.)
 In order to build a statistical model
 Data Gathering
 Descriptive Methods
 Thinking about Predictors
 Building of model
 Interpreting the Results

Introduction to Internet of Things 37


Analysis of variance
 Analysis of Variance (ANOVA) is a parametric statistical technique used to compare
datasets.
 ANOVA is best applied where more than 2 populations or samples are meant to be
compared.
 To perform an ANOVA, we must have a continuous response variable and at least one
categorical factor (e.g. age, gender) with two or more levels (e.g. Locations 1, 2)
 ANOVAs require data from approximately normally distributed populations

Introduction to Internet of Things 38


Analysis of variance (Contd.)
 Properties to perform ANOVA –
 Independence of case
 The sample should be selected randomly
 There should not be any pattern in the selection of the sample
 Normality
 Distribution of each group should be normal
 Homogeneity
 Variance between the groups should be the same (e.g. should not compare data from
cities with those from slums)

Introduction to Internet of Things 39


Analysis of variance (Contd.)
 Analysis of variance (ANOVA) has three types:
 One way analysis
 One fixed factor (levels set by investigator). Factors: age, gender, etc.
 Two way analysis
 Factor variables are more than two
 K‐way analysis
 Factor variables are k

Introduction to Internet of Things 40


Analysis of variance (Contd.)
 Total Sum of square
 In statistical data analysis, the total sum of squares (TSS or SST) is a quantity that
appears as part of a standard way of presenting results of such analyses. It is defined
as being the sum, over all observations, of the squared differences of each
observation from the overall mean.
 F –ratio
 Helps to understand the ratio of variance between two data sets
 The F ratio is approximately 1.0 when the null hypothesis is true and is greater than
1.0 when the null hypothesis is false.
 Degree of freedom
 Factors which have no effect on the variance
 The number of degrees of freedom is the number of values in the final calculation of a
statistic that are free to vary.

Introduction to Internet of Things 41


Data dispersion
 A measure of statistical dispersion is a nonnegative real number that is zero if all
the data are the same and increases as the data becomes more diverse.

 Examples of dispersion measures:


 Range
 Average absolute deviation
 Variance and Standard deviation

Introduction to Internet of Things 42


Data dispersion (Contd.)
 Range
 The range is calculated by simply taking the difference between the maximum and
minimum values in the data set.
 Average absolute deviation
 The average absolute deviation (or mean absolute deviation) of a data set is the average of the
absolute deviations from the mean.
 Variance
 Variance is the expectation of the squared deviation of a random variable from its mean
 Standard deviation
 Standard deviation (SD) is a measure that is used to quantify the amount of variation
or dispersion of a set of data values

Introduction to Internet of Things 43


Contingence and correlation
 In statistics, a contingency table (also known as a cross tabulation or crosstab) is a
type of table in a matrix format that displays the (multivariate) frequency
distribution of the variables.

 Provides a basic picture of the interrelation between two variables

 A crucial problem of multivariate statistics is finding (direct‐)dependence structure


underlying the variables contained in high‐dimensional contingency tables

Introduction to Internet of Things 44


Contingence and correlation (Contd.)
 Correlation is a technique for investigating the relationship between two
quantitative, continuous variables

 Pearson's correlation coefficient (r) is a measure of the strength of the association


between the two variables.

 Correlations are useful because they can indicate a predictive relationship that can
be exploited in practice

Introduction to Internet of Things 45


Regression analysis
 In statistical modeling, regression analysis is a statistical process for estimating the
relationships among variables

 Focuses on the relationship between a dependent variable and one or more


independent variables

 Regression analysis estimates the conditional expectation of the dependent


variable given the independent variables

Introduction to Internet of Things 46


Regression analysis (Contd.)
 The estimation target is a function of the independent variables called the
regression function
 Characterize the variation of the dependent variable around the regression
function which can be described by a probability distribution
 Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning
 Regression analysis is also used to understand which among the independent
variables are related to the dependent variable

Introduction to Internet of Things 47


Statistical significance
 Statistical significance is the likelihood that the difference in conversion rates
between a given variation and the baseline is not due to random chance

 Statistical significance level reflects the risk tolerance and confidence level

 There are two key variables that go into determining statistical significance:
 Sample size
 Effect size

Introduction to Internet of Things 48


Statistical significance (Contd.)
 Sample size refers to the sample size of the experiment

 The larger your sample size, the more confident you can be in the result of the
experiment (assuming that it is a randomized sample)

 The effect size is just the standardized mean difference between the two groups

 If a particular experiment replicated, the different effect size estimates from each
study can easily be combined to give an overall best estimate of the effect size

Introduction to Internet of Things 49


Precision and Error limits
 Precision refers to how close estimates from different samples are to each other

 The standard error is a measure of precision

 When the standard error is small, estimates from different samples will be close in
value and vice versa

 Precision is inversely related to standard error

Introduction to Internet of Things 50


Precision and Error limits (Contd.)
 The limits of error are the maximum overestimate and the maximum
underestimate from the combination of the sampling and the non‐sampling errors

 The margin of error is defined as –


 Limit of error = Critical value x Standard deviation of the statistic
 Critical value: Determines the tolerance level of error.

Introduction to Internet of Things 51


Introduction to Internet of Things 26

You might also like