Data and Analytics for IoT
MODULE
4
As more and more devices are added to IoT networks,
the data generated by these systems becomes
overwhelming
Traditional data management systems are simply
unprepared
for the demands of what has come to be known as “big
data.”
The real value of IoT is not just in connecting things but
rather in the data produced by those things, the new services you
can enable via those connected things, and the business
insights that the data can reveal.
However, to be useful, the data needs to be handled in a
way that is organized and controlled.
Thus, a new approach to data analytics is needed for
An Introduction to Data Analytics
for IoT
In the world of IoT, the creation of massive amounts of data
from sensors is common and one of the biggest challenges—
not only from a transport perspective but also from a
data management standpoint
Modern jet engines are fitted with thousands of sensors
that generate a whopping 10GB of data per second
Analyzing this amount of data in the most efficient manner
possible falls under the umbrella of data analytics
Not all data is the same; it can be categorized and thus
analyzed in different ways.
Depending on how data is categorized, various data analytics
tools and processing methods can be applied.
Two important categorizations from an IoT
perspective are whether the data is structured or unstructured
and whether it is in motion or at rest.
Structured Versus Unstructured
Data
Structured data and unstructured data are important
classifications as they typically require different toolsets from
a data analytics perspective
Structured data means that the data follows a model or
schema that defines how the data is represented or organized,
meaning it fits well with a traditional relational database
management system (RDBMS).
In many cases you will find structured data in a simple
tabular form—for example, a spreadsheet where data
occupies a specific cell and can be explicitly defined and
referenced
Structured data can be found in most computing systems
and includes everything from banking transaction and
invoices to computer log files and router configurations.
IoT sensor data often uses structured values, such as
temperature, pressure, humidity, and so on, which are
all sent in a known format.
Structured data is easily formatted, stored, queried, and processed
Because of the highly organizational format of structured
data, a wide array of data analytics tools are readily
available for processing this type of data.
From custom scripts to commercial software like Microsoft
Excel and Tableau
Unstructured data lacks a logical schema
forunderstanding and decoding the data through
traditional programming means.
Examples of this data type include text, speech, images,
and video.
As a general rule, any data that does not fit neatly into a
predefined data model is classified as unstructured
data
According to some estimates, around 80% of a business’s
data is unstructured.
Because of this fact, data analytics methods that can be
applied to unstructured data, such as cognitive
computing and machine learning, are deservedly garnering
a lot of attention.
With machine learning applications, such as natural
language processing (NLP), you can decode
speech.
With image/facial recognition applications, you can extract
critical information from still images and video
Smart objects in IoT networks generate both
structured and unstructured data.
Structured data is more easily managed and processed due
to its well-defined organization.
On the other hand, unstructured data can be harder to
deal with and typically requires very different analytics
tools for processing the data
Data in Motion Versus Data at
Rest
Data in IoT networks is either in transit (“data in motion”)
or being held or stored (“data at rest”).
Examples of data in motion include traditional client/server
exchanges, such as web browsing and file transfers, and
email.
Data saved to a hard drive, storage array, or USB drive is
data at rest.
From an IoT perspective, the data from smart objects is
considered data in motion as it passes through the network en
route to its final destination.
This is often processed at the edge, using fog computing.
When data is processed at the edge, it may be filtered and deleted
or forwarded on for further processing and possible storage at a
fog node or in the data center.
Data does not come to rest at the edge.
When data arrives at the data center, it is possible to process it
in real-time, just like at the edge, while it is still in motion.
Tools with this sort of capability, are Spark, Storm, and
Flink
Data at rest in IoT networks can be typically found in
IoT brokers or in some sort of storage array at the
data center
Hadoop not only helps with data processing but also
data storage
IoT Data Analytics
Overview
The true importance of IoT data from smart objects
is realized only when the analysis of the data leads to
actionable business intelligence and insights.
Data analysis is typically broken down by the types
of results that are produced
Types of Data Analysis Results
Four types of data analysis
results
Descriptive:
Descriptive data analysis tells you what is happening,
either now or in the past.
For example, a thermometer in a truck engine
reports temperature values every second.
From a descriptive analysis perspective, you can pull this data at
any moment to gain insight into the current operating
condition of the truck engine.
If the temperature value is too high, then there may
be a cooling problem or the engine may be experiencing
too much load.
Diagnostic:
When you are interested in the “why,” diagnostic data
analysis
can provide the answer.
Continuing with the example of the temperature sensor in the
truck engine, you might wonder why the truck engine
failed.
Diagnostic analysis might show that the temperature
of the engine was too high, and the engine
overheated.
Applying diagnostic analysis across the data generated by a
wide range of smart objects can provide a clear picture of why
a problem or an event occurred
Predictive:
Predictive analysis aims to foretell problems or
issues
before they occur.
For example, with historical values of temperatures for the
truck engine, predictive analysis could provide an
estimate on the remaining life of certain components
in the engine.
These components could then be proactively replaced before
failure occurs.
Or perhaps if temperature values of the truck engine start to
rise slowly over time, this could indicate the need for an oil
change or some other sort of engine cooling
maintenance.
Prescriptive:
Prescriptive analysis goes a step beyond predictive and
recommends
solutions for upcoming problems.
A prescriptive analysis of the temperature data from a truck
engine might calculate various alternatives to cost-
effectively
maintain our truck
These calculations could range from the cost necessary for more frequent
oil
changes and cooling maintenance to installing new cooling equipment on the
engine or upgrading to a lease on a model with a more powerful
engine.
Prescriptive analysis looks at a variety of factors and makes the
Both predictive and prescriptive analyses are more resource
intensive and increase complexity, but the value they
provide is much greater than the value from descriptive and
diagnostic analysis
IoT Data Analytics
Challenges
Problems by using RDMS in IoT
1.Scaling Problems (performance issues, costly to
resolve, req more h/w, architechture changes)
2. Volatility of Data (change in schema)
Machine
Learning
ML is central to IoT.
Data collected by smart objects needs to be analyzed, and
intelligent actions need to be taken based on these
analyses.
Performing this kind of operation manually is almost impossible
(or very, very slow and inefficient).
Machines are needed to process information fast and
react instantly when thresholds are met
Ex: advances in self-driving vehicle--abnormal pattrn
recognition in a crowd and automated intelligent
and machine-assisted decision system
Machine Learning
Overview
Machine learning is, in fact, part of a larger set of technologies
commonly grouped under the term artificial intelligence
(AI).
AI includes any technology that allows a computing system to
mimic human intelligence using any technique, from
very advanced logic to basic “if-then-else” decision loops.
Any computer that uses rules to make decisions belong
to this group
A simple example is an app that can help you
find your parked car.
A GPS reading of your position at regular intervals calculates
your speed.
A basic threshold system determines whether you are driving
(for example, “if speed > 20 mph or 30 kmh, then start
calculating speed”).
When you park and disconnect from the car
Bluetooth system, the app simply records the location
when the disconnection happens.
This is where your car is parked.
In more complex cases, static rules cannot be simply
inserted into the program because they require parameters
that can change or that are imperfectly understood
A typical example is a dictation program that runs on a
computer.
The program is configured to recognize the audio pattern
of each word in a dictionary, but it does not know your
voice’s specifics—your accent, tone, speed, and so on
You need to record a set of predetermined sentences to
help the tool match well-known words to the sounds
you make when you say the words.
This process is called machine learning.
ML is concerned with any process where the
computer needs to receive a set of data that is
processed to help perform a task with more
efficiency.
ML is a vast field but can be simply divided in two main
categories: supervised and unsupervised
learning
Supervised
Learning
In supervised learning, the machine is trained with input for
which there is a known correct answer.
For example, suppose that you are training a system to recognize
when there is a human in a mine tunnel.
A sensor equipped with a basic camera can capture shapes
and return them to a computing system that is responsible
for determining whether the shape is a human or
something else (such as a vehicle, a pile of ore, a rock, a piece
of wood, and so on.).
With supervised learning techniques, hundreds or thousands
of images are fed into the machine, and each image is
labelled (human or nonhuman in this case).
This is called the training set.
An algorithm is used to determine common parameters
and common differences between the images.
The comparison is usually done at the scale of the entire
image, or pixel by pixel.
Images are resized to have the same characteristics
(resolution, color depth, position of the central figure, and
so on), and each point is analyzed.
Each new image is compared to the set of known “good images,” and a
deviation is calculated to determine how different, the new
image is from the average human image and, therefore, the
probability that what is shown is a human figure. This process is
called classification.
After training, the machine should be able to recognize human shapes.
Before real field deployments, the machine is usually tested with
unlabeled pictures— this is called the validation or the test set,
depending on the ML system used—to verify that the recognition
level is at acceptable thresholds. If the machine does not reach the
level of success expected, more training is needed
In other cases, the learning process is not about classifying in two
or more categories but about finding a correct value.
For example, the speed of the flow of oil in a pipe is a
function of the size of the pipe, the viscosity of the oil, pressure, and a
few other factors.
When you train the machine with measured values, the machine
can predict the speed of the flow for a new, and unmeasured,
viscosity.
This process is called regression; regression predicts numeric
values, whereas classification predicts categories
Unsupervised
Learning
In some cases, supervised learning is not the best method for a
machine to help with a human decision.
Suppose that you are processing IoT data from a
factory
manufacturing small engines.
You know that about 0.1% of the produced engines on average
need adjustments to prevent later defects, and your task is to
identify them before they get mounted into machines and shipped
away from the factory.
With hundreds of parts, it may be very difficult to detect the
potential defects, and it is almost impossible to train a machine to
recognize issues that may not be visible
However, you can test each engine and record multiple
parameters, such as sound, pressure, temperature of key
parts, and so on.
Once data is recorded, you can graph these elements in
relation to one another (for example, temperature as a
function of pressure, sound versus rotating speed
overtime).
You can then input this data into a computer and use
mathematical functions to find groups.
For example, you may decide to group the engines by the
sound they make at a given temperature.
A standard function to operate this grouping, K-means clustering,
finds the mean values for a group of engines (for example,
mean value for temperature, mean frequency for sound).
Grouping the engines this way can quickly reveal several types of
engines that all belong to the same category (for example, small
engine of chainsaw type, medium engine of lawnmower type).
All engines of the same type produce sounds and temperatures in
the same range as the other members of the same group.
There will occasionally be an engine in the group that
displays unusual characteristics (slightly out of
expected temperature or sound range).
This is the engine that you send for manual evaluation.
The computing process associated with this determination is
called unsupervised learning.
This type of learning is unsupervised because there is not a
“good” or “bad” answer known in advance.
It is the variation from a group behavior that allows the
computer to learn that something is different