Unit 4 IOT
Unit 4 IOT
Internet of things 4
UNIT 4
DATA ANALYTICS AND SUPPORTING SERVICES 9
Structured Vs Unstructured Data and Data in Motion Vs Data in Rest – Role of Machine
Learning – No SQL Databases – Hadoop Ecosystem – Apache Kafka, Apache Spark – Edge
Streaming Analytics and Network Analytics – Xively Cloud for IoT, Python Web
Application Framework – Django – AWS for IoT – System Management with NETCONF-
YANG
4. Introduction to Data analytics for IoT:
Data management is an important concept in IoT because sensors generate massive amount
of data. Jet engines are embedded with sensors which generate 10 GB of data per second.
Modern jet engines generate 500 TB of data on daily basis for a period of 8 hours.Peta byte
of data is also generated from the one airplane. Handling this massive amount of data is
categorized by the data analytics which helps the IoT industries.
The sensors in IoT generate both structured and un structured data. Structured data is is
managed by the well defined scheme, unstructured data is managed by analytical tools.
Data in IoT is operated as Data in transit (motion) or data at rest. The data acquired from the
IoT sensor objects is the data in motion. The data in motion is utilized by the fog and edge
computing. Data is sent to data center from the fog and edge computing.
Data in motion:
Data is actively moving from one location to another in the data in motion .for example data
is transferred between two networks.
Data at rest: Data at rest is data that is not actively moving from device to device or network
to network such as data stored on a hard drive, laptop, and flash drive.eg: USB.
Protecting sensitive data both in transit and at rest is much needed for modern systems as
intruders find more complicated ways to steal data. Spark, storm and Flink are the tools used
for analysing the stored data. Myriad tools are used for processing the structured data.Hadoop
helps data processing and data storage.
Machine learning is important tool for the IoT and data analytics .Machine learning, Deep
learning, Neural Networks and Convolutional networks are the various terms related to the
field of IoT.Self driving vehicles are embedded with self-learning capacity to make
intelligent decisions during driving is due to advancements in the machine learning concepts.
Supervised learning:
Supervised learning involves a set of inputs and their corresponding outputs. The system will
be trained on set of inputs called the training set, algorithms work on the training set and it
calculates the difference between the input in the training set and it finally classifies the
different set of classes in the input. This process of classifying independent classes from a
given set of inputs is called Classification. The inputs given will be labeled set of inputs in
classification. The training and testing is done .testing is done with unlabeled data sets. The
classification results in finding correct value. Classification and regression is considered to
the important approaches of supervised learning. Classification predicts discrete value and the
regression predicts the continuous value. Greater number of inputs or larger datasets would
result in better training for the systems to obtain good accuracy.
Unsupervised Learning:
The given data is unlabeled and we are able to find different categories of the input it is said
to unsupervised learning. This algorithm finds the different set of groups in the given
unlabeled set of data. This grouping is performed by the K Means clustering. The mean of the
particular input is calculated and all the data with similar kind are grouped together. The
following figure depicts the three different clusters formed from given set of unlabeled data.
Neural Networks:
Neural networks are the extensions of machine learning approach the system are able to
recognize or differentiate and mimic human brain. Network is formed with different set of
layers namely input layer, first layer, higher layer, top layer and the output layer. The
following figure explains how a system is trained to find a dog from a given set of labeled
images of animals, through proper learning to classify them. In Input layer unlabeled image is
sent to the pretrained network. The first layer finds the different shapes and in the higher
layer complex structures are identified (different features like face, arm) and top layer would
identify the different high complex structures (differentiate different animal categories).the
final output layer predicts the animal based on the training .the output unit gives the final
output with high accuracy. A neural network has much research focus. A neural network has
been used with various image processing application. There are different kinds of neural
network namely artificial neural network, convolutional neural network and recurrent neural
network. Deep learning concept was further developed which consists of more number of
layers. The result of one layer is fed into the next layer and the processing is done fast at the
intermediate layers. Numerous applications nowadays rely on deep learning concept and
neural network approaches.
For every possible use cases, it is necessary to determine the proper algorithm usage to obtain
good result when integrated with the IoT Application.ML operation can be handled by two
ways namely local learning and remote learning.
Local learning: if the data is processed in the sensor node or fog node
Remote Learning: data is collected and it is processed in the central cloud server
ML for IoT in major domains: Weather sensor can provide the details of pollution level at
the city. Light embedded on street can change the luminosity based on the local light
conditions of the environment.ML integrated with IoT is deployed on various applications.
The following actions are performed on the sensors embedded on various places.
Monitoring: The sensors are used for monitoring the environment for example the
temperature sensor.ML integrated with this sensor can find the failure condition.
Behavior control: For example, if a system monitors a hot atmosphere in the environment
the ML may be used to control the behavior of the system and inducing the system to
generate fresh cool air to the environment.
Operations optimization: behavior control focus on the corrective operation, This operation
optimization aims at providing increased efficiency and optimized solution.
Self healing, Self-optimizing: The system which identifies the fault by itself and it can find a
corrective action for the fault being identified.
Predictive Analytics: This kind of analytics is done to predict the issue which is going to
arise due to some fault in the system. Predictive analysis is done to improve the safety and
maintenance of the system .sensors which are embedded in machines can predict the faults
which is going to occur through the help of big data analytics
The data management is done by the big data and hadoop.Hadoop is the backbone of various
big data application. The data is being collected; stored, manipulated and analyzed .The big
data has three Vs
Velocity: Velocity deals how fast the data is collected and processed.Hadoop file system is
used to process the data fast which is collected by the sensor objects
Variety: deals with different kinds of data like structured, unstructured and semi structured
data stored in the hadoop.Data from sensors is the example of structured data, data from the
social media is the unstructured data
Volume: deals with the huge volume of data ranging from giga bytes to exa bytes. Clusters
of servers are used for big deployments.
Types of Data Sources:
Machine data: Data generated from the sensors embedded in IoT systems
Transaction data: Data obtained from transactions
Social data: Data obtained from the social media like face book, twitter (huge amount of data
generated from the social media)
Enterprise data: Data from the enterprises are structured in nature.
Industrial automation and control systems feed their data into relational databases and
historians. Examples of relational databases include oracle and Microsoft SQL.Historian
databases include the time series data recorded from the sensors. There are new technologies
for handling the data management. They are
Massively Parallel Processing Databases
NoSQL Databases
Hadoop
Massively Parallel Processing Databases:
The data from the enterprises are structured data and it is being stored in relation databases.
These group of relational databases together constitute the data warehouses.MPP is a concept
which is built on the top of the relational data warehouses for faster access and reducing the
query time. These systems can process the data in parallel so it results in faster query
process time.MPP is also termed as the analytic databases. Refer the following figure for the
MPP nothing sharing architecture. It possess the master node to which all nodes are
connected .each node has the processor, memory and storage within itself. The whole process
is optimized with the help of SQL. Fast processing is an important aspect of MPP.
1. NoSQL (“non SQL” or “not only SQL”) databases store data in a format other than
relational tables. The semi structured and unstructured data are processed by NO
SQL. NoSQL database has been characterized in many types which include document
stores, key-value stores, wide-column stores, and graph stores.
Document stores: It involves unstructured data (XML and JSON)
Key value stores: It stores in the form of associative arrays. Key is paired with value.
Wide column stores: stores key value pairs but formatting takes place row by row
Graph stores: it describes the relationship between elements. Well suited for natural
Language processing and social media.
Features of NoSQL
Non-relational
Schema-free
Simple API
Distributed
Column-based
1. Column-oriented databases work on columns and are based on Big Table paper by
Google. Every column is treated separately.
2. They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
3. Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence,
4. HBase, hyper table are examples of column based database.
Document-Oriented:
1. Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value
part is stored as a document. The document is stored in JSON or XML formats.
2. In this diagram on your left you can see we have rows and columns, and in the right, we
have a document database which has a similar structure to JSON.
3. The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications..
4. Amazon Simple DB,,Mongo DB, are popular Document originated DBMS systems.
Graph-Based
1. A graph type database stores entities as well the relations amongst those entities. The
entity is stored as a node with the relationship as edges. An edge gives a relationship
between nodes. Every node and edge has a unique identifier.
2. Compared to a relational database where tables are loosely connected, a Graph database is
a multi-relational in nature..
3. Graph base database mostly used for social networks, logistics, and spatial data.
Some specific cases when NoSQL databases are a better choice than RDBMS include the
following:
When there is a large need for storing large amounts of unstructured data with
changing schemas.
When you are interconnected by cloud computing.
When you need to develop rapidly.
When a hybrid data environment is available.
4.6 Hadoop:
Hadoop is recent data management for processing of data.Hadoop system was initially
developed to handle the millions of websites and to enhance the fast search .Hadoop has two
key elements. (HDFS and Map reduce)
Hadoop Distributed File System: system for storing data from different nodes.
Map reduce: Processing engine which divides a big task into small one and it runs in parallel
for faster approach.
The above figure depicts the hadoop cluster; it includes the name nodes and the data nodes.
Name Nodes: This Node is important for data ads, deletes reads on the HDFS system.
Namenode takes the request from clients and it gives the requested block to the available
nodes. It gives instruction to the data nodes when to perform the replication.
Data nodes: This node is to store the data .The various blocks are distributed in the data
nodes .The same block is shared to one or more nodes as per their replication policy. This is
done to ensure the data redundancy.
Hadoop Architecture
• Hadoop framework includes following four modules:
• Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules.
• Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
• HadoopMapReduce: This is YARN-based system for parallel processing of large
data sets.
MAPREDUCE:
• Hadoop divides the job into two important tasks. There are two types of tasks:
1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)
The execution process is controlled and controlled by two types of entities called aJob
tracker: Acts like a master (responsible for complete execution of submitted job)
Multiple Task Trackers: Acts like slaves, each of them performing the job
APACHE KAFKA is a messaging scheme that is working based on the distributed publisher
subscribers. It is a real time event streaming system. It delivers an information or message to
stream processing engine like spark streaming or storm. It has numerous producers and
consumers connected to the Kafka Cluster .producers and consumers exchange information
between them through the kafka cluster. The producers generate the data and the consumers
read the data
4.8 APACHE SPARK:Spark was introduced by Apache Software Foundation for speeding
up the Hadoop computational computing software process.
Apache Spark is an open-source distributed processing system used for big data workloads.
This utilizes in-memory caching. The task is performed at a rapid speed the data is transferred
to the high speed memory for the read and write operation. It provides development APIs in
Java, Scale, Python and R.The data is being processed in real time Real time processing is
done by the apache spark project and it is also termed as spark streaming. Live streaming and
messing system activities are performed by the spark core. Spark core takes the data from the
Kafka. The data collected from the Kafka is further divided into small batches or micro
batches. For the purpose of security
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
COMPONENTS OF SPARK
Spark SQL
Spark SQL is a component focuses on a new data abstraction called SchemaRDD.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture.
GraphX
GraphX is a distributed graph-processing framework on top of Spark. (User defined graphs)
In the era of information technology every company relies on the cloud to store, or
retrieves or market their business with the help of cloud.IoT integrated with the cloud
plays a major role. The data stored in the cloud is analyzed and various decisions are
taken. In automobile racing, various sensors in car produce enormous amount of data per
second, resulting in huge gigabytes of data, similarly weather forecasting involves
numerous data generated from the various sensorsEdge analytics is the collection,
processing, and analysis of data at the edge of a network either at sensor or near it. Retail,
manufacturing, transportation, are generating huge volume of data at the edge of the
network. Edge analytics is data analytics in real-time and on site where data collection is
happening. Edge analytics could be descriptive or diagnostic or predictive analytics.
Big data refers to the unstructured data collected and stored in the cloud. Big data
analytics can be performed on the data centre data in the cloud .it performs batch job
analytics .This edge streaming analytics allows you to analyse and monitor the streaming
of data at the edges to make the prediction decision wisely. In edges analytics the data is
not been analysed in single edge it is analysed in distributed edge nodes, each node has to
communicate with one another. Streaming analytics is being done on the traffic data
which gives information to the driver in taking important decisions due to analytics on the
traffic data. Big data analytics is performed on the data at rest, streaming analytics is
performed on the data in motion.
Key values of edge streaming analytics:
Reducing data at the edge.
Analysis and response at the edge
Time sensitivity
The data in real time is analysed is by the streaming analytics. This analytics is performed
in three stages
1) Raw input data: data from sensors are given as input
2) Analytics processing unit: It takes the data streams and processes by time windows by
operating it with analytical functions
3) Output streams: output depicts the communication using messaging protocol MQTT
Filter: It filters out the irrelevant data and takes only important data needed for processing
that is the work of filter in APU.
Transform: The data extracted is formatted for processing.
Time: As the data flow through real time basis, timing should be framed. If there is a
fluctuation of data at different times .The average value is calculated from the various time
fluctuated data. Average value between the certain time intervals is calculated.
Correlate: The data is obtained from different sensors and finally combined into single
record. For example data comes from the different instruments is combined into single health
record of a patient finally. Combining real time data with the historical data of the patients
leads to know the insights of the current health condition of the patients. This process is
called correlation.
Match patterns: Matching patterns aims at the alerting the system if there is a kind of
emergency. For example the matching pattern may alert a nurse by notification of an alarm.
Machine learning technique is adopted to find the matching patterns of the system
Improve business Intelligence: Edge Analytics improves the business intelligence by
improving the basic operations which in turn gives better efficiency.
Advantages
Reduce latency of data analytics.
Scalability of analytics.
The amount of bandwidth needed to transmit all the data collected by thousands of
these edge devices will also grow exponentially with the increasing number of these
devices
Edge analytics will reduce overall expenses by minimizing bandwidth.
Network traffic monitoring and profiling: This feature lets you analyze the network
by monitoring the traffic and it rectifies the problem.
Application traffic monitoring and profiling: This kind of monitoring is done by
the protocols MQTT, CoAP, and DNP3
Capacity planning: It helps in analyzing the data for certain period of time. This
analysis may help to monitor the traffic growth.
Security analysis: This kind of analysis is done to monitor the denial of service
attack
Accounting: For this kind of accounting process the software like cisco Jasper is used
for monitoring the flow of data
Data warehousing and data mining: Data stored in the warehouse will be analyzed
for multiservice applications
Netflow server for collection and reporting: Problems in network the final
destination of the net flow is analyzed by the server.
FNF is installed on the routers, this provides the view of multiservice performed in the IoT
network.LoRaWAN cannot perform the net flow analysis.MQTT can do this only with the
help of IoT Broker. Challenges are faced if network does not support flow analytics, or any
additional bandwidth systems has to be reviewed.
Xively Python Libraries are used to embed python code as per the Xively APIs. A Xively
Web interface is available for creating the front end part. Xively can work with different
programming language platform .HTTP protocols, APIs, MQTT are the protocols used in
Xively. All the devices are connected to Xively Cloud for real-time processing and archiving
to cloud. IOT application developers can write the frontend for IoT applications as per their
requirements. Management of apps is very flexible with Xively cloud and other APIs. Xively
is very popular with companies which deal with IoT based device manufacturing and
development. Companies using Xively has the secure connectivity of devices and good data
management capability.
Xively is an IoT cloud platform that is “an enterprise platform for building, managing, and
deriving business value from connected products”. It is a cloud-based API with an SDK
which simplifies and reduces the time of the development process. It supports several
platforms like
Android
Arduino
Arm mbed
C
Java and much more.
Step 2: Developers can create different devices for which he has to create an IoT app.
Templates are provided in the Web Interface of Xively.
Step 3: Unique FEED_IDV is allocated to the connected devices. It specifies the data
stream of the connected device.
Step 4: IoT devices are assigned using the available APIs. The permissions are given to
perform the Create, Update, Delete and Read operation.
Step 5: Bidirectional channels are created after we connect a device with Xively. Each
channel is unique to the device connected.
Step 7: Xively APIs are used by IoT devices to create communication enabled
products.
4.13 DJANGO:
DJANGO is a Web framework. Django help us to build better Web apps very fast with less
code. Django is a high-level Python Web framework that encourages rapid development and
proper design. Web applications are fast and results with good performance. Django focuses
on automating part. It follows the DRY (Don't Repeat Yourself) principle. To develop an e-
commerce website, Django is best. The execution of the work would be very fast. It’s
free and open source.
Django architecture: The below fig depicts a simple DJANGO framework with
templates and caching framework.
Billions of devices in homes, factories, oil wells, hospitals, cars, and thousands of other
devices are found in many places. Solutions are needed to connect them, and collect, store,
and analyze device data. AWS IoT provides broad functionality, spanning the edge to the
cloud, for building the IoT solutions virtually across a wide range of devices for any kind of
devices. Since AWS IoT integrates with AI services the devices become smarter.AWS IoT
can easily scale based on the requirements of the business. AWS IoT provides good security
features and preventive security policies. These policies respond immediately to all security
related issues.
o This service Brings Alexa Voice to any connected device. AVS for AWS
IoT reduces the cost and complexity of integrating Alexa.
o AVS for AWS IoT enables Alexa built-in functionality on MCUs, such as
the ARM Cortex M class with less than 1 MB embedded RAM. To do so,
AVS offloads memory and compute tasks to a virtual Alexa Built-in device
in the cloud.
Device gateway
o This Feature provides devices to securely communicate with AWS IoT.
Device provisioning service
o This feature Allows us to provision devices using a template that describes
the resources required for your device: a thing, a certificate, and one or
more policies.
Message broker
o The MQTT protocol is used for the secure transmission over WebSocket to
publish and subscribe. HTTP REST interface is used to publish.
Registry
o Register your devices and associate up to three custom attributes with each
one.
Rules engine
o Provides message processing and integration with other AWS services.
SQL-based language to select data from message payloads, and then
process and send the data to other services, such as Amazon S3, Amazon
DynamoDB, and AWS Lambda.
Security and Identity service
o Provides shared responsibility for security in the AWS Cloud. The message
broker and rules engine use AWS security features to send data securely to
devices or other AWS services.
Fig.4.34:Industrial
AWS IoT customers are building industrial IoT applications for predictive quality and
maintenance and to remotely monitor operations.
AWS IoT customers are building connected home applications for home automation, home
security and monitoring, and home networking.
AWS IoT customers are building commercial applications for traffic monitoring, public
safety, and health monitoring.
Configuration transactions
Network-wide orchestrated activation
Network-level validation and roll-back.
Save and restore configurations
Service provider and enterprise network teams are changing their trends towards a service
oriented approach for managing their networks. IETF’s Network Configuration Protocol
(NETCONF) and YANG, a data modelling language, to help remove the time, cost and
manual steps involved in network element configuration.
NETCONF is the standard for installing, manipulating and deleting configuration of network
devices .YANG is used to model both configuration and state data of network elements.
YANG structures the data definitions into tree structures and provides many modelling
features, including an extensible type system, formal separation of state and configuration
data and a variety of syntactic and semantic constraints. YANG data definitions are contained
in modules and provide a strong set of features for extensibility and reuse.
YANG
YANG is a data modelling language used to model configuration and state data
manipulated by the NETCONF protocol
YANG modules contain the definitions of the configuration data, state data, RPC calls
and can be formatted according to the notifications.
A YANG module defines the data exchanged between the NETCONF client and
server.
A module comprises of a number of 'leaf' nodes which are organized into a
hierarchical tree structure. The 'leaf' nodes are specified using the 'leaf' or 'leaf-list'
constructs. Leaf nodes are organized using 'container' or 'list' constructs. The below
fig depicts the leaf node structure(it is a hierarchical tree structure)
A YANG module can import definitions from other modules. Constraints can be
defined on the data nodes, e.g. allowed values.
YANG can model both configuration data and state data using the 'config' statement
This YANG module is a YANG version of the toaster MIB
The toaster YANG module begins with the header information followed by identity
declarations which define various bread types.
The leaf nodes (‘toasterManufacturer’ , ‘toasterModelNumber’ and oasterStatus’) are
defined in the ‘toaster’ container.
Each leaf node definition has a type and optionally a description and default value.
The module has two RPC definitions (‘make-toast’ and ‘cancel-toast’).