BDA Unit 1
BDA Unit 1
Module -1
Introduction to Big Data
1.1 Need of Big Data
The rise in technology has led to the production and storage of voluminous amounts of data.
Earlier megabytes (106 B) were used but nowadays petabytes (1015 B) are used for processing,
analysis, discovering new facts and generating new knowledge. Conventional systems for
storage, processing and analysis pose challenges in large growth in volume of data, variety of
data, various forms and formats, increasing complexity, faster generation of data and need of
quickly processing, analyzing and usage.
Figure 1.1 shows data usage and growth. As size and complexity increase, the proportion of
unstructured data types also increase.
An example of a traditional tool for structured data storage and querying is RDBMS. Volume,
velocity and variety (3Vs) of data need the usage of number of programs and tools for analyzing
and processing at a very high speed.
1.2 BIG DATA
Data is information, usually in the form of facts or statistics that one can analyze or use for further
calculations. Data is information that can be stored and used by a computer program. Data is information
presented in numbers, letters, or other form. Data is information from series of observations,
measurements or facts. Data is information from series of behavioral observations, measurements or
facts.
Web Data
Web data is the data present on web servers (or enterprise servers) in the form of text, images,
videos, audios and multimedia files for web users. A user (client software) interacts with this
data. A client can access (pull) data of responses from a server. The data can also publish (push)
or post (after registering subscription) from a server. Internet applications including web sites,
web services, web portals, online business applications, emails, chats, tweets and social networks
provide and consume the web data.
Transactions processing which follows ACID rules (Atomicity, Consistency, Isolation and
Durability)
Examples of semi-structured data are XML and JSON documents. Semi-structured data
contain tags or other markers, which separate semantic elements and enforce hierarchies of
records and fields within the data. Semi-structured form of data does not conform and
associate with formal data model structures. Data do not associate data models, such as the
relational database and table models.
Multi-structured data refers to data consisting of multiple formats of data, viz. structured,
semi-structured and/or unstructured data.
Unstructured Data
Data does not possess data features such as a table or a database.
unstructured Data
Mobile data: Text messages, chat messages, tweets, blogs and comments
Website content data: YouTube videos, browsing data, e-payments, web store data, user-
generated maps
Satellite images, atmospheric data, surveillance, traffic videos, images from Instagram,
Flickr (upload, access, organize, edit and share photos from any device from anywhere
in the world).
Big Data is high-volume, high-velocity and/or high-variety information that requires new
forms of processing for enhanced decision making, insight discovery and process
optimization
A collection of data sets so large or complex that traditional data processing applications
are inadequate.” -Wikipedia
Data of a very large size, typically to the extent that its manipulation and management
present significant logistical challenges-oxford English dictionary.
Big Data refers to data sets whose size is beyond the ability of typical database software
tool to capture, store, manage and analyze
Varacity: quality of data captured, which can vary greatly, affecting its accurate analysis
Big Data can be classified on the basis of its characteristics that are used for designing data
architecture for processing and analytics.
Huge data volumes storage, data distribution, high-speed networks and high-
performance computing
Applications scheduling using open source, reliable, scalable, distributed file system,
distributed database, parallel and distributed computing systems, such as Hadoop or
Spark
Open source tools which are scalable, elastic and provide virtualized environment,
clusters of data nodes, task and thread management
Data management using NoSQL, document database, column-oriented database, graph
database and other form of databases used as per needs of the applications and
inmemory data management using columnar or Parquet formats during program
execution
Data mining and analytics, data retrieval, data reporting, data visualization and
machinelearning Big Data tools.
Vertical scalability means scaling up the given system’s resources and increasing the system’s
analytics, reporting and visualization capabilities. This is an additional way to solve problems
of greater complexities. Scaling up means designing the algorithm according to the
architecture that uses resources efficiently.
x terabyte of data take time t for processing, code size with increasing complexity increase
by factor n, then scaling up means that processing takes equal, less or much less than (n * t).
Horizontal scalability means increasing the number of systems working in coherence and
scaling out the workload. Processing different datasets of a large dataset deploys horizontal
scalability. Scaling out means using more resources and distributing the processing and
storage tasks in parallel. The easiest way to scale up and scale out execution of analytics
software is to implement it on a bigger machine with more CPUs for greater volume, velocity,
variety and complexity of data. The software will definitely perform better on a bigger
machine.
A distributed computing model uses cloud, grid or clusters, which process and analyze big and
large datasets on distributed computing nodes connected by high-speed networks.
Big Data processing uses a parallel, scalable and no-sharing program model, such as
MapReduce, for computations on it.
Distributed Computing on multiple Big data Large data Small to Medium
nodes data
Yes Yes No
Distributed computing
Yes Yes No
Parallel computing
One of the best approach for data processing is to perform parallel and distributed computing
in a cloud-computing environment
Cloud resources can be Amazon Web Service (AWS) Elastic Compute Cloud (EC2), Microsoft
Azure or Apache CloudStack.
on-demand service
resource pooling,
scalability,
accountability,
Cloud services can be accessed from anywhere and at any time through the Internet.
Cloud Services
There are three types of Cloud Services
Providing access to resources, such as hard disks, network connections, databases storage,
data center and virtual server spaces is Infrastructure as a Service (IaaS).
Some examples are Tata Communications, Amazon data centers and virtual servers.
Apache CloudStack is an open source software for deploying and managing a large network
of virtual machines, and offers public cloud services which provide highly scalable
Infrastructure as a Service (IaaS).
Platform as a Service
Software at the clouds support and manage the services, storage, networking, deploying,
testing, collaborating, hosting and maintaining applications.
Examples are Hadoop Cloud Service (IBM BigInsight, Microsoft Azure HD Insights, Oracle
Big Data Cloud Services).
Software as a service
Software applications are hosted by a service provider and made available to customers
over the Internet.
Some examples are SQL Google SQL, IBM BigSQL, Microsoft Polybase and Oracle Big Data
SQL.
Cluster Computing
Data analytics need the number of sequential steps. Big Data architecture design task
simplifies when using the logical layers approach. Figure 1.2 shows the logical layers and the
functions which are considered in Big Data architecture
data-processing,
Logical layer 1 (L1) is for identifying data sources, which are external, internal or both. The
layer 2 (L2) is for data-ingestion.Data ingestion means a process of absorbing
information, just like the process of absorbing nutrients and medications into
the body by eating or drinking them .Ingestion is the process of obtaining and importing
data for immediate use or transfer. Ingestion may be in batches or in real time using
preprocessing or semantics.
Layer 1
Layer 2
Ingestion and ETL processes either in real time, which means store and use the data as
generated, or in batches.
Layer 3
Data storage using Hadoop distributed file system or NoSQL data stores—HBase,
Cassandra, MongoDB.
Layer 4
Data processing software such as MapReduce, Hive, Pig, Spark, Spark Mahout, Spark
Streaming
Layer 5
Data integration
Analytics (real time, near real time, scheduled batches), BPs, BIs, knowledge discovery
Data managing means enabling, controlling, protecting, delivering and enhancing the value of
data and information asset. Reports, analysis and visualizations need well- defined data.
Data management functions include:
2. Data governance, which includes establishing the processes for ensuring the availability,
usability, integrity, security and high-quality of data. The processes enable trustworthy
data availability for analytics, followed by the decision making at the enterprise.
5. Managing data security, data access control, deletion, privacy and security
9. Creation of reference and master data, and data control and supervision
11. Integrated data management, enterprise-ready data creation, fast access and analysis,
automation and simplification of operations on the data,
A source can be internal. Sources can be data repositories, such as database, relational
database, flat file, spreadsheet, mail server, web server, directory services, even text or
files such as comma-separated values (CSV) files. Source may be a data store for
applications
semi-structured
multi-structured or unstructured
Data source for ingestion, storage and processing can be a file, database or streaming
data.
The source may be on the same computer running a program or a networked computer
Structured data sources are SQL Server, MySQL, Microsoft Access database, Oracle
DBMS, IBM DB2, Informix, Amazon SimpleDB or a file-collection directory at a server.
The data need high velocity processing. Sources are from distributed file systems.
The sources are of file types, such as .txt (text file), .csv (comma separated values file).
Data may be as key value pairs, such as hash key-values pairs
Data Sources - Sensors, Signals and GPS
The data sources can be sensors, sensor networks, signals from machines, devices,
controllers and intelligent edge nodes of different types in the industry M2M communication
and the GPS systems.
Sensors are electronic devices that sense the physical environment. Sensors are devices
which are used for measuring temperature, pressure, humidity, light intensity, traffic in
proximity, acceleration, locations, object(s) proximity, orientations and magnetic intensity,
and other physical states and parameters. Sensors play an active role in the automotive
industry.
RFIDs and their sensors play an active role in RFID based supply chain management, and
tracking parcels, goods and delivery.
High quality means data, which enables all the required operations, analysis, decisions,
planning and knowledge discovery correctly. Five R's as follows:
Relevancy,
recency,
range,
robustne
ss
reliability
. Data Integrity
Data integrity refers to the maintenance of consistency and accuracy in data over its usable
life. Software, which store, process, or retrieve the data, should maintain the integrity of
data. Data should be incorruptible
Data Noise
Outlier
Missing Value
Duplicate value
Data Noise
Noise in data refers to data giving additional meaningless information besides true
(actual/required) information.
Noise is random in character, which means frequency with which it occurs is variable over
time.
Outlier
An outlier in data refers to data, which appears to not belong to the dataset.For example,
data that is outside an expected range.
Actual outliers need to be removed from the dataset, else the result will be effected by a
small or large amount.
Missing Values Another factor effecting data quality is missing values. Missing value implies
data not appearing in the data set.
Duplicate Values Another factor effecting data quality is duplicate values. Duplicate value
implies the same data appearing two or more times in a dataset.
ELT processing
Data Cleaning
Data cleaning is done before mining of data. Incomplete or irrelevant data may result into
misleading decisions.
Data cleaning tools help in refining and structuring data into usable data. Examples of such
tools are OpenRefine and DataCleaner.
Data Enrichment
"Data enrichment refers to operations or processes which refine, enhance or improve the
raw data.“
Data editing refers to the process of reviewing and adjusting the acquired datasets.
Editing methods are (i) interactive, (ii) selective, (iii) automatic, (iv) aggregating and (v)
distribution.
Data Reduction
Data reduction enables the transformation of acquired information into an ordered, correct
and simplified form.
Data wrangling refers to the process of transforming and mapping the data. Results from
analytics are then appropriate and valuable.
mapping enables data into another format, which makes it valuable for analytics and data
visualizations
Java Script Object Notation (JSON) as batches of object arrays or resource arrays
Key-value pairs
Hash-key-value pair
Cloud offers various services. These services can be accessed through a cloud client (client
application), such as a web browser, SQL or other client. Figure 1.4 shows data-store export from
machines, files, computers, web servers and web services. The data exports to clouds, such as
IBM, Microsoft, Oracle, Amazon, Rackspace, TCS, Tata Communications or Hadoop cloud
services.
Figure 1.4 Data store export from machines, files, computers, web servers and web
services
Google cloud platform provides a cloud service called BigQuery Figure 1.5 shows BigQuery
cloud service at Google cloud platform. The data exports from a table or partition schema,JSON,
CSV or AVRO files from data sources after the pre-processing.
Figure 1.5 BigQuery cloud service at Google cloud platform
Data Store first pre-processes from machine and file data sources. Pre-processing transforms the
data in table or partition schema or supported data formats. For example, JSON, CSV and AVRO.
Data then exports in compressed or uncompressed data formats.
SQL
An RDBMS uses SQL (Structured Query Language). SQL is a language for viewing or changing
(update, insert or append or delete) databases.
1. Create schema, Create schema, which is a structure which contains description of objects
(base tables, views, constraints) created by a user. The user can describe the data and
define the data in the database.
2. Create catalog, which consists of a set of schemas which describe the database.
3. Data Definition Language (DDL) for the commands which depicts a database, that include
creating, altering and dropping of tables and establishing the constraints. A user can
create and drop databases and tables, establish foreign keys, create view, stored
procedure, functions in the database etc.
4. Data Manipulation Language (DML) for commands that maintain and query the database.
A user can manipulate (INSERT/UPDATE) and access (SELECT) the data.
5. Data Control Language (DCL) for commands that control a database, and include
administering of privileges and committing. A user can set (grant, add or revoke)
permissions on tables, procedures and views.
be 'location independent' which means the user is unaware of where the data is located,
and it is possible to move the data from one physical location to another without affecting
the user.
A columnar format in-memory allows faster data retrieval when only a few columns in a
table need to be selected during query processing or aggregation.
Online Analytical Processing (OLAP) in real-time transaction processing is fast when using
in-memory column format tables.
The CPU accesses all columns in a single instance of access to the memory in columnar
format in memory data-storage.
A row format in-memory allows much faster data processing during OLTP
Each row record has corresponding values in multiple columns and the on-line values store
at the consecutive memory addresses in row format.
Enterprise data server use data from several distributed sources which store data using
various technologies.
Enterprise data integration may also include integration with application(s), such as
analytics, visualization, reporting, business intelligence and knowledge discovery
Following are some standardised business processes, as defined in the Oracle application-
integration architecture:
Figure 1.6 Steps 1 to 5 in Enterprise data integration and management with Big- Data for high
performance computing using local and cloud resources for the analytics, applications and
services
NoSQL databases are considered as semi-structured data. Big Data Store uses NoSQL. NOSQL
stands for No SQL or Not Only SQL.
The stores do not integrate with applications using SQL. NoSQL is also used in cloud data store.
Features ofNoSQL are as follows:
It is a class of non-relational data storage systems, and the flexible data models and multiple
schema:
Class consisting of ordered keys and semi-structured data storage systems [BigTable, Cassandra
(used in Facebook/Apache) and HBase]
Data written at one node can replicate at multiple nodes, therefore Data storage is fault-
tolerant, May relax the ACID rules during the Data Store transactions.
Figure 1.7 Coexistence ofRDBMS for traditional server data, NoSQL and Hadoop, Spark
and compatible Big Data Clusters
1.6.3 Big Data Platform
A Big Data platform supports large datasets and volume of data. The data generate at a
higher velocity, in more varieties or in higher veracity. Managing Big Data requires large
resources of MPPs, cloud, parallel processing and specialized tools. Bigdata platform
should provision tools and services for:
3. reducing the complexity of multiple data sources and integration of applications into one
cohesive solution,
Data management, storage and analytics of Big data captured at the companies and
services require the following:
5. Massive parallelism
13. Big data sources: Data storages, data warehouse, Oracle Big Data, MongoDB NoSQL,
Cassandra NoSQL
14. Data sources: Sensors, Audit trail of Financial transactions data, external data such as Web,
Social Media, weather data, health records data.
Hadoop
Big Data platform consists of Big Data storage(s), server(s) and data management and business
intelligence software. Storage can deploy Hadoop Distributed File System (HDFS), NoSQL data
stores, such as HBase, MongoDB, Cassandra. HDFS system is an open source storage system.
HDFS is a scaling, self-managing and self-healing file system.
A stack consists of a set of software components and data store units. Applications,
machinelearning algorithms, analytics and visualization tools use Big Data Stack (BDS) at a cloud
service, such as Amazon EC2, Azure or private cloud. The stack uses cluster of high performance
machines.
Types Examples
Hadoop, Apache Hive, Apache Pig, Cascading, Cascalog, mrjob (Python MapReduce library),
MapReduce Apache S4, MapR, Apple Acunu, Apache Flume, Apache Kafka
NoSQL
Databases MongoDB, Apache CouchDB, Apache Cassandra, Aerospike, Apache HBase, Hypertable
Spark, IBM BigSheets, PySpark, R, Yahoo! Pipes, Amazon Mechanical Turk, Datameer, Apache
Processing Solr/Lucene, ElasticSearch
Servers Amazon ECZ, S3, GoogleQuery, Google App Engine, AWS Elastic Beanstalk, Salesforce Heroku
Storage
Hadoop Distributed File System, Amazon S3, Mesos
Analysis of data is a process of inspecting, cleaning, transforming and modeling data with the
goal of discovering useful information, suggesting conclusions and supporting decision making
Phases in analytics
Analytics has the following phases before deriving the new facts, providing business intelligence
and generating new knowledge.
3. Prescriptive analytics enable derivation of the additional value and undertake better
decisions for new option(s) to maximize the profits
4. Cognitive analytics enables derivation of the additional value and undertake better
decision.
Figure 1.9 shows an overview of a reference model for analytics architecture. The figure also
shows on the right-hand side the Big Data file systems, machine learning algorithms and
query languages and usage of the Hadoop ecosystem
Berkely Data Analysis Stack(BDAS)
Berkeley Data Analytics Stack (BDAS) consists of data processing, data management and
resource management layers. Following list these:
1. Applications, AMP-Genomics and Carat run at the BDAS. Data processing software
component provides in-memory processing which processes the data efficiently across the
frameworks. AMP stands for Berkeley's Algorithms, Machines and Peoples Laboratory.
3. Resource management software component provides for sharing the infrastructure across
various frameworks.
Figure 1.10 shows a four layers architecture for Big Data Stack that consists of Hadoop,
MapReduce, Spark core and SparkSQL, Streaming, R, GraphX, MLib, Mahout, Arrow and Kafka
1.7 Big Data Applications
Big Data in Marketing and Sales
Data are important for most aspect of marketing, sales and advertising. Customer Value (CV)
depends on three factors - quality, service and price. Big data analytics deploy large volume of
data to identify and derive intelligence using predictive models about the individuals. The facts
enable marketing companies to decide what products to sell.
Big Data analytics enable fraud detection. Big Data usages has the following features-for enabling
detection and prevention of frauds:
Fusing of existing data at an enterprise data warehouse with data from sources such as social
media, websites, biogs, e-mails, and thus enriching existing data
Providing high volume data mining, new innovative applications and thus leading to new
business intelligence and knowledge discovery
Large volume and velocity of Big Data provide greater insights but also associate risks with the data
used. Data included may be erroneous, less accurate or far from reality. Analytics introduces new
errors due to such data.
Five data risks, described by Bernard Marr are data security, data privacy breach, costs affecting
profits, bad analytics and bad data
Financial institutions, such as banks, extend loans to industrial and household sectors. These
institutions in many countries face credit risks, mainly risks of (i) loan defaults, (ii) timely return
of interests and principal amount. Financing institutions are keen to get insights into the
following:
4. Identifying types of employees (such as daily wage earners in construction sites) and
businesses (such as oil exploration) with greater risks
5. Anticipating liquidity issues (availability of money for further issue of credit and
rescheduling credit installments) over the years.
Big Data analytics in health care use the following data sources:clinical records, (ii) pharmacy
records, (3) electronic medical records (4) diagnosis logs and notes and (v) additional data, such
as deviations from person usual activities, medical leaves from job, social interactions.
Healthcare analytics using Big Data can facilitate the following:
3. Preventing fraud, waste, abuse in the healthcare industry and reduce healthcare costs
(Examples of frauds are excessive or duplicate claims for clinical and hospital treatments.
Example of waste is unnecessary tests. Abuse means unnecessary use of medicines, such
as tonics and testing facilities.)
4. Improving outcomes
5. Monitoring patients in real time.
Big Data analytics deploys large volume of data to identify and derive intelligence using
predictive models about individuals. Big Data driven approaches help in research in
medicine which can help patients
Following are some findings: building the health profiles of individual patients and
predicting models for diagnosing better and offer better treatment,
Aggregating large volume and variety of information around from multiple sources the
DNAs, proteins, and metabolites to cells, tissues, organs, organisms, and ecosystems, that
can enhance the understanding of biology of diseases. Big data creates patterns and
models by data mining and help in better understanding and research,
Deploying wearable devices data, the devices data records during active as well as inactive
periods, provide better understanding of patient health, and better risk profiling the user for
certain diseases.
The impact of Big Data is tremendous on the digital advertising industry. The digital advertising
industry sends advertisements using SMS, e-mails, WhatsApp, Linkedln, Facebook, Twitter and
other mediums.
Big Data captures data of multiple sources in large volume, velocity and variety of data
unstructured and enriches the structured data at the enterprise data warehouse. Big data
real time analytics provide emerging trends and patterns, and gain actionable insights for
facing competitions from similar products. The data helps digital advertisers to discover
new relationships, lesser competitive regions and areas.
Success from advertisements depend on collection, analyzing and mining. The new insights
enable the personalization and targeting the online, social media and mobile for
advertisements called hyper-localized advertising.
Advertising on digital medium needs optimization. Too much usage can also effect negatively.
Phone calls, SMSs, e-mail-based advertisements can be nuisance if sent without appropriate
researching on the potential targets. The analytics help in this direction. The usage of Big Data
after appropriate filtering and elimination is crucial enabler of BigData Analytics with
appropriate data, data forms and data handling in the right manner.