Unit 1(Big Data Analytics) (Search with for new topic)
What is big data
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store it or
process it efficiently. Big data is also a data but with huge size.
why big data
The term Big Data can be described as a large volume of data, both structured and unstructured.
The term big data is quite new. even before it comes to a term, companies have been dealing with a large
scale of data sets around for decades using spreadsheets, feedback forms, and graphs to track customer
insights and trends. The only difference today is we have the right tools and technical experts to gain the
benefits of big data.
convergence of key trends
We live in a world of data. Everywhere we turn, it’s there. It’s bringing us to the future. Here are
the key trends in the world of Big Data and Data Science in 2018.
Did you know the world of technology generates about 2.5 billion gigabytes of data daily? Seems
astounding, isn’t it? But here’s the twist: this is a 2012 information from IBM.
If that figure amazed you, prepare to be shaken up by a prediction published in an article by
Forbes. It mentioned the growth of data at an astonishing rate. It’s so amazing that it’s borderline
fiction.
This is what you can expect. There will be around 1.7 megabytes of new data that will be
created every second for every person on this planet.
unstructured data
Unstructured data is the data which does not conforms to a data model and has no easily
identifiable structure such that it can not be used by a computer program easily. Unstructured data is
not organised in a pre-defined manner or does not have a pre-defined data model, thus it is not a
good fit for a mainstream relational database.
industry examples of big data
Below, are 6 examples of big data being used across some of the main industries in the World.
1.) Retail Good customer service and building customer relationships is vital in the retail industry. The
best ways to build and maintain this service and relationship is through big data analysis. Retail
companies need to understand the best techniques to market their products to their customers, the
best process to manage transactions and the most efficient and strategic way to bring back lapsed
customers in such a competitive industry.
2.) Banking Due to the amount of data streaming into banks from a wide variety of channels, the
banking sector needs new means to manage big data. Of course like the retail industry and all others it is
Page 1 of 11
important to build relationships, but banks must also minimise fraud and risk whilst at the same time
maintaining compliance.
3.) Manufacturing Manufacturers can use big data to boost their productivity whilst also minimising
wastage and costs - processes which are welcomed in all sectors but vital within manufacturing. There
has been a large cultural shift by many manufacturers to embrace analytics in order to make more
speedy and agile business decisions.
4.) Education Schools and colleges which use big data analysis can make large positive differences to the
education system, its employees and students. By analysing big data, schools are supplied with the intel
needed to implement a better system for evaluation and support of teachers, to make sure students are
progressing and identifying at risk pupils.
5.) Government The Government has a large scope to make change to the community we live in as a
whole when utilising big data, such as dealing with traffic congestion, preventing crime, running
agencies and managing utilities. Governments however need to address the issues of privacy and
transparency.
6.) Health Care Health Care is one industry where lives could be at stake if information isn’t quick,
accurate and in some cases, transparent enough to satisfy strict industry regulations. When big data is
analysed effectively, health care providers can uncover insights that can find new cures and improve the
lives of everyone.
web analytics
Web Analytics is the methodological study of online/offline patterns and trends. It is a technique that
you can employ to collect, measure, report, and analyze your website data. It is normally carried out to
analyze the performance of a website and optimize its web usage.
We use web analytics to track key metrics and analyze visitors’ activity and traffic flow. It is a tactical
approach to collect data and generate reports.
Importance of Web Analytics
We need Web Analytics to assess the success rate of a website and its associated business. Using Web
Analytics, we can −
Assess web content problems so that they can be rectified
Have a clear perspective of website trends
Monitor web traffic and user flow
Demonstrate goals acquisition
Figure out potential keywords
Identify segments for improvement
Find out referring sources
big data and marketing
Page 2 of 11
Big data marketing is Micro marketing that analyzes the consumption patterns, preferences, and
information of customers and offers customized benefit to person who is likely to buy the products.
Recently, it has been increased from the tangible products to intangible products such as the finance,
distribution, medicine, telecommunications, and insurance. It also analyzes the political tendencies,
preferred pledges, etc. of voters.
fraud and big data
Big data fraud detection is a cutting-edge way to use consumer trends to detect and prevent suspicious
activity. Even subtle differences in a consumer’s purchases or credit activity can be automatically
analyzed and flagged as potential fraud. Using data analytics to detect fraud requires expert knowledge
and computer resources, but is easier than ever, due to improvements in programming languages and
server technology.
risk and big data
While it’s easy to get caught up in the opportunities big data offers, it’s not necessarily a cornucopia of
progress. If gathered, stored, or used wrongly, big data poses some serious dangers. However, the key
to overcoming these is to understand them. So let’s get ahead of the curve.
Broadly speaking, the risks of big data can be divided into four main categories: security issues, ethical
issues, the deliberate abuse of big data by malevolent players (e.g. organized crime), and unintentional
misuse.
credit risk management
Banks are essentially financial consultants of their customers so there is an investment to know as much
about the customers as possible and that includes risks associated with the customer’s industry,
business and management. With machine learning, one can find possible patterns with customers that
can lead to more intelligent questions that could potentially anticipate an upcoming risk and more
importantly, mitigate it.
However, jumping into Big Data for the sake of getting into the technology is counterproductive. One
needs to identify where they currently are and understand the journey that has to be taken in order to
make better decisions. If an institution currently uses mainframe and DOS terminals to handle all their
reporting, it is not wise to expect them to jump to sentiment analysis and advanced analytics in a
month. The data journey can be started with that end in mind but remember that it is a marathon and
not a race.
big data and algorithmic trading
Robots are trading millions of stocks by the millisecond, every day. Thanks to ever-advancing algorithms,
machines buy and sell in worldwide financial markets at unimaginable speeds.
Page 3 of 11
Robotic trading has only grown in recent years. The most popular form of algorithmic trading, high-
frequency trading (HFT), processes historical and real-time market data to know when to buy and sell at
lightning speed.
HFT trading volume grew by 164 percent between 2005 and 2009, according to the NYSE. And according
to Bank of England, HFTs traded 70 percent of the average daily share volume in US equities.
Thanks to big data, these algorithms are not only becoming smarter at trading billions of dollars every
day. The more and more data financial robots consume makes them even smarter at making quick,
accurate and profitable investment decisions.
big data and healthcare
“Big data in healthcare” refers to the abundant health data amassed from numerous sources including
electronic health records (EHRs), medical imaging, genomic sequencing, payor records, pharmaceutical
research, wearables, and medical devices, to name a few. Three characteristics distinguish it from
traditional electronic medical and human health data used for decision-making: It is available in
extraordinarily high volume; it moves at high velocity and spans the health industry’s massive digital
universe; and, because it derives from many sources, it is highly variable in structure and nature. This is
known as the 3Vs of Big Data.
big data in medicine
Big data is a massive amount of information on a given topic. Big data includes information that is
generated, stored, and analyzed on a vast scale — too vast to manage with traditional information
Page 4 of 11
storage systems. In health care, the move to digitize records and the rapid improvement of medical
technologies have paved the way for big data to have a big impact in the field.
Many industries use big data to learn about their customers and tailor their products or services
accordingly. In health care, big data sources include patient medical records, hospital records, medical
exam results, and information collected by healthcare testing machines (such as those used to perform
electrocardiograms, also known as EKGs).
Biomedical research on public health also provides a large portion of the big data that, if properly
managed and analyzed, can serve as meaningful information for patients, doctors, administrators, and
researchers alike. For example, public health researchers can generate big data to predict and prepare
for future pandemics.
Types of patient-centered healthcare data also include:
Medical records
Dental records
Surgical records
Behavioral data (for example, a patient’s diet)
Biometrics (for example, a patient’s blood pressure)
Living conditions
advertising and big data
The digital advertising industry is evolving like never before. The ability to capture and analyze massive
amounts of structured and unstructured data is helping digital advertisers to discover new
relationships, spot emerging trends, and patterns, and gain actionable insights that lead to competitive
advantage. As a result, traditional advertising is shifting rapidly into the realm of personalized and highly
targeted online and mobile ads—the realm of data-driven marketing. Once dismissed as a “buzzword”,
big data is having a big impact on the digital advertising industry, and here are some reasons why.
Page 5 of 11
Finding order among chaos
Real-time data analysis
More personalized and targeted ads
Hyper-localized advertising
Mergers and acquisitions
big data technologies
Hadoop
Hadoop is one of the best open-source software that allows distributed processing of multiple sets of
real-time data across several clusters of computers with simple programming models. It helps in
scalability from single servers to thousands of machines by detecting any failure at the application layer.
There are five current projects available in modules— Hadoop Common, Hadoop Distributed File
System, Hadoop YARN, Hadoop MapReduce, and Hadoop Ozone. The frameworks are written in Java
that can process any size and format of real-time data. It is cost-effective and provides efficient service
even in severe unfavorable conditions such as cyberattacks or machine crashes.
MongoDB
MongoDB is a document-oriented distributed database facilitating the data management of
unstructured or semi-structured real-time data for application developers. It is one of the most popular
open-source data analysis tools that is utilized to create the most innovative products and services in
the global market. It helps to store data in JSON-like documents that allow flexible and dynamic
schemas. There is a multi-cloud database service for MongoDB known as MongoDB Atlas that provides
top-notch automation and built-in practices to provide continuous availability, elastic scalability as well
as support with regulatory compliance. It also provides a powerful query language for aggregation, geo-
based search, text search, graph search, ad hoc queries, indexing, and many more facilities.
R
R is another Big Data technology used for statistical computing and graphics in the programming
language. This programming software provides a diverse range of functionalities to Big Data engineers,
statisticians, etc.- linear modeling, non-linear modeling, classical statistical test, time-series analysis,
clustering also graphical techniques. It is a well-designed platform with the availability of various
mathematical symbols and formulae. It facilitates effective data management that has a large coherent
and integrated collection of effective tools for real-time data analytics.
Page 6 of 11
Tableau
Tableau is a robust Big Data technology that can be connected to several open-source databases. The
server even provides a free public option to create appropriate visualization. This analytics platform
consists of various attractive features like sharing options with anybody, moderate speed to enhance
extensive operation, integrated with more than 250 applications, and most importantly assists to solve
big real-time data analytics issues. It is one of the most powerful, secure, flexible end-to-end real-time
data analytics platforms. It generates a series of Tableau product lines— Tableau Prep, Tableau Desktop,
Tableau Server, and Tableau Online as well as Tableau Mobile.
Cassandra
Cassandra is an open-source NoSQL database that transforms multiple sets of real-time data into in-
depth analysis. It has linear scalability with proven fault-tolerance on both commodity hardware and
cloud infrastructure. Cassandra ensures no data loss while the failed nodes can be replaced efficiently. It
has been tested with replay, fuzz, property-based, fault injection as well as multiple performance tests
to ensure reliability. It tends to power critical deployments with enhanced performances and scalability
in the cloud.
Qlik
Qlik provides transparent raw data integration efficiently with automatically aligned data association. It
helps Big Data analysts to detect the potential market trends by integrating embedded and predictive
analysis. It supports a full range of real-time data analytics with the Associative Engine and a governed
multi-cloud architecture. The Associative Engine ensures to deliver unlimited combinations of Big Data
by indexing every relationship within the data. It helps to detect the in-depth insights for better
workflow. QlikView consists of multiple attractive products for the global market— Qlik Replicate, Qlik
Compose, Qlik Gold Client, Qlik Enterprise Manager, Qlik Catalog, and Qlik Gold Client for Data
Protection.
Splunk
Splunk aims to empower IT, DevOps, and other teams to transform their multiple sets of real-time data
from any source at any time. This Big Data technology is providing service to a diverse range of
industries— aerospace, education, manufacturing, healthcare, retail, and many more. It helps to
transform the data into colorful reports, graphs, personalized dashboards, and other data visualization
facilities.
Page 7 of 11
ElasticSearch
ElasticSearch is also an open-source database server utilized for performing full-text search and real-
time data analytics with HTTP web interface and Schema-free JSON documents. It is one of the best Big
Data technologies due to its reliability and scalability with high speed. It also offers the analysts a smart
platform that is highly optimized for language-based searches. It provides rapid results with the
implementation of inverted indices for full-text querying, BKD trees, and a column store for real-time
data analytics. The scalability can manage kajillions of events per second in a 300-node cluster.
KNIME
KNIME or Konstanz Information Miner is another open-source real-time data analytics technology
written in Java. It consists of several functionalities— data visualization, selective execution of analysis
steps, detecting outcomes, interactive views as well as personalized data models. It also offers ETL
operations with a broad spectrum of integrated tools that are easy to install in the existing computer
systems.
RapidMiner
RapidMiner is a top-notch Big Data platform proficient in delivering transformational business insights to
various industries. It helps to upskill organizations with portability and extensibility. RapidMiner
provides an integrated environment for data preparation, deep learning, text mining as well as
predictive analytics. It is more popular among non-programmers and researchers due to its compatibility
with Apple, Android, NodeJS, flask, and many more. It also provides its dataset collection and allows the
user to load real-time data from Cloud, RDBMS, NoSQL, and so on.
introduction to Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across clusters
of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering
local computation and storage.
open source technologies
This article will give you the top 7 open source big data tools that do this job par excellence. These
tools help in handling massive data sets, identifying patterns, and trends.
So, if you are someone who is looking forward to becoming a part of the big data industry, equip
yourself with these big data tools.
1. Hadoop
Even if you are a beginner in this field, I am sure this is not the first time you’re reading about
Hadoop. It is recognized as one of the most popular big data tools to analyze large data sets as the
Page 8 of 11
platform can send data to different servers. Another benefit of using Hadoop is that it can also run
on a cloud infrastructure.
This open-source software framework is put into use when the volume of data exceeds the available
memory. This big data tool is also ideal for data exploration, filtration, sampling, and
summarization. It consists of four parts:
Hadoop Distributed File System: This file system which is commonly known as HDFS, is a
distributed file system that is compatible with very high scale bandwidth.
MapReduce: It refers to a programming model for processing big data.
YARN: All the hadoop’s resources in its infrastructure are managed and scheduled using
this platform.
Libraries: They allow other modules to work efficiently with Hadoop.
2. Apache Spark
The next hype in the industry among big data tools is Apache Spark. See, the reason behind this is
that this open source big data tool fills the gaps of Hadoop when it comes to data processing. This
big data tool is the most preferred tool for data analysis over other types of programs due to its
ability to store large computations into memory. It is capable of running complicated algorithms
which is a prerequisite while dealing with large data sets.
Proficient in handling both batch and real-time data, Apache Spark is flexible to work with HDFS as
well as OpenStack Swift or Apache Cassandra. Often used as an alternative to MapReduce, Spark
can run tasks 100x faster than Hadoop’s MapReduce.
3. Cassandra
Apache Cassandra is one of the best big data tools to process structured data sets. Created in 2008
by Apache Software Foundation, it is recognized as the best open source big data tool for
scalability. This big data tool has a proven fault-tolerance on cloud infrastructure and commodity
hardware which makes it more critical for big data uses.
Additionally, it offers features which no other relational database and NoSQL database can provide.
This includes simple operations, cloud availability points, performance, continuous availability as a
data source; to name a few. Apache Cassandra is used by giants like Twitter, Cisco, and Netflix.
4. MongoDB
MongoDB is an ideal alternative to modern databases. A document -oriented database, it is the
ideal choice for businesses that need fast and real-time data for instant decisions. One thing that
sets it apart from other traditional databases is that it makes use of documents and collections
instead of rows and columns.
Thanks to its power of storing data in documents, it is very flexible and can be easily adapted by
companies. It can store any data type, be it integer, strings, Booleans, arrays, and
objects. MongoDB is easy to learn and provides support for multiple technologies and platforms.
5. HPCC
High Performance Computing Cluster or HPCC is the competitor of hadoop in the big data market. It
is one of the open source big data tools under the Apache 2.0 license. Developed by LexisNexis Risk
Solution, its public release was announced in 2011. It delivers on a single platform, a single
architecture, and a single programming language for data processing. If you are looking to
Page 9 of 11
accomplish big data tasks with minimal use of code, HPCC is your big data tool. It automatically
optimizes code for parallel processing and provides enhanced performance. Its uniqueness lies in its
lightweight core architecture which ensures near real-time results without a large scale
development team.
6. Apache Storm
It is a free big data open source computation system. It is one of the best big data tools that offers a
distributed real-time, fault-tolerant processing system. Having been benchmarked as
processing one million 100 byte messages per second per node, it has big data technologies and
tools that use parallel calculations that can run across a cluster of machines. Being open source,
robust and flexible, it is preferred by medium and large-scale organizations. It guarantees data
processing even if the messages are lost or nodes of the cluster die.
7. Apache SAMOA
Scalable Advanced Massive Online Analysis (SAMOA) is an open source platform used for mining
big data streams with a special emphasis on machine learning enablement. It supports the Write
Once Run Anywhere (WORA) architecture that allows seamless integration of multiple distributed
stream processing engines into the framework. It allows the development of new machine learning
algorithms while avoiding the complexity of dealing with distributed stream processing engines like
Apache Storm, Flink, and Samza.
cloud and big data
You’ve likely heard the terms “Big Data” and “Cloud Computing” before. If you’re involved with cloud
application development, you may even have experience with them. The two go hand-in-hand, with
many public cloud services performing big data analytics.
With Software as a Service (SaaS) becoming increasingly popular, keeping up-to-date with cloud
infrastructure best practices and the types of data that can be stored in large quantities is crucial. We’ll
take a look at the differences between cloud computing and big data, the relationship between them,
and why the two are a perfect match, bringing us lots of new, innovative technologies, such as artificial
intelligence.
The Difference Between Big Data & Cloud Computing
Before discussing how the two go together, it’s important to form a clear distinction between “Big Data”
and “Cloud Computing”. Although they are technically different terms, they’re often seen together in
literature because they interact synergistically with one another.
Big Data: This simply refers to the very large sets of data that are output by a variety of programs. It can
refer to any of a large variety of types of data, and the data sets are usually far too large to peruse or
query on a regular computer.
Cloud Computing: This refers to the processing of anything, including Big Data Analytics, on the “cloud”.
The “cloud” is just a set of high-powered servers from one of many providers. They can often view and
query large data sets much more quickly than a standard computer could.
mobile business intelligence
Mobile business intelligence (mobile BI) refers to the ability to provide business and data analytics
services to mobile/handheld devices and/or remote users. MBI enables users with limited computing
Page 10 of 11
capacity to use and receive the same or similar features, capabilities and processes as those found in a
desktop-based business intelligence software solution.
Techopedia Explains Mobile Business Intelligence (Mobile BI)
MBI works much like a standard BI software/solution but it is designed specifically for handheld users.
Typically, MBI requires a client end utility to be installed on mobile devices, which remotely/wirelessly
connect over the Internet or a mobile network to the primary business intelligence application server.
Upon connection, MBI users can perform queries, and request and receive data. Similarly, clientless MBI
solutions can be accessed through a cloud server that provides Software as a Service business
intelligence (SaaS BI).
Crowd sourcing analytics
Crowdsourcing and Big data analytics together can enable associations to abuse data for settling on
educated business choices that are a commendable journey. Crowdsourcing data is an effective way to
seek the help of a large audience usually through the internet to gather information on how to solve the
company’s problems, generate new ideas and innovations. The conceivable eventual fate of
crowdsourcing: Flexible crowdsourcing platform will turn out to be anything but difficult to utilize and
will be flawlessly coordinated into learning forms. There will be interdisciplinary joint effort between
researchers. Publicly supporting will be a piece of the non-formal instructive framework. Publicly
supporting appears to be a characteristic way to deal with handling huge information. Expert groups will
emerge. There are sufficient open doors for the abuse of the achievement of on-line media in training:
Facebook, YouTube, Wikipedia. The potential outcomes are boundless. These brief understudies
towards idea and collaboration, and give useable information to them. So as to reinforce the procedure
of change there is a need, other than the genuine gadgets and framework, for the obtaining of the
subjective and conduct capabilities which will make on-line concentrating productive and successful.
inter and trans firewall analytics
Over the last 100 years, supply has evolved to connect multiple companies and enable them to
collaborate to create enormous value to the end-consumer via concepts like CPFR, VMI, etc.Decision
sciences will witness a similar trend as enterprises begin to collaborate on insights across the value
chain. For instance, in the health care industry, rich consumer insights can be generated by collaborating
on data and insights from the health insurance provider, pharmacies delivering the drugs and the drug
manufacturer. In fact, this is not necessarily limited to companies within the tradition demand-supply
chain. There are instances where a retailer and a social media company can come together to share
insights on consumer behaviour that will benefit both players. Some of the more progressive companies
will take this a step further and work on leveraging the large volumes of data outside the firewall, such
as social data, location data, etc. in other words, it will not be very long before internal data and insights
from within the enterprise firewall is no longer a differentiator. We call this trend the move from intra -
to inter and trans-firewall analytics. Yesterday companies were doing functional silo based analytics.
Today they are doing intra-firewall analytics with data within the firewall. Tomorrow they will be
collaborating on insights with other companies to do inter-firewall analytics as well as leveraging the
public domain to do trans-firewall analytics.
Page 11 of 11