0% found this document useful (0 votes)

38 views19 pages

Module 2

Uploaded by

chinnu.200420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views19 pages

Module 2

Uploaded by

chinnu.200420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Data is the foundational element for any Business Intelligence (BI), data science, or

business analytics initiative, serving as the raw material that enables the creation of
information, insight, and knowledge. Historically, analytics models were built using
expert knowledge with minimal data, but in today’s data-driven era, data has become
essential. While once seen as a challenge to collect, store, and manage, data is now
viewed as one of the most valuable assets an organization can possess, offering deep
insights into customers, competitors, and business operations. Data can vary in size
(small or large), structure (structured or unstructured), and arrival mode (real-time or
batch), which are characteristics of what is often termed Big Data.

1. Data Source Reliability:

o Refers to the originality and trustworthiness of the data source.

o Data should come from the original creator to avoid misrepresentation during
transfers.

o Every movement increases the risk of errors, reducing data integrity and accuracy.

2. Data Content Accuracy:

o Ensures the data is correct and matches the intended purpose.

o Example: Contact info should reflect what the customer actually provided.

3. Data Accessibility:

o Measures how easily data can be accessed when needed.

o Becomes complex with distributed storage systems like data lakes or Hadoop.

4. Data Security and Privacy:

o Protects data from unauthorized access while ensuring availability to authorized

users.

o Vital in sensitive sectors like healthcare (e.g., HIPAA compliance).

o Includes proper identification to ensure accurate record access.

5. Data Richness:

o The data should be comprehensive, covering all necessary variables.

o Rich data enables robust predictive and prescriptive analytics.

6. Data Consistency:

o Ensures data from multiple sources are correctly merged.

o Inconsistent merging can lead to mixing data from different entities (e.g., patients).

7. Data Currency / Timeliness:

o Data must be current and recorded near the time of the actual event.

o Prevents errors due to time delays or memory-based misreporting.

8. Data Granularity:

o Refers to the level of detail in the data.

o Fine-grained data is preferred for analytics; aggregated data lacks needed detail.

o Once data is aggregated, it cannot be broken down unless the original exists.

9. Data Validity:

o Describes whether data values match predefined acceptable ranges.

o Example: Gender values should be clearly defined (male, female, other, etc.).

10. Data Relevancy:

• Variables should be relevant to the specific study or model.

• Relevancy exists on a spectrum, not just relevant/irrelevant.

• Irrelevant data should be avoided as it can mislead analytics models

Data (singular: datum) is a collection of facts from experiments, observations, transactions,
or experiences, and can take forms such as numbers, text, images, or audio. It represents
measurements of variables and is the lowest level of abstraction from which information and
knowledge are derived. Data can be classified as structured or unstructured. Unstructured or
semistructured data includes text, images, and audio, while structured data, which is usable
by data mining algorithms, is divided into categorical (nominal or ordinal) and numeric
(interval or ratio) types. These classifications are illustrated in Figure 3.2.

1. Categorical data consists of labels used to group variables into distinct classes, such as
race, sex, age group, and education level. While some variables can be expressed
numerically, they are often more useful when categorized. Also known as discrete data,
it represents a finite set of values with no continuity, and even numeric labels are
symbolic, not meant for calculations.
2. Nominal data contain measurements of simple codes assigned to objects as labels, which
are not measurements. For example, the variable marital status can be gen erally
categorized as (1) single, (2) married, and (3) divorced. Nominal data can be represented
with binomial values having two possible values (e.g., yes/no, true/ false, good/bad), or
multinomial values having three or more possible values (e.g., brown/green/blue,
white/black/Latinx/Asian, single/married/divorced/legal-unions)
3. Ordinal data are labels assigned to objects or events that indicate their rank order, such
as credit scores (low, medium, high), age groups, or education levels. Some predictive
models, like ordinal logistic regression, use this ranking information to improve
classification accuracy.
4. Numeric data represent measurable values like age, number of children, income,
distance, and temperature. These values can be integers (whole numbers) or real
numbers (including fractions). Also called continuous data, numeric data allow for
infinite possible values within a range, unlike discrete data, which are finite and
countable.
5. Ratio data are measurements commonly used in physical sciences and engineering, such
as mass, length, time, energy, and electric charge. They are defined by a meaningful ratio
between values and have a true, nonarbitrary zero point. For example, the Kelvin
temperature scale’s zero (absolute zero) represents the absence of kinetic energy,
making it a true zero value.
6. Interval data are variables that can be measured on interval scales. A common example
of interval scale measurement is temperature on the Celsius scale. In this particular
scale, the unit of measurement is 1/100 of the difference between the melting
temperature and the boiling temperature of water in atmospheric pressure; that is, there
is not an absolute zero value.

Raw real-world data is often dirty, misaligned, complex, and inaccurate, requiring a time-
consuming process called data preprocessing to prepare it for analytics. This phase
typically takes longer than the actual model building and assessment. Data preprocessing
involves collecting relevant data, selecting necessary records and variables, filtering out
unnecessary information, and integrating records from multiple sources—a process
known as data blending.
1. Data Cleaning (Data Scrubbing):
o Identifying and handling missing values (imputing or ignoring based on context).
o Detecting and smoothing out noisy values or outliers.
o Addressing inconsistencies using domain knowledge or expert opinion.
2. Data Transformation:
o Normalizing data to reduce bias from variables with large numeric ranges.
o Discretizing or aggregating variables (e.g., converting numeric to categorical or
grouping nominal values).
o Creating new variables to simplify or enhance data (e.g., combining donor and
recipient blood types into a single match variable).
3. Data Reduction:
o Reducing the number of variables (dimensionality reduction) using techniques like
principal component analysis or expert consultation.
o Sampling records to manage large data sets, ensuring samples are representative
using random or stratified sampling.
o Balancing skewed data sets through oversampling or undersampling to improve
model accuracy.
Big Data refers to the massive and rapidly growing volumes of data that exceed the
processing capabilities of traditional hardware and software. While the term originally
described huge datasets managed by organizations like Google or NASA, it is relative and
depends on an organization’s size and needs. Big Data includes both structured and
unstructured data from diverse sources such as web logs, sensors, social media, scientific
research, and more. The continuous growth in data volume—from terabytes to
exabytes—along with new technologies for storage and analysis, has made Big Data a
key driver for innovation and business insights. However, Big Data is more than just size;
it also involves variety, velocity, veracity, variability, and value, making it a complex and
evolving concept rather than just a buzzword.

Big Data is typically defined by three “V”s: volume, variety, velocity. In addition to these
three, we see some of the leading Big Data solution providers adding other “V”s, such as
veracity (IBM), variability (SAS), and value proposition.

VOLUME:
Volume is the most prominent characteristic of Big Data, driven by factors like
transaction records, social media, sensors, RFID, and GPS data. While data storage used
to be a major challenge, advancements in technology and lower storage costs have
shifted the focus to identifying relevant data and extracting value from it. The definition
of "big" is relative and changes over time, with data volume increasing from 0.8
zettabytes (ZB) in 2009 to an expected 44 ZB in 2020. The rise of IoT and sensors may
push these numbers even higher, bringing both significant challenges and opportunities.

Variety
Variety in Big Data refers to the wide range of data formats—structured, semi-structured,
and unstructured—such as text, audio, video, emails, and sensor data. Around 80–85%
of organizational data is unstructured, making it challenging to analyze with traditional
tools, but still valuable for decision-making.

Velocity
Velocity in Big Data refers to the speed at which data is generated and processed. With
technologies like sensors and GPS, data must often be analyzed in real time. Quick
reaction is crucial, as data loses value over time. While many focus on analyzing stored
data ("at-rest analytics"), real-time processing ("in-motion analytics") is becoming
increasingly important and valuable in time-sensitive situations.

Veracity
Veracity in Big Data refers to the accuracy, quality, and trustworthiness of the data. Since
data can be inconsistent or unreliable, tools and techniques are used to improve its
quality and generate trustworthy insights.

Variability
Variability in Big Data refers to the inconsistent and unpredictable data flows, often
caused by trends, events, or seasonal spikes. These sudden surges, especially from social
media, make data management more complex and challenging.

Value Proportion
The true value of Big Data lies in its potential to reveal deeper patterns and insights than
small data, leading to better business decisions. Big Data enables advanced analytics
("big analytics") that go beyond simple tools, providing greater value. As the field
evolves, more characteristics may be added, but the value proposition of Big Data in
driving insights and decision-making remains essential.

Big Data alone holds no value unless organizations can analyze it to gain actionable
insights, which is where Big Data analytics becomes essential. Traditional data platforms
struggle with limitations in processing volume, integrating diverse and fast-moving data
sources, and handling unstructured formats that don’t fit predefined schemas. As data
velocity, variety, and volume increase, businesses must adopt new technologies—such as
real-time analytics, schema-on-demand systems, and scalable storage—to stay
competitive and extract meaningful value from their data.
The following are the most critical success factors for Big Data analytics:
1. A clear business need (alignment with the vision and the strategy). Business investments
ought to be made for the good of the business, not for the sake of mere technology
advancements. Therefore, the main driver for Big Data analytics should be the needs of the
business, at any level—strategic, tactical, and operations.
Example: An e-commerce company wants to reduce cart abandonment. Instead of just
building fancy dashboards, they use Big Data analytics to study customer clickstreams and
identify why people drop out before purchase.
2. Strong, committed sponsorship (executive champion). It is a well-known fact that if you
don’t have strong, committed executive sponsorship, it is difficult (if not impossible) to
succeed. If the scope is a single or a few analytical applications, the sponsorship can be at
the departmental level. However, if the target is enterprise wide organizational
transformation, which is often the case for Big Data initiatives, sponsorship needs to be at
the highest levels and organization wide.
Example: A hospital wants to implement predictive analytics for early disease detection. If
the hospital director supports it, budgets and staff can be allocated, ensuring success.
3. Alignment between the business and IT strategy. It is essential to make sure that the
analytics work is always supporting the business strategy, and not the other way around.
Analyse
Example: If a retail chain’s business goal is to improve customer loyalty, IT should build
recommendation systems (using Big Data) that align with that, instead of wasting resources
on unrelated technologies.
4. A fact-based decision-making culture. In a fact-based decision-making culture, the numbers
rather than intuition, gut feeling, or supposition drive decision making.
To create a fact-based decision-making culture, senior management needs to:
• Recognize that some people can’t or won’t adjust
• Be a vocal supporter
• Stress that outdated methods must be discontinued
• Ask to see what analytics went into decisions
• Link incentives and compensation to desired behaviors
o Example: In a bank, loan approvals are no longer based on gut feeling of managers.
Instead, Big Data models analyze credit history, transactions, and spending patterns
to decide loan eligibility.
o Management enforces this by asking: “Show me the data behind your decision.”

5. A strong data infrastructure , traditionally built on data warehouses, is evolving in the Big
Data era with the addition of new technologies. To succeed, organizations must integrate
both old and new systems into a unified, efficient framework. As data size and complexity
grow, the demand for faster and more powerful analytics has led to the rise of high-
performance computing, which includes advanced techniques and platforms designed to
meet the intensive computational needs of Big Data.
Example:
• An airline company stores historical booking data in a warehouse.
• It also uses real-time flight sensor data from aircraft engines with Hadoop + Spark to predict
maintenance needs.
• By integrating both (old + new), the airline prevents engine failures and reduces delays.
This shows how old systems (warehouses) and new systems (Big Data platforms) must work
together.

As the size and complexity increase, the need for more efficient analytical systems is also
increasing. To keep up with the computational needs of Big Data, a number of new and
innovative computational techniques and platforms have been developed.
These techniques are collectively called high-performance computing, which includes the
following:

1.In-memory analytics: Solves complex problems in near real time with highly accurate
insights by allowing analytical computations and Big Data to be processed in-memory and
distributed across a dedicated set of nodes.
Example: Retailers analyzing millions of transactions in real time to adjust discounts instantly
during a sale.

2. In-database analytics: Speeds time to insights and enables better data governance by
performing data integration and analytic functions inside the database so you won’t have to
move or convert data repeatedly.
Example: A bank running fraud detection models directly inside its customer transaction
database → faster fraud alerts.

3. Grid computing: Promotes efficiency, lower cost, and better performance by processing
jobs in a shared, centrally managed pool of IT resources.
Example: A research lab analyzing genetic data by splitting tasks across many computers,
reducing processing time.

4. Appliances: Brings together hardware and software in a physical unit that is not only fast
but also scalable on an as-needed basis.
Example: A telecom company using a Big Data appliance to analyze call records and detect
network issues quickly.
When considering Big Data projects and architecture, being mindful of these challenges
will make the journey to analytics competency a less stressful one.

Data volume: The ability to capture, store, and process a huge volume of data at an
acceptable speed so that the latest information is available to decision makers when they
need it.

Data integration: The ability to combine data that is not similar in structure or source and to
do so quickly and at a reasonable cost.

Processing capabilities: The ability to process data quickly, as it is captured. The traditional
way of collecting and processing data may not work. In many situa tions, data needs to be
analyzed as soon as it is captured to leverage the most value. (This is called stream analytics,
which will be covered later in this chapter.)

Data governance: The ability to keep up with the security, privacy, ownership, and quality
issues of Big Data. As the volume, variety (format and source), and velocity of data change,
so should the capabilities of governance practices

Skills availability: Big Data is being harnessed with new tools and is being looked at in
different ways. There is a shortage of people (often called data scientists) with skills to do the
job.

Solution cost: Because Big Data has opened up a world of possible business improvements, a
great deal of experimentation and discovery is taking place to determine the patterns that
matter and the insights that turn to value.

Big Data analytics addresses key business challenges such as process efficiency, cost
reduction, customer experience, and risk management, with priorities varying by industry. It
also supports brand management, revenue growth, cross-selling, and up-selling.

• Process efficiency and cost reduction

• Brand management
• Revenue maximization, cross-selling, and up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service • Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities

There are a number of technologies for processing and analyzing Big Data, but most have
some common characteristics (Kelly, 2012). Namely, they take advantage of commodity
hardware to enable scale-out and parallel-processing techniques; employ nonrelational data
storage capabilities to process unstructured and semistructured data; and apply advanced
analytics and data visualization technology to Big Data to convey insights to end users. The
three Big Data technologies that stand out that most believe will transform the business
analytics and data management markets are Hadoop, MapReduce, and NoSQL.

Hadoop
Hadoop is an open-source tool that stores and processes large data by splitting it across
many cheap computers. Created at Yahoo! and inspired by Google’s MapReduce, it handles
huge datasets efficiently. Managed by Apache, it allows fast, parallel data processing.

How does Hadoop Works?

A client accesses different types of data, including log files, social media, and internal
systems. Hadoop breaks this data into smaller parts and stores them across multiple
inexpensive machines using the Hadoop Distributed File System (HDFS). HDFS stores large
volumes of unstructured or semi-structured data and creates multiple copies of each data
part for reliability in case a machine fails. A Name Node manages the system, tracking where
data is stored and which nodes are active.
To analyze the data, the client runs a "Map" job using MapReduce, typically written in Java. A
Job Tracker sends this job to the right nodes based on guidance from the Name Node. Each
node processes its part of the data in parallel. Once processing is done, the client starts a
"Reduce" job to gather and combine the results from each node.
The final results are stored and can be used in analytical tools or transferred to databases or
data warehouses for deeper analysis or reporting. Data scientists can then explore the data
to find patterns, insights, or build applications.
MapReduce is a technique developed by Google to process large and complex data files by
dividing the work across many machines. It works by breaking tasks into small parts that run
in parallel on hundreds or even thousands of computers, which makes processing faster.
According to a key paper, MapReduce is a programming model, not a programming
language. It helps programmers process large data sets without needing deep knowledge of
parallel or distributed systems.
In short, MapReduce lets developers handle big data efficiently by splitting work across many
computers and running tasks at the same time.
Here’s how it works:
1. Input split: The system breaks the input into smaller parts (splits). In real cases, there are
many splits.
2. Map phase: Each split is processed by a map function on different machines. The map
function groups the squares by color.
3. Shuffle and sort: The system collects and organizes the output from all the map functions.
4. Reduce phase: A reduce function adds up the number of squares for each color.

Though this example uses one reduce function, in practice, there can be more. To
improve speed, programmers can also create custom shuffle/sort logic or use a combiner
to reduce the amount of data transferred between steps.

MapReduce helps organizations process and analyze large amounts of complex data. It’s
used in tasks like search indexing, graph and text analysis, machine learning, and data
transformation—jobs that are hard to do with standard SQL in relational databases.
Because MapReduce is procedural, skilled programmers find it easy to use. It also
handles parallel computing automatically, so developers don’t need to manage it
themselves.

While MapReduce is built for programmers, non-programmers can still benefit from
ready-made applications and libraries. There are both commercial and open-source
options. For example, Apache Mahout is an open-source library that uses MapReduce
for machine learning tasks like clustering and classification.

In addition to MapReduce, a Hadoop “stack” is made up of a number of components,

which include the following:

• Hadoop Distributed File System (HDFS): The default storage layer in any given Hadoop
cluster.

• Name Node: The node in a Hadoop cluster that provides the client information on where
in the cluster particular data is stored and if any nodes fail.

• Secondary Node: A backup to the Name Node, it periodically replicates and stores data
from the Name Node should it fail.

• Job Tracker: The node in a Hadoop cluster that initiates and coordinates MapReduce
jobs or the processing of the data.

• Worker Nodes: The grunts of any Hadoop cluster, worker nodes store data and take
direction to process it from the Job Tracker.

Hive

Hive is a Hadoop-based data warehousing–like framework originally developed by

Facebook. It allows users to write queries in an SQL-like language called HiveQL, which
are then converted to MapReduce. This allows SQL programmers with no MapReduce ex
perience to use the warehouse and makes it easier to integrate with business intelligence
(BI) and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, and
so forth.

Pig

Pig is a Hadoop-based query language developed by Yahoo! It is relatively easy to learn

and is adept at very deep, very long data pipelines (a limitation of SQL).

HBase

HBase is a nonrelational database that allows for low-latency, quick lookups in Hadoop.
It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts,
and deletes. eBay and Facebook use HBase heavily.
Flume

Flume is a framework for populating Hadoop with data. Agents are populated throughout
one’s IT infrastructure—inside Web servers, application servers, and mobile devices, for
example—to collect data and integrate it into Hadoop.

Oozie

Oozie is a workflow processing system that lets users define a series of jobs writ ten in
multiple languages—such as MapReduce, Pig, and Hive—and then intelligently link them
to one another. Oozie allows users to specify, for example, that a particular query is only
to be initiated after specified previous jobs on which it relies for data are completed.

Ambari Ambari is a Web-based set of tools for deploying, administering, and monitoring
Apache Hadoop clusters. Its development is being led by engineers from Hortonworks,
which includes Ambari in its Hortonworks Data Platform.

Avro

Avro is a data serialization system that allows for encoding the schema of Hadoop files. It
is adept at parsing data and performing removed procedure calls.

Mahout

Mahout is a data mining library. It takes the most popular data mining algo rithms for
performing clustering, regression testing, and statistical modeling and imple ments them
using the MapReduce model.

Sqoop

Sqoop is a connectivity tool for moving data from non-Hadoop data stores—such as
relational databases and data warehouses—into Hadoop. It allows users to specify the
target location inside of Hadoop and instructs Sqoop to move data from Oracle, Teradata,
or other relational databases to the target.

HCatalog

HCatalog is a centralized metadata management and sharing service for Apache Hadoop.
It allows for a unified view of all data in Hadoop clusters and allows diverse tools,
including Pig and Hive, to process any data elements without needing to know physically
where in the cluster the data is stored.

Hadoop allows organizations to process and analyze massive amounts of unstructured and
semi-structured data cost-effectively. It can scale to petabytes or exabytes, enabling full
data analysis instead of just samples. Data scientists benefit from its iterative analysis
capabilities, and it's easy to get started since Hadoop is free to download.

However, Hadoop is still maturing and requires skilled developers and data scientists to
manage. There's a shortage of such talent, and the open-source nature of Hadoop can lead
to version fragmentation (forking). It also lacks real-time processing since it's batch-
oriented.

To address this, Apache Spark was developed for faster, real-time data processing.
Despite challenges, Hadoop and related technologies are improving quickly thanks to
community contributions. Companies like Cloudera, Hortonworks, IBM, and Microsoft
are making enterprise-ready tools, while others are developing NoSQL systems for near
real-time insights alongside Hadoop and Spark.

Hadoop (or more appropriately Apache Hadoop), as described in more detail above, is a
well-known, one of the first successfully developed and deployed framework to cope with
Big Data. It can handle not only very large sizes of data (i.e., volume), but also a wide
range of data types (i.e., variety) created at a unprecedented speed (i.e., velocity). Key
benefits of Hadoop include (1) handling the Big Data with commodity hardware, (2)
preventing loss of data and information due to hardware failures through replications, (3)
scaling from a small cluster to a very large analytics system, and (4) en abling the
discovery of knowledge from Big Data cost effectively and efficiently.

Apache Spark is an open-source Big Data system designed to be faster and more efficient
than Hadoop by using in-memory processing (RAM) instead of disk storage. It breaks big
tasks into smaller parts handled by many nodes. Spark supports fast SQL queries,
streaming, machine learning, and graph processing with easy-to-use APIs. While
Hadoop’s ecosystem includes HDFS, YARN, MapReduce, and Core, Spark’s ecosystem
has Spark Core, SQL, Streaming, MLlib, and GraphX. Spark offers a faster, unified
alternative to Hadoop.

Performance: Spark is faster because it uses random access memory (RAM) instead of
reading and writing intermediate data to disks. In contrast, Hadoop stores data on multiple
sources and processes it in batches via MapReduce.

• Cost: Hadoop runs at a lower cost since it relies on any disk storage type for data
processing. Spark runs at a higher cost because it relies on in-memory computations for
real-time data processing, which requires it to use high quantities of RAM to spin up
nodes.

• Parallel processing: Though both platforms process data in parallel in a distrib uted
environment, Hadoop is ideal for batch processing and linear data processing whereas
Spark is ideal for real-time processing and processing live unstructured data streams. •
Scalability: When data volume rapidly grows, Hadoop quickly scales to accommo date
the demand via Hadoop Distributed File System (HDFS). In turn, Spark relies on the fault
tolerant HDFS for large volumes of data.
• Security: Spark enhances security with authentication via shared secret or event
logging, whereas Hadoop uses multiple authentication and access control methods.
Though, overall, Hadoop is more secure, Spark can integrate with Hadoop to reach a
higher security level.

• Analytics: Spark is the superior platform in this category because it includes MLlib,
which performs iterative in-memory ML computations. It also includes tools that perform
regression, classification, persistence, pipeline construction, evaluation, etc.

There are misconceptions when it comes to comparing Hadoop and Spark. Here are a few
most common ones:

• Spark replaced Hadoop. Although there is a partial truth to it, it is not entirely correct.
Both Hadoop and Spark have their respective use cases (as mentioned below).

• Hadoop is a database and Spark is an analytics engine. This is not true as both Hadoop
and Spark can be used as both data management and data analysis tools.

• Hadoop is cheaper. Although both Hadoop and Spark are free open-source frame works,
professional installation and use of these tools are anything but free, requiring significant
investment on consultancy, data scientists, and appropriate hardware infrastructure.

Spark is always order of magnitude (up to 100 time) faster than Hadoop. That is not true.
Although for small data processing tasks Spark can performs 100 time faster than
Hadoop, its efficiency diminishes significantly for very large data processing tasks.

So, how do we decide when to use Hadoop and when to use Spark:

Use Hadoop for:

• Processing big data sets in environments where data size exceeds available memory

• Batch processing with tasks that exploit disk read and write operations

• Building data analysis infrastructure with a limited budget

• Completing jobs that are not time-sensitive

• Historical and archive data analysis

Use Spark for:

• Dealing with chains of parallel operations by using iterative algorithms

• Achieving quick results with in-memory computations

• Analyzing stream data analysis in real time

• Graph-parallel processing to model data

• All ML applications

NoSQL

NoSQL databases, like Hadoop, handle large volumes of multistructured data but focus
on fast access to discrete data for Big Data applications, unlike relational databases which
struggle at scale. Often, NoSQL works with Hadoop—for example, HBase runs on
Hadoop’s HDFS to enable quick data lookups. However, most NoSQL databases sacrifice
ACID compliance for performance and scalability and currently lack mature management
tools. Efforts by open-source communities and vendors are improving these issues.
Popular NoSQL databases include HBase, Cassandra, MongoDB, Accumulo, Riak,
CouchDB, and DynamoDB.

In recent years, many community-driven projects have emerged, often focused on

algorithms, programming languages like Python and R, and tools like KNIME and
Orange. Some of these projects aim to solve social and environmental issues and are
called “data for good” projects. They use public data and data science techniques to create
new insights and solutions that help the environment, large groups, and disadvantaged
communities. Table 3.5 shows some of the most well-known of these projects.
Big Data is defined not only by volume and variety but also by velocity—the rapid speed
at which data is generated and streamed for analysis. Traditional analytics, which work on
stored data, often lead to delayed or inaccurate decisions in fast-changing environments.
Therefore, real-time analysis of streaming data is critical for timely, relevant actions.
However, storing all generated data is increasingly impractical due to exploding data
volumes and limited storage capacity. This challenge gave rise to stream analytics—the
process of extracting actionable insights from continuous data flows (streams) without
permanent storage. Streams consist of tuples (data units), and analysis often involves
sliding windows of recent tuples to detect meaningful patterns quickly. Stream analytics
is gaining traction because of the need for rapid time-to-action and improved technology
to process data as it is created.

A key application is in the energy sector, particularly smart grids, where streaming data
from smart meters, sensors, and weather models enables real-time electricity demand and
production predictions. This allows optimized power distribution, handling of unexpected
demand spikes, and dynamic pricing adjustments, improving efficiency and customer
satisfaction.

Companies like Amazon and eBay collect and analyze every customer action on their
websites—page visits, product views, searches, and clicks—to maximize value from each
visit. By processing this real-time stream of data quickly, they can turn casual browsers
into buyers and even repeat shoppers. Even non-members start receiving personalized
product and bundle offers after just a few clicks. Advanced analytics continuously crunch
these massive clickstreams, along with data from thousands of others, to predict customer
interests—sometimes before customers themselves realize them—and create effective,

targeted offerings.

Telecommunication
Telcom companies generate huge volumes of call detail records (CDR), once used mainly
for billing, but now seen as a rich source of insights. By analyzing CDR with social
network analysis, they can prevent churn by identifying influencers, leaders, and
followers, since leaders shape customer perception positively or negatively. This helps in
managing customer bases, recruiting new members, and maximizing existing value.
When combined with social media sentiment data, CDR streams can assess marketing
campaign effectiveness, enabling quick responses to negative impacts or boosting
positive ones. Similar analysis of Internet protocol detail records allows telecoms offering
both services to optimize holistically, achieving major market gains.

Streams of Big Data greatly enhance crime prevention, law enforcement, and security by
enabling applications like real-time situational awareness, multimodal surveillance,
cybersecurity detection, legal wiretapping, video surveillance, and face recognition. In
enterprises, streaming analytics can also be applied to information assurance, helping
detect and prevent network intrusions, cyberattacks, and other malicious activities by
analyzing network logs and internet activity in real time.

With the rise of smart meters, power utilities now collect massive real-time data, moving
from monthly to 15-minute (or faster) readings. These meters and sensors send data to
control centers for real-time analysis, helping optimize supply chain decisions like
capacity, distribution, and energy trading based on usage and demand patterns. Utilities
can also integrate weather and environmental data to optimize renewable power
generation and improve demand forecasting across regions. Similar advantages extend to
other utilities such as water and natural gas.

Financial service companies use Big Data stream analysis for faster decisions,
competitive advantage, and regulatory oversight. By analyzing massive, fast-moving
trading data at very low latency across markets, they gain an edge in making split-second
buy/sell decisions for major financial gains. Stream analytics also enables real-time trade
monitoring to detect fraud and illegal activities.

Modern medical devices (e.g., ECGs, blood pressure, oxygen, sugar, and temperature
monitors) generate high-speed streaming diagnostic data that, when analyzed in real time,
can be life-saving. Stream analytics not only improves patient care and safety by
detecting anomalies quickly but also helps healthcare companies become more efficient
and competitive. Many hospitals are building futuristic systems that combine rapid data
from advanced devices with powerful computers to analyze multiple streams
simultaneously, enabling doctors to make faster and better decisions.
Governments aim to be more efficient with resources and effective in delivering services,
and Big Data streams play a key role. With e-government and social media, agencies now
have vast structured and unstructured data, enabling proactive decisions compared to
traditional reactive methods. Real-time analytics helps in disaster management
(snowstorms, hurricanes, wildfires) using radar and sensor data, as well as monitoring
water, air quality, and consumption to detect problems early. It is also applied in traffic
management, where data from cameras, GPS, and road sensors are used to adjust signals
and lanes to reduce congestion.

With growing use of business analytics, traditional statistical methods are gaining
renewed importance for evidence-based decision making. Statistics is central to
descriptive analytics, while some methods (e.g., regression, clustering, discriminant
analysis) also serve predictive analytics. Descriptive analytics has two branches:
statistics and OLAP (business intelligence using data cubes). Statistics includes
descriptive statistics (describing sample data) and inferential statistics (drawing
conclusions about populations). Descriptive statistics forms the foundation, while
regression is covered under inferential statistics

Descriptive statistics describes the basic characteristics of data, usually one variable at a
time, using formulas and numerical summaries to reveal clear patterns. It only characterizes
the sample data without making inferences about the population. In business analytics, it is
crucial for presenting data meaningfully through aggregated numbers, tables, and graphs,
helping both decision makers and analysts. It also identifies data concentration, outliers, and
unusual distributions. Descriptive statistics methods are mainly classified into measures of
central tendency and measures of dispersion, which will be represented mathematically in
the next section.

Essentials of Data Quality for Analytics
No ratings yet
Essentials of Data Quality for Analytics
11 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Most Frequent Attribute in Data Analysis
No ratings yet
Most Frequent Attribute in Data Analysis
86 pages
Data Science Basics for Beginners
No ratings yet
Data Science Basics for Beginners
291 pages
BDA Unit1 Notes
No ratings yet
BDA Unit1 Notes
14 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Data Science
No ratings yet
Data Science
12 pages
BA TH Exam
No ratings yet
BA TH Exam
38 pages
Ds Notes-Unit 1, II and III Upto Part1
No ratings yet
Ds Notes-Unit 1, II and III Upto Part1
341 pages
Comprehensive Guide to Data Analytics
No ratings yet
Comprehensive Guide to Data Analytics
4 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Unit 2 Data Analytics
No ratings yet
Unit 2 Data Analytics
16 pages
Big Data Platforms and Analytics
No ratings yet
Big Data Platforms and Analytics
20 pages
Unit - 1
No ratings yet
Unit - 1
32 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
LECTURE AUG 31 - INTRODUCTION To STATISTICS AND BUSINESS ANALYTICS
No ratings yet
LECTURE AUG 31 - INTRODUCTION To STATISTICS AND BUSINESS ANALYTICS
23 pages
Module 1 - BCS602 - Chapter 02
No ratings yet
Module 1 - BCS602 - Chapter 02
90 pages
Data Analytics in Business
No ratings yet
Data Analytics in Business
13 pages
Unit 1
No ratings yet
Unit 1
9 pages
Intro. To Business Analytics
No ratings yet
Intro. To Business Analytics
44 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
DA Unit 2 Trio 1
No ratings yet
DA Unit 2 Trio 1
26 pages
Data Science & Analytics Paper
No ratings yet
Data Science & Analytics Paper
55 pages
Dossier BA 2025
No ratings yet
Dossier BA 2025
121 pages
Data Analitics 2
No ratings yet
Data Analitics 2
8 pages
Introduction To Business Analytics
No ratings yet
Introduction To Business Analytics
63 pages
Imp Mcs226
No ratings yet
Imp Mcs226
321 pages
Introduction to Business Analytics
No ratings yet
Introduction to Business Analytics
48 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Unit-01 Bda
No ratings yet
Unit-01 Bda
25 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Data Fundamentals
No ratings yet
Data Fundamentals
21 pages
BDA Unit 1 Bigdata Intro
No ratings yet
BDA Unit 1 Bigdata Intro
69 pages
Actionable Insights in Business Analytics
No ratings yet
Actionable Insights in Business Analytics
14 pages
Business Analytics
No ratings yet
Business Analytics
34 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Big Data
No ratings yet
Big Data
54 pages
Lecture 1&2
No ratings yet
Lecture 1&2
46 pages
Kit 601 L Unit 1 240219102731 858108ce
No ratings yet
Kit 601 L Unit 1 240219102731 858108ce
35 pages
Big Data and Data Analytics Overview
No ratings yet
Big Data and Data Analytics Overview
58 pages
Understandingbigdataanddataanalytics Bigdata 211116132905
No ratings yet
Understandingbigdataanddataanalytics Bigdata 211116132905
46 pages
Data Analytics
No ratings yet
Data Analytics
47 pages
BA - Unit 1
No ratings yet
BA - Unit 1
16 pages
BA - Unit - 1 - Merged (1) Highlighted
No ratings yet
BA - Unit - 1 - Merged (1) Highlighted
103 pages
Chapter 1: Introduction To Business Analytics
No ratings yet
Chapter 1: Introduction To Business Analytics
14 pages
Data Science Notes
No ratings yet
Data Science Notes
56 pages
BDA Assignment 1: Big Data Features and Characteristics
No ratings yet
BDA Assignment 1: Big Data Features and Characteristics
14 pages
Very Imp Read Once
No ratings yet
Very Imp Read Once
30 pages
Module 3
No ratings yet
Module 3
137 pages
Foundation of Data Science Imp Notes
No ratings yet
Foundation of Data Science Imp Notes
6 pages
Big Data Analytics - Drivers
No ratings yet
Big Data Analytics - Drivers
39 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Data Analysis - Unit1
No ratings yet
Data Analysis - Unit1
65 pages
FBAS Notes
No ratings yet
FBAS Notes
20 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
DKP Advanced Excel and R - Session 1-2
No ratings yet
DKP Advanced Excel and R - Session 1-2
52 pages
Module 1 BA Notes
No ratings yet
Module 1 BA Notes
18 pages
Module 2
No ratings yet
Module 2
16 pages
BAI701 - DLRL - Question Bank (Module 1 & 2)
No ratings yet
BAI701 - DLRL - Question Bank (Module 1 & 2)
3 pages
DLRL Module 1
No ratings yet
DLRL Module 1
20 pages
Convolution Operation Solution
No ratings yet
Convolution Operation Solution
4 pages
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
No ratings yet
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
78 pages
DL QB
No ratings yet
DL QB
1 page
Module1 Smlds Bad702 Notes
No ratings yet
Module1 Smlds Bad702 Notes
29 pages
QB 1st IA
No ratings yet
QB 1st IA
2 pages
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
No ratings yet
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
61 pages
Module-IV HIVE
No ratings yet
Module-IV HIVE
69 pages
02.MOUDLE 5 - Text Mining
No ratings yet
02.MOUDLE 5 - Text Mining
27 pages
Ejercicios Cinética Enzimática
No ratings yet
Ejercicios Cinética Enzimática
6 pages
Module 1 - Database Systems
No ratings yet
Module 1 - Database Systems
9 pages
A Blockchain-Based Framework For Secure Educational Credentials
No ratings yet
A Blockchain-Based Framework For Secure Educational Credentials
11 pages
Firefighting Robot
No ratings yet
Firefighting Robot
57 pages
HTML5 Black Book Overview and Essentials
No ratings yet
HTML5 Black Book Overview and Essentials
30 pages
Andreas Fertig - Fast and Small
No ratings yet
Andreas Fertig - Fast and Small
19 pages
Online Presentation Tool
No ratings yet
Online Presentation Tool
17 pages
Computer Mcqs PACEGK
No ratings yet
Computer Mcqs PACEGK
22 pages
Log
No ratings yet
Log
52 pages
Stack and Queue Operations in Python
No ratings yet
Stack and Queue Operations in Python
3 pages
Techdb Settings Guide: Cover Page
No ratings yet
Techdb Settings Guide: Cover Page
48 pages
ERP: Features and Benefits Explained
No ratings yet
ERP: Features and Benefits Explained
13 pages
Advanced Maintenance Repair Monitoring
No ratings yet
Advanced Maintenance Repair Monitoring
56 pages
Using PMREP and PMCMD in Informatica
100% (1)
Using PMREP and PMCMD in Informatica
17 pages
Cyber Threat Intelligence Course
No ratings yet
Cyber Threat Intelligence Course
67 pages
Diploma Intake Approval Process 2024-25
No ratings yet
Diploma Intake Approval Process 2024-25
4 pages
Experimental of Vectorizer and Classifier For Scrapped Social Media Data
No ratings yet
Experimental of Vectorizer and Classifier For Scrapped Social Media Data
10 pages
(Notebook) Intel 11th Generation Processors (Intel Tiger Lake) Troubleshooting - No Drives Can Be Found During Windows 10 Installation
No ratings yet
(Notebook) Intel 11th Generation Processors (Intel Tiger Lake) Troubleshooting - No Drives Can Be Found During Windows 10 Installation
16 pages
Real-Time Systems Q&A Answer Key
No ratings yet
Real-Time Systems Q&A Answer Key
46 pages
COE 205 Lab Manual Lab 6: Conditional Processing - Page 56
No ratings yet
COE 205 Lab Manual Lab 6: Conditional Processing - Page 56
11 pages
Tables Practice
No ratings yet
Tables Practice
9 pages
Serial Entrepreneurship and Born-Global New Ventures. A Case Study
No ratings yet
Serial Entrepreneurship and Born-Global New Ventures. A Case Study
33 pages
SAP MM G/L Account Determination Guide
No ratings yet
SAP MM G/L Account Determination Guide
17 pages
Secure E-Voting System Overview
No ratings yet
Secure E-Voting System Overview
13 pages
AD Module and Assessment Handbook 2022-23-16 - 8 - 2022
No ratings yet
AD Module and Assessment Handbook 2022-23-16 - 8 - 2022
24 pages
Unit 1 - Basics of C
No ratings yet
Unit 1 - Basics of C
37 pages
Airpcap Installation Guide
No ratings yet
Airpcap Installation Guide
8 pages
My Study Habits: Effective or Not Effective: Study Habit Practices Yes No
No ratings yet
My Study Habits: Effective or Not Effective: Study Habit Practices Yes No
5 pages
Tutorial For Visual Environment 777
No ratings yet
Tutorial For Visual Environment 777
39 pages
Java While Loop Program
No ratings yet
Java While Loop Program
7 pages

Module 2

Uploaded by

Module 2

Uploaded by

Data is the foundational element for any Business Intelligence (BI), data science, or

1. Data Source Reliability:

o Refers to the originality and trustworthiness of the data source.

2. Data Content Accuracy:

o Measures how easily data can be accessed when needed.

4. Data Security and Privacy:

o Protects data from unauthorized access while ensuring availability to authorized

o Vital in sensitive sectors like healthcare (e.g., HIPAA compliance).

o Includes proper identification to ensure accurate record access.

o The data should be comprehensive, covering all necessary variables.

o Rich data enables robust predictive and prescriptive analytics.

o Ensures data from multiple sources are correctly merged.

7. Data Currency / Timeliness:

o Prevents errors due to time delays or memory-based misreporting.

o Refers to the level of detail in the data.

o Describes whether data values match predefined acceptable ranges.

10. Data Relevancy:

• Variables should be relevant to the specific study or model.

• Relevancy exists on a spectrum, not just relevant/irrelevant.

• Irrelevant data should be avoided as it can mislead analytics models

• Process efficiency and cost reduction

How does Hadoop Works?

In addition to MapReduce, a Hadoop “stack” is made up of a number of components,

Hive is a Hadoop-based data warehousing–like framework originally developed by

Pig is a Hadoop-based query language developed by Yahoo! It is relatively easy to learn

Use Hadoop for:

• Building data analysis infrastructure with a limited budget

• Completing jobs that are not time-sensitive

• Historical and archive data analysis

Use Spark for:

• Dealing with chains of parallel operations by using iterative algorithms

• Achieving quick results with in-memory computations

• Analyzing stream data analysis in real time

• Graph-parallel processing to model data

In recent years, many community-driven projects have emerged, often focused on

You might also like