Big Data Study Material
Big Data Study Material
Affiliated to Madurai Kamaraj University | Accredited with ‘A’ Grade by NAAC (3rd cycle)
Approved by UGC Under Section 2(f) Status| ISO 9001:2015 Certified Institution
Paravai, Madurai-625402
STUDY MATERIAL
Big Data Analytic
III B. Sc., (CS)
IV SEMESTER
2025-2026
UNIT – I
File-Based Systems
Limitations:
Complex structure
Rigid schema
Difficult for end users to query data
Key Concepts:
Examples:
Oracle
MySQL
Microsoft SQL Server
PostgreSQL
Advantages:
Easy to use
Structured schema
Powerful querying with SQL
Widespread adoption
Object-Oriented Databases
Object-Relational Databases
Why NoSQL?
Explosion of unstructured and semi-structured data (e.g., social media, sensor data, logs)
Need for horizontal scalability and high performance
Traditional RDBMS unable to handle Big Data effectively
Big Data refers to large volumes of data that cannot be processed using traditional methods due to:
Key Technologies:
Big Data is not just about large volumes of data. It includes a set of characteristics and components that
define its nature and how it can be processed and analyzed effectively. These characteristics are commonly
known as the V's of Big Data.
1. Volume
Refers to the massive amount of data generated every second.
Data comes from various sources like:
o Social media posts
o IoT devices
o Business transactions
o Videos, images, logs, etc.
Measured in terabytes, petabytes, and beyond.
2. Velocity
3. Variety
Example: An e-commerce website deals with product data (structured), user reviews (semi-structured),
and product images (unstructured).
4. Veracity
Example: Sensor data may have errors or missing values, affecting the analysis outcome.
5. Value
The most important element — extracting useful insights and business value from Big Data.
Value comes through:
o Predictive analytics
o Business intelligence
o Improved decision-making
6. Variability
Data flow rates can vary greatly over time.
Some systems need to handle spikes (e.g., during sales or festivals).
7. Visualization
These systems are made up of several key components working together to handle the 5Vs of Big Data:
Volume, Velocity, Variety, Veracity, and Value.
1. Data Sources
These are the origins from where data is generated and collected.
Types:
2. Data Ingestion
The process of collecting and importing data into the Big Data system.
3. Data Storage
Stores large volumes of structured and unstructured data across distributed systems.
Components:
HDFS (Hadoop Distributed File System): Fault-tolerant and scalable storage
NoSQL Databases: MongoDB, Cassandra, HBase
Data Lakes: Store raw and processed data (e.g., AWS S3, Azure Data Lake)
4. Data Processing
Batch Processing: Process large data chunks (e.g., Hadoop MapReduce, Apache Spark)
Real-time Processing: Handle streaming data (e.g., Apache Storm, Apache Flink)
5. Data Analysis
Extracts insights and patterns from the processed data using analytics and machine learning.
Techniques:
Statistical analysis
Data mining
Predictive modeling
Machine learning algorithms
Tools:
6. Data Visualization
Tools:
Tableau
Power BI
Apache Superset
Grafana
D3.js
Secure
Compliant with regulations
Properly managed
It involves applying advanced analytic techniques to very large, diverse data sets from various sources,
including social media, sensors, web logs, and transactional systems.
Big Data Analytics is the process of examining, processing, and analyzing massive and varied data sets
— known as Big Data — to discover patterns, correlations, trends, and insights that can support
decision-making and strategic planning.
It involves applying advanced analytic techniques to very large, diverse data sets from various sources,
including social media, sensors, web logs, and transactional systems.
Descriptive Analytics Analyzes past data to understand what happened Monthly sales reports
Diagnostic Analytics Examines data to understand why something happened Identifying reasons for sales drop
Predictive Analytics Uses historical data to predict future outcomes Forecasting customer demand
Prescriptive Analytics Suggests actions to achieve desired outcomes Recommending price adjustments
Processing Frameworks
Domain Applications
Data Analytics
Data Analytics is the science of examining raw data to find trends, draw conclusions, and support decision-
making. It involves the collection, transformation, analysis, and interpretation of data to gain useful insights.
Descriptive Analytics Summarizes past data to understand what happened Monthly sales reports
Diagnostic Analytics Examines data to find out why something happened Investigating customer churn
Predictive Analytics Uses historical data to predict future outcomes Forecasting demand or sales
1. Data Collection – Gather data from various sources (databases, sensors, websites, etc.)
2. Data Cleaning – Remove duplicates, fix errors, fill missing values
3. Data Transformation – Convert data into suitable formats or structures
4. Data Analysis – Use statistical or machine learning techniques to explore patterns
5. Data Visualization – Present findings in charts, dashboards, or graphs
6. Decision Making – Use insights to support business or scientific actions
Programming Languages
Visualization Tools
Tableau
Power BI
Google Data Studio
Data Management
Sector Application
Predictive Maintenance
– Sensor data from equipment (vibration, temperature) triggers maintenance before costly
breakdowns.
Supply-Chain Visibility
– Track parts and shipments across tiers; optimize routing and warehouse operations via real-time
analytics.
Quality Control
– Image and sensor analytics detect defects on production lines at scale.
5. Telecommunications
Smart Grids
– Real-time consumption data from smart meters helps balance load, integrate renewables, and
reduce outages.
Oil & Gas Exploration
– Process seismic and geological data at scale to identify promising drilling sites.
Predictive Asset Management
– Monitor pipelines, turbines, and transformers to forecast failures and schedule maintenance.
UNIT II
Analytical Theory
Introduction about Classification Algorithms
What Is Classification?
2. Key Concepts
Model 𝑃(𝑦
Logistic Regression, Linear
Linear Models
Discriminant Analysis (LDA)
Decision Trees (CART, ID3, C4.5), Split feature space via binary or multiway tests; ensemble
Tree-Based
Random Forests methods average many trees.
Ensemble Boosting (AdaBoost, Gradient Combine multiple “weak learners” to form a stronger
Methods Boosting), Bagging overall model.
Multilayer Perceptron, Deep Learning Learn complex, non-linear decision boundaries via layers
Neural Networks
models of interconnected nodes.
1. Data Preparation
o Collect and clean data, handle missing values
o Encode categorical features (one-hot, label encoding)
o Scale or normalize numerical features
2. Feature Selection / Engineering
o Choose or construct the most informative inputs
o Reduce dimensionality (PCA, LDA) if needed
3. Model Selection
o Pick candidate algorithms (e.g., logistic regression vs. SVM vs. tree)
o Set up cross-validation strategy
4. Training
o Fit model parameters on training data
o Tune hyperparameters (grid search, random search)
5. Evaluation
o Use metrics such as accuracy, precision, recall, F₁-score, ROC AUC
o Inspect confusion matrix to understand class-specific performance
6. Deployment & Monitoring
o Integrate the model into production
o Monitor performance drift and retrain as necessary
5. Evaluation Metrics
6. Theoretical Foundations
Empirical Risk Minimization (ERM) guides model fitting: minimize average loss over training
samples.
Structural Risk Minimization (SRM) (in SVM) balances fitting the data vs. model complexity to
avoid overfitting.
Curse of Dimensionality affects instance-based and distance-based methods: high-dimensional
spaces dilute distance metrics.
Overfitting vs. Underfitting: Use regularization, pruning, or ensemble methods to balance bias and
variance.
Imbalanced Data: Employ resampling (SMOTE), cost-sensitive learning, or metric selection
beyond accuracy.
Feature Correlation: Some algorithms (Naïve Bayes) assume feature independence; correlated
features can degrade performance.
Scalability: For very large datasets, prefer scalable frameworks (e.g., distributed implementations of
Spark MLlib).
Regression Techniques
1. What Is Regression?
Regression is a branch of supervised learning where the goal is to model the relationship between one or
more independent variables (features) and a continuous dependent variable (target). It answers questions
like “How much?” or “What value?”.
Ridge Regression Linear model with L₂ regularization to penalize large θ (reduce overfitting)
Lasso Regression Linear model with L₁ regularization, promotes sparsity in θ (feature selection)
Elastic Net Combines L₁ and L₂ penalties for balance between Ridge and Lasso
Support Vector
Finds a tube (ε-insensitive) around the regression line, uses kernel trick for nonlinearity
Regression
Decision Tree Regression Splits feature space into regions and fits constant values per leaf
Random Forest
Ensemble of decision trees; averages their predictions
Regression
Gradient Boosting
Sequentially builds trees to correct previous errors (e.g., XGBoost, LightGBM)
Regression
Neural Network
Multi-layer perceptron with continuous output; captures complex nonlinear relationships
Regression
4. Theoretical Foundations
2. Regularization
o Ridge (L₂):
o Lasso (L₁):
Accounts for number of predictors; penalizes adding Better for comparing models with
Adjsted R²
irrelevant features different feature counts
1. Data Preparation
o Clean missing values, outliers
o Encode categorical variables
o Scale/normalize features for methods sensitive to magnitude
2. Feature Engineering
o Create interaction or polynomial terms
o Feature selection via correlation analysis, Lasso, or tree-based importance
3. Model Selection & Training
o Choose baseline (e.g., linear regression)
o Use cross-validation to compare methods
o Tune hyperparameters (λ in Ridge/Lasso, tree depth, etc.)
4. Evaluation
o Compute metrics on validation/test data
Analyze residuals to detect patterns or heteroscedasticity
o
5. Deployment & Monitoring
o Integrate model into production pipelines
o Retrain as data distributions evolve
1. Healthcare Analytics
Techniques:
o Predictive modeling (e.g., predicting disease outbreaks or patient readmission)
o Temporal data mining (e.g., patient monitoring over time)
o Natural Language Processing (NLP) for EHR (Electronic Health Records)
o Anomaly detection for medical fraud
Tools: Apache Spark, Hadoop with HL7 data formats, TensorFlow for deep learning
Techniques:
o Fraud detection using graph analytics and real-time stream analysis
o Risk modeling (e.g., credit scoring using logistic regression, random forests)
o Sentiment analysis from financial news and social media
o Time series analysis for stock prediction
Tools: Apache Flink, Kafka for real-time processing, SQL-on-Hadoop engines (Hive, Presto)
3. Retail & E-commerce Analytics
Techniques:
o Market basket analysis using association rule mining (Apriori, FP-Growth)
o Customer segmentation (K-means, DBSCAN)
o Recommendation engines (collaborative filtering, matrix factorization)
o A/B testing and conversion rate optimization
Tools: Spark MLlib, Amazon Redshift, Google BigQuery
4. Telecommunications
Techniques:
o Churn prediction using classification algorithms
o Network optimization with graph-based models
o Customer usage pattern mining
o Real-time analytics for call detail records (CDRs)
Tools: Apache Cassandra (for CDRs), Hadoop, Spark Streaming
Techniques:
o Predictive maintenance (using time-series models like ARIMA, LSTM)
o Sensor data analysis (using edge analytics)
o Anomaly detection in machine operation
Tools: InfluxDB, TimescaleDB, Azure IoT Suite, Apache NiFi
6. Cybersecurity
Techniques:
o Intrusion detection using clustering and classification
o Log analysis with pattern matching
o Threat intelligence correlation with graph analytics
Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Apache Metron
Techniques:
o Spatial analytics with GIS integration
o Traffic pattern prediction using streaming analytics
o Public safety data mining
Tools: PostGIS, Hadoop GIS, GeoMesa, Apache Storm
Text Analytics
Text Analytics in Database Analytics for Big Data involves extracting meaningful information from large
volumes of unstructured or semi-structured text data stored in databases or data lakes. It's a crucial
component across many domains (e.g., healthcare, finance, e-commerce), especially given that over 80% of
enterprise data is unstructured.
1. Text Preprocessing
2. Feature Extraction
4. Sentiment Analysis
5. Topic Modeling
Techniques like LDA (Latent Dirichlet Allocation) to uncover hidden themes in large text corpora.
6. Text Classification
7. Text Clustering
Search, query expansion, and ranking algorithms (TF-IDF, BM25, BERT-based models).
Core to search engines and enterprise knowledge systems.
Data Storage:
Real-time analysis refers to the processing and analysis of data as it is generated — with minimal
latency — to extract insights and trigger actions instantly or within seconds. It’s a critical capability for
applications where immediate decisions or reactions are needed, such as fraud detection, live dashboards,
IoT monitoring, and recommendation systems.
Component Function
Data Ingestion Capture data streams from multiple sources (logs, IoT, APIs)
Stream Processing Process data in-memory as it arrives
Storage Temporarily store incoming data for quick access
Analysis Engine Run transformations, models, or analytics in real time
Visualization Update dashboards or systems instantly
Action System Trigger alerts, decisions, or other automated actions
Common Technologies
Layer Examples
Data Streams Apache Kafka, Amazon Kinesis, Apache Pulsar
Stream Processing Apache Flink, Apache Spark Structured Streaming, Apache Storm
Message Brokers Kafka, RabbitMQ, MQTT
Storage (Low-latency) Redis, Cassandra, HBase, Elasticsearch
Query Engines Apache Druid, ClickHouse, Pinot (real-time OLAP)
Dashboards Grafana, Kibana, Superset
Event time is when the event occurred; processing time is when it was processed.
Crucial for accurate event ordering and late arrival handling
3. Anomaly Detection
Introduction
Real-time System
A Real-Time System is a type of computing system that is designed to process data and produce
responses within a strict time constraint — often in milliseconds or microseconds. These systems are built
not just to compute correctly, but also to compute on time.
Key Definition
A Real-Time System is one where the correctness of an operation depends not only on its logical result,
but also on the time at which the result is produced.
Characteristic Description
Deterministic behavior Must respond predictably and consistently under set limits
Time constraints Operates under hard, firm, or soft deadlines
Reliability & Availability Must be robust and available continuously (often 24/7)
Concurrency Handles multiple tasks simultaneously
Event-driven Often triggered by external events or inputs
Real-time systems are computer systems that must respond to inputs or events within a specified time
constraint. These systems are used in situations where delays in response can lead to system failure or
undesired consequences. Based on the strictness of the timing constraints, real-time systems are classified
into three main types:
Definition: In soft real-time systems, missing a deadline is undesirable but not catastrophic.
Characteristics:
o Occasional deadline misses are tolerable.
o System performance degrades gracefully rather than failing.
o Emphasis is on overall performance rather than individual task timing.
Examples:
o Video conferencing systems
o Online transaction systems
o Multimedia streaming
o Online gaming
Definition: In firm real-time systems, missing a deadline renders the result useless, but the system
itself does not fail.
Characteristics:
o Tasks that miss deadlines are discarded.
o No penalties for occasional deadline misses, but they should be minimized.
Examples:
o Automated stock trading systems
o Quality control systems in manufacturing
o Airline reservation systems
Real-time systems are designed to process data and provide responses within strict time constraints. Their
primary goal is not just to compute correctly, but to do so within a defined time frame. Below are the
key characteristics that define real-time systems:
1. Timeliness (Determinism)
Definition: The ability of the system to respond to events or inputs within a predetermined and
guaranteed time.
Importance: Missing a deadline can lead to failure, especially in hard real-time systems.
2. Predictability
Definition: The system's behavior and timing must be predictable, even under heavy loads.
Importance: Ensures consistent and reliable performance regardless of workload.
3. Reliability
Definition: The system should function correctly over a long period without failures.
Importance: Crucial in safety-critical applications like aerospace and medical systems.
4. Availability
Definition: The system should be available and operational at all required times.
Importance: Many real-time applications, such as traffic control, demand continuous uptime.
6. Concurrency
7. Minimal Latency
Definition: The time between receiving input and producing output should be as low as possible.
Importance: Critical for systems like real-time audio/video or emergency alert systems.
8. Priority Scheduling
Definition: Tasks are assigned priorities to ensure critical tasks are executed first.
Importance: Helps meet timing requirements by allowing urgent tasks to preempt less critical ones.
9. Resource Efficiency
Definition: Ability to continue operating correctly even when some parts fail.
Importance: Essential in life-critical systems like medical devices or flight controllers.
Introduction
In the era of Big Data, organizations generate and consume vast volumes of data from diverse
sources like social media, sensors, logs, and IoT devices. Traditional batch processing methods are often
inadequate for time-sensitive data. This is where Real-Time Processing Systems come in—designed to
analyze and respond to data as it is generated, providing immediate insights and actions.
Real-time processing refers to the continuous input, processing, and output of data within a short,
guaranteed time frame. Unlike batch processing, which handles data in large volumes at scheduled intervals,
real-time systems work on streaming data—handling it event by event or record by record.
Immediate Insights: Enables quick decisions (e.g., fraud detection, alert systems).
Improved User Experience: Personalized recommendations and dynamic content delivery.
Operational Efficiency: Real
Data Integration
Definition:
Data Integration is the process of combining data from multiple disparate sources into a single, unified
view to ensure consistency, accessibility, and accuracy.
Key Functions:
Data Collection: Gather data from sources like databases, APIs, files, and real-time streams.
Data Cleaning: Remove inconsistencies, duplicates, and errors.
Data Transformation: Convert data into a standard format suitable for analysis (ETL – Extract,
Transform, Load).
Data Consolidation: Merge datasets into a single repository, such as a data warehouse or data lake.
Examples:
Integrating customer data from CRM, website logs, and social media.
Consolidating sales data from different regional branches.
Data Analytics
Definition:
Data Analytics is the process of examining, interpreting, and visualizing data to discover meaningful
patterns, trends, correlations, and insights.
Types of Analytics:
1. Descriptive Analytics:
o Summarizes past data.
o Example: Monthly sales reports.
2. Diagnostic Analytics:
o Explains why something happened.
o Example: Drop in customer engagement analysis.
3. Predictive Analytics:
o Forecasts future outcomes using statistical models and machine learning.
o Example: Predicting product demand.
4. Prescriptive Analytics:
o Recommends actions based on predictive models.
o Example: Inventory optimization based on future demand.
Dependency Prerequisite for effective analytics Uses integrated data for insights
Hadoop is an open-source, distributed framework designed for storing and processing large volumes of
data across clusters of computers. It forms the backbone of many Big Data applications by enabling
reliable, scalable, and cost-effective data storage and analytics.
What is Hadoop?
2. MapReduce
4. Hadoop Common
Tool Purpose
A Real-Time System Architecture defines how the components of a real-time system are structured and
interact to ensure timely, predictable, and reliable responses to inputs or events. These systems are
engineered to meet strict timing constraints and are often used in safety-critical and mission-critical
applications such as avionics, automotive systems, industrial automation, and healthcare devices.
1. Sensor/Input Interface
Function: Captures data from the external environment (e.g., temperature, speed, pressure).
Example Devices: Cameras, sensors, microphones.
Role: Triggers events in the system that must be responded to immediately.
Function: Executes real-time tasks, algorithms, and logic based on input data.
Needs:
o High performance for computation
o Deterministic behavior (i.e., predictable execution times)
4. Memory Management
5. Actuators/Output Interface
6. Communication Interface
Function: Facilitates data exchange between system components and external systems.
Examples:
o CAN bus (automotive)
o Ethernet/IP (industrial control)
o UART/SPI/I2C (embedded devices)
7. Clock/Timer
Function: Provides precise timing and synchronization for task scheduling and deadlines.
Essential For:
o Measuring task execution time
o Triggering periodic tasks
Monolithic All tasks run in a single executable; simple but hard to scale
Layered Organizes the system into layers (e.g., hardware, kernel, application) for modularity
Microkernel (RTOS) Provides minimal kernel features; other services run in user space for stability and isolation
Architecture Type Description
Distributed Real-time processing is spread across multiple connected systems (e.g., sensor networks)
Real-Time Data Analytics refers to the process of analyzing data as soon as it is generated or received,
enabling organizations to make immediate decisions and take timely actions. Unlike traditional (batch)
analytics, which processes data after storage, real-time analytics processes streaming data continuously,
providing insights in seconds or milliseconds.
Capturing live data from sources like IoT devices, logs, transactions, and sensors.
Processing and analyzing that data instantly.
Generating outputs like alerts, visualizations, or automated actions without delay.
Component Description
Data Ingestion Layer Collects real-time data using tools like Apache Kafka, Flume, or MQTT
Stream Processing Engine Processes data on the fly (e.g., Apache Spark Streaming, Apache Flink, Storm)
Storage Layer Stores data temporarily or permanently (e.g., Redis, Cassandra, Elasticsearch)
Technologies Used
1. Descriptive Analytics
o Shows what is happening now.
o Example: Current website visitor count.
2. Predictive Analytics
o Uses live data to forecast trends or issues.
o Example: Predicting machine failure in a factory.
3. Prescriptive Analytics
o Recommends real-time actions.
o Example: Automatically rerouting delivery trucks based on traffic data.
Use Cases
Faster decision-making
Early anomaly detection
Better customer engagement
Operational efficiency
Competitive advantage
Unit-3
The Big Data stack is typically divided into several layers, from data storage to data processing to analytics
and visualization. Here's a breakdown of the key components of a typical Big Data stack:
The first layer in the Big Data stack is responsible for storing massive volumes of data. Since traditional
databases (like relational databases) are not designed to handle the scale of Big Data, specialized storage
systems are used.
Technologies:
Hadoop Distributed File System (HDFS): A distributed file system that stores data across many
machines. It is highly scalable and fault-tolerant.
NoSQL Databases:
o MongoDB: A document-based NoSQL database that stores data in JSON-like format.
o Cassandra: A highly scalable column-family store designed for large, distributed data
environments.
o HBase: A NoSQL, column-oriented database built on top of HDFS.
Cloud Storage:
o Amazon S3: Object storage service that can scale to store terabytes or petabytes of data.
o Google Cloud Storage and Azure Blob Storage also offer scalable object storage solutions.
Technologies:
Apache Hadoop (MapReduce): A programming model and processing engine that divides tasks
into smaller sub-tasks, which are then executed in parallel across a cluster. It works well for batch
processing.
Apache Spark: A fast, in-memory data processing engine that supports both batch and real-time
processing (streaming). Spark is significantly faster than Hadoop MapReduce and supports
advanced analytics like machine learning and graph processing.
Apache Flink: A stream-processing framework designed for real-time analytics. It supports both
batch and stream processing and can handle stateful computations over unbounded data streams.
Apache Storm: A real-time, distributed processing system that allows for complex event processing
(CEP) in real time.
Apache Kafka: A distributed event streaming platform used to build real-time data pipelines and
streaming applications. It allows systems to publish, subscribe to, store, and process real-time data
streams.
Google Dataflow: A fully managed stream and batch processing service on Google Cloud that
allows you to execute pipelines in real-time or batch mode.
The data integration layer focuses on extracting, transforming, and loading (ETL) data from different
sources into the system, or combining datasets for analysis. ETL tools automate the process of getting data
into a usable format.
Technologies:
Apache Nifi: A data integration tool that automates data flow between different systems. It is
designed to handle both batch and real-time data integration.
Talend: A leading ETL tool for integrating, transforming, and cleaning data.
Apache Airflow: A workflow orchestration tool that automates the scheduling and monitoring of
ETL tasks.
Informatica: A data integration platform used to manage the flow of data from multiple sources.
Once data is stored and processed, it needs to be analyzed. The data analytics layer includes tools for
querying, aggregating, and analyzing data to derive insights.
Technologies:
Apache Hive: A data warehouse built on top of Hadoop that allows you to query data using SQL-
like language (HiveQL). It’s a popular choice for batch processing and data warehousing.
Apache Impala: A high-performance SQL engine designed for real-time querying of data stored in
Hadoop, often used as an alternative to Hive for faster query processing.
Presto: A distributed SQL query engine that allows for fast querying across large datasets stored in
different data sources (including HDFS, Amazon S3, etc.).
Google BigQuery: A fully-managed, serverless data warehouse that enables real-time analytics
using SQL-like queries.
ClickHouse: A columnar database management system optimized for online analytical processing
(OLAP).
The machine learning (ML) and advanced analytics layer is used to build and deploy predictive models,
conduct statistical analysis, and apply algorithms to derive insights from Big Data.
Technologies:
Apache Mahout: A machine learning library built on top of Hadoop, primarily for large-scale data
mining and machine learning.
MLlib (Apache Spark): A scalable machine learning library built into Apache Spark, supporting
algorithms like regression, classification, and clustering.
TensorFlow: An open-source framework developed by Google for building and training machine
learning models.
Scikit-learn: A Python library for machine learning, including algorithms for classification,
regression, clustering, and dimensionality reduction.
H2O.ai: An open-source machine learning platform that includes tools for building and deploying
ML models at scale.
SageMaker (AWS): A fully managed service from Amazon Web Services for building, training, and
deploying machine learning models at scale.
The data visualization layer presents insights in an easily understandable format. BI tools allow users to
create dashboards, charts, and reports based on data analysis.
Technologies:
Tableau: A leading data visualization tool that allows users to create interactive visualizations from
Big Data sources.
Power BI: A Microsoft tool that integrates with various data sources, including Big Data platforms,
to create interactive reports and dashboards.
QlikView: A BI tool that provides a rich set of features for data exploration and visualization.
Apache Superset: An open-source data visualization platform built for modern data exploration.
Looker: A BI tool that allows you to create custom data reports and dashboards with a focus on data
exploration and business intelligence.
This layer ensures that Big Data is handled properly, with the appropriate security measures, compliance,
and governance in place.
Technologies:
Apache Atlas: A framework for governance and metadata management, which allows organizations
to manage data lineage, audit trails, and other governance-related tasks.
Apache Ranger: A framework to manage and enforce security policies across the Hadoop
ecosystem, including data access control.
Cloudera Navigator: A tool for managing and governing Big Data environments, including
metadata management and data lineage.
1. Resource Efficiency: Virtualization allows multiple virtual machines (VMs) to run on a single
physical server, optimizing resource use and reducing hardware costs.
2. Elastic Scaling: Virtual environments can scale up or down quickly based on workload demands,
ideal for the dynamic nature of Big Data applications.
3. High Availability: Virtualized systems ensure minimal downtime through features like VM
migration (e.g., VMware vMotion) and fault tolerance.
4. Cost Savings: Reduces the need for large physical infrastructures, making it cost-effective for
organizations to scale Big Data operations.
5. Simplified Testing and Development: Virtual machines can quickly replicate Big Data
environments for testing, ensuring flexibility and faster development cycles.
VMware: Popular virtualization platform for managing Big Data clusters, offering features like
vSphere and vMotion.
KVM (Kernel-based Virtual Machine): Open-source solution widely used for virtualizing Linux-
based Big Data applications.
OpenStack: Cloud platform that provides infrastructure-as-a-service (IaaS) for virtualizing and
scaling Big Data environments in private and hybrid clouds.
Docker and Kubernetes: Containerization technologies that work on virtualized infrastructure to
create lightweight, scalable environments for Big Data applications like Hadoop and Spark.
Hadoop: Virtualization helps manage Hadoop clusters by distributing data nodes and other
components across virtual machines. It simplifies provisioning, scaling, and resource management.
Spark: Spark clusters benefit from virtualization by scaling up/down based on data processing
requirements, improving performance and flexibility.
NoSQL Databases (Cassandra, MongoDB, etc.): Virtual machines enable better resource isolation,
replication, and scaling of NoSQL database clusters, ensuring efficient handling of Big Data
workloads.
Virtualization in Cloud-Based Big Data:
Cloud Platforms like AWS, Google Cloud, and Azure use virtualization to provide scalable Big
Data services such as EMR (Elastic MapReduce), BigQuery, and HDInsight.
Hybrid Cloud setups allow virtualized environments to seamlessly move between on-premises and
cloud infrastructures.
Performance Overhead: Virtualization can introduce some performance penalties due to the
abstraction layer between the hardware and the applications.
Storage Complexity: Managing vast amounts of distributed data across virtual machines requires
efficient storage solutions to prevent bottlenecks.
Resource Contention: Multiple virtual machines may compete for CPU, memory, and storage,
which can affect the performance of Big Data applications.
NoSQL (Not Only SQL) is a category of database systems designed to handle large volumes of
unstructured, semi-structured, or structured data. Unlike traditional relational databases (RDBMS),
NoSQL databases do not rely on fixed table schemas and typically avoid SQL as their primary query
language.
1. Schema-less: Data can be stored without defining a fixed structure (flexible schema).
2. Horizontal Scalability: Easily scales out by adding more servers.
3. High Performance: Optimized for read/write speeds, especially in big data and real-time web
applications.
4. Distributed Architecture: Designed for distributed computing and high availability.
5. Supports Large Volumes of Data: Can efficiently handle terabytes to petabytes of data.
1. Document Stores
o Store data in JSON, BSON, or XML format.
o Each document is a self-contained data unit.
o 🔹 Examples: MongoDB, CouchDB
2. Key-Value Stores
o Store data as key-value pairs.
o Extremely fast and scalable.
o 🔹 Examples: Redis, Amazon DynamoDB
3. Column-Family Stores
o Store data in columns rather than rows (like RDBMS).
o Suitable for large datasets with high read/write throughput.
o 🔹 Examples: Apache Cassandra, HBase
4. Graph Databases
o Store data as nodes and edges representing entities and their relationships.
o Ideal for complex relationship queries.
o 🔹 Examples: Neo4j, Amazon Neptune
Advantages of NoSQL:
Disadvantages of NoSQL:
CouchDB:
Apache CouchDB is an open-source NoSQL database that focuses on ease of use and scalability. It stores
data in a document-oriented format using JSON and offers a flexible, schema-less model.
1. Document Store:
o Data is stored as JSON documents.
o Each document has a unique ID and can contain nested data structures.
2. RESTful HTTP API:
o CouchDB uses HTTP for its API, making it easy to interact with via web protocols.
o You can perform CRUD operations (Create, Read, Update, Delete) through simple HTTP
requests.
3. ACID Properties:
o CouchDB ensures Atomicity, Consistency, Isolation, and Durability at the document level.
o Uses Multi-Version Concurrency Control (MVCC) for safe concurrent updates without
locking.
4. Replication and Synchronization:
o Supports multi-master replication, allowing databases to sync across different servers or
devices.
o Ideal for offline-first applications where data can be updated locally and synchronized later.
5. MapReduce Queries:
o Uses JavaScript-based MapReduce for querying and indexing data.
o Allows building complex queries and views.
6. Fault Tolerance:
o Designed for distributed use; can replicate data across unreliable networks and recover
gracefully from failures.
Advantages:
Limitations:
MongoDB:
MongoDB is a popular, open-source NoSQL document database designed for high performance, high
availability, and easy scalability. It uses a document-oriented data model, storing data in BSON (Binary
JSON) format. MongoDB is widely used in applications that require fast read/write performance, large-scale
data storage, and flexible data modeling.
1. Document-Oriented Storage:
o MongoDB stores data in documents (similar to JSON format), which are collections of key-
value pairs.
o Each document can have a different structure, offering flexibility and allowing for schema-
less designs.
2. BSON (Binary JSON):
o MongoDB uses BSON, an extended version of JSON, which supports additional data types
(like Date, Binary, ObjectId, etc.) that standard JSON does not support.
3. Flexible Schema:
o Unlike relational databases, MongoDB doesn’t enforce a fixed schema, making it easier to
modify the data structure over time.
o Collections in MongoDB are schema-free, so documents can have different fields or data
types.
4. High Performance:
o MongoDB is optimized for fast reads and writes, making it suitable for applications with
heavy data input/output (I/O) and real-time analytics.
o It supports in-memory storage for faster performance and includes indexing for efficient
querying.
5. Horizontal Scalability (Sharding):
o MongoDB supports horizontal scaling (sharding), which distributes data across multiple
servers to manage large volumes of data.
o Shards are individual databases that MongoDB can balance and replicate across nodes.
6. Replication:
o MongoDB supports replica sets, a group of primary and secondary databases that provide
automatic failover and data redundancy.
o Ensures high availability and fault tolerance, as if the primary server goes down, one of the
secondaries can take over.
7. Aggregation Framework:
o MongoDB provides a powerful aggregation framework to perform complex queries and
data transformations.
o Supports operations like grouping, filtering, sorting, and joining within collections.
8. Built-in Data Redundancy and Failover:
o Replica sets automatically ensure that data is available and fault-tolerant. If a primary node
fails, one of the secondaries is promoted to primary without any downtime.
1. Real-Time Analytics:
o MongoDB handles large volumes of real-time data, making it ideal for applications that
require real-time analytics or data aggregation (e.g., IoT platforms, social media feeds).
2. Content Management Systems:
o MongoDB's flexible schema is great for managing and storing content that doesn't fit neatly
into relational tables, such as media files, blogs, or customer data.
3. Catalog and Inventory Management:
o Used in e-commerce or product catalog systems where different items may have different
attributes.
4. Mobile and Web Applications:
o Ideal for building web apps or mobile applications that need to handle a wide range of data
types or require rapid schema evolution.
5. Data Warehousing:
o Often used in data lakes or data warehousing solutions, where large amounts of unstructured
data need to be stored and processed.
Advantages of MongoDB:
1. Scalability:
o Horizontal scalability with sharding, which allows the database to distribute data across
multiple servers.
o Supports replica sets to ensure high availability and redundancy.
2. Performance:
o MongoDB is optimized for fast writes and reads with features like indexing, in-memory
storage, and automatic replication.
3. Flexibility:
o The schema-less nature of MongoDB makes it easy to adapt to changing requirements.
o Supports rich, nested data types and arrays, which are harder to model in traditional relational
databases.
4. Real-Time Data Handling:
o MongoDB supports real-time data ingestion and querying, making it well-suited for
applications that require immediate access to fresh data.
Disadvantages of MongoDB:
Hadoop Ecosystem:
The Hadoop Ecosystem refers to a set of tools, frameworks, and services that work together to process and
store large volumes of Big Data in a distributed computing environment. It is built around the core Hadoop
framework, which includes the Hadoop Distributed File System (HDFS) and MapReduce.
1. Scalability:
o Easily scales by adding more nodes to the cluster, making it suitable for handling petabytes
of data.
2. Cost-Effective:
o Uses commodity hardware to store and process data, which is more affordable than
traditional enterprise storage solutions.
3. Fault Tolerance:
o Data is replicated across multiple nodes in HDFS, ensuring availability even if some nodes
fail.
o Provides automatic failover for MapReduce jobs and other services.
4. Flexibility:
o Can handle both structured and unstructured data, making it ideal for a wide variety of use
cases (e.g., web logs, IoT data, social media data).
5. High Performance:
o Tools like Apache Spark offer real-time processing and much faster analytics compared to
traditional MapReduce.
6. Integration:
o Integrates easily with other tools and systems (like NoSQL databases, machine learning
frameworks, data lakes, etc.).
Big Data Analytics: Processing and analyzing huge datasets in real time for use cases like fraud
detection, recommendation engines, and social media sentiment analysis.
Data Warehousing: Storing and querying massive datasets for business intelligence and decision-
making.
Log Analysis: Processing and analyzing server and application logs for monitoring, troubleshooting,
and reporting.
Real-Time Streaming: Real-time data processing from IoT devices, sensors, and streaming data
sources.
Machine Learning: Building scalable machine learning models using Spark MLlib or Mahout.
HDFS is the primary storage system of the Hadoop ecosystem, designed for storing large datasets in a
distributed and fault-tolerant manner. It breaks large files into smaller blocks and distributes them across a
cluster of machines, providing scalability, reliability, and high throughput.
1. Distributed Storage:
o Large files are divided into smaller blocks (default size: 128 MB or 256 MB).
o These blocks are distributed across multiple machines (DataNodes) in the cluster.
2. Fault Tolerance:
o Data is replicated across multiple DataNodes (default replication factor = 3).
o If one node fails, the data remains available through the replicated blocks on other nodes.
3. Write-Once, Read-Many:
o Optimized for scenarios where data is written once and read many times (e.g., log files, large
datasets).
o It does not support frequent updates or random writes.
4. High Throughput:
o Optimized for high throughput rather than low latency, making it suitable for batch
processing and large data analytics tasks.
5. Data Locality:
o Hadoop tries to process data where it is stored, reducing network traffic and improving
performance by running MapReduce jobs close to data.
6. Scalability:
o It can scale horizontally by adding new DataNodes as data volume increases. HDFS can
store petabytes of data across thousands of machines.
7. Master-Slave Architecture:
o NameNode (Master): Manages metadata and file system namespace (file names, block
locations).
o DataNodes (Slaves): Store actual data blocks and handle read/write operations.
Components of HDFS:
1. NameNode (Master):
o Stores the metadata of the entire file system (e.g., file names, locations of blocks).
o Coordinates access to files and maintains the file system namespace.
2. DataNode (Slave):
o Stores actual data blocks.
o Each DataNode periodically sends heartbeat signals and block reports to the NameNode.
3. Secondary NameNode:
o Creates periodic checkpoints to reduce recovery time for the NameNode in case of failure.
o Does not serve as a backup for NameNode, but helps in managing its metadata logs.
Advantages of HDFS:
1. Fault Tolerance:
o Data replication ensures that the system can continue to function even if some DataNodes
fail.
2. Scalable:
o Can easily scale by adding more DataNodes to accommodate increasing data volumes.
3. Cost-Effective:
o Uses commodity hardware for storage, making it affordable compared to traditional storage
systems.
4. High Throughput:
o Well-suited for batch processing and big data analytics where throughput is more important
than latency.
Disadvantages of HDFS:
HBase
HBase is a distributed, scalable, NoSQL database built on top of Hadoop’s HDFS. It is designed to provide
real-time, random read/write access to large amounts of sparse data.
1. Column-Oriented Database:
o Stores data in column families rather than rows, allowing efficient reads and writes for
sparse datasets.
2. Built on HDFS:
o Uses Hadoop’s HDFS for reliable, distributed storage and fault tolerance.
3. Real-Time Access:
o Supports fast random reads and writes, unlike HDFS which is optimized for batch
processing.
4. Scalability:
o Can scale horizontally by adding more servers (RegionServers) to handle large datasets and
high throughput.
5. Automatic Sharding:
o Data is automatically split into regions and distributed across multiple servers.
6. Strong Consistency:
o Provides strong consistency for read and write operations.
7. No Fixed Schema:
o Supports flexible schema; columns can be added dynamically without predefined structure.
Core Components:
Use Cases:
Real-time analytics
Time-series data
Online applications requiring fast random access
Storing large sparse datasets like web logs or social media data
Advantages:
Disadvantages:
YARN is a core component of the Hadoop ecosystem that manages and schedules resources in a Hadoop
cluster. It was introduced in Hadoop 2.x to improve the resource management capabilities of Hadoop and to
overcome the limitations of the MapReduce framework.
1. Resource Management:
o YARN is responsible for allocating resources across various applications running on a
Hadoop cluster.
o It separates the resource management and job scheduling functions, unlike the older
MapReduce framework, where the ResourceManager and job execution were tightly
coupled.
2. Cluster Resource Scheduler:
o It allows multiple applications (MapReduce, Spark, Tez, etc.) to run on the same Hadoop
cluster by efficiently distributing resources to them.
3. Scalability:
o YARN enables Hadoop clusters to scale efficiently by distributing resources dynamically,
allowing the addition of more applications and users without significant overhead.
4. Multi-Tenancy:
o YARN supports multi-tenancy, meaning multiple applications or frameworks can run
simultaneously on the same cluster, efficiently utilizing resources without interference.
1. ResourceManager (RM):
o The ResourceManager is the master daemon that manages and allocates resources in the
cluster.
o It has two main components:
Scheduler: Allocates resources to applications based on scheduling policies.
ApplicationManager: Manages the lifecycle of applications, including job
submission and monitoring.
2. NodeManager (NM):
o The NodeManager is the worker daemon running on each node in the cluster. It monitors
the resource usage (CPU, memory, etc.) and reports it to the ResourceManager.
o It also manages the lifecycle of containers running on its node.
3. ApplicationMaster (AM):
o Each application submitted to the cluster has its own ApplicationMaster.
o The ApplicationMaster is responsible for the lifecycle of a single job. It negotiates resources
with the ResourceManager, monitors the application's progress, and handles failures.
4. Container:
o A container is the fundamental unit of resource allocation in YARN. It encapsulates the
necessary resources (memory, CPU) and the environment required to run a task.
5. JobHistoryServer:
o The JobHistoryServer stores the history of jobs that have been completed, including logs
and metrics. It allows users to track job performance after execution.
Advantages of YARN:
Disadvantages of YARN:
1. Complexity:
o The introduction of YARN adds complexity to the Hadoop ecosystem. It requires careful
configuration and management of multiple components like ResourceManager,
NodeManager, ApplicationMaster, and Scheduler.
2. Increased Overhead:
o The overhead of managing multiple frameworks and applications might increase, particularly
in cases where there are many smaller jobs running in the cluster.
3. Resource Fragmentation:
o In multi-tenant environments, resource fragmentation can occur, leading to inefficiencies if
not managed properly by the Scheduler.
Unit-4
High Dimensional Data refers to datasets with a large number of features (also called variables, attributes,
or dimensions) relative to the number of observations (data points). This is common in fields like genomics,
image processing, text analysis, and finance.
Key Concepts
1. Curse of Dimensionality:
o As the number of dimensions increases, the data becomes sparse, and distances between data
points become less meaningful.
o Algorithms that work well in low dimensions (e.g., k-NN, clustering) may perform poorly.
2. Overfitting:
o More features can lead to models that capture noise instead of patterns, especially if the
number of samples is small.
o Regularization techniques (like L1/L2 penalties) help control overfitting.
3. Feature Selection vs. Dimensionality Reduction:
o Feature Selection: Selects a subset of relevant features (e.g., using mutual information, chi-
squared tests).
o Dimensionality Reduction: Transforms the data into a lower-dimensional space (e.g., PCA,
t-SNE, UMAP).
1. Visualization Challenges:
o It’s hard to visualize more than 3D. Techniques like t-SNE and UMAP help to project high-
dimensional data into 2D or 3D.
2. Computational Complexity:
o High dimensions increase the computational load, especially for distance-based algorithms.
What is Dimensionality?
Dimensionality refers to the number of features or variables in a dataset. In simpler terms, it’s the number
of independent values or coordinates needed to represent data points in a space.
Example:
A dataset with:
o 1 feature → 1D (e.g., temperature)
o 2 features → 2D (e.g., temperature and humidity)
o 3 features → 3D (e.g., temperature, humidity, and pressure)
o 1000 features → High-dimensional data
Types of Dimensionality
Dimensionality Reduction
Dimensionality Reduction is the process of reducing the number of input variables or features in a dataset
while retaining as much meaningful information as possible.
This is especially important when working with high-dimensional data, where too many features can lead
to overfitting, increased computational cost, and difficulty in visualization.
Goals of Dimensionality Reduction
1. Feature Selection
Common Methods:
Method Description
Filter Methods Use statistical tests (e.g., correlation, chi-square, ANOVA) to select features
Use machine learning models to evaluate feature subsets (e.g., RFE - Recursive Feature
Wrapper Methods
Elimination)
Embedded
Feature selection is built into the model (e.g., Lasso, Tree-based models)
Methods
2. Feature Extraction
Linear Techniques:
Technique Description
PCA (Principal Component Analysis) Projects data onto directions of max variance (unsupervised)
Non-linear Techniques:
Technique Description
Linear Techniques
Non-linear Techniques:
3. Kernel PCA
In the context of data science, machine learning, or software development, User Interface (UI) and
visualization are key aspects for improving user experience and data interpretation. The goal is to make
complex data or processes easily understandable and actionable. Let’s break down both concepts and how
they can work together.
A User Interface (UI) is how users interact with software or hardware. It involves the layout, design, and
interaction mechanisms that allow users to input data, navigate, and interact with the application.
2. Data Visualization
Data Visualization is the graphical representation of data. Visualizations help users understand trends,
patterns, and outliers by using charts, graphs, and other visual formats.
Bar Charts: Show categorical data comparisons (e.g., sales across different months).
Line Charts: Represent trends over time (e.g., stock prices, temperature change).
Pie Charts: Show proportions of a whole (e.g., market share of different brands).
Scatter Plots: Display relationships between two continuous variables (e.g., height vs. weight).
Heatmaps: Show data intensity with color gradients (e.g., correlation matrix).
Histograms: Show frequency distribution of a variable (e.g., age distribution).
Box Plots: Show the distribution and outliers in data.
Maps: Geospatial data visualization (e.g., locations of stores, weather patterns).
Word Clouds: Used in textual data analysis (e.g., most common words in a document).
1. Choose the right chart: Select the visualization that best represents the data and the insights you
want to convey.
2. Simplify: Avoid clutter and unnecessary elements. Stick to the essentials.
3. Use color effectively: Use contrasting colors to highlight differences or trends, but avoid
overloading with too many colors.
4. Label properly: Ensure that axes, legends, and titles are clear and descriptive.
5. Context matters: Provide context or annotations to help users understand the significance of the
visualization.
When you combine UI and data visualization, you create interactive systems where users can explore data
and gain insights visually. Here are some ways UI and visualization work together:
Interactive Dashboards: Allow users to interact with graphs, filter data, and dynamically explore
visualizations.
Real-time Data Updates: Visualizations that reflect the most recent data as it updates.
User Controls: Elements like sliders, checkboxes, or dropdown menus to adjust the visualization
parameters (e.g., time range, data type).
Annotations: Features that allow users to add notes or insights to specific points in the data.
Exporting: Options to download charts or reports for further analysis.
1. Business Dashboards: Provide real-time KPIs (Key Performance Indicators), charts, and metrics for
managers to track business performance.
2. Data Exploration Tools: Tools like Tableau, Power BI, or Google Data Studio allow users to create,
filter, and modify visualizations.
3. Scientific Visualization Software: Tools for visualizing complex data, like biological datasets or
astronomical images, with features for 3D rendering and interaction.
4. Analytics Platforms: Machine learning platforms (e.g., Jupyter Notebooks, Google Colab) often
integrate visualizations (like Matplotlib or Seaborn) to make data analysis more intuitive.
React: JavaScript library for building interactive UIs, often used in combination with D3.js or
Chart.js for data visualizations.
Vue.js: Lightweight JavaScript framework for building reactive UIs.
Angular: A full-featured framework for building complex, single-page applications (SPAs) with
strong data binding features.
Bootstrap: Front-end framework for creating responsive designs and layouts.
Flask/Django (Python): Backend frameworks often paired with JavaScript (React/Vue) to serve
data to users.
D3.js: A powerful JavaScript library for creating interactive, complex, and highly customizable
visualizations.
Plotly: Interactive graphs for web applications; integrates with Python, R, and JavaScript.
Matplotlib & Seaborn: Python libraries for static, high-quality visualizations (commonly used in
data science).
Chart.js: Simple-to-use JavaScript library for creating responsive charts.
ggplot2: R library for creating elegant and complex visualizations.
Tableau/Power BI: Popular drag-and-drop tools for creating business intelligence visualizations and
dashboards.
1. Web Dashboards:
Example: A dashboard displaying sales data, with users able to filter by year, region, or product
category. Visualizations could include bar charts, line charts, and heatmaps.
2. Geospatial Visualizations:
Example: A map showing the locations of delivery trucks, with users able to zoom in/out and click
on markers to get more details.
3. Real-time Analytics:
Example: A monitoring system for website traffic with real-time line charts showing visitor counts
and user engagement.
Interact with the data: Filter, sort, zoom, and manipulate visualizations to uncover insights.
Make informed decisions: By simplifying complex data into easy-to-understand visual
representations.
Enhance accessibility: For non-expert users to understand and analyze data without needing deep
technical knowledge.
When designing a User Interface (UI) and implementing data visualizations, certain properties are
essential for ensuring that the system is intuitive, effective, and provides valuable insights. These properties
make the interface/user experience functional, engaging, and informative.
1. Desirable Properties of User Interfaces (UI)
1.1 Usability
Definition: The ease with which a user can learn and use the interface to achieve their goals.
Key Elements:
o Intuitive Navigation: Menus, buttons, and interactions should feel natural and easy to find.
o Consistency: Repeating design patterns across the app helps users know what to expect and
minimizes confusion.
o Clear Feedback: Provide immediate feedback on user actions (e.g., loading spinners,
tooltips, button states) to assure users their actions are being processed.
o Error Prevention and Recovery: Design with error prevention in mind, and provide helpful
error messages when things go wrong.
1.2 Responsiveness
Definition: The ability of the UI to adapt to different screen sizes, devices, and user actions.
Key Elements:
o Mobile Responsiveness: The UI adjusts seamlessly to mobile screens, tablets, and desktops.
o Real-Time Interactivity: Immediate feedback when users interact with UI elements (e.g.,
forms, buttons) or data updates.
1.3 Aesthetics
Definition: The visual appeal of the interface; it should be pleasing and engaging without being
overwhelming.
Key Elements:
o Visual Hierarchy: Important elements (buttons, primary actions) are visually distinct and
easy to identify.
o Color Scheme: Colors should not only be aesthetically pleasing but should also convey
meaning (e.g., red for errors, green for success).
o Minimalism: Avoid unnecessary elements that can clutter the interface. Every element
should have a clear purpose.
1.4 Accessibility
Definition: Designing the UI in a way that it is usable by people with various disabilities.
Key Elements:
o Keyboard Navigability: Ensure that users can navigate without a mouse (important for users
with motor disabilities).
o Screen Reader Support: Proper use of ARIA (Accessible Rich Internet Applications) tags
for visually impaired users.
o Color Blindness Consideration: Avoid using color alone to convey meaning (e.g., using
color + text or patterns).
1.5 Efficiency
2.1 Clarity
Definition: The visualization should present data clearly and without ambiguity, making it easy for
users to interpret and understand the story behind the data.
Key Elements:
o Simple Visuals: Avoid overloading with excessive chart types or details. Stick to the
essentials.
o Proper Labeling: Axes, titles, legends, and units should be clearly labeled and easy to
understand.
o Logical Scale: Ensure that scales (e.g., axis ranges) are logical and appropriate for the data.
2.2 Accuracy
Definition: The visualization should represent the data correctly, without misleading the user.
Key Elements:
o Correct Axes Scaling: Ensure that axis scales are consistent and do not exaggerate trends
(e.g., avoid misleading bar charts with disproportionate axis intervals).
o Honest Representations: Avoid distorting the data or making misleading comparisons.
Ensure visual encoding accurately represents the data's magnitude or proportions.
2.3 Interactivity
Definition: The visualization should allow users to engage with the data, explore different aspects,
and discover deeper insights.
Key Elements:
o Tooltips and Hover Effects: Display additional data when users hover over elements for
more detailed insights (e.g., showing exact values on a bar in a bar chart).
o Zooming and Panning: Allow users to zoom in on specific parts of a chart (e.g., in a time-
series graph).
o Filtering: Users can filter data based on categories, time periods, or values.
2.4 Consistency
Definition: The data representation should be consistent across different views and charts.
Key Elements:
o Consistent Color Scheme: Use the same colors to represent the same categories or data
types across all visualizations.
o Uniform Layout: Data visualizations across pages or reports should follow the same visual
rules (e.g., same axis labels, scale ranges).
2.5 Engagement
Definition: The visualization should engage the user, sparking curiosity and facilitating exploration.
Key Elements:
o Interactive Features: Add elements that encourage users to explore, like drill-downs,
filtering options, or dynamic updates.
o Contextualization: Provide background information, tooltips, or data annotations to guide
users through the insights of the data.
2.6 Comparability
Definition: The visualization should enable users to compare data points effectively.
Key Elements:
o Side-by-Side Comparisons: Allow the user to compare similar categories, time periods, or
variables (e.g., using stacked bar charts or multiple line graphs).
o Clear Contrast: Ensure that different data series or categories stand out clearly from one
another (using contrasting colors, line styles, etc.).
2.7 Relevance
Definition: Only relevant data should be presented in the visualization to avoid overwhelming users.
Key Elements:
o Contextual Filters: Allow users to control which data is displayed (e.g., by date range,
categories).
o Focus on Key Metrics: Emphasize the data that is most important for the user's goals or
business needs.
When UI and data visualizations are combined into a single interface, the following properties become
essential to ensure the user experience is both functional and engaging:
Definition: The visualization should be integrated seamlessly into the UI without disrupting the
user's workflow.
Key Elements:
o Smooth Transitions: Ensure there are smooth transitions between different sections of the
app or dashboard.
o Context-Sensitive Actions: Provide users with actionable insights directly from the
visualizations (e.g., "Click to explore more").
Definition: The interface should adapt based on user needs or device capabilities.
Key Elements:
o Responsive Layouts: Visualizations and UI components should adjust for different screen
sizes (mobile, tablet, desktop).
o Personalization: Allow users to customize the interface and visualizations (e.g., sorting data,
setting display preferences).
Definition: The UI should provide real-time feedback as data changes, with dynamic visualizations
that reflect updates.
Key Elements:
o Live Data Feeds: Automatically update visualizations with new data without requiring page
refresh.
o Notifications: Inform users of important updates, changes, or anomalies in the data.
Visualization Techniques
Visualization techniques help to represent data graphically, enabling users to see trends, patterns, and
relationships that might not be obvious in raw data. Effective visualizations make complex data more
accessible and easier to understand, whether for analysis, decision-making, or communication. Below are
some of the most common and powerful visualization techniques, each suitable for specific types of data and
goals.
Purpose: Displays data points over a continuous range, typically used for time series data.
Ideal For: Showing trends over time, comparisons between multiple data series, or identifying
patterns like seasonality.
Variants:
o Single Line Chart: One data series over time.
o Multiple Line Chart: Multiple series plotted on the same graph for comparison.
1.4. Histogram
Purpose: Displays the frequency distribution of continuous data by dividing the data into bins
(intervals).
Ideal For: Showing the distribution of a single continuous variable (e.g., age distribution, income
ranges).
Purpose: Displays data points on a two-dimensional plane, showing the relationship between two
variables.
Ideal For: Investigating correlations or patterns between continuous variables (e.g., height vs.
weight, income vs. education level).
Purpose: Similar to a line chart, but the area below the line is filled with color.
Ideal For: Showing the cumulative value over time or the relative contributions of multiple series.
Purpose: Similar to a scatter plot, but with an additional dimension represented by the size of the
bubble.
Ideal For: Visualizing relationships between three variables, with the size of the bubbles
representing a third variable (e.g., market size by company, revenue vs. expenses).
3. Geospatial Visualization
Purpose: Integrates multiple visualizations into a single interactive view, often with filters and
controls.
Ideal For: Business or operational dashboards that need to provide real-time data and interactivity
(e.g., performance metrics, sales reports).
Tools:
o Tableau
o Power BI
o Google Data Studio
Purpose: Visualizations that change dynamically based on user inputs or changing data.
Ideal For: Time series data, real-time analytics, or simulations.
Examples:
o Stock market visualizations.
o Real-time sensor data (e.g., IoT systems).
Purpose: Displays the frequency of words in a corpus, with word size proportional to frequency.
Ideal For: Text analysis (e.g., visualizing most frequent terms in a dataset of reviews or social media
posts).
Purpose: Combines aspects of box plots and density plots to show the distribution of a dataset.
Ideal For: Visualizing the distribution of continuous data across multiple categories.
Purpose: Used in project management to show the timeline of tasks, including their start and end
dates.
Ideal For: Project planning and scheduling.
6. Time Series Visualizations
Purpose: A time-series heatmap displays time-based data in a matrix format, with time periods on
one axis and data categories on the other.
Ideal For: Visualizing time-dependent patterns or cycles (e.g., website activity over the day of the
week).
Purpose: Displays multi-dimensional data by plotting each data point as a line across multiple
vertical axes.
Ideal For: Visualizing patterns and correlations across multiple variables.
Purpose: A scatter plot showing the projection of high-dimensional data onto two or three principal
components.
Ideal For: Reducing dimensionality and identifying clusters or outliers in complex datasets.
R overview:
R code that you write on one platform can easily be ported to another
without any issues. Cross-platform interoperability is an important
feature to have in today’s computing world.
Features of R :
1. Go to http://ftp.heanet.ie/mirrors/cran.r-project.org.
2. Under “Download and Install R”, click on the “Windows” link.
3. Under “Subdirectories”, click on the “base” link.
4. On the next page, you should see a link saying something like “Download R
3.4.3 for Windows” (or R X.X.X, where X.X.X gives the version of R, eg.
R 3.4.3). Click on this link.
5. You may be asked if you want to save or run a file “R-3.4.3-
win32.exe”. Choose “Save” and save the file on the Desktop.
Then double-click on the icon for the file to run it.
6. You will be asked what language to install it in - choose English.
7. The R Setup Wizard will appear in a window. Click “Next” at
the bottom of the R Setup wizard window.
8. The next page says “Information” at the top. Click “Next” again.
9. The next page says “Information” at the top. Click “Next” again.
10. The next page says “Select Destination Location” at the
top. By default, it will suggest to install R in “C:\Program Files”
on your computer.
11. Click “Next” at the bottom of the R Setup wizard window.
12. The next page says “Select components” at the top. Click “Next” again.
13. The next page says “Startup options” at the top. Click “Next” again.
14. The next page says “Select start menu folder” at the top. Click “Next”
again.
15. The next page says “Select additional tasks” at the top. Click “Next” again.
16. R should now be installed. This will take about a
minute. When R has finished, you will see “Completing the
R for Windows Setup Wizard” appear. Click “Finish”.
17. To start R, you can either follow step 18, or 19:
18. Check if there is an “R” icon on the desktop of the
computer that you are using. If so, double-click on the “R” icon
to start R. If you cannot find an “R” icon, try step 19 instead.
19. Click on the “Start” button at the bottom left of your
computer screen, and then choose “All programs”, and start R
by selecting “R” (or R X.X.X, where X.X.X gives the version of R,
eg. R 3.4.3) from the menu of programs.
20. The R console (a rectangle) should pop up:
How to install R / R Studio:
For Windows users, R Studio is available for Windows Vista and above versions.
1. Go to https://www.rstudio.com/products/rstudio/download/
3. Click Next..Next..Finish.
4. Download Complete.
1. R Console: This area shows the output of code you run. Also,
you can directly write codes in console. Code entered directly in
R console cannot be traced later. This is where R script comes to
use.
2. R Script: As the name suggest, here you get space to write
codes. To run those codes, simply select the line(s) of code and
press Ctrl + Enter. Alternatively, you can click on little ‘Run’
button location at top right corner of R Script.
3. R environment: This space displays the set of external
elements added. This includes data set, variables, vectors,
functions etc. To check if data has been loaded properly in R,
always look at this area.
4. Graphical Output: This space display the graphs created
during exploratory data analysis. Not just graphs, you could
select packages, seek help with embedded R’s official
documentation.
install.packages("package name")
R- Basic Syntax:
You will type R commands into the R console in order to carry out analyses in R.
Once you have started R, you can start typing in commands, and the
results will be calculated immediately, for example:
Variables in R
Variables are used to store data, whose value can be changed according
to our need. Unique name given to variable (function and objects as well) is
identifier.
1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
2. It must start with a letter or a period. If it starts with a
period, it cannot be followed by a digit.
3. Reserved words in R cannot be used as identifiers.
Valid identifiers in R
All variables (scalars, vectors, matrices, etc.) created by R are called objects.
In R, we assign values to variables using an arrow and equals to operators.
For example, we can assign the value 2*3 to the variable x using the
command:
EX: >x
<
-
2
*
3
O
R
> x=2*3
OR
> 2*3->x
To view the contents of any R object, just type its name, and the
contents of that R object will be displayed:
Ex: >x
[1] 6
OR
>print(x)
Comments
Comments are like helping text in your R program and they are
ignored by the interpreter while executing your actual program. Single
comment is written using # in the beginning of the statement as follows −
Constants in R
Constants, as the name suggests, are entities whose value cannot be altered.
Basic types of constant are numeric constants and character constants.
Numeric Constants
> typeof(5)
> 0xff
[1]
[1] 255
"double"
> 0XF + 1
[1] "integer"
> typeof(5i)
[1] "complex"
Numeric constants preceded by 0x or 0X are interpreted as
hexadecimal numbers.
[1] 16
Character Constants
> 'example'
[1] "example"
> typeof("5")
[1] "character"
Built-in Constants
Some of the built-in constants defined in R along with their values is shown below.
The quotes are printed by default. To avoid this we can pass the argument
quote
= FALSE.
If there are more than one item, we can use the paste() or cat()
function to concatenate the strings together.
print(paste("Hi,", my.name, "next year you will be", my.age+1, "years old."))
Output
Reserved words in R
> ?reserved
R - Data Types:
There are several basic data types in R which are of frequent occurrence in
coding R calculations and programs. The variables are assigned with R-
Objects and the data type of the R-object becomes the data type of the
variable. There are many types of R-objects.
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data
types of these atomic vectors, also termed as six classes of vectors. The
other R-Objects are built upon the atomic vectors.
Dat Example Verify
a
Typ
e
v <- TRUE
[1] "logical"
v <- 23.5
v <- 2L
[1] "integer"
v <- 2+5i
Complex 3 + 2i print(class(v))
[1] "complex"
v <- "TRUE"
Characte 'a' , '"good", "TRUE", '23.4'
print(class(v))
r
it produces the following result −
[1] "character"
v <- charToRaw("Hello")
"Hello" is stored as 48 65 print(class(v))
Raw
6c 6c 6f
it produces the following result −
[1] "raw"
In R programming, the very basic data types are the R-objects called vectors
which hold elements of different classes as shown above.
Vectors
When you want to create vector with more than one element, you
should use c() function which means to combine the elements into a
vector.
# Create a vector.
apple <-
c('red','green',"yellow")
print(apple)
# Get the class of the vector.
When we execute the above code, it produces the following result −
elements inside it like vectors, functions and even another list inside it.
# Create a list.
list1 <-
list(c(2,5,3),21.3,sin) #
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
Matrices
# Create a matrix.
TRUE) print(M)
When we execute the above code, it produces the following result −
Arrays
# Create an array.
a <- array(c('green','yellow'),dim =
c(3,3,2)) print(a)
When we execute the above code, it produces the following result −
,,1
Factors
Factors are the r-objects which are created using a vector. It stores the
vector along with the distinct values of the elements in the vector as labels.
The labels are always character irrespective of whether it is numeric or
character or Boolean etc. in the input vector. They are useful in statistical
modeling.
Factors are created using the factor() function.The nlevels functions gives
the count of levels.
# Create a vector.
apple_colors <-
c('green','green','yellow','red','red','red','green') #
Create a factor object.
factor_apple <-
factor(apple_colors) # Print the
When we execute the above code, it produces the following result −
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame
each column can contain different modes of data. The first column can be
numeric while the second column can be character and third column can be
logical. It is a list of vectors of equal length.
R-Operators:
TypesofOperators
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
+ Adds two
vectors print(v+t)
v <- c( 2,5.5,6)
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
Multiplies
*
print(v*t)
both vectors
it produces the following result −
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
Divide the first
/ vector with the print(v/t)
second
When we execute the above code, it
produces the following result −
[1] 0.250000 1.833333 1.500000
v <- c( 2,5.5,6)
Give the
t <- c(8, 3, 4)
remainder of
%%
the first vector print(v%%t)
with the
it produces the following result −
second
[1] 2.0 2.5 2.0
v <- c( 2,5.5,6)
The result
of t <- c(8, 3, 4)
%/% division of
print(v%/%t)
first vector
with it produces the following result −
second
(quotient) [1] 0 1 1
v <- c( 2,5.5,6)
The first
t <- c(8, 3, 4)
vector raised
^
to the print(v^t)
exponent of
it produces the following result −
second vector
[1] 256.000 166.375 1296.000
Relational Operators
v <- c(2,5.5,6,9)
Checks if each element of
t <- c(8,2.5,14,9)
the first vector is less than
<
the corresponding element print(v < t)
of the second vector.
it produces the following result −
[1] TRUE FALSE TRUE FALSE
v <- c(2,5.5,6,9)
Checks if each element of
t <- c(8,2.5,14,9)
the first vector is equal to
==
the corresponding element print(v == t)
of the second vector.
it produces the following result −
[1] FALSE FALSE FALSE TRUE
v <- c(2,5.5,6,9)
Checks if each element of
t <- c(8,2.5,14,9)
the first vector is less than
<=
or equal to the print(v<=t)
corresponding element of
it produces the following result −
the second vector.
[1] TRUE FALSE TRUE TRUE
v <- c(2,5.5,6,9)
Checks if each element of
t <- c(8,2.5,14,9)
the first vector is unequal
!=
to the corresponding print(v!=t)
element of the second
it produces the following result −
vector.
[1] TRUE TRUE TRUE FALSE
LogicalOperators
It is called Element-wise
Logical AND operator. It v <-
combines each element of
c(3,1,TRUE,2+3i) t
the first vector with the
&
corresponding element of <-
the second vector and
c(4,1,FALSE,2+3i)
gives a output TRUE if
both the elements are print(v&t)
TRUE.
it produces the following result −
[1] TRUE TRUE FALSE TRUE
It is called Element-wise
Logical OR operator. It v <-
combines each element of c(3,0,TRUE,2+2i) t
| the first vector with the
corresponding element of <-
the second vector and c(4,0,FALSE,2+3i)
gives a output TRUE if
one the print(v|t)
it produces the following result −
v <- c(3,0,TRUE,2+2i)
Called Logical AND
operator. Takes first t <- c(1,3,TRUE,2+3i)
&&
element of both the
print(v&&t)
vectors and gives the
TRUE only if both are
TRUE.
it produces the following result −
[1] TRUE
[1] FALSE
Assignment Operators
v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
<− v3 = c(3,1,TRUE,2+3i)
or print(v1)
= Called Left Assignment
print(v2)
or print(v3)
<<−
it produces the following result −
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2
->
print(v1)
or Called Right Assignment
print(v2)
->>
it produces the following result −
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
Miscellaneous Operators
These operators are used to for specific purpose and not general
mathematical or logical computation.
: Colon
operator. It v <- 2:8
creates
the
series of print(v)
numbers in
sequence for a it produces the following result −
vector. [1] 2 3 4 5 6 7 8
v1 <- 8
v2 <- 12
This operator
t <- 1:10
is used to
%in% identify if an print(v1 %in% t)
print(v2 %in% t)
element
belongs to a it produces the following result −
vector.
[1] TRUE
[1] FALSE
R - Decision making:
be false.
1. if Statement 3. if...elseif...elseStateme
2. If -else
nt 4.SwitchStatement
Statement
If Statement:
Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
Flow Diagram
Example
x <- 30L
if(is.integer(x)) {
print("X is an
Integer")
When the above code is compiled and executed, it produces the following result −
If...Else Statement
Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
If//the
statement(s) will execute
Boolean expression if thetoboolean
evaluates be true,expression is false.
then the if block of
code will be executed, otherwise else block of code will be executed.
Flow Diagram
1
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
When the above code is compiled and executed, it produces
the following result −
Theif...elseif...else Statement
When using if, else if, else statements there are few points to keep in mind.
An if can have zero or one else and it must come after any else if's.
An if can have zero to many else if's and they must come before the else.
Once an else if succeeds, none of the remaining else if's or else's will be
tested.
Syntax
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
Mangayarkarasi College of Arts and Science for Women
Affiliated to Madurai Kamaraj University | Accredited with ‘A’ Grade by NAAC (3rd Cycle)
Approved by UGC Under Section 2(f) Status| ISO 9001:2015 Certified Institution
Paravai, Madurai-625402
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
When the above code is compiled and executed, it produces the
following result −
Switch Statement:
for (n in x) { - - - }
It means that there will be one
iteration of the loop for each
component of the vector x, with taking
on the values of those components—in
the first iteration, n = x[1]; in the
second iteration, n = x[2]; and so on.
Program to find
the
multiplication
# take input from the user
num = as.integer(readline(prompt = "Enter a number: ")) #
use for loop to iterate 10 times
for(i in 1:10) {
print(paste(num,'x', i, '=', num*i)) }
Example
>i <- 1
>while (i<=10) i <- i+4
>i
[1] 13
sum = 0
# take input from the user
}
print(paste("The sum is", sum))
{
statement
}
Example:
x <- 1
repeat Output:
{ [1] 1
print(x) [1] 2
x = x+1 [1] 3
if (x == 6) [1] 4
break} [1]
UNIT V
R is widely used across many industries due to its strong capabilities in data
analysis and visualization. Some key applications include:
Data Analysis and Statistics: R is widely used for statistical analysis and
modeling with built-in functions and packages that simplify complex
computations.
Data Visualization: With libraries like ggplot2 and lattice, R enables
creation of detailed and customizable charts and graphs for effective
data presentation.
Data Cleaning and Preparation: R provides tools to import, clean, and
transform data from various sources, making it ready for analysis.
Machine Learning and Data Science: R supports machine learning
through packages such as caret, randomForest, and xgboost, helping
build predictive models.
Reporting and Reproducible Research: Tools like R Markdown
and knitr allow dynamic report generation and sharing of reproducible
data analyses.
Interfacing allows you to combine R's statistical power with the strengths of other
languages, such as Python or C++, for improved performance and specialized libraries.
Rcpp (C++): The Rcpp package provides C++ classes and functions that offer seamless
integration of R and C++. It significantly simplifies passing data between R and C++ for
writing high-performance functions.
Other interfaces: Functions like .C() and .Fortran() exist for more direct interfacing with
compiled code, though Rcpp is often recommended for new projects.
PARALLEL PROGRAMMING
1. Install Required Packages: R provides several packages for parallel computing, such
as parallel, snow, and doMC. Install these packages using the install.packages()
function.
2. Check Available Cores: R's parallel processing capabilities depend on the number of
CPU cores available. Use the detectCores() function to determine how many cores
your computer has.
3. Load the Parallel Package: Once the packages are installed, load the parallel
package into your R session using the library(parallel) function.
4. Initialize Parallel Processing: Use the parLapply() function to divide tasks into sub-
vectors and execute them in parallel.
5. Utilize Parallel Functions: R offers several functions for parallel computation,
including parLapply(), parSapply(), and mclapply(). You can leverage these to
perform parallelized operations on your data.
Basic statistics involves the collection, summarization, and interpretation of data. It uses measures to
describe a dataset's main features.
Descriptive statistics: Methods for summarizing and organizing data. Key concepts include:
> Measures of central tendency: The mean (average), median (middle value), and
mode (most frequent value).
> Measures of dispersion: The standard deviation and variance, which measure how
spread out the data is.
Descriptive statistics
summarize and organize the key features of a dataset, providing a clear overview of its
characteristics .These methods help describe a collection of information by generating brief
informational coefficients. Unlike inferential statistics, descriptive statistics focus only on the data
at hand rather than making inferences about a larger population.
Measures of central tendency identify a single representative value that best describes the center
of a dataset. The three most common measures are:
Mean (average): The sum of all values in a dataset divided by the number of values. It is
best used for symmetrical distributions but can be skewed by outliers.
𝜇=∑𝑥𝑁 Where
∑xsum of x
is the sum of all values and
Ncap N
Median: The middle value of a dataset when arranged in ascending or descending order.
It is less affected by extreme outliers than the mean, making it a better measure for
skewed distributions.
o Calculation: For an odd number of observations, the median is the middle
value. For an even number, it is the average of the two middle values.
Mode: The value that appears most frequently in a dataset. A dataset can have one
mode (unimodal), more than one mode (multimodal), or no mode at all. The mode is
the only measure of central tendency that can be used with categorical (non-
numerical) data.
Measures of dispersion describe how spread out the values in a dataset are, giving a sense of
the data's variability.
Variance: Measures how far each number in a set is from the mean. It is calculated
by averaging the squared differences from the mean.
o Formula for a population variance ( σ2sigma squared 𝜎2):
σ2=∑(x−μ)2Nsigma squared equals the fraction with numerator sum of open paren x minus mu
close paren squared and denominator cap N end-fraction
𝜎2=∑(𝑥−𝜇)2𝑁 Where xx
𝑥 is each individual value,
μmu
Standard deviation: The square root of the variance, bringing the measure of spread
back into the original units of the data. A low standard deviation means the data points
are generally close to the mean, while a high standard deviation indicates a wider
spread.
o Formula for a population standard deviation (σsigma 𝜎 ):
σ=∑(x−μ)2Nsigma equals the square root of the fraction with numerator sum of
open paren x minus mu close paren squared and denominator cap N end-
fraction end-root
𝜎=∑(𝑥−𝜇)2𝑁
74/11≈6.73
Median: First, sort the data: {3, 4, 5, 6, 6, 7, 8, 8, 8, 9, 10}. The middle value is 7.
Mode: The number 8 appears most frequently, so the mode is 8.
Measures of dispersion:
Variance:
1. First, subtract the mean (6.73) from each value and square the result.
2. Sum the squared differences, which equals approximately 46.18.
3. Divide the sum by the number of values (11).
o Variance
Standard deviation:
o Take the square root of the variance.
4.20√≈2.05
Inferential statistics: Methods for making predictions or inferences about a larger population
based on a sample of data.
Inferential statistics
It uses a sample of data to make inferences and predictions about a larger population. This is
necessary when studying an entire population is too costly, time-consuming, or impractical. The
conclusions are based on probability theory and are subject to a degree of uncertainty, which is
quantified with confidence levels and margins of error.
Core components
A strong understanding of inferential statistics requires familiarity with these key terms:
Inferential statistics includes a variety of methods for analyzing data and making
inferences: Hypothesis testing
This is a formal process for testing a claim or assumption about a population. It involves these
steps:
1. State the hypotheses: Formulate a null hypothesis ( H0cap H sub 0 𝐻0) and an
alternative hypothesis (H1cap H sub 1𝐻1). The null hypothesis states there is no effect
or difference, while the alternative contradicts it.
2. Calculate a test statistic: The test determines how far your sample data deviates from
the null hypothesis.
3. Determine the p-value: This is the probability of observing results as extreme as
the sample data, assuming the null hypothesis is true. A small p-value (typically <
0.05) provides evidence to reject the null hypothesis.
4. Draw a conclusion: Based on the p-value, you either reject or fail to reject the
null hypothesis.
Estimation
Instead of testing a hypothesis, estimation provides a likely value or range of values for
a population parameter.
Point estimation: Uses a single value from the sample data to estimate the population
parameter. For example, using the sample mean (x¯x bar ) as the single best guess for
the population mean (μ).
Interval estimation: Provides a range of values, known as a confidence interval, within
which the population parameter is likely to fall. For example, a "95% confidence
interval" indicates that if you repeat the sampling process, 95% of the calculated
intervals would contain the true population parameter.
Regression analysis
This technique examines the relationship between a dependent variable and one
or more independent variables.
It allows for predictions about an outcome variable based on the input of predictor variables.
For example, a business could use regression to predict future sales based on
advertising spending.
ANOVA is a test used to compare the means of three or more groups simultaneously to determine
if a statistically significant difference exists between them. It extends the t-test, which is used for
comparing only two groups.
support is higher than 50% (H0cap H sub 0𝐻0: μ=0.50mu equals 0.50 𝜇=0.50
o Hypothesis testing: A news outlet might test the hypothesis that the candidate's
vs. H1cap H sub 1 𝐻1 : μ>0.50mu is greater than 0.50 𝜇>0.50). If the poll's
result is statistically significant, they could reject the null hypothesis and report
that the candidate is likely in the lead.
LINEAR MODELS
Linear models are fundamental and powerful tools for big data analysis, but their application requires
specialized techniques to overcome computational challenges and interpret results. Standard linear
regression is not designed for the massive scale of big data, leading to the development of scalable and
distributed methods.
Applying a conventional linear model to a big data problem presents several challenges:
To adapt linear models for big data, researchers and developers have created more
advanced techniques that address the limitations of scale and complexity.
Distributed linear regression: This approach partitions a massive dataset across multiple
machines in a network. Computations, such as finding the sums of squares and cross
products, are performed locally on each machine. The results are then aggregated to
compute the global model coefficients. Distributed frameworks like Apache Spark and
MapReduce enable this approach.
Regularization techniques: Methods like Lasso and Ridge regression are used to handle
high-dimensional data and multicollinearity. They penalize large or unnecessary
coefficients, which prevents overfitting and improves model stability and
interpretability.
Generalized linear models (GLMs): GLMs are a flexible extension of linear models that
accommodate dependent variables with non-normal distributions, such as count data
(Poisson regression) or binary outcomes (logistic regression).
Online and streaming algorithms: For datasets that are too large to store or that arrive in
real-time, online linear regression algorithms update the model's coefficients with each
new data point or batch, rather than training on the entire dataset at once.
Approximation algorithms: Researchers have developed algorithms, such as the Multiple-
Model Linear Regression (MMLR), that construct localized linear models on subsets of the
data. This provides high accuracy with a lower time complexity than traditional methods.
Kernel methods: While not strictly linear, methods like support vector machines (SVMs)
use kernel tricks to map data into higher-dimensional spaces where a linear boundary can
separate non-linear data. This allows linear techniques to be applied to non-linear
problems.
GENERAL LINEAR MODELS
General linear models (GLMs) can be used in big data by
adapting them with distributed processing techniques, such as
divide and recombine (D&R) methods, to handle massive
datasets beyond the memory capacity of a single machine.
GLMs extend traditional linear models to accommodate non-
normally distributed response variables and utilize link
functions to model non-linear relationships. In big data, GLMs
are applied to large- scale datasets to build and score models,
with specialized algorithms enabling them to process extensive
numbers of predictors and observations efficiently
1. Flexibility: GLMs can model a wide range of relationships between the response
and predictor variables, including linear, logistic, Poisson, and exponential
relationships.
2. Model interpretability: GLMs provide a clear interpretation of the relationship
between the response and predictor variables, as well as the effect of each
predictor on the response.
3. Robustness: GLMs can be robust to outliers and other anomalies in the data, as
they allow for non-normal distributions of the response variable.
4. Scalability: GLMs can be used for large datasets and complex models, as they have
efficient algorithms for model fitting and prediction.
5. Ease of use: GLMs are relatively easy to understand and use, especially compared
to more complex models such as neural networks or decision trees.
6. Hypothesis testing: GLMs allow for hypothesis testing and statistical inference,
which can be useful in many applications where it's important to understand the
significance of relationships between variables.
7. Regularization: GLMs can be regularized to reduce overfitting and improve model
performance, using techniques such as Lasso, Ridge, or Elastic Net regression.
8. Model comparison: GLMs can be compared using information criteria such as AIC
or BIC, which can help to choose the best model among a set of alternatives.
NON-LINEAR MODEL
Non-proportional relationships:
Changes in the dependent variable are not directly proportional to changes in the independent
variables.
Complex patterns:
They can capture curved, exponential, logarithmic, or interactive patterns in data that linear models
cannot.
Non-linear in parameters:
In some cases, the model's regression function is nonlinear with respect to the parameters being
estimated.
Iterative estimation:
Because the relationship isn't linear, an iterative algorithm is often needed to find the best-fitting
parameters for the model.
Examples
Population modeling: Modeling population growth where birth and death rates interact.
Time Series and Autocorrelation-based Clustering groups time series by analyzing their internal
dependency structures using autocorrelation functions.
This method involves calculating the autocorrelation function (ACF) for each time series, which
captures how a series correlates with its lagged versions, and then using these ACF values as a
basis for clustering.
By comparing the ACF profiles, similar time series—those with similar internal patterns or
dependence structures—can be grouped together.
How it works
Calculate Autocorrelation:
The first step is to compute the autocorrelation function for each time series in the dataset. The
ACF measures the correlation between a time series and its past values at different "lags" (time
shifts).
1. Extract ACF Features:
The ACF for a single time series generates a series of correlation values for different lags.
These values form a profile that represents the series' dependence structure, revealing patterns
like trends and seasonality.
Autocorrelation is a powerful feature for clustering because it describes the underlying dynamic
behavior of a time series rather than just the raw data points.
Captures repeating patterns: Autocorrelation helps identify seasonality and repeating patterns
that might be hidden by noise. For example, a high autocorrelation at a 24-hour lag would reveal
a daily cycle in a time series of electricity usage.
Enables clustering of different lengths: Instead of using the time series' raw data points, which
may have different lengths, you can use the autocorrelation function (ACF). This transforms
each time series into a fixed-length vector of autocorrelation coefficients, which can then be
clustered using standard algorithms.
Creates more meaningful clusters: Clustering based on the ACF allows you to group series that
share similar dynamics or behaviors, even if their raw values differ. This is especially useful for
long or high-dimensional time series, where clustering based on raw values is computationally
expensive and less effective.
CLUSTERING METHODS THAT USE AUTOCORRELATION
Several time series clustering approaches incorporate autocorrelation, either directly as a feature or
indirectly as part of a distance measure.
Feature-based clustering
This method extracts descriptive features that represent the characteristics of a time series, then
uses these features as input for a traditional clustering algorithm like k-means or hierarchical
clustering.
How it works: You can represent each time series by its autocorrelation coefficients at various
lags. This results in a feature vector that captures its serial correlation. Other statistical features,
like trend, seasonality, and variance, can also be included.
Example: A dataset of store sales might be represented by a vector containing the autocorrelation
at a weekly lag (7) and a monthly lag (30). Stores with similar vectors would be grouped into the
same cluster, representing similar sales cycles.
Correlation-based distance measures
Rather than using raw values, these methods compute the distance between time series based
on their correlation, emphasizing the similarity of their patterns and profiles.
Cross-correlation distance: The k-shape algorithm uses a normalized cross-correlation (NCC)
based distance measure. It finds the best alignment by shifting one series relative to the other
to maximize their correlation. This makes it robust to time shifts and amplitude scaling.
Generalized cross-correlation: More advanced methods can cluster multivariate time series by
comparing the cross-correlation functions between different variables over various lags. This
reveals hidden dependencies that traditional clustering may miss, especially in noisy
environments.
This technique uses a fuzzy c-means model, which assigns a membership degree for each time
series to a cluster, rather than forcing a hard assignment.
How it works: This approach uses a dissimilarity measure that compares the autocorrelation
functions of time series. It is particularly useful for dealing with time series that change their
dynamics over time, allowing them to belong to different clusters with varying degrees of
membership.
Standard distance measures for time series clustering
Dynamic time warping (DTW): This elastic measure "warps" the time axis
to find the optimal alignment between two series. It is more robust to
shifts and variations in speed than the Euclidean distance, making it a
popular choice for shape-based clustering.
2. Pre-process the data: Standardize or normalize the time series (e.g., using a
z-score) to make them invariant to amplitude scaling and offset. If needed,
remove trends and de-seasonalize the data.