0% found this document useful (0 votes)
53 views46 pages

Data Analytics Unit - I Data Analytics and Lifecycle

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views46 pages

Data Analytics Unit - I Data Analytics and Lifecycle

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Analytics

Unit - I | Data Analysis & Lifecycle


By - Er. Monu Kumar
B.Tech(CSE), M.Tech(CSE), NET JRF, Ph.D(CSE)*
Introduction to Data
Analytics
Introduction to Data Analytics & Data Analytics Lifecycle

What is Data Analytics?


● The process of examining raw data to draw conclusions about that
information.
● Helps in decision-making, identifying trends, and solving problems.
● Goes beyond traditional business intelligence by focusing on predictive and
prescriptive insights.
Top Analytical Skills to become a Data Analyst
Sources and Nature of Data
Source and Nature of Data

Sources of Data
● Transactional Data: Sales, purchases, orders.
● Interaction Data: Website clicks, social media likes, customer service
interactions.
● Sensor Data: IoT devices, smart cities, industrial sensors.
● Human-Generated Data: Emails, documents, surveys, audio recordings.
● Machine-Generated Data: Server logs, network traffic, application logs.
Source and Nature of Data

Nature of Data
● Can be static (historical) or streaming (real-time).
● Varies in volume, velocity, variety, veracity, and value (the 5 Vs of Big
Data).
● Can be internal (from within an organization) or external (from third
parties, public sources).
Classification of Data
Classification of Data

1. Structured Data
● Definition: Highly organized, follows a fixed schema, easily stored and
accessed in relational databases.
● Examples: SQL databases, Excel spreadsheets, CRM systems.
● Characteristics: Rows and columns, predefined data types, easy to query
using SQL
Classification of Data

2. Semi-Structured Data
● Definition: Has some organizational properties, but does not conform to a
rigid relational database schema. Uses tags or markers to separate semantic
elements.
● Examples: XML, JSON, CSV, log files, NoSQL databases.
● Characteristics: Flexible schema, hierarchical structure, can contain nested
data.
Classification of Data

3. Unstructured Data
● Definition: Has no predefined format or organization. Very difficult to
process and analyze using traditional methods.
● Examples: Text documents (Word, PDF), images, audio, video, emails,
social media posts.
● Characteristics: Raw, complex, requires advanced techniques (NLP, image
recognition) for analysis.
Characteristics of Data
Characteristics of Data

(The 5 Vs of Big Data)


● Volume: The sheer amount of data generated and stored. (e.g., terabytes,
petabytes, zettabytes).
● Velocity: The speed at which data is generated, collected, and processed.
(e.g., real-time streaming data).
● Variety: The different forms and types of data (structured, semi-structured,
unstructured).
● Veracity: The quality, accuracy, and trustworthiness of the data. (e.g.,
dealing with noise, bias, abnormalities).
● Value: The potential to transform data into insights that lead to better
decision-making and business outcomes.
Introduction to Big Data
Platform
Introduction to Big Data Platform

What is Big Data?


Data sets that are so large and complex that traditional data processing
application software are inadequate to deal with them
Introduction to Big Data Platform

Big Data Platform Components


● Data Ingestion: Tools for collecting data from various sources (e.g., Apache
Kafka).
● Data Storage: Distributed file systems and NoSQL databases (e.g., HDFS,
Apache Cassandra, MongoDB).
● Data Processing: Frameworks for processing large datasets (e.g., Apache
Hadoop, Apache Spark).
● Data Analysis & Visualization: Tools for querying, analyzing, and
presenting insights (e.g., Apache Hive, Tableau, Power BI).
Need of Data Analytics
Need of Data Analytics

● Better Decision Making: Data-driven insights lead to more informed and


strategic decisions.
● Identify Trends & Patterns: Uncover hidden correlations and trends in
large datasets.
● Optimize Processes: Improve efficiency and reduce costs by identifying
bottlenecks.
● Personalized Customer Experiences: Tailor products and services to
individual customer needs.
● Risk Management: Detect fraud, predict failures, and mitigate risks.
● Competitive Advantage: Gain an edge over competitors by understanding
market dynamics.
● Innovation: Drive new product development and service offerings.
Evolution of Analytic
Scalability
Evolution of Analytic Scalability

(1970s-1980s) (1990s) (2000s) (2010s) (2020s)

Early Days Relational Business Big Data Era AI/ML Integration


Databases Intelligence

Batch processing on SQL for querying OLAP, dashboards, Distributed computing Advanced predictive
mainframes, simple structured data, data reporting tools for (Hadoop, Spark), and prescriptive
reporting. warehousing emerges. historical analysis. NoSQL, real-time analytics, automation,
analytics. MLOps.
Analytic Process and Tools
Analytic Process and Tools

Analytic Process (General Steps)


1. Define the Problem: Clearly state the business question.
2. Data Collection: Gather relevant data from various sources.
3. Data Cleaning & Preparation: Handle missing values, outliers, transform
data.
4. Data Exploration & Visualization: Understand data patterns, initial
insights.
5. Model Building: Apply statistical or machine learning models.
6. Model Evaluation: Assess model performance and accuracy.
7. Deployment & Monitoring: Put the model into production, track
performance.
8. Communication: Present findings to stakeholders.
Analytic Process and Tools

Common Analytic Tools


1. Programming Languages: Python (Pandas, NumPy, Scikit-learn), R.
2. Databases: SQL (MySQL, PostgreSQL), NoSQL (MongoDB, Cassandra).
3. Big Data Frameworks: Apache Hadoop, Apache Spark.
4. BI & Visualization Tools: Tableau, Power BI, Qlik Sense.
5. Cloud Platforms: AWS, Azure, Google Cloud Platform (GCP).
Analysis vs Reporting
Analysis vs Reporting

Reporting
● What: Summarizes past data to show "what happened."
● Focus: Historical data, descriptive statistics.
● Purpose: Provide insights into past performance, monitor KPIs.
● Tools: Dashboards, standard reports, spreadsheets.
● Example: Monthly sales report, website traffic summary
Analysis vs Reporting

Analysis
● What: Explores data to understand "why it happened" and "what will
happen."
● Focus: Diagnostic (why), Predictive (what will), Prescriptive (what to do).
● Purpose: Discover patterns, predict future outcomes, recommend actions.
● Tools: Statistical software, machine learning algorithms, advanced
visualization.
● Example: Predicting customer churn, identifying root causes of product
defects.
Modern Data Analytic Tools
Modern Data Analytic Tools
Modern Data Analytic Tools

1. Programming Languages: Python (widely used for ML, data science), R


(statistical analysis).
2. Big Data Processing: Apache Spark (fast, general-purpose cluster
computing), Apache Flink (stream processing).
3. Cloud Data Warehouses: Snowflake, Google BigQuery, Amazon
Redshift (scalable, cloud-native).
4. Business Intelligence & Visualization: Tableau, Microsoft Power BI,
Looker (interactive dashboards).
5. Machine Learning Platforms: TensorFlow, PyTorch, Scikit-learn (for
building ML models).
6. Data Orchestration: Apache Airflow (workflow management).
7. Data Governance: Tools for data quality, security, and compliance.
Applications of Data
Analytics
Applications of Data Analytics
Applications of Data Analytics

1. Healthcare: Disease prediction, personalized medicine, drug discovery.


2. Finance: Fraud detection, risk assessment, algorithmic trading.
3. Retail & E-commerce: Customer segmentation, recommendation systems,
inventory optimization.
4. Marketing: Targeted advertising, campaign optimization, customer lifetime
value prediction.
5. Manufacturing: Predictive maintenance, quality control, supply chain
optimization.
6. Smart Cities: Traffic management, energy efficiency, public safety.
7. Sports: Player performance analysis, fan engagement.
Data Analytics Lifecycle
Data Analytics Lifecycle: Need

Why a Lifecycle?
● Data analytics projects are complex and iterative.
● Provides a structured approach to ensure success.
● Helps in managing expectations and resources.
● Ensures reproducibility and maintainability of solutions.
● Facilitates collaboration among different team members.
● Minimizes risks and maximizes the value derived from data.
Key Roles for Successful
Analytic Projects
Key Roles for Successful Analytic Projects

● Business User/Domain Expert: Defines the problem, provides domain


knowledge.
● Project Manager: Oversees the project, manages timelines and resources.
● Data Scientist: Designs and builds analytical models, interprets results.
● Data Engineer: Builds and maintains data pipelines, manages data
infrastructure.
● Data Analyst: Explores data, creates reports, performs descriptive analysis.
● Database Administrator (DBA): Manages databases, ensures data
availability.
● IT Operations: Manages infrastructure, deploys solutions.
Various Phases of Data
Analytics Lifecycle
Various Phases of Data Analytics Lifecycle
Various Phases of Data Analytics Lifecycle—Phase 1: Discovery

Objective
● Understand the business context, problem, and objectives.
● Formulate the analytical problem as a testable hypothesis.
● Identify data sources and initial data requirements.
Activities
● Brainstorming with stakeholders.
● Reviewing existing literature and data.
● Developing a project charter.
● Defining success criteria.
Various Phases of Data Analytics Lifecycle—Phase 2: Data
Preparation
Objective
● Collect, clean, and transform data into a suitable format for analysis.
● Address data quality issues.
Activities
● Data Collection: Extracting data from various sources.
● Data Cleaning: Handling missing values, outliers, inconsistencies.
● Data Transformation: Normalization, aggregation, feature engineering.
● Data Integration: Combining data from different sources.
● Data Exploration: Initial statistical analysis and visualization to understand
data.
Various Phases of Data Analytics Lifecycle—Phase 2: Data
Preparation
Various Phases of Data Analytics Lifecycle—Phase 3: Model
Planning
Objective
● Determine the analytical techniques and models to be used.
● Design the experimental setup.
Activities
● Reviewing available data and tools.
● Selecting appropriate algorithms (e.g., regression, classification, clustering).
● Developing a model strategy.
● Defining model evaluation metrics.
● Creating a plan for model training and testing.
Various Phases of Data Analytics Lifecycle—Phase 4: Model
Building
Objective
● Develop and execute the analytical model.
● Train and refine the model.
Activities
● Feature Selection: Choosing relevant variables for the model.
● Model Training: Running algorithms on prepared data.
● Model Tuning: Optimizing model parameters for better performance.
● Model Validation: Testing the model's performance on unseen data.
● Iterative refinement based on results.
Various Phases of Data Analytics Lifecycle—Phase 5:
Communicating Results
Objective
● Present the findings and insights to stakeholders effectively.
● Explain the model's implications and recommendations.
Activities
● Visualization: Creating compelling charts, graphs, and dashboards.
● Storytelling: Presenting insights in a clear, concise, and impactful narrative.
● Documentation: Preparing reports, presentations, and technical
documentation.
● Addressing questions and feedback from stakeholders.
Various Phases of Data Analytics Lifecycle—Phase 6:
Operationalization
Objective
● Deploy the analytical model or solution into a production environment.
● Ensure the solution is integrated into business processes.
Activities
● Deployment: Integrating the model into existing systems or creating new
applications.
● Monitoring: Tracking model performance, data drift, and system health.
● Maintenance: Updating models, retraining with new data, troubleshooting.
● Change Management: Ensuring adoption by end-users.
● Automating the analytical process where possible.

You might also like