Unit I
Introduction: Data science and Big data
✔ Introduction to Data science and Big Data ✔ Big Data Processing Architectures
✔ Defining Data science and Big Data ✔ Data Warehouse
✔ Big Data examples ✔ Re-Engineering the Data Warehouse
✔ Big data infrastructure and challenges ✔ Data explosion
✔ Shared everything and shared nothing ✔ Big data learning approaches
architecture
Introduction to Data science and Big Data
▪ Data Science:
? Process to examine from where the information can be taken, and how it can be
converted into useful resources helpful for business & IT
? mining huge quantity of structured and Unstructured into recognize pattern
▪ Big Data:
? Huge amount od data ,or collection of large dataset can not process using traditional
way
? Difficult to store maintain and access
• Sources of Big Data
1. Stock Exchange
2. Social Media Data
3. Video sharing Portals
4. Search Engine Data
5. Transport data
6. Banking data
• Categorise of Data
1. Structured data
2. Semi-structured data
3. Unstructured data
• Examples of Big Data Application
1. Fraud Detection
2. IT log Analytics
3. Call Centre Analytics
4. Social Media Analysis
• Big Data Infrastructure & Challenges
1. Storage
2. Transformation
3. Speed or Throughput
4. Processing : i) CPU
ii) Memory
iii) Software
Shared Everything & Shared Nothing Architecture
Shared Everything Shared Nothing Architecture
Data Warehouse
A Data Warehousing (DW) is process for collecting and managing data from varied sources to
provide meaningful business insights.
? Data warehousing started in the late 1980s when IBM worker Paul Murphy and
Barry Devlin developed the Business Data Warehouse
? Characteristics of Data Warehouse
Characteristics of Data
Warehouse
1. Subject Oriented
2. Integrated
3. Time Variant
4. Non-Volatile
• Benefits of Data Warehousing
• Benefits of Data
Warehousing
1. Saves times
2. Delivers enhanced business intelligence
3. Enhances data quality and consistency
[Link] a high Return on Investment
5. Provides competitive advantage
6. Improves the decision-making process
7. Enhance Customer services
• Limitations of Data Warehousing
• Limitations of Data
Warehousing
1. Data Ownership
2. Inability to capture required data
3. Increased demands of the users
[Link]-duration project
5. Maintenance costs
6. Complexity of Integration
Big Data Learning Approaches
1. Machine Learning:
Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly programmed
? The process of learning begins with observations or data, such as examples,
direct experience, or instruction. Eg, Facebook Newsfeed
? Supervised machine learning algorithms can apply what has been learned in
the past to new data using labelled examples to predict future events.
? unsupervised machine learning algorithms are used when the information
used to train is neither classified nor labelled.
Machine Learning System Model
Characteristics of Data
Warehouse
1. Learning Element (Gets input from user)
2. Knowledge base (database, get more knowledge)
3. Performance element (perform task, solve problem)
4. Feedback element (get 2 i/p,from learning & standard sys)
5. Standard System (train person/comp to produce
correct output)
Big Data Processing Architecture
? There are 3 type of Big Data Processing Architecture available
Big Data Processing
Architecture
Lambda Kappa Zeta
Architecture Architecture Architecture
Lambda Architecture
? Design to manage huge amount of data by using batch and stream method
? try to maintain throughput ,fault tolerance
• 3 Layers in Lambda Architecture
Kappa Architecture
? Same like Lambda architecture. But Batch layer is removed
? data can be pass with the help of stream layer
Zeta Architecture
? High-level enterprise architectural
which simplified business
processes & increase the speed of
integrating data
? There are seven pluggable
components ,which reduce
system-level complexity while
increasing resource utilization and
efficiency.
Zeta Architecture
? Distributed File System - all applications read and write to a common, scalable solution,
which dramatically simplifies the system architecture.
? Real-time Data Storage - supports the need for high-speed business applications through
the use of real-time databases.
? Pluggable Compute Model / Execution Engine -delivers different processing engines
and models in order to meet the needs of diverse business applications and users in an
organization.
? Deployment / Container Management System - provides a standardized approach for
deploying software. All resource consumers are isolated and deployed in a standard way.
? Solution Architecture - focuses on solving specific business problems
? Enterprise Applications - brings simplicity and reusability by delivering the components
necessary to realize all of the business goals defined for an application.
? Dynamic and Global Resource Management - allows dynamic allocation of resources
so that you can accommodate whatever task is the most important for that day.
Benefits Zeta Architecture
? Reduce time and costs of deploying and maintaining applications
? Fewer moving parts with simplifications such as using a distributed file system
? Less data movement and duplication - transforming and moving data around
will no longer be required unless a specific use case calls for it
? Simplified testing, troubleshooting, and systems management
? Better resource utilization to lower data center costs