Data Set Types: - Discretization Extraction Transformation
(dividing continuous Loading (ETL)
- Record Data into intervals)
- Graphs and Networks o Binning - Data extraction
- Ordered Sets - Data Cleaning
Equal-
- Spatial, Image, & - Data transformation
width
Multimedia - Load
Equal-
- Refresh
Important Characteristics of depth
Structured Data: o Histogram Data Lake: Centralized
o Repository storing all
- Dimensionality structured and unstructured
o Clustering
- Sparsity data. Stores data as is.
- Resolution Analysis,
- Distribution remove outliers Layers of Storage:
o Decision-Tree
Types of Attributes: o Correlation - Sandbox data layer
- Application data layer
- Nominal (red, blue) Data Reduction methods: - Cleansed data layer
- Binary (true or false) - Standardized Data
- Ordinal (junior, senior) - Regression
- Histogram, Clustering, layer
- Numeric - Raw data layer
o Interval no zero sampling
o Ratio zero - Data cube Types of Schemas:
point aggregation
- Data compression - Star Schema
- Discrete vs - Snowflake Schema
Continuous Dimensionality Reduction - Fact Constellation
Central Tendency: - Feature Extraction OLAP Operations:
- Mean (ave) - Feature Selection
- Roll up (drill-up):
- Median (middle point) Data Warehouse: Historical summarize data by
- Mode (most number) data for analysis climbing up the
Skew: OLTP vs OLAP hierarchy
- Drill down: higher
- Mode < Median < OLTP – Online Transactional level to lower level
Mean (positive) Processing, DBMS ops, - Dice: Pick specific
- Mean < Median < query and transactional values or ranges
Mode (negative) processing - Pivot: Rotate a cube –
Data Preprocessing: changing the order
OLAP – Online analytical
- Slice: Removing a
- Data Cleaning processing, Data warehouse
specific dimension
- Data Integration ops, drilling, slicing, dicing,
from a cube
- Data Reduction etc.
- Data Transformation OLAP Architectures
Data warehouse models:
- Data Discretization - Relational OLAP
- Enterprise Warehouse
Data Transformation (uses dbms for data
– collects all info
Processing: management)
- Data mart – selected
- Multidimensional
- Normalization (0-1) information
OLAP (sparse array-
o Min-max - Virtual warehouse –
based)
view on operational
o Z-score - Hybrid OLAP
databases
o By decimal (Flexible)
- Specialized SQL - Arbitrary missing o Selection
Servers (for SQL pattern. Markov chain - Mining
queries over - Pattern/model
star/snowflake) Metadata – data about data
evaluation
XML – Data interchange - Knowledge
Full Cube vs Iceberg Cube
format presentation
Iceberg: Only focuses cells
Data Cleaning: Data Transformation
that satisfy conditions.
Methods:
Data Quality Requirements: How to handle noisy data:
- Smoothing
- Accuracy - Binning - Attribute/feature
- Completeness - Regression construction
- Uniqueness - Clustering - Aggregation
- Timeliness - Combined Human - Normalization
- Consistency and computer - Discretization
inspection
Problems:
Data Integration:
- Unmeasurable - Data quality:
o Accuracy and - Correlation Analysis accuracy,
(measures linear completeness,
completeness
relationship) consistency,
are extremely
- Covariance ( if timeliness,
difficult
positive then A&B are believability,
- Context independent
larger than average, if interpretability
o No accounting
negative if A is larger - Data cleaning: e.g.
for what is than average than B
important missing/noisy values,
is smaller than outliers
- Incomplete average. If 0 both are
o Interpretability, - Data integration from
independent) multiple sources:
accessibility
- Vague Data Reduction: o Entity
o Conventional identification
- Dimensionality problem
defn provide no Reduction o Remove
guidance. o Wavelet redundancies
Data Quality Continuum: transform o Detect
o Principal inconsistencies
- Data Gathering Components
- Data delivery - Data reduction
Analysis o Dimensionality
- Data storage o Feature subset
- Data integration reduction
- Numerosity Reduction o Numerosity
- Data retrieval
o Regression reduction
- Data mining/analysis
o Histogram o Data
Missing Value Imputation: o Data cube compression
- Input mean, median, - Data Compression - Data transformation
other point estimates Knowledge Discovery and data
- Use attribute Process: discretization
relationship, o Normalization
regression, propensity - Data preparation o Concept
score. o Data cleaning hierarchy
- Regression method o Integration generation
o Transformation