0% found this document useful (0 votes)
13 views3 pages

Module1 Reviewer

The document outlines various data set types, including discretization, record data, and graphs, along with important characteristics of structured data such as dimensionality and sparsity. It also covers data preprocessing techniques, data warehouses, OLAP operations, and data quality requirements, emphasizing the significance of data cleaning and integration. Additionally, it discusses methods for data transformation, reduction, and the challenges associated with data quality and missing values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views3 pages

Module1 Reviewer

The document outlines various data set types, including discretization, record data, and graphs, along with important characteristics of structured data such as dimensionality and sparsity. It also covers data preprocessing techniques, data warehouses, OLAP operations, and data quality requirements, emphasizing the significance of data cleaning and integration. Additionally, it discusses methods for data transformation, reduction, and the challenges associated with data quality and missing values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Set Types: - Discretization Extraction Transformation

(dividing continuous Loading (ETL)


- Record Data into intervals)
- Graphs and Networks o Binning - Data extraction
- Ordered Sets - Data Cleaning
 Equal-
- Spatial, Image, & - Data transformation
width
Multimedia - Load
 Equal-
- Refresh
Important Characteristics of depth
Structured Data: o Histogram Data Lake: Centralized
o Repository storing all
- Dimensionality structured and unstructured
o Clustering
- Sparsity data. Stores data as is.
- Resolution Analysis,
- Distribution remove outliers Layers of Storage:
o Decision-Tree
Types of Attributes: o Correlation - Sandbox data layer
- Application data layer
- Nominal (red, blue) Data Reduction methods: - Cleansed data layer
- Binary (true or false) - Standardized Data
- Ordinal (junior, senior) - Regression
- Histogram, Clustering, layer
- Numeric - Raw data layer
o Interval no zero sampling
o Ratio zero - Data cube Types of Schemas:
point aggregation
- Data compression - Star Schema
- Discrete vs - Snowflake Schema
Continuous Dimensionality Reduction - Fact Constellation
Central Tendency: - Feature Extraction OLAP Operations:
- Mean (ave) - Feature Selection
- Roll up (drill-up):
- Median (middle point) Data Warehouse: Historical summarize data by
- Mode (most number) data for analysis climbing up the
Skew: OLTP vs OLAP hierarchy
- Drill down: higher
- Mode < Median < OLTP – Online Transactional level to lower level
Mean (positive) Processing, DBMS ops, - Dice: Pick specific
- Mean < Median < query and transactional values or ranges
Mode (negative) processing - Pivot: Rotate a cube –
Data Preprocessing: changing the order
OLAP – Online analytical
- Slice: Removing a
- Data Cleaning processing, Data warehouse
specific dimension
- Data Integration ops, drilling, slicing, dicing,
from a cube
- Data Reduction etc.
- Data Transformation OLAP Architectures
Data warehouse models:
- Data Discretization - Relational OLAP
- Enterprise Warehouse
Data Transformation (uses dbms for data
– collects all info
Processing: management)
- Data mart – selected
- Multidimensional
- Normalization (0-1) information
OLAP (sparse array-
o Min-max - Virtual warehouse –
based)
view on operational
o Z-score - Hybrid OLAP
databases
o By decimal (Flexible)
- Specialized SQL - Arbitrary missing o Selection
Servers (for SQL pattern. Markov chain - Mining
queries over - Pattern/model
star/snowflake) Metadata – data about data
evaluation
XML – Data interchange - Knowledge
Full Cube vs Iceberg Cube
format presentation
Iceberg: Only focuses cells
Data Cleaning: Data Transformation
that satisfy conditions.
Methods:
Data Quality Requirements: How to handle noisy data:
- Smoothing
- Accuracy - Binning - Attribute/feature
- Completeness - Regression construction
- Uniqueness - Clustering - Aggregation
- Timeliness - Combined Human - Normalization
- Consistency and computer - Discretization
inspection
Problems:
Data Integration:
- Unmeasurable - Data quality:
o Accuracy and - Correlation Analysis accuracy,
(measures linear completeness,
completeness
relationship) consistency,
are extremely
- Covariance ( if timeliness,
difficult
positive then A&B are believability,
- Context independent
larger than average, if interpretability
o No accounting
negative if A is larger - Data cleaning: e.g.
for what is than average than B
important missing/noisy values,
is smaller than outliers
- Incomplete average. If 0 both are
o Interpretability, - Data integration from
independent) multiple sources:
accessibility
- Vague Data Reduction: o Entity
o Conventional identification
- Dimensionality problem
defn provide no Reduction o Remove
guidance. o Wavelet redundancies
Data Quality Continuum: transform o Detect
o Principal inconsistencies
- Data Gathering Components
- Data delivery - Data reduction
Analysis o Dimensionality
- Data storage o Feature subset
- Data integration reduction
- Numerosity Reduction o Numerosity
- Data retrieval
o Regression reduction
- Data mining/analysis
o Histogram o Data
Missing Value Imputation: o Data cube compression
- Input mean, median, - Data Compression - Data transformation
other point estimates Knowledge Discovery and data
- Use attribute Process: discretization
relationship, o Normalization
regression, propensity - Data preparation o Concept
score. o Data cleaning hierarchy
- Regression method o Integration generation
o Transformation

You might also like