0% found this document useful (0 votes)
41 views7 pages

Lecture Notes DWM

Uploaded by

gf1166679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views7 pages

Lecture Notes DWM

Uploaded by

gf1166679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Yadavrao Tasgaonkar Institute of Engingeineering & Technology, Karjat

Semester-V A.Y. 2025-26 Subject: Data Warehousing and Mining

Lecture Notes

Q.1 Explain data warehouse features.


Ans. i. Data Warehouse is characterized by four key features:

1. Subject oriented Data


2. Integrated Data
3. Time Variant Data
4. Non-volatile Data

ii. Let us examine some of key defining features of data warehouse based on these definitions.
How is this data different from the data in any operational system?

 Subject oriented Data:


- In every industries data sets are organized around individual application to support
those particular operational system.
- In contrast, in the data warehouse data is stored by business subjects not by
applications. Business subject differs from enterprise to enterprise. The fig. below
distinguishes between how data is stored in operational system and in the data
warehouse.
- For example, claims is a critical business subject for an insurance company. Claims
under automobile insurance policies are processed in Auto Insurance Application.
- Similarly, claims data for worker’s compensation insurance is organized in the Workers
Company Insurance application.

 Integrated Data:
- For proper decision making, you need to pull together all relevant data from the various
applications.
- Fig. below illustrates the simple process of data integration for a banking
[Link] data fed into the subject area of account in the data warehouse comes
from three different operational applications.
- Before moving the data into the data warehouse, you have to go through a process of
transformation, consolidation and integration of source data.
Yadavrao Tasgaonkar Institute of Engingeineering & Technology, Karjat
Semester-V A.Y. 2025-26 Subject: Data Warehousing and Mining

Here are some of item that would need standardization


- Naming Conventions
- Codes
- Data Attributes
- Measurements
 Time Variant Data
- For operational system, the stored data reflect current information because these
system support day to day current operations.
- A data warehouse, because of the very nature of its purpose has to contain historical
data not just current values.
- Every data structure in the data warehouse contains the time element.
- The time variant nature of the data in data warehouse
 Allows for analysis of the past.
 Relates information to present.
 Enables forecast for the future.

 Non-volatile Data
- As shown in fig. below, every business transaction does not update the operational
system database in real time.
- We add, change or delete data from operational system database as each transaction
happens but not usually update data in the data warehouse.
- You don’t delete the data in data warehouse in real time.
- The data in data warehouse is not as volatile as the data in an operational database is.
Yadavrao Tasgaonkar Institute of Engingeineering & Technology, Karjat
Semester-V A.Y. 2025-26 Subject: Data Warehousing and Mining

Q.2 What is prediction? Explain about Linear regression method.


Ans. Prediction: Prediction within the context of linear regression refers to the process of
estimating
the value of dependent variable based on value of one and more independent
variable, assuming a linear relation exist between them. Consider example,
y = kx

Linear Regression Method for Prediction


 Modeling the Relationship:
1. Linear regression establishes a mathematical model that describes the relationship
between the dependent variable (the variable to be predicted, often denoted as ‘y’)
and the independent variable (the predictor variables, often denoted as ‘x’).
2. This linear relationship is represented by a linear equation. For simple linear regression
one independent variables:
Y=a+bx
3. For multiple linear regression multiple independent variables:
Y = a + b1x1 + b2x2 + ……….. + bnxn
4. In these equations ‘a’ represents the y intercept the value of when pole x variable are
zero. ‘b’ represents the slope the change in y curve unit change in x.

 Training the Model:


1. The linear regression model is “trained” using a dataset of known values for both the
independent and dependent variables.
2. During this training phase, statistical methods are used to determine the optimal values
for ‘a’ and ‘b’ (b1,b2……bn) that best bit the observed data, minimizing the difference
between the actual ‘y’ values and the predicted ‘y’ values.

 Making the Prediction:


1. Once the model is trained and the parameter (a& b) are determined. You can use the
model to predict the value of the dependent variables.
2. You simply plug the new ‘x’ variables into established linear equation, and the equation
will output the predicted ‘y’ value.
Example:
If you have historical data on advertising spend (independent variable) and corresponding
sales figures (dependent variable) linear regression can be used to find the linear
relationship between them.

Q.3 Explain with example any four OLAP operations.


Ans. A combination of multiple types of technologies is needed for building a data warehouse.
- The range is wide, data modelling, data extraction, data transformation, database
management system, control modules, alert system agents, query tools, analysis tools,
report writers and so on.
- There is no scarcity of vendors and products.
- These multivendor products have to cooperate and work together in your data warehouse.
- When you use database from one product/vendor, the query and reporter tool from another
vendor, and OLAP product from yet another vendor, these three products have no standard
method for exchanging data.
Yadavrao Tasgaonkar Institute of Engingeineering & Technology, Karjat
Semester-V A.Y. 2025-26 Subject: Data Warehousing and Mining

OLAP (Online Analytical Processing) is a technology used in data warehousing to enable fast,
multi-dimensional analysis of large datasets.
1. It is a comparing technology that provides fast, consistent, and interactive access to data for
analysis.
2. Typically used in data warehouse but also can be used with data lakes.
3. The core of OLAP data model is multi-dimensional which allows users to explore data from
different dimensions (time, product, geography).
4. Pre-calculate and aggregate data to enable quick query responses.

How OLAP works with data warehouse:


- Data is extracted from various sources and loaded into data warehouse.
- The data warehouse is optimized for analytical queries not for real time transactions.
- OLAP serves then connected to the data warehouse and provide tools for users to query,
analyze and report on the data.
- Users can use OLAP tools to perform operations like drilling upto higher level of aggregation,
slicing and dicing data by different dimensions.

Key benefits of OLAP:


- Faster query performance.
- Improved business intelligence.
- Enhanced reporting capabilities.
- Support for complex queries.

Q.4 Explain different OLAP operations on multi-dimensional data.


Ans. OLAP (Online Analytical Processing) utilizes a multi-dimensional data model, often represented
as a cube, to enable fast and interactive analysis of data.
- The multi-dimensional structure, with dimensions like time, product, and location facilitates
quick access to summarized information and supports sophisticated business intelligence
and decision making.

Multi-dimensional Data Model OLAP cubes:


- At the heart of OLAP is the concept of multi-dimensional data cube, which is a data structure
designed for efficient storage and retrieval of data across multi-dimensions.

Dimensions:
- Dimensions represent the different perspective from which data can be analyzed. (e.g. time,
product, geography, customer)

Measures:
- Measures are the numerical values that are being analyzed within the cube. (e.g. sales,
figures, profit, margin)

How it works:
1. Data is pre-aggregated OLAP systems pre-calculate and store aggregated data across
different dimensions, allowing for rapid query performance.
2. Users interact with the cube user can “slice and dice” the data, down into details or rolling
upto higher level of summarization.
Yadavrao Tasgaonkar Institute of Engingeineering & Technology, Karjat
Semester-V A.Y. 2025-26 Subject: Data Warehousing and Mining

3. Complex calculations are performed efficiently because data is pre-aggregated, OLAP system
can handle complex calculations and aggregations much faster than traditional relational
database.

Example:
- Imagine a sales database, An OLAP cube could represent with dimension like “Time” (years,
queries, month) “Product” (different product categories) and “Location” (region categories).

Q.5 Demonstrate with diagram Data Mining Architecture.


Ans. - Data mining is the process of discovering interesting and useful knowledge from large
amount of data stored in data base.
- Data mining is usually applied to data warehouse.
- Fig. shows the architecture of data mining system Database or Data Warehouse or other
information repository.
- This includes one or more databases, a data warehouse, or any other information repository.
- Data cleaning and data integration techniques have to be applied to the data before data
mining algorithm can be applied on it.

Database or Data Warehouse Serves:


- The database or data warehouse serves is used to fetch the relevant data on the user’s data
mining request.

Knowledge base:
- It contains the domain knowledge that is used to guide the search or used for evaluation of
the interestingness of resulting patterns.

Data Mining Engine:


- It consists of a set of functionalities for task such as characterization, association,
classification and cluster analysis and evolution analysis.

Pattern Evolution Model:


- The correspondents employ interestingness measures and other threshold values and
interacts with data mining engine so as to retrieve only the interesting results.

Graphical uses Interface (GUI):


- This module interacts with the users and data mining system thereby allowing the user to
use data mining system either by specifying the query or task like that of characterization
association, classification and cluster analysis.
Yadavrao Tasgaonkar Institute of Engingeineering & Technology, Karjat
Semester-V A.Y. 2025-26 Subject: Data Warehousing and Mining

Q.6 Explain with example (1) Star Schema and (2) Snowflake Schema.
Ans. - In this section, we will use the fact and dimensional tables to prepare the logical design of the
data warehouse.
- It is a design technique to structure the business dimensions and metrics.
- Dimensional modelling is used for designing tables for a data warehouse.
- Once the fact and dimension tables have been formed, how these tables be arranged in the
dimensional model

The Star Schema:


- Just imagine a dimensional model with the fact table in the middle and dimension tables
arranged around the fact table.
- This model represents a star formation with the fact table at core and dimension table along
the spikes of star.
- This particular arrangement is thus called a star schema.
- Example: fig. shows simple star schema. It shows an order fact table in the middle and four
dimension table of customers, sales person, time and product.
Yadavrao Tasgaonkar Institute of Engingeineering & Technology, Karjat
Semester-V A.Y. 2025-26 Subject: Data Warehousing and Mining

The Snowflake Schema:


- It is a variation of the star schema model, in which some or all dimension tables are
normalized, thereby further splitting the data into additional tables.
- The resulting schema graph from a shape similar to a snowflake.
- Thus, snowflake schema is a more complex data warehouse model than a star schema.
- Example: Snowflake schemas are generally used when a dimension table become very big
and when a star schema can’t represent the complexity of a data structure.

You might also like