0% found this document useful (0 votes)
13 views20 pages

Data Mining Notes

Uploaded by

adityarsahu969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Data Mining Notes

Uploaded by

adityarsahu969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Q1.what is data Warehousing?

what are the difference between Data


Warehouse and DataMining?

A Data Warehouse is built to support management functions whereas data mining is used to
extract useful information and patterns from data. Data warehousing is the process of compiling
information into a data warehouse. The main purpose of data warehousing is to consolidate and
store large datasets from various sources for efficient retrieval and analysis, supporting reporting
and decision-making. Data mining focuses on analyzing data to discover patterns, trends, and
insights, while data warehousing focuses on storing and managing data in a centralized location.

What is Data Warehousing?


It is a technology that aggregates structured data from one or more sources so that it can be
compared and analyzed rather than transaction processing. A data warehouse is designed to
support the management decision-making process by providing a platform for data cleaning, data
integration, and data consolidation. A data warehouse contains subject-oriented, integrated, time-
variant, and non-volatile data. The Data warehouse consolidates data from many sources while
ensuring data quality, consistency, and accuracy. Data warehouse improves system performance
by separating analytics processing from transactional databases. Data flows into a data
warehouse from the various databases. A data warehouse works by organizing data into a
schema that describes the layout and type of data. Query tools analyze the data tables using
schema.

Figure: Data Warehousing process


Advantages of Data Warehousing

 The data warehouse's job is to make any form of corporate data easier to understand. The
majority of the user's job will consist of inputting raw data.

 The capacity to update continuously and frequently is the key benefit of this technology.
As a result, data warehouses are perfect for organizations and entrepreneurs who want to
stay current with their target audience and customers.

 It makes data more accessible to businesses and organizations.

 A data warehouse holds a large volume of historical data that users can use to evaluate
different periods and trends in order to create predictions for the future.

Disadvantages of Data Warehousing

 There is a great risk of accumulating irrelevant and useless data. Data loss and erasure are
other potential issues.

 Data is gathered from various sources in a data warehouse. Cleansing and transformation
of the data are required. This could be a difficult task.

What is Data Mining?


It is the process of finding patterns and correlations within large data sets to identify
relationships between data. Data mining tools allow a business organization to predict customer
behavior. Data mining tools are used to build risk models and detect fraud. Data mining is used
in market analysis and management, fraud detection, corporate analysis, and risk management.

Figure: Data Mining process

Advantages of Data Mining

 Data mining aids in a variety of data analysis and sorting procedures. The identification
and detection of any undesired fault in a system is one of the best implementations here.
This method permits any dangers to be eliminated sooner.

 In comparison to other statistical data applications, data mining methods are both cost-
effective and efficient.

 Companies can take advantage of this analytical tool by providing appropriate and easily
accessible knowledge-based data.
 The detection and identification of undesirable faults that occur in the system are one of
the most astonishing data mining techniques.

Disadvantages of Data Mining

 Data mining isn't always 100 percent accurate, and if done incorrectly, it can lead to data
breaches.

 Organizations must devote a significant amount of resources to training and


implementation. Furthermore, the algorithms used in the creation of data mining tools
cause them to work in different ways.

Difference Between Data Mining and Data Warehousing


Basis of
Data Warehousing Data Mining
Comparison
A data warehouse is a database
system that is designed for Data mining is the process of analyzing data
Definition
analytical analysis instead of patterns.
transactional work.
Process Data is stored periodically. Data is analyzed regularly.
Data warehousing is the process
Data mining is the use of pattern recognition
Purpose of extracting and storing data to
logic to identify patterns.
allow easier reporting.
Managing Data warehousing is solely Data mining is carried out by business users
Authorities carried out by engineers. with the help of engineers.
Data warehousing is the process
Data mining is considered as a process of
Data Handling of pooling all relevant data
extracting data from large data sets.
together.
Subject-oriented, integrated, AI, statistics, databases, and machine learning
Functionality time-varying and non-volatile systems are all used in data mining
constitute data warehouses. technologies.
Data warehousing is the process
of extracting and storing data in Pattern recognition logic is used in data mining
Task
order to make reporting more to find patterns.
efficient.
It extracts data and stores it in an This procedure employs pattern recognition
Uses orderly format, making reporting tools to aid in the identification of access
easier and faster. patterns.
When a data warehouse is Data mining aids in the creation of suggestive
connected with operational patterns of key parameters. Customer
business systems like CRM purchasing behavior, items, and sales are
Examples
(Customer Relationship examples. As a result, businesses will be able to
Management) systems, it adds make the required adjustments to their
value. operations and production.
Q2.Difference between OLAP And OLTP?
Ans.OLAP stands for On-Line Analytical Processing. It is used for analysis of database
information from multiple database systems at one time such as sales analysis and forecasting,
market research, budgeting and etc. Data Warehouse is the example of OLAP system.

OLTP stands for On-Line Transactional processing. It is used for maintaining the online
transaction and record integrity in multiple access environments. OLTP is a system that manages
very large number of short online transactions for example, ATM.

Sr. No. Key OLAP OLTP


It is used to manage very large number of
1 Basic It is used for data analysis
online short transactions
Database
2 It uses data warehouse It uses traditional DBMS
Type
Data It manages all insert, update
3 It is mainly used for data reading
Modification and delete transaction
4 Response time Processing is little slow In Milliseconds
Tables in OLAP database are
5 Normalization Tables in OLTP database are normalized.
not normalized.

Q3.what are.Characteristics of Data Warehouse?

Ans. A data warehouse is a centralized repository for storing and managing large amounts of
data from various sources for analysis and reporting. It is optimized for fast querying and
analysis, enabling organizations to make informed decisions by providing a single source of truth
for data. Data warehousing typically involves transforming and integrating data from multiple
sources into a unified, organized, and consistent format. Data warehouse can be controlled when
the user has a shared way of explaining the trends that are introduced as specific subject.
1 A data warehouse is subject-oriented, meaning it focuses on specific themes like sales,
healthcare, marketing, or distribution, rather than day-to-day operations. It is designed to
collect and organize data related to a particular topic to support analysis and decision-making.
Unnecessary data is removed, making it easier to get clear and relevant insights for that subject.

Subject-oriented

2. Integrated
Integration in a data warehouse means combining data from different sources like mainframes
and relational databases into a consistent and reliable format. This involves using standard
naming conventions, formats, and codes so that data can be easily understood and analyzed.
Integration ensures that all related data is unified, allowing for more accurate and efficient
decision-making across different subject areas.

Data warehouse is integrated

3. Time-Variant

Time-variance means that data in a data warehouse is stored over different time periods—such
as weekly, monthly, or yearly. Unlike operational systems, it keeps historical data for long-term
analysis. Once data is entered, it is not changed or updated, preserving the state of data at a
specific point in time. This allows users to analyze trends and changes over time.
Time-Variant

4. Non-Volatile

Non-volatility means that once data is stored in a data warehouse, it is not deleted or updated.
Instead, new data is added over time, keeping the historical records intact. The data is read-only
and refreshed at specific intervals, making it ideal for analyzing trends and long-term
performance.

Unlike operational systems, a data warehouse does not require transaction processing, recovery,
or concurrency control. Operations like insert, update, and delete used in day-to-day applications
are generally not performed here.

There are mainly two types of data operations in a data warehouse:

1. Data Loading: inserting bulk data from various sources.

2. Data Access: reading and analyzing the stored data.


Non-Volatile

Functions of Data warehouse


It serves as a collection of organized data, managed by different groups to support data retrieval.
It tracks high-transaction tables and helps define key data warehousing techniques and functions.

 Data Consolidation: Combines data from multiple sources into a single, consistent
repository.

 Data Cleaning: Removes errors, duplicates, and irrelevant information to ensure data
quality.

 Data Integration: Merges data from various sources into a unified format for accurate
analysis.

 Data Storage: Stores large volumes of historical data for easy and quick access.

 Data Transformation: Converts and standardizes data to ensure consistency and usability.

 Data Analysis: Enables deep data exploration and insight generation.

 Data Reporting: Supports dashboards and reports for stakeholders and departments.

 Data Mining: Identifies patterns and trends to aid in strategic decisions.

 Performance Optimization: Ensures fast querying and efficient data access.


Q3.what are the features of data ware house?
Ans. Data Warehouse Features are the core functionalities that enable efficient and effective data
management within a data warehouse. Some key data warehousing features include a centralized
repository, subject-oriented data, non-volatile data storage, data integration, and transformation.

The very first question that was asked at the starting of the blog is now getting answered:

A data warehouse is a location where businesses store critical information holdings such as
client data, sales figures, employee data, and so on.

(DW) is a digital information system that links and unifies massive amounts of data from
numerous sources.

A data warehouse is a central server system that permits the storage, analysis, and interpretation
of data to aid in decision-making.

It is a storage area that houses structured data (database tables, Excel sheets) as well as semi-
structured data (XML files, webpages) for tracking and reporting.

The data warehouse is the heart of the BI system, designed for reporting and analysis of data.

It is a fusion of elements and technologies that facilitates the strategic application of data.

So, how did the term” data warehouse” came into existence,
Database Data Warehouse
A database is a collection of data to organize A data warehouse is a central server system that
information. It helps to access, retrieve, and allows the storage, analysis, and interpretation of
manipulate information. data to support in decision-making.
Its purpose is to store the data. Its purpose is to analyze the data.
Managing daily transactions and business Data warehouses are used for strategic goals such
procedures is one of the many operational as business decision-making and historical pattern
tasks for which databases are used for. analysis.
Tables and joins in a database are complex Tables and joins in a data warehouse are easy
because of normalization. because they are denormalized.
Databases are commonly utilized by both Data warehouses are commonly used by
operational staff and application developers. executives and business analysts.
To keep the database accurate and consistent, Usually, static and historical data are present in
the data in it is updated on a regular basis. data warehouses.
Databases are generally smaller in size than When compared to databases, data warehouses
data warehouses. are larger.
Designing is done using ER modeling
Designing is done using data modeling methods.
methods.
It supports OLTP (Online Transaction
It supports OLAP (Online Analytical Processing).
Processing).
A database keeps detailed data. Data warehouses contain summarized data.
Example: MySQL, Oracle, etc. Example: Google BigQuery, IBM Db2, etc.
A data warehouse converts relational data and other data sources into multidimensional concepts
for analysis. Metadata is formed during this conversion to speed up concerns and searches. On
top of this data layer is a semantic layer that organizes and maps complex data into familiar
business language such as ‘product’ or ‘customer’ so analysts can quickly build analyses without
knowing database table names. Finally, an analytics layer sits on top of the semantic layer,
allowing authorized users to access, visualize, and interpret data.

ome of the features of a Data Warehouse are listed below:

Integrated

Establishing a common unit of measurement for all related data in a data warehouse using data
from different databases is the process of integrating data. You must store data within it in a
simple and universally acceptable manner.

It must also be consistent in terms of nomenclature and layout. This type of application is useful
for analyzing big data.
Non-volatile

The data warehouse is also non-volatile, which means that past data cannot be erased. The
information is read-only and is only modified on a routine basis. It also helps with statistical data
evaluation and comprehension of what and when events occurred. You don’t require any other
complicated procedure.

Subject-oriented

Rather than company operations, a data warehouse typically provides information on a specific
topic (such as sales inventory or supply chain).

Persistent

Prior data is not deleted when new data is added, making it persistent and non-volatile. Data
from the past is kept for analogies, patterns, and predictive analysis.

1. Enterprise Data Warehouse (EDW)

A centralized warehouse is an Enterprise Data Warehouse (EDW). It offers decision support


services throughout the organization. It provides a unified approach to data organization and
representation. It also allows you to categorize data by subject and grant access based on those
classifications.

2. Operational Data Warehouse

When neither a data warehouse nor an OLTP system can meet a firm’s information
requirements, an operations and maintenance data store,is required. The data warehouse in ODS
is refreshed in real-time. As a result, it is widely used for routine tasks such as stashing records
of employees.

3. Data Marts

A Data Mart is a subdivision of a data warehouse. It is specifically designed for a specific


business segment, such as sales, funding, or both. Data can be gathered from sources directly and
stored in an independent data mart.

Q4.what are rhe advantages of warehouse?

Ans. Data warehouses offer numerous advantages, primarily focused on improving data-driven
decision-making. These benefits include consolidated and consistent data, enhanced analytics
capabilities, improved data quality, and streamlined data management, ultimately leading to
better business intelligence and a competitive edge.

Here's a more detailed look at the advantages:


1. Informed Decision Making:

 Data warehouses provide a unified view of information from various sources, enabling
comprehensive analysis and better-informed decisions.

  By consolidating data, businesses can identify trends, patterns, and insights that might be
missed when working with isolated data sources.
  Historical data analysis is facilitated, allowing for trend identification and informed
predictions.

2. Enhanced Analytics Capabilities:

 Data warehouses are designed to handle large volumes of data, making them ideal for
complex analytics and data mining.

  They support various analytical tools and techniques, including reporting, dashboards, and
machine learning, leading to deeper insights.
  This allows for more sophisticated analysis of business performance and customer
behavior.

3. Improved Data Quality and Consistency:

 Data warehouses cleanse, standardize, and integrate data from different sources, ensuring
a single source of truth.

  This eliminates data silos and improves data quality, which is crucial for reliable analysis
and decision-making.
  Consistent data across the organization reduces errors and inconsistencies in reporting and
analytics.

4. Streamlined Data Management:

 Centralized data storage simplifies data management and reduces the need to access
multiple systems.

  This leads to time and cost savings in data storage and management.
  Data warehouses also improve data accessibility and retrieval efficiency, making it easier
for users to find the information they need.

5. Competitive Advantage:

 By leveraging data-driven insights, businesses can optimize their strategies, improve


customer relationships, and gain a competitive edge.
 Data warehouses enable businesses to understand their customers better, personalize their
offerings, and develop targeted marketing campaigns.
 They also facilitate better supply chain management and overall operational efficiency.

6. Scalability:

 Data warehouses can be scaled to accommodate growing data volumes and business
needs, ensuring they remain relevant as the business evolves.
 This scalability is crucial for long-term data management and analysis.

7. Historical Data Analysis:

 Data warehouses store historical data, allowing businesses to track trends over time and
make informed decisions based on past performance.

  This historical data is valuable for forecasting, trend analysis, and understanding business
performance over extended periods.

In essence, data warehouses provide a robust platform for managing, analyzing, and leveraging
data to drive business growth, improve decision-making, and gain a competitive advantage.

Q5.what is top down approach and bottom up approach in data warehouse?

Ans. When designing complex systems, choosing the right approach for software development is
important. Two fundamental design approaches are Top-Down Design and Bottom-Up Design.
Each has its unique advantages, disadvantages, and use cases. In this article, we will explain the
Top-Down Design Model and the Bottom-Up Design Model, highlighting their differences,
benefits, and practical applications.

What is the Top-Down Design Model?


In the top-down model, an overview of the system is formulated without going into detail for
any part of it. Each part of it then refined into more details, defining it in yet more details until
the entire specification is detailed enough to validate the model. if we glance at a haul as a full,
it's going to appear not possible as a result of it's so complicated For example: Writing a
University system program, writing a word processor. Complicated issues may be resolved
victimization high down style, conjointly referred to as Stepwise refinement where,

How Does Top-Down Approach Work?

The Top-Down approach works by:

1. Breaking the problem into major components,

2. Refining each component into smaller subcomponents

3. We Continue this process until every part is simple enough to implement.

This approach is particularly useful for solving complex problems, like developing a University
System Program or a Word Processor. By starting with a high-level design and gradually
focusing on smaller details, the complexity of the system becomes more manageable.

Advantages of the Top-Down Design Model

 Simplifies Complex Problems: Breaking problems into smaller parts help us to identify
what needs to be done.

 Easy to Identify Requirements: At each step of refinement, new parts will become less
complex and therefore easier to solve.
 Promotes Reusability: Parts of the solution may turn out to be reusable.

 Collaboration-Friendly: Breaking problems into parts allows more than one person to
solve the problem.

What is Bottom-Up Design Model?


In contrast, the Bottom-Up Design Model is started by defining the system's individual parts
first. Once the individual components are detailed, they are integrated into larger modules. This
process is continued till the system is fully integrated. The Bottom-Up approach is often used in
Object-Oriented Programming (OOP) languages like C++, Java, and Python, where individual
objects are identified and developed first.

How Does Bottom-Up Approach Work?

he Bottom-Up approach works by:

1. Identifying and specifying the smaller components (or objects).

2. Linking these smaller parts together to form larger components.

3. Continuously integrating them to complete the system.

This method focuses on creating well-defined and reusable low-level components before
deciding how to integrate them into higher-level systems.

Advantages of the Bottom-Up Design Model

 Reusability of Low-Level Components: Decisions about reusable low-level utilities are


made early in the design.

 Focused Problem-Solving: Developers can focus on solving smaller and more isolated
problems first.

 Increased Modularity: The modular approach makes it easier to update individual


components without affecting the entire system.

Key Differences Between Top-Down and Bottom-Up Design


Models
S.
TOP DOWN APPROACH BOTTOM UP APPROACH
No.
1. focus on breaking the problem into smaller, Solves smaller problems and integrates them
S.
TOP DOWN APPROACH BOTTOM UP APPROACH
No.
more manageable parts into a complete system
Mainly used in Structured programming Mainly used in Object Oriented Programming
2.
languages like COBOL, Fortran, C, etc. languages like C++, C#, Python.
Each part is programmed separately Redundancy is minimized by using data
3.
therefore contains redundancy. encapsulation and data hiding.
Modules must communicate to integrate the
4. communication is less among modules.
system.
It is used for debugging and module
5. It is basically used in testing.
documentation
Composition of the system happens by
Decomposition of the system occurs,
6. combining low-level components into a higher-
breaking it into smaller components.
level structure.
The top function of system might be hard In this sometimes we can not build a program
7.
to identify. from the piece we have started.
implementation details can vary throughout Building a program can be difficult if modules
8.
the process are not assembled in a logical order
Pros
Pros
 Easier isolation of interface errors
 Easy to create test conditions
 It benefits in the case error occurs
towards the top of the program.
9.  Test results are easy to observe
 Defects in design get detected early
 It is suited if defects occur at the bottom
and can be corrected as an early
of the program.
working module of the program is
available.

10. Cons- Cons-

 Difficulty in observing the output  There is no representation of the working


of test case. model once several modules have been
constructed.
 Stub writing is quite crucial as it
leads to setting of output  There is no existence of the program as
parameters. an entity without the addition of the last
module.
 When stubs are located far from the
top level module, choosing test  From a partially integrated system, test
cases and designing stubs become engineers cannot observe system-level
more challenging. functions. It can be possible only with
the installation of the top-level test
S.
TOP DOWN APPROACH BOTTOM UP APPROACH
No.

driver.

Q6.What is Data cube?

Ans.

A data cube is a multi-dimensional data structure that stores the data in a tabular form. The data
cube can be used to store any information, from a single column to multiple columns or
dimensions.

Each cell in a data cube represents a value that can be calculated using other values stored in
other cells of the same dimension. Data cubes are used by companies like Google, Facebook,
Twitter, and Amazon to handle vast amounts of data. These companies have millions of users.
They need ways to store all this data and make it available for quick retrieval.

Data cubes are used to store large amounts of related data. A single record or row in a database
table contains one piece of information. Conversely, a single record or row in a data cube
contains multiple pieces of information related to each other through their common attributes.

What is Data Cube?

A data cube is a multidimensional data structure that represents large amounts of data. It consists
of a set of measures, dimensions, and hierarchies, which are related to each other in a specific
way.

A measure is a numerical value that can be aggregated into groups. In a relational database, you
can create a table and define your measures as columns. In an OLAP database, you typically
have predefined measures such as Sales Amount or Profit (in thousands).

Data cubes support various operations. Here is an overview of some key data cube operations:

 Roll-up: This operation adds up all the data from a category and presents it as a singular
record. It is like zooming out of the cube and looking at the data from a broader
perspective.
 Drill-Down: Users typically need to navigate deeper into a dimension hierarchy. For
instance, drilling down into the product can reveal sales figures for the product.
 Slicing: Users may sometimes want to focus on a specific fact from a particular
dimension. In such cases, they can filter the data to view only that specific subset. For
example, slicing a sales data cube to focus on “Electronics” narrows the view. It limits
the data to sales related only to electronic products.
 Dicing: Breaking the data into multiple slices helps isolate specific factors. Users can
select a subset of values from each dimension. This allows them to focus on the exact
point where the two dimensions intersect.
 Pivoting: It means rotating the cube to view the data from a unique perspective. This
enables users to focus on various aspects. Pivoting the data cube allows users to rotate the
view of the data. For example, swapping the product and region dimensions shifts the
focus. Instead of viewing sales by product, the cube now displays sales by region.

What Is a Real-World Example of a Data Cube?


Banks collect and analyze data on customer interactions with their various products and services.
This data-driven approach allows banks to offer personalized services and promotions. This
enhances customer satisfaction and optimizing business performance. Here is an example of how
banks collect and organize data:

Table 1: Banking Products

Product Type Description


Checking Accounts Everyday banking and payment transactions
Credit Cards Credit card offerings for various needs
Personal Loans Loans for personal expenses
Mortgage loans Home loan products for buying properties
Business Accounts Banking services for businesses

………………………………………………………………………………………………………
…………………………………………………………………………………………………….

You might also like