Data Mining Notes
Data Mining Notes
A Data Warehouse is built to support management functions whereas data mining is used to
extract useful information and patterns from data. Data warehousing is the process of compiling
information into a data warehouse. The main purpose of data warehousing is to consolidate and
store large datasets from various sources for efficient retrieval and analysis, supporting reporting
and decision-making. Data mining focuses on analyzing data to discover patterns, trends, and
insights, while data warehousing focuses on storing and managing data in a centralized location.
The data warehouse's job is to make any form of corporate data easier to understand. The
majority of the user's job will consist of inputting raw data.
The capacity to update continuously and frequently is the key benefit of this technology.
As a result, data warehouses are perfect for organizations and entrepreneurs who want to
stay current with their target audience and customers.
A data warehouse holds a large volume of historical data that users can use to evaluate
different periods and trends in order to create predictions for the future.
There is a great risk of accumulating irrelevant and useless data. Data loss and erasure are
other potential issues.
Data is gathered from various sources in a data warehouse. Cleansing and transformation
of the data are required. This could be a difficult task.
Data mining aids in a variety of data analysis and sorting procedures. The identification
and detection of any undesired fault in a system is one of the best implementations here.
This method permits any dangers to be eliminated sooner.
In comparison to other statistical data applications, data mining methods are both cost-
effective and efficient.
Companies can take advantage of this analytical tool by providing appropriate and easily
accessible knowledge-based data.
The detection and identification of undesirable faults that occur in the system are one of
the most astonishing data mining techniques.
Data mining isn't always 100 percent accurate, and if done incorrectly, it can lead to data
breaches.
OLTP stands for On-Line Transactional processing. It is used for maintaining the online
transaction and record integrity in multiple access environments. OLTP is a system that manages
very large number of short online transactions for example, ATM.
Ans. A data warehouse is a centralized repository for storing and managing large amounts of
data from various sources for analysis and reporting. It is optimized for fast querying and
analysis, enabling organizations to make informed decisions by providing a single source of truth
for data. Data warehousing typically involves transforming and integrating data from multiple
sources into a unified, organized, and consistent format. Data warehouse can be controlled when
the user has a shared way of explaining the trends that are introduced as specific subject.
1 A data warehouse is subject-oriented, meaning it focuses on specific themes like sales,
healthcare, marketing, or distribution, rather than day-to-day operations. It is designed to
collect and organize data related to a particular topic to support analysis and decision-making.
Unnecessary data is removed, making it easier to get clear and relevant insights for that subject.
Subject-oriented
2. Integrated
Integration in a data warehouse means combining data from different sources like mainframes
and relational databases into a consistent and reliable format. This involves using standard
naming conventions, formats, and codes so that data can be easily understood and analyzed.
Integration ensures that all related data is unified, allowing for more accurate and efficient
decision-making across different subject areas.
3. Time-Variant
Time-variance means that data in a data warehouse is stored over different time periods—such
as weekly, monthly, or yearly. Unlike operational systems, it keeps historical data for long-term
analysis. Once data is entered, it is not changed or updated, preserving the state of data at a
specific point in time. This allows users to analyze trends and changes over time.
Time-Variant
4. Non-Volatile
Non-volatility means that once data is stored in a data warehouse, it is not deleted or updated.
Instead, new data is added over time, keeping the historical records intact. The data is read-only
and refreshed at specific intervals, making it ideal for analyzing trends and long-term
performance.
Unlike operational systems, a data warehouse does not require transaction processing, recovery,
or concurrency control. Operations like insert, update, and delete used in day-to-day applications
are generally not performed here.
Data Consolidation: Combines data from multiple sources into a single, consistent
repository.
Data Cleaning: Removes errors, duplicates, and irrelevant information to ensure data
quality.
Data Integration: Merges data from various sources into a unified format for accurate
analysis.
Data Storage: Stores large volumes of historical data for easy and quick access.
Data Transformation: Converts and standardizes data to ensure consistency and usability.
Data Reporting: Supports dashboards and reports for stakeholders and departments.
The very first question that was asked at the starting of the blog is now getting answered:
A data warehouse is a location where businesses store critical information holdings such as
client data, sales figures, employee data, and so on.
(DW) is a digital information system that links and unifies massive amounts of data from
numerous sources.
A data warehouse is a central server system that permits the storage, analysis, and interpretation
of data to aid in decision-making.
It is a storage area that houses structured data (database tables, Excel sheets) as well as semi-
structured data (XML files, webpages) for tracking and reporting.
The data warehouse is the heart of the BI system, designed for reporting and analysis of data.
It is a fusion of elements and technologies that facilitates the strategic application of data.
So, how did the term” data warehouse” came into existence,
Database Data Warehouse
A database is a collection of data to organize A data warehouse is a central server system that
information. It helps to access, retrieve, and allows the storage, analysis, and interpretation of
manipulate information. data to support in decision-making.
Its purpose is to store the data. Its purpose is to analyze the data.
Managing daily transactions and business Data warehouses are used for strategic goals such
procedures is one of the many operational as business decision-making and historical pattern
tasks for which databases are used for. analysis.
Tables and joins in a database are complex Tables and joins in a data warehouse are easy
because of normalization. because they are denormalized.
Databases are commonly utilized by both Data warehouses are commonly used by
operational staff and application developers. executives and business analysts.
To keep the database accurate and consistent, Usually, static and historical data are present in
the data in it is updated on a regular basis. data warehouses.
Databases are generally smaller in size than When compared to databases, data warehouses
data warehouses. are larger.
Designing is done using ER modeling
Designing is done using data modeling methods.
methods.
It supports OLTP (Online Transaction
It supports OLAP (Online Analytical Processing).
Processing).
A database keeps detailed data. Data warehouses contain summarized data.
Example: MySQL, Oracle, etc. Example: Google BigQuery, IBM Db2, etc.
A data warehouse converts relational data and other data sources into multidimensional concepts
for analysis. Metadata is formed during this conversion to speed up concerns and searches. On
top of this data layer is a semantic layer that organizes and maps complex data into familiar
business language such as ‘product’ or ‘customer’ so analysts can quickly build analyses without
knowing database table names. Finally, an analytics layer sits on top of the semantic layer,
allowing authorized users to access, visualize, and interpret data.
Integrated
Establishing a common unit of measurement for all related data in a data warehouse using data
from different databases is the process of integrating data. You must store data within it in a
simple and universally acceptable manner.
It must also be consistent in terms of nomenclature and layout. This type of application is useful
for analyzing big data.
Non-volatile
The data warehouse is also non-volatile, which means that past data cannot be erased. The
information is read-only and is only modified on a routine basis. It also helps with statistical data
evaluation and comprehension of what and when events occurred. You don’t require any other
complicated procedure.
Subject-oriented
Rather than company operations, a data warehouse typically provides information on a specific
topic (such as sales inventory or supply chain).
Persistent
Prior data is not deleted when new data is added, making it persistent and non-volatile. Data
from the past is kept for analogies, patterns, and predictive analysis.
When neither a data warehouse nor an OLTP system can meet a firm’s information
requirements, an operations and maintenance data store,is required. The data warehouse in ODS
is refreshed in real-time. As a result, it is widely used for routine tasks such as stashing records
of employees.
3. Data Marts
Ans. Data warehouses offer numerous advantages, primarily focused on improving data-driven
decision-making. These benefits include consolidated and consistent data, enhanced analytics
capabilities, improved data quality, and streamlined data management, ultimately leading to
better business intelligence and a competitive edge.
Data warehouses provide a unified view of information from various sources, enabling
comprehensive analysis and better-informed decisions.
By consolidating data, businesses can identify trends, patterns, and insights that might be
missed when working with isolated data sources.
Historical data analysis is facilitated, allowing for trend identification and informed
predictions.
Data warehouses are designed to handle large volumes of data, making them ideal for
complex analytics and data mining.
They support various analytical tools and techniques, including reporting, dashboards, and
machine learning, leading to deeper insights.
This allows for more sophisticated analysis of business performance and customer
behavior.
Data warehouses cleanse, standardize, and integrate data from different sources, ensuring
a single source of truth.
This eliminates data silos and improves data quality, which is crucial for reliable analysis
and decision-making.
Consistent data across the organization reduces errors and inconsistencies in reporting and
analytics.
Centralized data storage simplifies data management and reduces the need to access
multiple systems.
This leads to time and cost savings in data storage and management.
Data warehouses also improve data accessibility and retrieval efficiency, making it easier
for users to find the information they need.
5. Competitive Advantage:
6. Scalability:
Data warehouses can be scaled to accommodate growing data volumes and business
needs, ensuring they remain relevant as the business evolves.
This scalability is crucial for long-term data management and analysis.
Data warehouses store historical data, allowing businesses to track trends over time and
make informed decisions based on past performance.
This historical data is valuable for forecasting, trend analysis, and understanding business
performance over extended periods.
In essence, data warehouses provide a robust platform for managing, analyzing, and leveraging
data to drive business growth, improve decision-making, and gain a competitive advantage.
Ans. When designing complex systems, choosing the right approach for software development is
important. Two fundamental design approaches are Top-Down Design and Bottom-Up Design.
Each has its unique advantages, disadvantages, and use cases. In this article, we will explain the
Top-Down Design Model and the Bottom-Up Design Model, highlighting their differences,
benefits, and practical applications.
This approach is particularly useful for solving complex problems, like developing a University
System Program or a Word Processor. By starting with a high-level design and gradually
focusing on smaller details, the complexity of the system becomes more manageable.
Simplifies Complex Problems: Breaking problems into smaller parts help us to identify
what needs to be done.
Easy to Identify Requirements: At each step of refinement, new parts will become less
complex and therefore easier to solve.
Promotes Reusability: Parts of the solution may turn out to be reusable.
Collaboration-Friendly: Breaking problems into parts allows more than one person to
solve the problem.
This method focuses on creating well-defined and reusable low-level components before
deciding how to integrate them into higher-level systems.
Focused Problem-Solving: Developers can focus on solving smaller and more isolated
problems first.
driver.
Ans.
A data cube is a multi-dimensional data structure that stores the data in a tabular form. The data
cube can be used to store any information, from a single column to multiple columns or
dimensions.
Each cell in a data cube represents a value that can be calculated using other values stored in
other cells of the same dimension. Data cubes are used by companies like Google, Facebook,
Twitter, and Amazon to handle vast amounts of data. These companies have millions of users.
They need ways to store all this data and make it available for quick retrieval.
Data cubes are used to store large amounts of related data. A single record or row in a database
table contains one piece of information. Conversely, a single record or row in a data cube
contains multiple pieces of information related to each other through their common attributes.
A data cube is a multidimensional data structure that represents large amounts of data. It consists
of a set of measures, dimensions, and hierarchies, which are related to each other in a specific
way.
A measure is a numerical value that can be aggregated into groups. In a relational database, you
can create a table and define your measures as columns. In an OLAP database, you typically
have predefined measures such as Sales Amount or Profit (in thousands).
Data cubes support various operations. Here is an overview of some key data cube operations:
Roll-up: This operation adds up all the data from a category and presents it as a singular
record. It is like zooming out of the cube and looking at the data from a broader
perspective.
Drill-Down: Users typically need to navigate deeper into a dimension hierarchy. For
instance, drilling down into the product can reveal sales figures for the product.
Slicing: Users may sometimes want to focus on a specific fact from a particular
dimension. In such cases, they can filter the data to view only that specific subset. For
example, slicing a sales data cube to focus on “Electronics” narrows the view. It limits
the data to sales related only to electronic products.
Dicing: Breaking the data into multiple slices helps isolate specific factors. Users can
select a subset of values from each dimension. This allows them to focus on the exact
point where the two dimensions intersect.
Pivoting: It means rotating the cube to view the data from a unique perspective. This
enables users to focus on various aspects. Pivoting the data cube allows users to rotate the
view of the data. For example, swapping the product and region dimensions shifts the
focus. Instead of viewing sales by product, the cube now displays sales by region.
………………………………………………………………………………………………………
…………………………………………………………………………………………………….