A snowflake schema is a type of data modeling technique used in data
warehousing to represent data in a structured way that is optimized for
querying large amounts of data efficiently. In a snowflake schema, the
dimension tables are normalized into multiple related tables, creating
a hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the
schema, surrounded by the dimension tables. However, each dimension
table is further broken down into multiple related tables, creating
a hierarchical structure that resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table
might be normalized into multiple related tables, such as product
category, product subcategory, and product details. Each of these tables
would be related to the product dimension table through a foreign
key relationship.
Example:
Snowflake Schema
The Employee dimension table now contains the attributes:
EmployeeID, EmployeeName, DepartmentID, Region, and Territory. The
DepartmentID attribute links with the Employee table with
the Department dimension table. The Department dimension is used
to provide detail about each department, such as the Name and Location
of the department. The Customer dimension table now contains the
attributes: CustomerID, CustomerName, Address, and CityID. The
CityID attributes link the Customer dimension table with
the City dimension table. The City dimension table has details about
each city such as city name, Zipcode, State, and Country.
Characteristics of Snowflake Schema
• The snowflake schema uses small disk space.
• It is easy to implement the dimension that is added to the schema.
• There are multiple tables, so performance is reduced.
• The dimension table consists of two or more sets of attributes that
define information at different grains.
• The sets of attributes of the same dimension table are populated by
different source systems.
Advantages of Snowflake Schema
• It provides structured data which reduces the problem of data
integrity.
• It uses small disk space because data are highly structured.
Disadvantages of Snowflake Schema
1.difficult navigation :
As There are multiple tables are involved hence it is difficult to browsing
through contain.
2.time consuming
3.reduce query performance.
Difference between KDD and Data Mining
Parameter KDD Data Mining
KDD refers to a process of
Data Mining refers to a process
identifying valid, novel, potentially
of extracting useful and valuable
Definition useful, and ultimately
information or patterns from
understandable patterns and
large data sets.
relationships in data.
To find useful knowledge from To extract useful information
Objective
data. from data.
Data cleaning, data integration,
Association rules, classification,
data selection, data transformation,
Techniques clustering, regression, decision
data mining, pattern evaluation,
Used trees, neural networks, and
and knowledge representation and
dimensionality reduction.
visualization.
Patterns, associations, or
Structured information, such as
insights that can be used to
Output rules and models, that can be used
improve decision-making or
to make decisions or predictions.
understanding.
Focus is on the discovery of useful Data mining focus is on the
Focus knowledge, rather than simply discovery of patterns or
finding patterns in data. relationships in data.
Domain expertise is important in Domain expertise is less critical
Role of KDD, as it helps in defining the in data mining, as the algorithms
domain goals of the process, choosing are designed to identify patterns
expertise appropriate data, and interpreting without relying on prior
the results. knowledge.
KDD :
Data Mining also known as Knowledge Discovery in Databases. Data mining
is defined as techniques that are applied to extract patterns potentially
useful. It transforms task relevant data into patterns, and decides purpose of
model using classification or characterization.
The need of data mining is to extract useful information from large datasets
and use it to make predictions or better decision-making. Nowadays, data
mining is used in almost all places where a large amount of data is stored
and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion
Detection.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable
information from large datasets. The KDD process is an iterative process and
it requires multiple iterations of the above steps to extract accurate
knowledge from the data.The following steps are included in KDD process:
1.Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.
2.Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
3.Data Selection
Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering,
and Regression methods. Having the technique, we now decide on the
strategies. This stage incorporates choosing a particular technique to be
used for searching patterns that include multiple inducers. For example,
considering precision versus understandability, the previous is better with
neural networks, while the latter is better with decision trees.
4.Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to
destination to capture transformations.
2. Code generation: Creation of the actual transformation program.
Techniques here incorporate dimension reduction also attribute
transformation. This step can be essential for the success of the entire KDD
project, and it is typically very project-specific. In business, we may need
to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation.
However, if we do not utilize the right transformation at the starting, then
we may acquire an amazing effect that insights to us about the
transformation required in the next iteration. Thus, the KDD process
follows upon itself and prompts an understanding of the transformation
required.
5.Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
6.Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness
score of each pattern, and uses summarization and Visualization to make
data understandable by user.
7.Knowledge Representation
This involves presenting the results in a way that is meaningful and can be
used to make decisions.
Note: KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and
transformed in order to get different and more appropriate
results.Preprocessing of databases consists of Data cleaning and Data
Integration.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and
knowledge that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-
consuming tasks and makes the data ready for analysis, which saves
time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can
help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities
by identifying patterns and anomalies in the data that may indicate
fraud.
5. Predictive modeling: KDD can be used to build predictive models
that can forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves
collecting and analyzing large amounts of data, which can include
sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires
specialized skills and knowledge to implement and interpret the
results.
3. Data Quality: KDD process heavily depends on the quality of data,
if data is not accurate or consistent, the results can be misleading
4. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.
5. Overfitting: KDD process can lead to overfitting, which is a
common problem in machine learning where a model learns the
detail and noise in the training data to the extent that it negatively
impacts the performance of the model on new unseen data.
Difference Between Descriptive and Predictive Data Mining:
S.No. Comparison Descriptive Data Mining Predictive Data Mining
It determines, what It determines, what can
happened in the past by happen in the future with
1. Basic analyzing stored data. the help past data analysis.
It produces results does not
2. Preciseness It provides accurate data. ensure accuracy.
Practical Standard reporting, Predictive modelling,
analysis query/drill down and ad- forecasting, simulation and
3. methods hoc reporting. alerts.
It requires data
aggregation and data It requires statistics and
4. Require mining forecasting methods
Type of
5. approach Reactive approach Proactive approach
Carry out the induction over
Describes the the current and past data so
characteristics of the data that predictions can be
6. Describe in a target data set. made.
• what will happen
next?
• what happened? • what is the
• where exactly is outcome if these
the problem? trends continue?
• what is the • what actions are
Methods(in frequency of the required to be
7. general) problem? taken?
Descriptive mining:
This term is basically used to produce correlation, cross-tabulation,
frequency etc. These technologies are used to determine the similarities in
the data and to find existing patterns. One more application of descriptive
analysis is to develop the captivating subgroups in the major part of the
data available. This analytics emphasis on the summarization and
transformation of the data into meaningful information for reporting and
monitoring.
Examples of descriptive data mining include clustering, association rule
mining, and anomaly detection. Clustering involves grouping similar objects
together, while association rule mining involves identifying relationships
between different items in a dataset. Anomaly detection involves identifying
unusual patterns or outliers in the data.
Predictive Data Mining:
The main goal of this mining is to say something about future results not of
current behaviour. It uses the supervised learning functions which are used
to predict the target value. The methods come under this type of mining
category are called classification, time-series analysis and regression.
Modelling of data is the necessity of the predictive analysis, and it works by
utilizing a few variables of the present to predict the future not known data
values for other variables.
Examples of predictive data mining include regression analysis, decision
trees, and neural networks. Regression analysis involves predicting a
continuous outcome variable based on one or more predictor variables.
Decision trees involve building a tree-like model to make predictions based
on a set of rules. Neural networks involve building a model based on the
structure of the human brain to make predictions.
Application of Data Mining in Healthcare:
Data mining has been used intensively and widely by numerous industries. In
healthcare, data mining is becoming more popular nowadays. Data mining
applications can incredibly benefit all parties who are involved in the healthcare
industry. For example, data mining can help the healthcare industry in fraud detection
and abuse, customer relationship management, effective patient care, and best
practices, affordable healthcare services. The large amounts of data generated by
healthcare transactions are too complex and huge to be processed and analyzed by
conventional methods.
Data mining provides the framework and techniques to transform these data into
useful information for data-driven decision purposes.
Treatment effectiveness:
Data Mining applications can be used to assess the effectiveness of medical
treatments. Data mining can convey analysis of which course of action demonstrates
effective by comparing and differentiating causes, symptoms, and courses of
treatments.
Healthcare management:
Data mining applications can be used to identify and track chronic illness states and
incentive care unit patients, decrease the number of hospital admissions, and supports
healthcare management. Data mining used to analyze massive data sets and statistics
to search for patterns that may demonstrate an assault by bio-terrorists.
Customer relationship management:
Customer and management interactions are very crucial for any organization to
achieve business goals. Customer relationship management is the primary approach
to managing interactions between commercial organizations normally retail sectors
and banks, with their customers. Similarly, it is important in the healthcare context.
Customer interactions may happen through call centers, billing departments, and
ambulatory care settings.
Fraud and abuse:
Data mining fraud and abuse applications can focus on inappropriate or wrong
prescriptions and fraud insurance and medical claims.
role of data mining in Telecommunication Industry:
the telecommunication industry is rapidly expanding, This makes a huge
demand for data mining in order to support understanding the business
involved, identify telecommunication designs, catch fraudulent events, create
better use of resources, and enhance the quality of service. The following are
a few methods for which data mining can improve telecommunication services
−
Multidimensional analysis of telecommunication data −
Telecommunication data are intrinsically multidimensional, with dimensions
including calling-time, duration, location of the caller, location of the callee,
and type of call. The multidimensional analysis of such data can be used to
recognize and compare the data traffic, system workload, resource
management, customer group behavior, and profit. For instance, analysts in
the market can wish to regularly view charts and graphs concerning calling
source, destination, volume, and time-of-day usage designs.
Fraudulent pattern analysis and the identification of unusual
patterns − Fraudulent activity costs the telecommunication market thousands
of dollars per year. It is important to identify potentially fraudulent users and
their atypical usage patterns. It can detect attempts to gain fraudulent entry
into customer accounts.
It can discover unusual patterns that may need special attention, such as busy-
hour frustrated call attempts, switch and route congestion patterns, and
periodic calls from automatic dial-out equipment (like fax machines) that have
been improperly programmed. Some patterns can be found by
multidimensional analysis, cluster analysis, and outlier analysis.
Multidimensional association and sequential pattern analysis − The
discovery of association and sequential patterns in multidimensional analysis
can be used to promote telecommunication services.
Mobile telecommunication services − Mobile telecommunication, Web and
data services, and mobile computing are becoming increasingly integrated and
common in our work and life. The feature of mobile telecommunication data is
its relations with spatiotemporal data. Spatiotemporal data mining can become
important for finding specific designs.
The advantages of data mining software include the
following:
• In marketing campaigns, mining techniques are used. This is to
understand their own customer's needs and habits. From that,
customers can also choose their choice of brand’s clothes. Thus, you
can definitely be self-reliant with the help of this technique.
• It’s an efficient, cost-effective solution compared to other data
science applications.
• It helps businesses make profitable production and operational
adjustments.
• Since data extraction provides financial institutions information on
loans and credit reports, data can determine good or bad credits by
creating a model for historical customers.
• It also helps banks detect fraudulent transactions by credit cards that
protect a credit card owner.
• All information factors are part of the working nature of the system.
The data mining systems can also be obtained from these. They can
help you predict future trends, and with the help of this technology,
this is entirely possible.
• It helps data scientists easily analyze enormous amounts of data
quickly
• Data scientists can use the information to detect fraud, build risk
models, and improve product safety
• It helps data scientists quickly initiate automated predictions of
behaviors and trends and discover hidden patterns
Deductive databases
• A deductive database system is a database system that can make
deductions (ie: conclude additional facts) based on rules and
facts stored in the (deductive) database.
• Datalog is the language typically used to specify facts,
rules,andqueries in deductive databases.
• Deductive databases have grown out of the desire to combine
logic programming with relational databases to construct
systems that support a powerful formalism and are still fast and
able to deal with very large datasets.
• Deductive databases are more expressive than relational
databases but less expressive than logic programming systems
• A database system that includes capabilities to define (deductive)
rules, which can deduce or infer additional information from the
facts that are stored in the database is called a deductive
database.
• Rules are specified through declarative language –we specify
what to achieve rather than how to achieve it.
• The model used for deductive databases is related to logic
programming and the prolog language.
Mobile database
A Mobile database is a database that can be connected to a mobile computing
device over a mobile network (or wireless network). Here the client and the
server have wireless connections. In today’s world, mobile computing is
growing very rapidly, and it is huge potential in the field of the database. It
will be applicable on different-different devices like android based mobile
databases, iOS based mobile databases, etc. Common examples of
databases are Couch base Lite, Object Box, etc.
Features of Mobile database :
Here, we will discuss the features of the mobile database as follows.
• A cache is maintained to hold frequent and transactions so that they
are not lost due to connection failure.
• As the use of laptops, mobile and PDAs is increasing to reside in the
mobile system.
• Mobile databases are physically separate from the central database
server.
• Mobile databases resided on mobile devices.
• Mobile databases are capable of communicating with a central
database server or other mobile clients from remote sites.
• With the help of a mobile database, mobile users must be able to
work without a wireless connection due to poor or even non-
existent connections (disconnected).
• A mobile database is used to analyze and manipulate data on
mobile devices.
Mobile Database typically involves three parties :
1. Fixed Hosts –
It performs the transactions and data management functions with
the help of database servers.
2. Mobiles Units –
These are portable computers that move around a geographical
region that includes the cellular network that these units use to
communicate to base stations.
3. Base Stations –
These are two-way radios installation in fixed locations, that pass
communication with the mobile units to and from the fixed hosts.
Limitations :
Here, we will discuss the limitation of mobile databases as follows.
• It has Limited wireless bandwidth.
• In the mobile database, Wireless communication speed.
• It required Unlimited battery power to access.
• It is Less secured.
• It is Hard to make theft-proof.
Multimedia database:
It is the collection of interrelated multimedia data that includes text,
graphics (sketches, drawings), images, animations, video, audio etc and
have vast amounts of multisource multimedia data. The framework that
manages different types of multimedia data which can be stored, delivered
and utilized in different ways is known as multimedia database
management system. There are three classes of the multimedia database
which includes static media, dynamic media and dimensional media.
Content of Multimedia Database management system :
1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution,
encoding scheme etc. about the format of the media data after it
goes through the acquisition, processing and encoding phase.
3. Media keyword data – Keywords description relating to the
generation of data. It is also known as content descriptive data.
Example: date, time and place of recording.
4. Media feature data – Content dependent data such as the
distribution of colors, kinds of texture and different shapes present
in data.
Types of multimedia applications based on data management characteristic
are :
1. Repository applications – A Large amount of multimedia data as
well as meta-data(Media format date, Media keyword data, Media
feature data) that is stored for retrieval purpose, e.g., Repository of
satellite images, engineering drawings, radiology scanned pictures.
2. Presentation applications – They involve delivery of multimedia
data subject to temporal constraint. Optimal viewing or listening
requires DBMS to deliver data at certain rate offering the quality of
service above a certain threshold. Here data is processed as it is
delivered. Example: Annotating of video and audio data, real-time
editing analysis.
3. Collaborative work using multimedia information – It involves
executing a complex task by merging drawings, changing
notifications. Example: Intelligent healthcare network.
There are still many challenges to multimedia databases, some of which are
:
1. Modelling – Working in this area can improve database versus
information retrieval techniques thus, documents constitute a
specialized area and deserve special consideration.
2. Design – The conceptual, logical and physical design of
multimedia databases has not yet been addressed fully as
performance and tuning issues at each level are far more complex
as they consist of a variety of formats like JPEG, GIF, PNG, MPEG
which is not easy to convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk
presents the problem of representation, compression, mapping to
device hierarchies, archiving and buffering during input-output
operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows
untyped bitmaps to be stored and retrieved.
4. Performance – For an application involving video playback or
audio-video synchronization, physical limitations dominate. The
use of parallel processing may alleviate some problems but such
techniques are not yet fully developed. Apart from this multimedia
database consume a lot of processing time as well as bandwidth.
5. Queries and retrieval –For multimedia data like images, video,
audio accessing data through query opens up many issues like
efficient query formulation, query execution and optimization
which need to be worked upon