0% found this document useful (0 votes)
35 views29 pages

DWDM QB

The document provides a comprehensive overview of data warehousing, including definitions, approaches (Top-Down, Bottom-Up, Hybrid), and key components such as ETL processes and data marts. It also discusses OLAP versus OLTP, data mining, and machine learning techniques, along with the architecture of Oracle as a parallel DBMS vendor. Additionally, it outlines the knowledge discovery process and various OLAP server types and operations.

Uploaded by

Sri Kanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views29 pages

DWDM QB

The document provides a comprehensive overview of data warehousing, including definitions, approaches (Top-Down, Bottom-Up, Hybrid), and key components such as ETL processes and data marts. It also discusses OLAP versus OLTP, data mining, and machine learning techniques, along with the architecture of Oracle as a parallel DBMS vendor. Additionally, it outlines the knowledge discovery process and various OLAP server types and operations.

Uploaded by

Sri Kanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

ANSWER ALL THE QUESTIONS

1. Define data warehouse database ?


A Data Warehouse is a subject-oriented, integrated, time variant and non volatile collection of
data in support of management’s decision making process
2. Illustrate the role of sourcing, acquisition, cleanup and transformation tools ?
These tools are essential components of the ETL (Extract, Transform, Load) process:

Sourcing: Collects data from diverse operational sources (databases, flat files, APIs).

Acquisition: Moves data into staging areas or directly into the warehouse.

Cleanup: Detects and corrects errors, handles missing or inconsistent data.

Transformation: Converts data into a suitable format (normalization, aggregation) before


loading it into the warehouse, ensuring quality, consistency, and integrity.

3. Define the approaches taken for the development of a data warehouse ?


There are two main approaches:

● Top-Down Approach (Inmon): Build a central enterprise warehouse first, then create
data marts for departments. Ensures integration and consistency.
● Bottom-Up Approach (Kimball): Develop data marts for individual departments first,
then integrate them into a warehouse. Faster to implement and more flexible.
● Hybrid Approach: Combines both methods to leverage their strengths—fast
development and integrated architecture.

4. Write nine decision steps in the design of a data warehouse.


● Choose subject areas.
● Define the granularity of data.
● Choose dimensions and hierarchies.
● Select fact tables.
● Define attributes of dimensions.
● Establish measures (facts).
● Choose the duration of the database.
● Plan the frequency of data refresh.
● Determine summarization and aggregation levels.

5. Write the logical steps needed to build a data warehouse.


● Requirement analysis and business objectives identification.
● Select data sources.
● Design data models (Star, Snowflake).
● Extract, Transform, and Load (ETL) process setup.
● Data warehouse schema implementation.
● Data cleansing and validation.
● Loading data into the warehouse.
● Implement OLAP tools and BI reporting.
● Test and deploy.
6. Explain architecture of ORACLE as a parallel DBMS vendor.
Oracle's parallel DBMS architecture includes:

● Shared-disk architecture: All processors access the same storage.


● Parallel execution servers: Divide tasks like scans, joins, and sorts across multiple CPUs.
● Oracle Real Application Clusters (RAC): Multiple instances run on different servers but
access a single database.
● Partitioning and indexing: Speeds up data access by breaking large tables into smaller
pieces processed in parallel.
● Load balancing and fault tolerance are supported through Oracle Grid and RAC features.

7. Compare OLAP Vs. OLTP.

Feature OLAP OLTP

Purpose Analysis and decision making Day-to-day transactions

Data Historical, summarized Current, detailed

Operations Complex queries, read-intensive Simple transactions, read/write

Speed Optimized for query performance Optimized for fast inserts/updates

Schema Star/Snowflake schema ER Model

Examples Sales trends, forecasting Banking, order processing


8. Define data mining ?
Data mining is the process of discovering patterns, correlations, trends, or useful information
from large datasets using techniques from statistics, machine learning, and database systems. It
helps in predictive analysis and decision-making by transforming raw data into meaningful
insights.
9. Summarize the steps in the knowledge discovery process.
● Data selection: Identify relevant data from various sources.
● Data preprocessing: Clean and prepare the data (handling missing values, normalization).
● Data transformation: Convert data into suitable formats for mining.
● Data mining: Apply algorithms to extract patterns.
● Pattern evaluation: Identify truly interesting patterns.
● Knowledge presentation: Use visualization and reporting tools to present the findings.
10. Write the machine learning techniques ?
● Supervised Learning (e.g., classification, regression)
● Unsupervised Learning (e.g., clustering, association)
● Semi-supervised Learning
● Reinforcement Learning
● Deep Learning (e.g., neural networks)
● Ensemble Methods (e.g., bagging, boosting)
ANSWER ANY THREE QUESTIONS
11. Sketch the data warehouse architecture and explain its components.
A Data Warehouse is a system that combine data from multiple sources, organizes it under a
single architecture, and helps organizations make better decisions. It simplifies data handling,
storage, and reporting, making analysis more efficient. Data Warehouse Architecture uses a
structured framework to manage and store data effectively.

There are two common approaches to constructing a data warehouse:

Top-Down Approach: This method starts with designing the overall data warehouse
architecture first and then creating individual data marts.
Bottom-Up Approach: In this method, data marts are built first to meet specific business needs,
and later integrated into a central data warehouse.
Before diving deep into these approaches, we will first discuss the components of data
warehouse architecture.

Components of Data Warehouse Architecture


A data warehouse architecture consists of several key components that work together to store,
manage, and analyze data.

External Sources: External sources are where data originates. These sources provide a variety
of data types, such as structured data (databases, spreadsheets); semi-structured data (XML,
JSON) and unstructured data (emails, images).
Staging Area: The staging area is a temporary space where raw data from external sources is
validated and prepared before entering the data warehouse. This process ensures that the data
is consistent and usable. To handle this preparation effectively, ETL (Extract, Transform,
Load) tools are used.
Extract (E): Pulls raw data from external sources.
Transform (T): Converts raw data into a standard, uniform format.
Load (L): Loads the transformed data into the data warehouse for further processing.
Data Warehouse: The data warehouse acts as the central repository for storing cleansed and
organized data. It contains metadata and raw data. The data warehouse serves as the
foundation for advanced analysis, reporting, and decision-making.
Data Marts: A data mart is a subset of a data warehouse that stores data for a specific team or
purpose, like sales or marketing. It helps users quickly access the information they need for
their work.
Data Mining: Data mining is the process of analyzing large datasets stored in the data
warehouse to uncover meaningful patterns, trends, and insights. The insights gained can
support decision-making, identify hidden opportunities, and improve operational efficiency.
Top-Down Approach
The Top-Down Approach, introduced by Bill Inmon, is a method for designing data
warehouses that starts by building a centralized, company-wide data warehouse. This central
repository acts as the single source of truth for managing and analyzing data across the
organization. It ensures data consistency and provides a strong foundation for decision-
making.
Working of Top-Down Approach
Central Data Warehouse: The process begins with creating a comprehensive data warehouse
where data from various sources is collected, integrated, and stored. This involves the ETL
(Extract, Transform, Load) process to clean and transform the data.
Specialized Data Marts: Once the central warehouse is established, smaller, department-
specific data marts (e.g., for finance or marketing) are built. These data marts pull information
from the main data warehouse, ensuring consistency across departments.
Top down approach

Advantages of Top-Down Approach


1. Consistent Dimensional View: Data marts are created directly from the central data
warehouse, ensuring a consistent dimensional view across all departments. This minimizes
discrepancies and aligns data reporting with a unified structure.

2. Improved Data Consistency: By sourcing all data marts from a single data warehouse, the
approach promotes standardization. This reduces the risk of errors and inconsistencies in
reporting, leading to more reliable business insights.

3. Easier Maintenance: Centralizing data management simplifies maintenance. Updates or


changes made in the data warehouse automatically propagate to all connected data marts,
reducing the effort and time required for upkeep.

4. Better Scalability: The approach is highly scalable, allowing organizations to add new data
marts seamlessly as their needs grow or evolve. This is particularly beneficial for businesses
experiencing rapid expansion or shifting demands.

5. Enhanced Governance: Centralized control of data ensures better governance. Organizations


can manage data access, security, and quality from a single point, ensuring compliance with
standards and regulations.

6. Reduced Data Duplication: Storing data only once in the central warehouse minimizes
duplication, saving storage space and reducing inconsistencies caused by redundant data.
7. Improved Reporting: A consistent view of data across all data marts enables more accurate
and timely reporting. This enhances decision-making and helps drive better business
outcomes.

8. Better Data Integration: With all data marts being sourced from a single warehouse,
integrating data from multiple sources becomes easier. This provides a more comprehensive
view of organizational data and improves overall analytics capabilities.

Disadvantages of Top-Down Approach


1. High Cost and Time-Consuming: The Top-Down Approach requires significant investment
in terms of cost, time, and resources. Designing, implementing, and maintaining a central data
warehouse and its associated data marts can be a lengthy and expensive process, making it
challenging for smaller organizations.

2. Complexity: Implementing and managing the Top-Down Approach can be complex,


especially for large organizations with diverse and intricate data needs. The design and
integration of a centralized system demand a high level of expertise and careful planning.

3. Lack of Flexibility: Since the data warehouse and data marts are designed in advance,
adapting to new or changing business requirements can be difficult. This lack of flexibility
may not suit organizations that require dynamic and agile data reporting capabilities.

4. Limited User Involvement: The Top-Down Approach is often led by IT departments, which
can result in limited involvement from business users. This may lead to data marts that fail to
address the specific needs of end-users, reducing their overall effectiveness.

5. Data Latency: When data is sourced from multiple systems, the Top-Down Approach may
introduce delays in data processing and availability. This latency can affect the timeliness and
accuracy of reporting and analysis.

6. Data Ownership Challenges: Centralizing data in the data warehouse can create ambiguity
around data ownership and responsibilities. It may be unclear who is accountable for
maintaining and updating the data, leading to potential governance issues.

7. Integration Challenges: Integrating data from diverse sources with different formats or
structures can be difficult in the Top-Down Approach. These challenges may result in
inconsistencies and inaccuracies in the data warehouse.

8. Not Ideal for Smaller Organizations: Due to its high cost and resource requirements, the
Top-Down Approach is less suitable for smaller organizations or those with limited budgets
and simpler data needs.

Bottom-Up Approach
The Bottom-Up Approach, popularized by Ralph Kimball, takes a more flexible and
incremental path to designing data warehouses. Instead of starting with a central data
warehouse, it begins by building small, department-specific data marts that cater to the
immediate needs of individual teams, such as sales or finance. These data marts are later
integrated to form a larger, unified data warehouse.

Working of Bottom-Up Approach


Department-Specific Data Marts: The process starts with creating data marts for individual
departments or specific business functions. These data marts are designed to meet immediate
data analysis and reporting needs, allowing departments to gain quick insights.
Integration into a Data Warehouse: Over time, these data marts are connected and consolidated
to create a unified data warehouse. The integration ensures consistency and provides a
comprehensive view of the organization’s data.
Bottom up Approach

Advantages of Bottom-Up Approach


1. Faster Report Generation: Since data marts are created first, reports can be generated
quickly, providing immediate value to the organization. This enables faster insights and
decision-making.

2. Incremental Development: This approach supports incremental development by allowing


the creation of data marts one at a time. Organizations can achieve quick wins and gradually
improve data reporting and analysis over time.

3. User Involvement: The Bottom-Up Approach encourages active involvement from business
users during the design and implementation process. Users can provide feedback on data marts
and reports, ensuring the solution meets their specific needs.

4. Flexibility: This approach is highly flexible, as data marts are designed based on the unique
requirements of specific business functions. It is particularly beneficial for organizations that
require dynamic and customizable reporting and analysis.

5. Faster Time to Value: With quicker implementation compared to the Top-Down Approach,
the Bottom-Up Approach delivers faster time to value. This is especially useful for smaller
organizations with limited resources or businesses looking for immediate results.

6. Reduced Risk: By creating and refining individual data marts before integrating them into a
larger data warehouse, this approach reduces the risk of failure. It also helps identify and
resolve data quality issues early in the process.
7. Scalability: The Bottom-Up Approach is scalable, allowing organizations to add new data
marts as needed. This makes it an ideal choice for businesses experiencing growth or
undergoing significant change.

8. Clarified Data Ownership: Each data mart is typically owned and managed by a specific
business unit, which helps clarify data ownership and accountability. This ensures data
accuracy, consistency, and proper usage across the organization.

9. Lower Cost and Time Investment: Compared to the Top-Down Approach, the Bottom-Up
Approach requires less upfront cost and time to design and implement. This makes it an
attractive option for organizations with budgetary or time constraints.

Disadvantage of Bottom-Up Approach


1. Inconsistent Dimensional View: Unlike the Top-Down Approach, the Bottom-Up Approach
may not provide a consistent dimensional view of data marts. This inconsistency can lead to
variations in reporting and analysis across departments.

2. Data Silos: This approach can result in the creation of data silos, where different business
units develop their own data marts independently. This lack of coordination may cause
redundancies, data inconsistencies, and difficulties in integrating data across the organization.

3. Integration Challenges: Integrating multiple data marts into a unified data warehouse can be
challenging. Differences in data structures, formats, and granularity may lead to issues with
data quality, accuracy, and consistency.

4. Duplication of Effort: In a Bottom-Up Approach, different business units may inadvertently


duplicate efforts by creating data marts with overlapping or similar data. This can result in
inefficiencies and increased costs in data management.

5. Lack of Enterprise-Wide View: Since data marts are typically designed to meet the needs of
specific departments, this approach may not provide a comprehensive, enterprise-wide view of
data. This limitation can hinder strategic decision-making and limit an organization’s ability to
analyze data holistically.

6. Complexity in Management: Managing and maintaining multiple data marts with varying
complexities and granularities can be more challenging compared to a centralized data
warehouse. This can lead to higher maintenance efforts and potential difficulties in ensuring
long-term scalability.

7. Risk of Inconsistency: The decentralized nature of the Bottom-Up Approach increases the
risk of data inconsistency. Differences in data structures and definitions across data marts can
make it difficult to compare or combine data, reducing the reliability of reports and analyses.

8. Limited Standardization: Without a central repository to enforce standardization, the


Bottom-Up Approach may lack uniformity in data formats and definitions. This can
complicate collaboration and integration across departments.
12. Types of OLAP Servers

We have four types of OLAP servers −

Relational OLAP (ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

Specialized SQL Servers

Relational OLAP

ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.

ROLAP includes the following −

Implementation of aggregation navigation logic.

Optimization for each DBMS back end.

Additional tools and services.

Multidimensional OLAP

MOLAP uses array-based multidimensional storage engines for multidimensional views of


data. With multidimensional data stores, the storage utilization may be low if the data set is
sparse. Therefore, many MOLAP server use two levels of data storage representation to handle
dense and sparse data sets.

Hybrid OLAP

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.

OLAP Operations

Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.

Here is the list of OLAP operations −

Roll-up

Drill-down

Slice and dice

Pivot (rotate)

Roll-up

Roll-up performs aggregation on a data cube in any of the following ways −

By climbing up a concept hierarchy for a dimension

By dimension reduction

The following diagram illustrates how roll-up works.

Roll-up is performed by climbing up a concept hierarchy for the dimension location.

Initially the concept hierarchy was "street < city < province < country".

On rolling up, the data is aggregated by ascending the location hierarchy from the level of city
to the level of country.

The data is grouped into cities rather than countries.

When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down

Drill-down is the reverse operation of roll-up. It is performed by either of


the following ways −

By stepping down a concept hierarchy for a dimension


By introducing a new dimension.

The following diagram illustrates how drill-down works −

Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter &lt year."
On drilling down, the time dimension is descended from the level of quarter to the level of
month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.

Slice

The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.

Dice

Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.

(location = "Toronto" or "Vancouver")


(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")

Pivot

The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
OLAP vs OLTP

Sr.N Data Warehouse (OLAP) Operational Database (OLTP)


o.

1 Involves historical processing of Involves day-to-day processing.


information.

2 OLAP systems are used by knowledge OLTP systems are used by clerks,
workers such as executives, managers DBAs, or database professionals.
and analysts.

3 Useful in analyzing the business. Useful in running the business.


4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.


Schema and Fact Constellation Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and consolidated Provides primitive and highly detailed


data. data.

8 Provides summarized and Provides detailed and flat relational view


multidimensional view of data. of data.

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in tens.


millions.

11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.

12 Highly flexible. Provides high performance.


ith OLTP.
OLAP stands for Online Analytical Processing Server. It is a software technology that allows
users to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg.
Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes and these
cubes are known as Hyper-cubes.

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:
13. Describe Data mining techniques and its issues,applications

Data Mining Techniques


Some of the techniques which are used in development of data mining methods
are :
2.3.1 Statistics
The collection, analysis, interpretation and presentation of data can be studied by
applying statistics.
By application of statistical model the behaviour of objects in a target class is
described.
Statistical methods are used to summarize an describe collection of data.
These are also used to verify data mining results.
A statistical hypothesis test works on experiential data to make statistic decision.
2.3.2 Machine Learning
These techniques facilitate computes to learn the complex patterns and make
intelligent decision on data automatically.
For example : Automatic recognition of postal codes on mail from set of examples.
Different ways through which it can be done as
1) Supervised learning :
It is a classification technique.
Based on the labeled examples in training data set it facilitates the supervised
learning of classification model.
2) Unsupervised learning :
It is a clustering technique.
Classes are discovered within the data.
As the data is not labeled the meaning of clusters cannot be found.
For example : From set of handwritten documents cluster can be formed for digits
from 0 to 9.
3) Semi supervised learning :
It uses both labeled and unlabeled data in learning process.
There are different approaches. In one, unlabeled examples are considered as
boundary elements. In another, labeled and unlabeled examples are considered as
positive and negative examples.

4) Active learning :
In this approach user is asked to label an example.
Based on knowledge from human users the model quality is optimized.
2.3.3 Database Systems and Data Warehouses
To create, maintain and use the databases for organizations and end user, is the
aim.
Various techniques like query languages, query processing and optimization
methods, data storage and indexing and accessing methods are used to facilitate
this.
A data warehouse integrates data from various sources and timeframes.
2.3.4 Information Retrieval
Documents or information from the documents are searched by this technique.
The system assumes that -
1) The data to be searched is unstructured.
2) The queries are formed by keywords.
Language model is the probability density function which is used to generate bag
of words.
The topic in text documents can also be modeled as probability distribution over
vocabulary known as topic model.

Noise/outliers

Positive example
Negative example
Unlabeled example

Decision boundary without unlabeled examples


Decision boundary with unlabeled examples

Fig. 2.3.1 Semi - supervised learning

2.4 Major Issues in Data Mining


Major issues in data mining can be categorized in five groups.
1. Mining methodology
2. User interaction
3. Efficiency and scalability
4. Diversity of data types
5. Data mining and society
2.4.1 Mining Methodology
Mining methodologies are required for investigation of new kinds of knowledge,
mining in multidimensional space, etc.
Some of the issues faced by these methodologies are data uncertainty, noise and
incompleteness.
Various aspects of mining methodologies are :
1. Mining various and new kinds of knowledge :
The issues are faced due to diversity of applications.
2. Mining knowledge in multidimensional space :
If data sets are large, the interesting patterns can be found in combinations of
dimensions with different abstraction levels.
To mine this data multi dimensional data mining techniques are used.
3. Data mining - an interdisciplinary effort :
For interdisciplinary fields like natural language text mining, mixing methods
need to be combined with methods of information retrieval and natural
language processing.
4. Boosting the power of discovery in a networked environment :
The objects may reside in interconnected environment like web. Semantic links
between multiple data objects is advantages in mining.
5. Handling uncertainty, noise or incompleteness of data :
Mining process face the issues due to noise, errors, expections, etc.
6. Pattern evaluation and pattern guided mining :
The techniques are required to know the interesting patterns among all the
patterns.

2.4.2 User Interaction


User is important factor of data mining process.
The areas of research include -
1. Interactive mining.
2. Incorporation of background knowledge.
3. Adhoc mining and data mining query languages.
4. Presentation and visualization of data mining results.
1. Interactive mining :
Flexible user interfaces are necessary.
The interface should provide facility to the user to first sample set of data, explore
its characteristics and estimate mining results.
2. Incorporation of background knowledge :
Background discovery process is necessary for pattern evaluation.
3. Adhoc data mining and data mining query languages :
Query languages like SQL allow users to pose adhoc queries.
This facilities domain knowledge, knowledge to be mind, conditions and constants,
etc.
4. Presentation and visualization of data mining results :
The presentation of data mining results is very important for easy understanding
of the results.
2.4.3 Efficiency and Scalability
The factors need to be considered while comparing mining algorithms are :
1. Efficiency and scalability of data mining algorithms :
The running time of timing algorithm must be small and predictable.
The criterias need to be considered for mining algorithms are : efficiency,
scalability, performance, optimization and ability to execute in real line.
2. Parallel, distributed and incremental mining algorithms :
Due to humongous size of data sets there is need of such algorithms.
Cloud counting and cluster computing can facilitate parallel data mining.
2.4.4 Diversity of Database Types
Database types include :
1. Handling complex types of data :
The data objects can vary from simple to temporal, biological sequences, sensor
data, social network data, etc.

To handle such a varied data effective and efficient data mining tools are
needed.
2. Mining dynamic, networked and global data repositories :
The knowledge discovery from structured, semi-structured or unstructured
interconnected data is very challenging.
Web mining, multisource data mining techniques are needed to handle this data.
2.4.5 Data Mining and Society
There are many issues which need to addressed while using data mining in day to
day life.
Some of these are :
1. Social impact of data mining
2. Privacy preserving data mining
3. Invisible data mining
2.5 Applications of Data Mining
Data mining has many applications in various domains.
In this section we are going to discuss two successful applications of data mining
1. Business intelligence
2. Search engines
2.5.1 Business Intelligence
The term Business Intelligence (BI) refers to technologies, applications and
practices for the collection, integration, analysis and presentation of business
information.
The purpose of BI is to support better business decision making.
Data mining is required in BI to perform effective market analysis, compare
customer feedback on similar products to discover strengths and weakness of their
competitors to retain highly valuable customers and to make smart business
decisions.
Data warehousing and multidimensional data mining is used in online analytical
processing tool.
Classification and prediction techniques are used in predictive analytics.
Clustering is used in customer relationship management.

Web Search Engines


A web search engine or internet search engine is a software system that is
designed to carry out web search, which means to search the world wide web in a
systematic way for particular information specified in a textual web search query.
User query results are returned at a list or hits.
The hits consist of web pages, images and other types of files.
Different data mining techniques are used extensively in web search engines.
Crawling, indexing and searching are some of them.
Challenges which can be faced by data mining usage are :
Handling of humongous amount of data getting generated daily.
Use of computer clouds, consisting of thousands or hundreds of thousands of
computers to work on data mining methods and large distributed data sets.
To deal with online data. A query classifier need to be built for this to handle
the queries on predifined categories.
To handle context aware queries. In context aware query search engine tries to
find out context of query using users profile to give customized answers in very
small amount of time.
Most of the queries are asked only once which is challenging for data mining
methods.

14. Explain all the methods of Statistical Description of data?


Statistical Description of Data
For data preprocessing to be successful it is essential to have an overall picture of
the data.
Statistical descriptions is used to
● identify properties of the data
● highlight important data values
● Identify noise or outliers
For practical utilization various techniques are used to show statistical data in the
graphical / visual presentation.
Measuring the Central Tendency of data
● The measures of central tendency provide us with statistical information about a set
of data.
● The four primary measurements that we use are the mean, median, mode and
range.
● Each one of these measurements can provide us with information about our data set.
● This information can then be used to define how the set of are connected.
● Let’s consider the Items as the set of data for understanding central tendency of
data.
The table shows the Quantity and cost of items as

Mean
The first measure is the mean which means average.
To calculate the mean add together all of the numbers in your data set.
Then divide that sum by the number of addends.
Let x x1 2 ...xN be a set of N values or observations such as for some numeric
attribute X then The mean of this data is
Arithmetic Mean or the Weighted Average
Sometimes each value xi in a set may be associated with a weight wi for i = 1...N.
The weights reflect the significance importance or occurrence frequency attached to
their respective values.
The weighted arithmetic mean or the weighted average is calculated as

For the given data set of items the quantity of the items is considered as weight w.
Hence the arithmetic mean is calculated as
To offset the effect caused by a small number of extreme values one can use
trimmed mean.
The trimmed mean is obtained after chopping off values at the high and low
extremes.
Median
Another measure of central tendency is the which is the middle number when
listed in order from least to greatest.
For skewed (asymmetric) data a better measure of the center of data is the median.
It is the middle value in a set of ordered data values.
The median is the value that separates the higher half of a data set from the lower
half
To calculate median
Suppose that a given data set of N values for an attribute X is sorted in
increasing order.
If N is odd then the median is the middle value of the ordered set.
If N is even then the median is not unique; it is the two middlemost values and
any value in between.
If X is a numeric attribute in this case by convention the median is taken as the
average of the two middlemost values.
For the sample dataset given we can find the median of the ‘Cost of items’ which
has a total 12 values. 12 ia an even number. Hence the median is taken as average
of middlemost values i.e. average of 52 and 56. Hence median is 54.
The median is expensive to compute when we have a large data set then following
formula is used to calculate median as

where
L1 is the lower boundary of the median interval
N is the number of values in the entire data set
freq l is the sum of the frequencies of all of the intervals that are lower than the
median interval
freq median is the frequency of the median interval
Width is the width of the median interval.
Mode
The mode is another measure of central tendency.
The mode for a set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative attributes.
It is possible for the greatest frequency to correspond to several different values,
which results in more than one mode.
Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal.
In general, a data set with two or more modes is multimodal. At the other extreme,
if each data value occurs only once, then there is no mode.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the
following empirical relation -

Mean Mode 3 (Mean Median)

The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set

Maximum (data) Minimum (data)

In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value as shown in Fig. 2.7.1
Data in most real applications are not symmetric. They may instead be either
positively skewed, where the mode occurs at a value that is smaller than the
median, or negatively skewed, where the mode occurs at a value greater than the
median (Refer Fig. 2.7.1).
Measuring the Dispersion of Data

The spread of dispersion of data can be measured statistically.


The measures include range, quantiles, quartiles, percentiles, and the interquartile
range.
The five-number summary, which can be displayed as a boxplot, is useful in
identifying outliers.
Variance and standard deviation also indicate the spread of a data distribution.
Range
Let x1, x2 , ..., xN be a set of observations for some numeric attribute, X.
The range of the set is the difference between the largest (max()) and smallest
(min()) values.
Quartiles
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
Suppose that the data for attribute X are sorted in increasing numeric order. Imagine
that we can pick certain data points so as to split the data distribution into
equal-size consecutive sets. These data points are called quantiles. Refer Fig. 2.7.2
The kth q-quantile for a given data distribution is the value x such that at most k/q
of the data values are less than x and at most (q k)/q of the data values are more
than x, where k is an integer such that 0 < k < q. There are q 1 q-quantiles.
The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.

(a) Symmetric data (b) Positively skewed data (c) Negatively skewed data
Fig 2.7.1 Mean, median, and mode of symmetric versus positively and negatively

skewed data

The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.
Interquartile Range
The quartiles give an indication of a distribution’s center, spread, and shape. The
first quartile, denoted by Q1, is the 25th percentile.
It cuts off the lowest 25 % of the data. The third quartile, denoted by Q3, is the 75th
percentile - it cuts off the lowest 75 % (or highest 25 %) of the data.
The second quartile is the 50th percentile. As the median, it gives the center of the
data distribution.
The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as IQR = Q3 Q1.
Outliers
In statistics, an outlier is a data
point that differs significantly from
other observations or patterns.
(Refer Fig. 2.7.3)
An outlier is an observation that lies
an abnormal distance from other
values in a random sample from a
population.

In a sense, this definition leaves it up to the analyst (or a consensus process) to


decide what will be considered abnormal.
Before abnormal observations can be singled out, it is necessary to characterize
normal observations.
An outlier may be due to variability in the measurement or it may indicate
experimental error; the latter are sometimes excluded from the data set.
An outlier can cause serious problems in statistical analyses.
A common rule of thumb for identifying suspected outliers is to single out values
falling at least 1.5 × IQR above the third quartile or below the first quartile.
Five-Number Summary
Because Q1, the median, and Q3 together contain no information about the
endpoints (e.g., tails) of the data.
A fuller summary of the shape of a distribution can be obtained by providing the
lowest and highest data values as well. This is known as the five-number summary.
The five-number summary of a distribution consists of
The median (Q2),
The quartiles Q1 and Q3, and the smallest and largest individual observations,
The five numbers are written in the order of Minimum, Q1, Median, Q3,
Maximum.
Boxplots
Boxplots are a popular way of visualizing a distribution.
A boxplot incorporates the five-number summary as follows :
Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extends to the smallest (Minimum)
and largest (Maximum) observations. (Refer Fig. 2.7.4 on next page)
Boxplots can be computed in On logn time. Approximate boxplots can be computed
in linear or sublinear time depending on the quality guarantee required.

Variance and Standard Deviation


Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data
observations tend to be very close to the mean, while a high standard deviation
indicates that the data are spread out over a large range of values.
The variance of N observations, x1, x2 ,..., xN , for a numeric attribute x is
Fig 2.7.4 Boxplot for the unit price data for items sold at four branches of a shop

measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
= 0 only when there is no spread, that is, when all observations have the same
value. Otherwise, > 0.
The standard deviation is a good indicator of the spread of a data set.
The computation of the variance and standard deviation is scalable in large
databases.
Graphic Displays of Basic Statistical Descriptions of Data
Graphs are helpful for the visual inspection of data, which is useful for data
preprocessing.
The first three of these show univariate distributions (i.e., data for one attribute),
while scatter plots show bivariate distributions (i.e., involving two attributes).
Quantile Plots
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
First, it displays all of the data for the given attribute (allowing the user to assess
both the overall behavior and unusual occurrences).
Second, it plots quantile information
Let xi , for i = 1 to N, be the data sorted in increasing order so that x1 is the
smallest observation.
xN is the largest for some ordinal or numeric attribute X.
Each observation, xi is paired with a percentage, fi which indicates that

approximately fi × 100% of the data are below the value, xi

Note that the 0.25 percentile corresponds to quartile Q1, the 0.50 percentile is the
median, and the 0.75 percentile is Q3.

These numbers increase in equal steps of 1/N, ranging from 1 2N (which is slightly
above 0) to 1-12N (which is slightly below 1).
On a quantile plot, xi is graphed against fi
. This allows us to compare different

distributions based on their quantiles.


Quantile - quantile Plots
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate
distribution against the corresponding quantiles of another.
It is a powerful visualization tool in that it allows the user to view whether there is
a shift in going from one distribution to another.
Suppose that we have two sets of observations for the attribute or variable unit
price, taken from two different branch locations (Refer Table 2.2.2)
Unit price Count of items sold
40 275
43 300
47 250
--
74 360
75 515
78 540
--
115 320
117 270
120 350

Table 2.7.2 - Unit price and count of sale at a shop

Let x1, ..., xN be the data from the first branch, and y1,..., yM be the data from the
second, where each data set is sorted in increasing order.
If M = N (i.e., the number of points in each set is the same), then we simply plot yi
against xi , where yi and xi are both (i - 0.5)/N quantiles of their respective data
sets.

If M < N (i.e., the second branch has fewer observations than the first), there can be
only M points on the q-q plot. Here, yi is the (i - 0.5)/M quantile of the y data,
which is plotted against the (i - 0.5)/M quantile of the x data. This computation
typically involves interpolation.

Histograms
Histograms (or frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
poles.
Plotting histograms is a graphical method for summarizing the distribution of a
given attribute, X.
If X is nominal, such as automobile model or item type, then a pole or vertical bar is
drawn for each known value of X.
The height of the bar indicates the frequency (i.e., count) of that X value. The
resulting graph is more commonly known as a bar chart.
If X is numeric, the term histogram is preferred.
The range of values for X is partitioned into disjoint consecutive subranges.
The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X.
The range of a bucket is known as the width.
Scatter Plots
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.

To construct a scatter plot, each pair of values is treated as a pair of coordinates in


an algebraic sense and plotted as points in the plane. (Refer Fig. 2.7.8)
The scatter plot is a useful method for providing a first look at bivariate data to see
clusters of points and outliers, or to explore the possibility of correlation
relationships.
Two attributes, X and Y, are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated).
If the plotted points pattern slopes from lower left to upper right, this means
that the values of X increase as the values of Y increase, suggesting a positive
correlation Fig 2.7.9 (a).

If the pattern of plotted points slopes from upper left to lower right, the values
of X increase as the values of Y decrease, suggesting a negative correlation
Fig. 2.7.9 (b).
A line of best fit can be drawn to study the correlation between the variables.
Refer Fig. 2.7.10 shows three cases for which there is no correlation relationship
between the two attributes in each of the given data sets.

Conclusion
Basic data descriptions (e.g., measures of central tendency and measures of
dispersion) and graphic statistical displays (e.g., quantile plots, histograms, and
scatter plots) provide valuable insight into the overall behavior of your data. By
identification of noise and outliers, these techniques are useful for data cleaning.

You might also like