Unit V
Data Mining on complex data and applications
Syllabus Contents
• Algorithms for mining of Spatial Data
• Multimedia Data
• Text Data
• Data mining applications
• Social Impacts of Data Mining
• Trends in Data Mining.
Spatial Data Mining
• A spatial database stores a large amount of space-related data, such as maps,
preprocessed remote sensing or medical imaging data, and VLSI chip layout
data.
• They carry topological and/or distance information, usually organized by
sophisticated, multidimensional spatial indexing structures that are accessed
by spatial data access methods and often require spatial reasoning, geometric
computation, and spatial knowledge representation techniques.
• Spatial data mining refers to the extraction of knowledge, spatial relationships,
or other interesting patterns not explicitly stored in spatial databases.
• Such mining demands an integration of data mining with spatial database
technologies.
• It can be used for understanding spatial data, discovering spatial relationships
and relationships between spatial and nonspatial data, constructing spatial
knowledge bases, reorganizing spatial databases, and optimizing spatial
queries.
Statistical Techniques for Spatial Data Mining
• Statistical spatial data analysis has been a popular approach to analyzing
spatial data and exploring geographic information.
• The term geostatistics is often associated with continuous geographic
space, whereas the term spatial statistics is often associated with
discrete space.
• In a statistical model that handles nonspatial data, one usually assumes
statistical independence among different portions of data.
• However, different from traditional data sets, there is no such
independence among spatially distributed data because in reality, spatial
objects are often interrelated, or more exactly spatially co-located, in the
sense that the closer the two objects are located, the more likely they
share similar properties.
Statistical Techniques for Spatial Data Mining
• Such a property of close interdependency across nearby space leads to
the notion of spatial autocorrelation.
• Based on this notion, spatial statistical modeling methods have been
developed with good success.
• Spatial data mining will further develop spatial statistical analysis
methods and extend them for huge amounts of spatial data, with more
emphasis on efficiency, scalability, cooperation with database and data
warehouse systems, improved user interaction, and the discovery of new
types of knowledge.
Statistical Techniques for Spatial Data Mining
• Such a property of close interdependency across nearby space leads to
the notion of spatial autocorrelation.
• Based on this notion, spatial statistical modeling methods have been
developed with good success.
• Spatial data mining will further develop spatial statistical analysis
methods and extend them for huge amounts of spatial data, with more
emphasis on efficiency, scalability, cooperation with database and data
warehouse systems, improved user interaction, and the discovery of new
types of knowledge.
Spatial Data Cube Construction and Spatial OLAP
Spatial Data Cube Construction and Spatial OLAP
Spatial Data Cube Construction and Spatial OLAP
Spatial Data Cube Construction and Spatial OLAP
Spatial Data Cube Construction and Spatial OLAP
Mining Spatial Association and Co-location Patterns
Mining Spatial Association and Co-location Patterns
• Since spatial association mining needs to evaluate multiple spatial
relationships among a large number of spatial objects, the process could be
quite costly.
• An interesting mining optimization method called progressive refinement
can be adopted in spatial association analysis.
• The method first mines large data sets roughly using a fast algorithm and
then improves the quality of mining in a pruned data set using a more
expensive algorithm.
Mining Spatial Association and Co-location Patterns
Mining Spatial Association and Co-location Patterns
Spatial Clustering Methods
Spatial data clustering identifies clusters, or densely populated regions, according to
some distance measurement in a large, multidimensional data set.
Spatial clustering methods were thoroughly studied in Unit IV since cluster analysis
usually considers spatial data clustering in examples and applications.
Spatial Classification and Spatial Trend Analysis
Spatial Classification and Spatial Trend Analysis
Mining Raster Databases
Multimedia Data Mining
• A multimedia database system stores and manages a large collection of
multimedia data, such as audio, video, image, graphics, speech, text,
document, and hypertext data, which contain text, text markups, and
linkages.
• Multimedia database systems are increasingly common owing to the
popular use of audio video equipment, digital cameras, CD-ROMs, and the
Internet.
• Typical multimedia database systems include NASA’s EOS (Earth
Observation System), various kinds of image and audio-video databases, and
Internet databases.
Multimedia Techniques
• Similarity Search in Multimedia Data
• Multidimensional Analysis of Multimedia Data
• Classification and Prediction Analysis of Multimedia Data
• Mining Associations in Multimedia Data
• Audio and Video Data Mining
Similarity Search in Multimedia Data
• For similarity searching in multimedia data, we consider two main families of
multimedia indexing and retrieval systems:
• (1) description-based retrieval systems, which build indices and perform
object retrieval based on image descriptions, such as keywords, captions,
size, and time of creation;
• (2) content-based retrieval systems, which support retrieval based on the
image content, such as color histogram, texture, pattern, image topology,
and the shape of objects and their layouts and locations within the image.
Similarity Search in Multimedia Data
• Description-based retrieval is labor-intensive if performed manually. If
automated, the results are typically of poor quality.
• For example, the assignment of keywords to images can be a tricky and
arbitrary task.
• Recent development of Web-based image clustering and classification
methods has improved the quality of description-based Web image
retrieval, because image surrounded text information as well as Web linkage
information can be used to extract proper description and group images
describing a similar theme together.
Similarity Search in Multimedia Data
• In a content-based image retrieval system, there are often two kinds of queries:
image sample- based queries and image feature specification queries.
• Image-sample-based queries find all of the images that are similar to the given
image sample.
• This search compares the feature vector (or signature) extracted from the sample
with the feature vectors of images that have already been extracted and indexed
in the image database.
• Based on this comparison, images that are close to the sample image are
returned.
• Image feature specification queries specify or sketch image features like color,
texture, or shape, which are translated into a feature vector to be matched with
the feature vectors of the images in the database.
• Content-based retrieval has wide applications, including medical diagnosis,
weather prediction, TV production, Web search engines for images, and e-
commerce.
• Some systems, such as QBIC (Query By Image Content), support both sample-
based and image feature specification queries.
Similarity Search in Multimedia Data
• Several approaches have been proposed and studied for similarity-based
retrieval in image databases, based on image signature:
• Color histogram–based signature: In this approach, the signature of an
image includes color histograms based on the color composition of an image
regardless of its scale or orientation.
• Multi feature composed signature: In this approach, the signature of an
image includes a composition of multiple features: color histogram, shape,
image topology, and texture.
• Wavelet-based signature: This approach uses the dominant wavelet
coefficients of an image as its signature.
• Wavelet-based signature with region-based granularity: In this approach,
the computation and comparison of signatures are at the granularity of
regions, not the entire image.
Multidimensional Analysis of Multimedia Data
• A multimedia data cube can contain additional dimensions and measures for
multimedia information, such as color, texture, and shape.
• A multimedia data cube can have many dimensions.
• The following are some examples: the size of the image or video in bytes;
the width and height of the frames (or pictures), constituting two
dimensions; the date on which the image or video was created (or last
modified); the format type of the image or video; the frame sequence
duration in seconds; the image or video Internet domain; the Internet
domain of pages referencing the image or video (parent URL); the keywords;
a color dimension; an edge-orientation dimension; and so on.
• Concept hierarchies for many numerical dimensions may be automatically
defined. For other dimensions, such as for Internet domains or color,
predefined hierarchies may be used.
Multidimensional Analysis of Multimedia Data
• The construction of a multimedia data cube will facilitate multidimensional
analysis of multimedia data primarily based on visual content, and the
mining of multiple kinds of knowledge, including summarization,
comparison, classification, association, and clustering.
Classification and Prediction Analysis of Multimedia Data
• Classification and predictive modeling have been used for mining
multimedia data, especially in scientific research, such as astronomy,
seismology, and geoscientific research.
Classification and Prediction Analysis of Multimedia Data
• Data preprocessing is important when mining image data and can include
data cleaning, data transformation, and feature extraction.
• Aside from standard methods used in pattern recognition, such as edge
detection and Hough transformations, techniques can be explored, such as
the decomposition of images to eigenvectors or the adoption of probabilistic
models to deal with uncertainty.
• Since the image data are often in huge volumes and may require substantial
processing power, parallel and distributed processing are useful.
• Image data mining classification and clustering are closely linked to image
analysis and scientific data mining, and thus many image analysis techniques
and scientific data analysis methods can be applied to image data mining.
Mining Associations in Multimedia Data
Audio and Video Data Mining
• Besides still images, an incommensurable amount of audiovisual information
is becoming available in digital form, in digital archives, on the World Wide
Web, in broadcast data streams, and in personal and professional databases.
• There are great demands for effective content-based retrieval and data
mining methods for audio and video data.
• Typical examples include searching for and multimedia editing of particular
video clips in a TV studio, detecting suspicious persons or scenes in
surveillance videos, searching for particular events in a personal multimedia
repository such as MyLifeBits, discovering patterns and outliers in weather
radar recordings, and finding a particular melody or tune in your MP3 audio
album.
Audio and Video Data Mining
• To facilitate the recording, search, and analysis of audio and video
information from multimedia data, industry and standardization committees
have made great strides toward developing a set of standards for multimedia
information description and compression.
• For example, MPEG-k (developed by MPEG: Moving Picture Experts Group)
and JPEG are typical video compression schemes. The most recently released
MPEG-7, formally named “Multimedia Content Description Interface,” is a
standard for describing the multimedia content data.
• It supports some degree of interpretation of the information meaning,
which can be passed onto, or accessed by, a device or a computer.
• The audiovisual data description in MPEG-7 includes still pictures, video,
graphics, audio, speech, three-dimensional models, and information about
how these data elements are combined in the multimedia presentation.
Audio and Video Data Mining
Audio and Video Data Mining
• It is unrealistic to treat a video clip as a long sequence of individual still
pictures and analyze each picture since there are too many pictures, and
most adjacent images could be rather similar.
• In order to capture the story or event structure of a video, it is better to
treat each video clip as a collection of actions and events in time and first
temporarily segment them into video shots.
• A shot is a group of frames or pictures where the video content from one
frame to the adjacent ones does not change abruptly.
• Moreover, the most representative frame in a video shot is considered
the key frame of the shot.
• Each key frame can be analyzed using the image feature extraction and
analysis methods studied above in the content-based image retrieval.
Audio and Video Data Mining
• The sequence of key frames will then be used to define the sequence of
the events happening in the video clip.
• Thus, the detection of shots and the extraction of key frames from video
clips become the essential tasks in video processing and mining.
Text Mining
• Most previous studies of data mining have focused on structured
data, such as relational, transactional, and data warehouse data.
• However, in reality, a substantial portion of the available information
is stored in text databases (or document databases), which consist of
large collections of documents from various sources, such as news
articles, research papers, books, digital libraries, e-mail messages, and
Web pages.
• Data stored in most text databases are semistructured data in that
they are neither completely unstructured nor completely structured.
Text Data Analysis and Information Retrieval
• Information retrieval is concerned with the organization and retrieval
of information from a large number of text-based documents.
• Since information retrieval and database systems each handle
different kinds of data, some database system problems are usually
not present in information retrieval systems, such as concurrency
control, recovery, transaction management, and update.
• Also, some common information retrieval problems are usually not
encountered in traditional database systems, such as unstructured
documents, approximate search based on keywords, and the notion
of relevance.
Text Data Analysis and Information Retrieval
• There exist many information retrieval systems, such as on-line library
catalog systems, on-line document management systems, and the
more recently developed Web search engines.
• A typical information retrieval problem is to locate relevant
documents in a document collection based on a user’s query, which is
often some keywords describing an information need, although it
could also be an example relevant document.
• In such a search problem, a user takes the initiative to “pull” the
relevant information out from the collection; this is most appropriate
when a user has some ad hoc (i.e., short-term) information need,
such as finding information to buy a used car.
Text Data Analysis and Information Retrieval
• When a user has a long-term information need (e.g., a researcher’s
interests), a retrieval system may also take the initiative to “push” any
newly arrived information item to a user if the item is judged as being
relevant to the user’s information need.
• Such an information access process is called information filtering, and
the corresponding systems are often called filtering systems or
recommender systems.
Basic Measures for Text Retrieval: Precision and Recall
Basic Measures for Text Retrieval: Precision and Recall
Text Retrieval Methods
Text Retrieval Methods
Data Mining Applications
• Data Mining for Financial Data Analysis
• Data Mining for the Retail Industry
• Data Mining for the Telecommunication Industry
• Data Mining for Biological Data Analysis
• Data Mining for Intrusion Detection
• Data Mining in Other Scientific Applications
Data Mining for Financial Data Analysis
Data Mining for Financial Data Analysis
Data Mining for Financial Data Analysis
Data Mining for the Retail Industry
Data Mining for the Retail Industry
Data Mining for the Retail Industry
Data Mining for the Telecommunication Industry
Data Mining for the Telecommunication Industry
Data Mining for the Telecommunication Industry
Data Mining for the Telecommunication Industry
Data Mining for Biological Data Analysis
Data Mining for Biological Data Analysis
Data Mining for Biological Data Analysis
Data Mining for Intrusion Detection
Data Mining for Intrusion Detection
Data Mining for Intrusion Detection
Data Mining in Other Scientific Applications
Challenges brought about by emerging scientific
applications of data mining, such as the following:
Challenges brought about by emerging scientific
applications of data mining, such as the following:
Challenges brought about by emerging scientific
applications of data mining, such as the following:
Social Impacts of Data Mining
• For most of us, data mining is part of our daily lives, although we may
often be unaware of its presence.
• “ubiquitous and invisible” data mining, affect everyday things from
the products stocked at our local supermarket, to the ads we see
while surfing the Internet, to crime prevention.
• Data mining can offer the individual many benefits by improving
customer service and satisfaction, and lifestyle, in general.
• However, it also has serious implications regarding one’s right to
privacy and data security.
Ubiquitous and Invisible Data Mining
Ubiquitous and Invisible Data Mining
• Data mining has shaped the on-line shopping experience. Many
shoppers routinely turn to on-line stores to purchase books, music,
movies, and toys.
• Amazon.com was at the forefront of using such a personalized, data
mining–based approach as a marketing strategy.
• CEO and founder Jeff Bezos had observed that in traditional brick-
and-mortar stores, the hardest part is getting the customer into the
store.
• Once the customer is there, she is likely to buy something, since the
cost of going to another store is high.
• Therefore, the marketing for brick-and-mortar stores tends to
emphasize drawing customers in, rather than the actual in-store
customer experience.
Ubiquitous and Invisible Data Mining
• Many companies increasingly use data mining for customer
relationship management (CRM), which helps provide more
customized, personal service addressing individual customer’s needs,
in lieu of mass marketing.
• By studying browsing and purchasing patterns on Web stores,
companies can tailor advertisements and promotions to customer
profiles, so that customers are less likely to be annoyed with
unwanted mass mailings or junk mail.
• These actions can result in substantial cost savings for companies.
• The customers further benefit in that they are more likely to be
notified of offers that are actually of interest, resulting in less waste of
personal time and greater satisfaction.
Ubiquitous and Invisible Data Mining
• Data mining has greatly influenced the ways in which people use
computers, search for information, and work.
• Suppose that you are sitting at your computer and have just logged onto
the Internet. Chances are, you have a personalized portal, that is, the initial
Web page displayed by your Internet service provider is designed to have a
look and feel that reflects your personal interests.
• Yahoo (www.yahoo.com) was the first to introduce this concept.
• Usage logs from MyYahoo are mined to provide Yahoo with valuable
information regarding an individual’s Web usage habits, enabling Yahoo to
provide personalized content.
• This, in turn, has contributed to Yahoo’s consistent ranking as one of the
top Web search providers for years, according to Advertising Age’s BtoB
magazine’s
• Media Power 50 (www.btobonline.com), which recognizes the 50 most
powerful and targeted business-to-business advertising outlets each year.
Ubiquitous and Invisible Data Mining
• You decide to type in some keywords for a topic of interest. Google returns
a list of websites on your topic of interest, mined and organized by
PageRank.
• Unlike earlier search engines, which concentrated solely on Web content
when returning the pages relevant to a query, PageRank measures the
importance of a page using structural link information from the Web graph.
• It is the core of Google’s Web mining technology.
• While you are viewing the results of your Google query, various ads pop up
relating to your query.
• Google’s strategy of tailoring advertising to match the user’s interests is
successful—it has increased the clicks for the companies involved by four
to five times.
• This also makes you happier, because you are less likely to be pestered with
irrelevant ads.
• Google was named a top-10 advertising venue by Media Power 50.
Ubiquitous and Invisible Data Mining
Ubiquitous and Invisible Data Mining
Ubiquitous and Invisible Data Mining
• Finally, data mining can contribute toward our health and well-being.
• Several pharmaceutical companies use data mining software to analyze data
when developing drugs and to find associations between patients, drugs, and
outcomes. It is also being used to detect beneficial side effects of drugs
• Data mining can also be used to keep our streets safe.
• The data mining system Clementine from SPSS is being used by police
departments to identify key patterns in crime data.
• It has also been used by police to detect unsolved crimes that may have been
committed by the same criminal.
• Many police departments around the world are using data mining software for
crime prevention, such as the Dutch police’s use of Data Detective
(www.sentient.nl) to find patterns in criminal databases.
• Such discoveries can contribute toward controlling crime.
Data Mining, Privacy, and Data Security
Data Mining, Privacy, and Data Security
Data Mining, Privacy, and Data Security
Data Mining, Privacy, and Data Security
Data Mining, Privacy, and Data Security
Data Mining, Privacy, and Data Security
Trends in Data Mining
Trends in Data Mining
Trends in Data Mining
Trends in Data Mining
Trends in Data Mining
Trends in Data Mining
Trends in Data Mining