Papers by Florent Masseglia
Mining Data Streams for Frequent Sequences Extraction
Abstractó In recent years, emerging applications introduced new constraints for data mining metho... more Abstractó In recent years, emerging applications introduced new constraints for data mining methods. These constraints are particularly linked to new kinds of data that can be considered as complex data. One typical kind of such data is known as data streams. In a data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered
Sequential Pattern Mining: A Survey on Issues and Approaches
Florent Masseglia AxIS Research Group INRIA Sophia Antipolis BP 93 06902 Sophia Antipolis Cedex ... more Florent Masseglia AxIS Research Group INRIA Sophia Antipolis BP 93 06902 Sophia Antipolis Cedex France Phone number: (33) 4 92 38 50 67 Fax number: (33) 4 92 38 77 55 Email address: [email protected] ... Maguelonne Teisseire LIRMM Polytech - ...
Lecture Notes in Computer Science, 2000
Proceedings. 11th International Symposium on Temporal Representation and Reasoning, 2004. TIME 2004., 2004
In this paper we consider the problem of discovering sequential patterns by handling time constra... more In this paper we consider the problem of discovering sequential patterns by handling time constraints. While sequential patterns could be seen as temporal relationships between facts embedded in the database, generalized sequential patterns aim at providing the end user with a more flexible handling of the transactions embedded in the database. We propose a new efficient algorithm, called GTC (Graph for Time Constraints) for mining such patterns in very large databases. It is based on the idea that handling time constraints in the earlier stage of the algorithm can be highly beneficial since it minimizes computational costs by preprocessing data sequences. Our test shows that the proposed algorithm performs significantly faster than a stateof-the-art sequence mining algorithm.
ACM sigweb Newsletter, 1999
W1th the 9r0w1n9 p0pu1ar1ty 0f the W0r1d W1de We6 (We6), 1ar9e v01ume5 0f data 5uch a5 u5er addre... more W1th the 9r0w1n9 p0pu1ar1ty 0f the W0r1d W1de We6 (We6), 1ar9e v01ume5 0f data 5uch a5 u5er addre55 0r URL re4ue5ted are 9athered aut0mat1ca11y 6y We6 5erver5 and c011ected 1n acce55 109 f11e5. D15c0ver1n9 re1at10n5h1p5 and 9106a1 pattern5 that ex15t 1n ...

20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06), 2006
With the huge number of information sources available on the Internet, Peer-to-Peer (P2P) systems... more With the huge number of information sources available on the Internet, Peer-to-Peer (P2P) systems offer a novel kind of system architecture providing the large-scale community with applications for file sharing, distributed file systems, distributed computing, messaging and real-time communication. P2P applications also provide a good infrastructure for data and compute intensive operations such as data mining. In this paper we propose a new approach for improving resource searching in a dynamic and distributed database such as an unstructured P2P system. This approach takes advantage of data mining techniques. By using a genetic-inspired algorithm, we propose to extract patterns or relationships occurring in a large number of nodes. Such a knowledge is very useful for proposing the user with often downloaded or requested files according to a majority of behaviors. It may also be useful in order to avoid extra bandwidth consumption.
Lecture Notes in Computer Science, 1999
Large volumes of data such as user address or URL requested are gathered automatically by Web ser... more Large volumes of data such as user address or URL requested are gathered automatically by Web servers and collected in access log les. Analysis of server access data can provide signi cant and useful information for performance enhancement, and restructuring a Web site for increased e ectiveness. In this paper, we propose an integrated system (WebTool) for mining user patterns and association rules from one or more Web servers and pay a particular attention to handling of time constraints. Once interesting patterns are discovered, we illustrate how they can be used to customize the server hypertext organization dynamically.
TED and EVA: Expressing temporal tendencies among quantitative variables using fuzzy sequential patterns
2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence), 2008
Temporal data can be handled in many ways for discovering specific knowledge. Sequential pattern ... more Temporal data can be handled in many ways for discovering specific knowledge. Sequential pattern mining is one of these relevant approaches when dealing with temporally annotated data. It allows discovering frequent sequences embedded in the records. In the access data of a commercial Web site, one may, for instance, discover that ldquo5% of the users request the page register.php 3

Proceedings of the 18th international conference on World wide web - WWW '09, 2009
Detection of web attacks is an important issue in current defense-in-depth security framework. In... more Detection of web attacks is an important issue in current defense-in-depth security framework. In this paper, we propose a novel general framework for adaptive and online detection of web attacks. The general framework can be based on any online clustering methods. A detection model based on the framework is able to learn online and deal with "concept drift" in web audit data streams. Str-DBSCAN that we extended DBSCAN [1] to streaming data as well as StrAP are both used to validate the framework. The detection model based on the framework automatically labels the web audit data and adapts to normal behavior changes while identifies attacks through dynamical clustering of the streaming data. A very large size of real HTTP Log data collected in our institute is used to validate the framework and the model. The preliminary testing results demonstrated its effectiveness.
Lecture Notes in Computer Science, 2000
Lecture Notes in Computer Science, 2006
This article presents an original supervised classification technique for XML documents which is ... more This article presents an original supervised classification technique for XML documents which is based on structure only. Each XML document is viewed as an ordered labeled tree, represented by his tags only. Our method has three steps. After a cleaning step, we characterize each predefined cluster in terms of frequent structural subsequences. Then we classify the XML documents based on the mined patterns of each cluster.
Fast and Exact Mining of Probabilistic Data Streams
Lecture Notes in Computer Science, 2013
Lecture Notes in Computer Science, 1998
Lecture Notes in Computer Science, 2010
Change detection in satellite image time series is an important domain with various applications ... more Change detection in satellite image time series is an important domain with various applications in land study. Most previous works proposed to perform this detection by studying two images and analysing their differences. However, those methods do not exploit the whole set of images that is available today and they do not propose a description of the detected changes. We propose a sequential pattern mining approach for these image time series with two important features. First, our proposal allows for the analysis of all the images in the series and each image can be considered from multiple points of view. Second, our technique is specifically designed towards image time series where the changes are not the most frequent patterns that can be discovered. Our experiments show the relevance of our approach and the significance of our patterns.

International journal of neural systems, 2011
Satellite Image Time Series (SITS) provide us with precious information on land cover evolution. ... more Satellite Image Time Series (SITS) provide us with precious information on land cover evolution. By studying these series of images we can both understand the changes of specific areas and discover global phenomena that spread over larger areas. Changes that can occur throughout the sensing time can spread over very long periods and may have different start time and end time depending on the location, which complicates the mining and the analysis of series of images. This work focuses on frequent sequential pattern mining (FSPM) methods, since this family of methods fits the above-mentioned issues. This family of methods consists of finding the most frequent evolution behaviors, and is actually able to extract long-term changes as well as short term ones, whenever the change may start and end. However, applying FSPM methods to SITS implies confronting two main challenges, related to the characteristics of SITS and the domain's constraints. First, satellite images associate multi...

Discovering Highly Informative Feature Set over High Dimensions
2012 IEEE 24th International Conference on Tools with Artificial Intelligence, 2012
ABSTRACT For many textual collections, the number of features is often overly large. These featur... more ABSTRACT For many textual collections, the number of features is often overly large. These features can be very redundant, it is therefore desirable to have a small, succinct, yet highly informative collection of features that describes the key characteristics of a dataset. Information theory is one such tool for us to obtain this feature collection. With this paper, we mainly contribute to the improvement of efficiency for the process of selecting the most informative feature set over high-dimensional unlabeled data. We propose a heuristic theory for informative feature set selection from high dimensional data. Moreover, we design data structures that enable us to compute the entropies of the candidate feature sets efficiently. We also develop a simple pruning strategy that eliminates the hopeless candidates at each forward selection step. We test our method through experiments on real-world data sets, showing that our proposal is very efficient.

Modeling and Clustering Users with Evolving Profiles in Usage Streams
2012 19th International Symposium on Temporal Representation and Reasoning, 2012
ABSTRACT Today, there is an increasing need of data stream mining technology to discover importan... more ABSTRACT Today, there is an increasing need of data stream mining technology to discover important patterns on the fly. Existing data stream models and algorithms commonly assume that users' records or profiles in data streams will not be updated or revised once they arrive. Nevertheless, in various applications such as Web usage, the records/profiles of the users can evolve along time. This kind of streaming data evolves in two forms, the streaming of tuples or transactions as in the case of traditional data streams, and more importantly, the evolving of user records/profiles inside the streams. Such data streams bring difficulties on modeling and clustering for exploringusers' behaviors. In this paper, we propose three models to summarize this kind of data streams, which are the batch model, the Evolving Objects (EO) model and the Dynamic Data Stream (DDS) model. Through creating, updating and deleting user profiles, these models summarize the behaviors of each user as a profile object. Based upon these models, clustering algorithms are employed to discover interesting user groups from the profile objects. We have evaluated all the proposed models on a large real-world data set, showing that the DDS model summarizes the data streams with evolving tuples more efficiently and effectively, and provides better basis for clustering users than the other two models.

ABS: The Anti Bouncing Model for Usage Data Streams
2010 IEEE International Conference on Data Mining, 2010
ABSTRACT Usage data mining is an important research area with applications in various fields. How... more ABSTRACT Usage data mining is an important research area with applications in various fields. However, usage data is usually considered streaming, due to its high volumes and rates. Because of these characteristics, we only have access, at any point in time, to a small fraction of the stream. When the data is observed through such a limited window, it is challenging to give a reliable description of the recent usage data. We study the important consequences of these constraints, through the “bounce rate” problem and the clustering of usage data streams. Then, we propose the ABS (Anti-Bouncing Stream) model which combines the advantages of previous models but discards their drawbacks. First, under the same resource constraints as existing models in the literature, ABS can better model the recent data. Second, owing to its simple but effective management approach, the data in ABS is available at any time for analysis. We demonstrate its superiority through a theoretical study and experiments on two real-world data sets.
Techniques de généralisation des urls pour l’analyse des usages du web
Résumé. L'analyse des usages d'un site Web à partir d'une extraction de motifs est... more Résumé. L'analyse des usages d'un site Web à partir d'une extraction de motifs est souvent limitée par le faible support de ces motifs. Cela est dû principalement à la grande diversité des pages et des comportements. Il est pourtant possible de regrouper la plupart des ...
Mining Sequential Patterns with Time Constraints: Reducing the Combinations
Uploads
Papers by Florent Masseglia