Introduction To Formal Concept Analysis and Its
Introduction To Formal Concept Analysis and Its
Dmitry I. Ignatov
1 Introduction
as several related fields, in which the objects are the scientific papers and the
attributes are the relevant terms available in the title, keywords and abstract of
the papers. You can see an example of such a visualisation in Figure 1 for papers
published between 2003 and 2009. We developed a toolset with a central FCA
component that we used to index the papers with a thesaurus containing terms
related to FCA research and to generate the lattices. The tutorial also contains
a partial overview of the papers on using FCA in Information Retrieval with a
focus on visualisation.
Fig. 1. The lattice diagram representing collection of 702 papers on FCA including 103
papers on FCA and IR (2003-2009).
– How can FCA support IR activities including but not limited to query anal-
ysis, document representation, text classification and clustering, social net-
work mining, access to semantic web data, and ontology engineering?
– How can FCA be extended to address a wider range of IR activities, possibly
including new retrieval tasks?
Introduction to Formal Concept Analysis and Its Applications in IR 3
2 Introduction to FCA
Even though that many disciplines can be dated back to Aristotles time, more
closer prolegomena of FCA can be found, for example, in the Logic of Port Royal
(1662)[8], an old philosophical concept logic, where a concept was treated as a
pair of its extent and its intent (yet without formal mathematical apparatus).
Being a part of lattice theory, concept lattices are deeply rooted in works
of Dedekind, Birkgoff [9] (Galois connections and “polarities”), and Ore [10]
(Galois connections), and, later, on Barbut&Monjardet [11] (treillis de Galois,
i.e. Galois lattices).
In fact, the underlying structure, Galois connection, has a strong impact in
Data Analysis[12,13,14,15].
In this section, we mainly reproduce basic definitions from Ganter&Wille’s
book on Formal Concept Analysis [3]. However, one can find a good introductory
material, more focused on partial orders and lattices, in the book of Davey and
Priestly [16]. An IR-oriented reader may also find the following books interesting
and helpful [15,17].
1
http://bit.ly/RuSSIR2014FCAtut
4 Dmitry I. Ignatov
There were several good tutorials with notes in the past, for example, a basic
one [18] and more theoretical with algorithmic aspects [19].
We also refer the readers to some online materials that might be suitable for
self-study purposes 2 ,3 ,4 .
A short section summary:
1. aRa (reflexivity)
2. aRb and a 6= b =⇒ not aRb (antisymmetry)
3. aRb and bRc =⇒ aRc (transitivity)
Every finite ordered poset (P, ≤) can be depicted as a line diagram (many
authors call it Hasse diagram). Elements of P are represented by small circles
in the plane. If a ≤ b, the circle corresponding to a is depicted higher than the
circle corresponding to b, and the two circles are connected by a line segment.
One can check whether some a ≤ b if there is an ascending path from b to a in
the diagram.
≤ a b c d e
a ×
b ×××
c ×
d ×××××
e ×
The graph of P .
The line diagram of P .
For A = {a, b} we write x∧y for inf A and x∨y for supA. Infimum and supremum
are also called meet and join.
Example 2. In Fig. 2 there are the line diagrams of the poset P , which is not
a lattice, and the lattice L. It is interesting that P has its largest and smallest
elements, 1P and 0P ; the pair of its elements, s and t, has its infumum, s∨t = 0P ,
but there is no a supremum for it. In fact, p, t does not have a smallest element
in the set of all its upper bounds.
6 Dmitry I. Ignatov
1P 1L
q p
v∨w
t s v w
0P s ∧ t 0L v ∧ w
Fig. 2. The line diagrams of the order, which is not a lattice (left), and the order,
which is a lattice (right)
({1,2,3,4},∅)
(∅,M)
Fig. 3. The line diagram of the concept lattice for the context of geometric figures
Definition 12. For every two formal concepts (A1 , B1 ) and (A2 , B2 ) of a cer-
tain formal context their greatest common subconcept is defined as follows:
(A1 , B1 ) ∨ (A2 , B2 )
(A2 , B2 ) (A1 , B1 )
(A1 , B1 ) ∧ (A2 , B2 )
The Basic Theorem of Formal Concept Analysis below defines not only supre-
mum and infimum of arbitrary sets of concepts; it also answer the question
whether concept lattices have any special properties. In fact, the answer is “no”
since every concept lattice is (isomorphic to some) complete lattice. That is one
can compose a formal context with objects G, attributes M and binary relation
I ⊂ G × M such that the original complete lattice is isomorphic B(G, M, I).
Even though the theorem does not reply how such a context can be built, but
rather describes all possibilities to do this.
_ [ \
(Aj , Bj ) = (( Aj )00 , Bj ).
j∈J j∈J j∈J
An interested reader may refer to Ganter&Wille’s book on FCA [3] for further
detailed and examples.
One can obtain the whole set of concepts of a particular context K simply by
definition, i.e. it is enough to enumerate all subsets of objects A ⊆ G (or at-
tributes B ⊆ M ) and apply derivation operators to them. For example, for
the context from example 3 and empty set of objects, A = ∅, one may obtain
A0 = ∅0 = {a, b, c, d} = B, and then by applying (·)0 second time B 0 = ∅. Thus,
the resulting concept is (A, B) = (∅, M ).
1. B(G, M, I) := ∅
10 Dmitry I. Ignatov
Since the total number of formal concept is equal to 2min(|G|,|M |) in the worst
case, this naı̈ve approach is quite inefficient even for small contexts. However,
let us assume that now we know how find concepts and we are going to build
the line diagram of a concept lattice.
1. Draw a rather small circle for each formal concept such that a circle for a
concept is always depicted higher than the all circles for its subconcepts.
2. Connect each circle with the circles of its lower neighbors.
Definition 14. Let (G, M, I) be a formal context, then for each object g ∈ G
there is the object concept ({g}00 , {g}0 ) and for each attribute m ∈ M the
attribute concept is given by ({m}0 , {m}00 ).
So, if one has finished a line diagram drawing for some concept lattice, it
is possible to label the diagram with attribute names: one needs to attach the
attribute m to the circle representing the concept ({m}0 , {m}00 ). Similarly for
labeling by object names: one needs to attach each object g to the circle repre-
senting the concept ({g}00 , {g}0 ). An example of such reduced labeling is given
in Fig. 5.
d c a
3 b
4 1 2
current concept is generated first time (canonicity test). Thus, Ganter’s Next
Closure algorithm does not refer the list of generated concepts and uses little
storage space.
Since the extent of a concept defines its intent in a unique way, to obtain
the set of all formal concepts, it is enough to find closures either of subsets of
objects or subsets of attributes.
We assume that there is a linear order (<) on G. The algorithm starts by
examining the set consisting of the object maximal with respect to < (max(G)),
and finishes when the canonically generated closure is equal to G. Let A be a
currently examined subset of G. The generation of A00 is considered canonical
if A00 \ A does not contain g < max(A). If the generation of A00 is canonical
(and A00 is not equal to G), the next set to be examined is obtained from A00 as
follows:
A00 ∪ {g} \ {h|h ∈ A00 ∧ g < h}, where g = max({h|h ∈ G \ A00 }).
Otherwise, the set examined at the next step is obtained from A in a similar
way, but the added object must be less (w.r.t. <) than the maximal object in A:
Algorithm 1 NextClosure
Input: K = (G, M, I) is a context
Output: L is the concept set
1: L := ∅, A := ∅, g := max(G)
2: while A 6= G do
3: A := A00 ∪ {g} \ {h|h ∈ A ∧ g < h}
4: if {h|h ∈ A ∧ g ≤ h} = ∅ then
5: L := L ∪ {(A00 , A0 )}
6: g := g = max({h|h ∈ G \ A00 })
7: A := A00
8: else
9: g = max({h|h ∈ G \ A ∧ h < g})
10: end if
11: end while
12: return L
The NextClosure algorithm produces the set of all concepts in time O(|G|2 |M ||L|)
and has polynomial delay O(|G|2 |M |).
We provide a simple recursive version of CbO. The algorithm generates con-
cepts according to the lectic (lexicographic) order on the subsets of G (concepts
12 Dmitry I. Ignatov
whose extents are lectically less are generated first). By definition A is lectically
less than B if A ⊆ B, or B 6⊂ A and min((A ∪ B) \ (B ∩ A)) ∈ A. Note that
the NextClosure algorithm computes concepts in a different lectic order: A is
lectically less than B if min((A ∪ B) \ (B ∩ A)) ∈ B. The order in which concepts
are generated by CbO is beneficial when the line diagram is constructed: the
first generation of the concept is always canonical, which makes it possible to
find a concept in the tree and to draw appropriate diagram edges. NextClo-
sure-like lectic order allows binary search, which is helpful when the diagram
graph has to be generated after the generation of all concepts.
The time complexity of Close by One (CbO) is O(|G|2 |M ||L|), and its
polynomial delay is O(|G|3 |M |).
The generation protocol of CbO in a tree-like form is given in Fig. 6. Each
closed set of objects (extent) can be read from the tree by following the path
from the root to the corresponding node. Square bracket ] means that first prime
operator has been applied after addition of the lectically next object g to the
set A of the parent node and bracket ) shows which object have been added
after application of second prime operator, i.e. between ] and ) one can find
(A ∪ {g})00 \ (A ∪ {g}). A non-canonic generation can be identified by simply
checking whether there is an object between ] and ) that less than g w.r.t. <.
Introduction to Formal Concept Analysis and Its Applications in IR 13
Algorithm 3 Process(A, g, (C, D)) with C = A00 and D = A0 and ¡ the lexical
order on object names
Input: K = (G, M, I) is a context
Output: L is the concept set C = A00 , D = A0
1: if {h|h ∈ C \ A ∧ g < h} = ∅ then
2: L := L ∪ {(C, D)}
3: end if
4: for all f ∈ {h|h ∈ G \ A ∧ g < h} do
5: Z := C ∪ {f }
6: Z := D ∩ {f }0
7: X := Y 0
8: Process(Z, f, (X, Y ))
9: end for
One can note that the traverse of the generation tree is done in a depth-first
search manner.
3]4) 4]3)
Fig. 6. The tree of CbO protocol for the context of geometric figures. Non-canonic
generations are drawn in boxes.
After the inception of the first batch algorithms, the broadened FCA inven-
tory includes efficient incremental algorithms [21] and the distributed versions
of NextClosure and CbO for MapReduce [22,23].
G} ⊆ Gm . The objects of a scale are called scale values, the attributes are called
scale attributes.
Nominal scale is defined by the context (Wm , Wm , =).
This type of scaling is suitable for binary representation of nominal (cate-
gorical) attributes like color. For the context of university subjects, the subjects
can be scaled by nominal scaling as below.
Math
=
DM
CS
Math ×
CS ×
DM ×
A particular case of nominal scaling is the so called dichotomic scaling,
which is suitable for attributes with two mutually exclusive values like “yes” and
“no”. In our example, the attribute Gender can be scaled in this way.
M F
M ×
F ×
≤ 7 ≤ 8 ≤ 9 ≤ 10
≤ 21 ≤ 20 ≤ 19
7 × × × ×
19 × × ×
8 × × ×
20 × ×
9 × ×
21 ×
10 ×
≤7≤8≤9 ≤ 10 ≥7 ≥ 8 ≥ 9 ≥ 10
7 × × × × ×
8 × × × × ×
9 × × × × ×
10 × × × × ×
6= 1 2 3 4
1 ×××
2 × ××
3 ×× ×
4 ×××
In fact, this type of contexts gives rise to 2n formal concepts and can be used
for testing purposes.
The resulting scaled (or plain) context for our university subjects example is
below. Note that the Mark attribute is scaled by interordinal scale.
M F ≤ 19 ≤ 20 ≤ 21 Math CS DM ≤ 7 ≤ 8 ≤9≤ 10 ≥7 ≥ 8 ≥ 9 ≥ 10
1 × × × × × × × × × ×
2 × × × × × × × × ×
3 × × × × × × × × × ×
4 × × × × × × × × ×
5 × × × × × × × ×
X → Y, Y ∪ Z → W
(pseudotransitivity).
X ∪Z →W
An inference axiom is a rule that states if certain implications are valid in
the context, then certain other implications are valid.
Example 5. Let us check that the first and second Armstrong axioms fulfill for
implication over attributes.
Since X 0 ⊆ X 0 it is always true that X → X.
For the second rule we have X 0 ⊆ Y 0 . Applying property 4 from Proposition 1
we have: (X ∪ Z)0 = X 0 ∩ Z 0 . Since X 0 ∩ Z 0 ⊆ X 0 , we prove that X 0 ∩ Z 0 ⊆ Y 0 .
This implies X ∪ Z → Y .
Example 6. For the context of geometric figures one may check that b is a mini-
mal nontrivial generator for bc, The set ab is a minimal nontrivial generator for
abcd, but abc, abd, and acd are its nontrivial generators.
Exercise 8. For the context of geometric figures find all minimal generators and
obtain its generator implication cover.
B B0 B 00 B is pseudo-intent?
∅ 1234 ∅ No, it’s not.
a 12 a No, it’s not.
b 34 bc Yes, it is.
c 234 c No, it’s not.
d 14 d No, it’s not.
ab ∅ abcd No, it’s not.
ac 2 ac No, it’s not.
ad 1 ad No, it’s not.
bc 34 bc No, it’s not.
bd 4 bcd No, it’s not.
cd 4 bcd Yes, it is.
abc ∅ abcd Yes, it is.
abd ∅ abcd No, it’s not.
acd ∅ abcd No, it’s not.
bcd 4 bcd No, it’s not.
abcd ∅ abcd No, it’s not.
Example 7. Let us find all pseudo-intents for the context of geometric figures.
We build a table (Table 3) with B and B 0 0; it is clear that all closed sets are not
pseudo-intents by the definition. Since we have to check the containment of a
pseudo-intent in the generated pseudo-intents recursively, we should start with
the smallest possible set, i.e. ∅.
Thus, {b} is the first non-closed set in our table and the second part of pseudo-
intent definition fulfills trivially – there is no another pseudo-intent contained in
{b}. So, the whole set of pseudo-intents is {b, cd, abc}.
Exercise 9. Write down the Duquenne-Guigues base for the context of geometric
figures. Using Armstrong rules and the obtained Duquenne-Guigues base, deduce
the rest implications of the original context.
For recent efficient algorithm of finding the Duquenne-Guigues base see [27].
Example 8. For the example given in Table 2 the following functional dependen-
cies hold: Age → Subject, Subject → Age, M ark → Gender.
18 Dmitry I. Ignatov
The first two functional dependencies may have sense since students of the
same year may study the same subjects. However, the last one says Gender is
functionally dependent by Mark and looks as a pure coincidence because of the
small dataset.
The reduction of functional dependencies to implications:
Proposition 3. For a many-valued context (G, M, W, I), one defines the context
KN := (P2 (G), M, IN ), where P2 (G) is the set of all pairs of different objects
from G and IN is defined by
Mark
Age
{1,2}
{1,3} × ×
{1,4} ×
{1,5}
{2,3} ×
{2,4} × ×
{2,5} × ×
{3,4}
{3,5} ×
{4,5}
One may check that the following implications hold: Age → Subject, Subject →
Age, M ark → Gender, which are the functional dependencies that we so in ex-
ample 8.
Example 10. To fulfill the reduction one may build the corresponding many-
valued context in the following way:
1. Replace all “×” by 0s. 2. In each row, replace empty cells by the row
number starting from 1. 3. Add a new row filled by 0s.
Introduction to Formal Concept Analysis and Its Applications in IR 19
abcd
1 0110
2 0202
3 3003
4 4000
5 0000
Exercise 10. Check the functional dependencies from the previous example co-
incide with the implications of the context of geometric figures.
– Software for FCA: Concept Explorer, Lattice Miner, ToscanaJ, Galicia, FCART
etc.
– Exercises.
– Context editing (tab separated and csv formats of input files are supported
as well);
– Line diagrams drawing (allowing their import as image snapshots and even
text files with nodes position, edges and attributes names, but vector-based
formats are not supported);
– Finding the Duquenne-Guigues base of implications;
– Finding the base of association rules that are valid in a formal context;
– Performing attribute exploration.
It is important to note that the resulting diagram is not static and one
may perform exploratory analysis in an interactive manner selecting interesting
nodes, moving them etc. In Fig. 7, the line diagram of the concept lattice of
interordinal scale for attribute Mark drawn by ConExp is shown. See more details
in Fig. [31].
There is an attempt to reincarnate ConExp 6 by modern open software tools.
5
http://conexp.sourceforge.net/
6
https://github.com/fcatools/conexp-ng/wiki
20 Dmitry I. Ignatov
Fig. 7. The line diagram of the concept lattice for the interordinal scale of student
marks drawn by ConExp.
Math
CS DM
M F
1 3
4 2 5
Fig. 8. The nested line diagram for the two one-attribute subcontexts of the context
of university subjects. The outer diagram is for Gender attribute, and the inner one is
for Subject.
Lattice Miner. This is another attempt to establish basic FCA functionality and
several specific features to the FCA community 10 [36].
The initial objective of the tool was “to focus on visualization mechanisms
for the representation of concept lattices, including nested line diagrams” 11 .
Thus, its interesting feature is multi-level nested line diagrams, which can help
to explore comparatively large lattices.
10
http://sourceforge.net/projects/lattice-miner/
11
https://en.wikipedia.org/wiki/Lattice_Miner
22 Dmitry I. Ignatov
0
¥ I={}
¥
¥ E={obj_0, obj_1, obj_2, obj_3, obj_4}
¥
2 4
1 3
¥ I={Gender=M}
¥ ¥ I={Age=20, Subject=CS}
¥
¥ I={Age=19, Subject=Math}
¥ ¥ I={Gender=F}
¥
¥ E={obj_0, obj_3}
¥ ¥ E={obj_1, obj_3}
¥
¥ E={obj_0, obj_2}
¥ ¥ E={obj_1, obj_2, obj_4}
¥
6
¥ I={Age=19, Gender=F, Mark=7, Subject=Math}
¥
5 8
¥ E={obj_2}
¥
¥ I={Age=19, Gender=M, Mark=8, Subject=Math} ¥ I={Gender=F, Mark=9}
¥
¥
¥ E={obj_0} ¥ E={obj_1, obj_4}
¥
¥
9
¥ I={Age=20, Gender=F, Mark=9, Subject=CS}
¥
7 ¥ E={obj_1}
¥
¥ I={Age=20, Gender=M, Mark=10, Subject=CS}
¥ 11
¥ E={obj_3}
¥
¥ I={Age=21, Gender=F, Mark=9, Subject=DM}
¥
¥ E={obj_4}
¥
10
Fig. 9. The line diagram of concept lattice for the context of university subjects drawn
by Galicia.
FCART. Many different tools have been created and some of the projects are not
developing anymore but the software is still available; an interested reader can
refer Uta Priss’s webpage to find dozens of tools14 . However, new challenges such
as handling large heterogeneous datasets (large text collections, social networks
and media etc.) are coming and the community, which put a lot of efforts in
12
http://fcastone.sourceforge.net/
13
https://code.google.com/p/openfca/
14
http://www.fcahome.org.uk/fcasoftware.html
Introduction to Formal Concept Analysis and Its Applications in IR 23
the development of truly cross-platform and open software, needs a new wave of
tools that adopts modern technologies and formats.
Inspired by the successful application of FCA-based technologies in text min-
ing for criminology domain [37], in the Laboratory for Intelligent Systems and
Structural Analysis, a tool named Formal Concept Analysis Research Toolbox
(FCART) is developing.
FCART follows a methodology from [38] to formalise iterative ontology-
driven data analysis process and to implement several basic principles:
1. Iterative process of data analysis using ontology-driven queries and interac-
tive artifacts such as concept lattice, clusters, etc.
2. Separation of processes of data querying (from various data sources), data
preprocessing (via local immutable snapshots), data analysis (in interactive
visualizers of immutable analytic artifacts), and results presentation (in a
report editor).
3. Three-level extendability: settings customisation for data access components,
query builders, solvers and visualizers; writing scripts or macros; developing
components (add-ins).
4. Explicit definition of analytic artifacts and their types, which enables in-
tegrity of session data and links artifacts for the end-users.
5. Availability of integrated performance estimation tools.
6. Integrated documentation for software tools and methods of data analysis.
Originally, it was yet another FCA-based “integrated environment for knowl-
edge and data engineers with a set of research tools based on Formal Concept
Analysis” [39,40] featuring in addition work with unstructured data (including
texts with various metadata) and Pattern Structures [41]. In its current dis-
tributed version, FCART consists of the following parts:
1. AuthServer for authentication and authorisation.
2. Intermediate Data Storage (IDS) for storage and preprocessing of big datasets.
3. Thick Client for interactive data processing and visualisation in integrated
graphical multi-document user interface.
4. Web-based solvers for implementing independent resource-intensive compu-
tations.
The workflow is shown in Fig. 10.
The main questions are the following: Whether the product has only tech-
nological advantages or it really has fruitful methodology? Can it become open
in addition to its extendability? Can it finally handle big volumes of heteroge-
neous data in a suitable way for an FCART analyst? The answers to these posed
questions seem to be forthcoming challenging steps.
CryptoLatt. This tool15 was developed to help students and researchers from
neighbouring domains (e.g., Data Mining) to recognise cryptomorphisms in lattice-
based problems, i.e. to realise that a particular problem in one domain is “isomor-
phic” to some other in terms of lattice theory [42]. Thus, one of the well-known
15
http://www.cs.unic.ac.cy/florent/software.htm
4
interaction between user and external data storages (except local files) goes
through the IDS.
AuthServer and Intermediate Data Storage. IDS is responsible for storing
24 Dmitry I. Ignatov
and preprocessing chunks of data from various data sources in common format
(JSON collections). In the current release IDS provides a Web-API and uses
AuthServer for OAuth 2.0 authorization. There are many data storages, which
contain petabytes of collected data. For analyzing such Big Data analyst cannot
use any of the tools mentioned above. FCART provides different scenarios for
working with local and external data storages. In the previous release of FCART
analyst can work with quite small external datasets because all data from external
storages convert into JSON files and saves into IDS. In the current release, we
have improved strategies of working with external data. Now analyst can choose
between loading all data from external data storage to the IDS and accessing ex-
ternal data by chunks using paging mechanism. FCART analyst should specify the
way of accessing external data at the property page of the data generator (EDQD).
Clusters
– Frequent Itemset Mining and Association Rules: FCA did it even earlier [43,44]
– Multimodal clustering (biclustering and triclustering) [45,46,47]
– FCA in Classification: JSM-method, version spaces∗16 , and decision trees∗
[48]
– Pattern Structures for data with complex descriptions [49,50]
– FCA-based Boolean Matrix Factorisation [51]
– Educational Data Mining case study [52]
– Exercises with JSM-method in QuDA (Qualitative Data Analysis): solving
classification task [53]
Data Mining. The original problem for association rules mining is market basket
analysis. In early 90s, since the current level of technologies made it possible to
store large amount of transactions of purchased items, companies started their
attempts to use these data to facilitate their typical business decisions concern-
ing “what to put on sale, how to design coupons, how to place merchandise
on shelves in order to maximize the profit”[55]. So, firstly this market basket
analysis problem was formalised in [55] as a task of finding frequently bought
items together in a form of rules “if a customer buys items A, (s)he also buys
items B”. One of the first and rather efficient algorithms of that period was
proposed in [43], namely Apriori. From the very beginning these rules are toler-
ant to some number of exceptions, they were not strict as implications in FCA.
However, several years before, in [44], Michael Luxenburger introduced partial
implications motivated by more general problem statement, “a generalisation of
the theory of implications between attributes to partial implications” since “in
data analysis the user is not only interested in (global) implications, but also in
“implications with a few exceptions””. The author proposed theoretical treat-
ment of the problem in terms of Formal Concept Analysis and was guided by
the idea of characterisation of “sets of partial implications which arise from real
data” and “a possibility of an “exploration” of partial implications by a com-
puter”. In addition, he proposed a minimal base of partial implications known
as Luxenburger’s base of association rules as well.
|(A ∪ B)0 |
supp(A → B) = .
|G|
|(A ∪ B)0 |
conf (A → B) = .
|A0 |
This value conf (A → B) shows which part of objects that possess A also
contains A ∪ B. Often confidence can be given in %.
Cakes
Chips
Müsli
Milk
Beer
c1 × ×
c2 ×××
c3 × × × ×
c4 × × × ×
c5 ××× ×
The main task of association rules mining is formulated as follows: Find all
association rules of a context, where support and confidence of the rules are
greater than predefined thresholds, min-confidence and min-support, denoted as
min conf and min supp, respectively [55]
Proposition 5. (Association rules and implications)
Let K be a context, then its associations rules under condition min supp =
0% and min conf = 100% are implications of the same context.
c
Sometimes an association rule can be written as A −→ B, where c and s are
s
confidence and support of the given rule.
Two main steps of association rules mining are given below:
The first step is the most expensive, the second one is rather trivial.
The well-known algorithm for frequent itemset mining is Apriori [43] uses
the antimonotony property to ease frequent itemsets enumeration.
Property 1. (Antimonotony property) For ∀A, B ⊆ M and A ⊆ B ⇒ supp(B) ≤
supp(A).
This property implies the following facts:
– The larger set, the smaller support it has or its support remains the same;
– Support of any itemset is not greater than a minimal support of any its
subset;
– Aa itemset of size n is frequent if and only if all its (n − 1)-subsets are
frequent.
28 Dmitry I. Ignatov
Algorithm 5 AprioriGen(Fi )
Input: Fi is a set of frequent i-itemsets
Output: Ci+1 is a set of (i + 1)-itemsets candidates
1: insert into Ci+1 {union}
2: select p[1], p[2], . . . , p[i], q[i]
3: from Fi .p, Fi .q
4: where p[1] = q[1], . . . , p[i − 1] = q[i − 1], p[i] < q[i]
5: for all c ∈ Ci+1 do
6: {elimination}
7: S ← (i − 1)-itemset c
8: for all s ∈ S do
9: if s 6∈ Fi then
10: Ci+1 ← Ci+1 \ c
11: end if
12: end for
13: end for
14: return Ci+1
Introduction to Formal Concept Analysis and Its Applications in IR 29
Example 12. Union and elimination steps of AprioriGen for a certain context.
– The set of frequent 3-itemsets: F3 = {{a, b, c}, {a, b, d}, {a, c, d}, {a, c, e}, {b, c, d}}.
– The set of candidate 4-itemsets (union step): C4 = {{a, b, c, d}, {a, c, d, e}}.
– The remaining candidate is C4 = {{a, b, c, d}}, since is eliminated {a, c, d, e}
because {c, d, e} 6∈ F3 (elimination step).
supp(F )
conf (f → F \ f ) = ≥ min conf, wheref ⊂ F.
supp(f )
supp(F )
Property 2. Confidence conf (f → F \ f ) = supp(f ) is minimal, when supp(f ) is
maximal.
Exercise 14. Find all frequent itemsets for the customers context with with Apri-
ori algorithm and min sup = 1/3.
One may check that the lattices, whose diagrams are depicted in Fig. 11, are
isomorphic.
abcde(0)
2 1
cde
(2)
ace(2) bce(2) bcd(2) a d b
10 5 8 14
c1 11 4 13 15
cd
(3)
ae(3) ce(3) bc(3)
11 4 13 15 10 5 8 14
c2
e(4) c(4)
3 12 7 9 6
c3 c4 c5
1
∅(5) 2
Fig. 11. The line diagrams of the lattice of closed itemsets (left, with support given
in parentheses) and the concept lattice for the customers context (right, with reduced
labeling)
The set of all frequent concepts of the context K for the threshold min sup is
also known as the “iceberg concept lattice” [59], mathematically it corresponds
to the order filter of the concept lattice. However, the idea of usage the size
of concept’s extent, intent (or even their different combinations) as a concept
quality measure is not new in FCA [60].
Of course, the application domain is not restricted to market basket analysis;
thus, the line diagram built in ConExp shows 25 largest concepts of visitors of
HSE website in terms of news websites in 2006.
For real datasets, association rules mining usually results in the large number
of rules. However, not all rules are necessary to present the information. Similar
compact representation can be used here; thus, one can represent all valid asso-
ciation rules by their subsets that called bases. For example, the Luxenburger
base is a set of association rules in the form
Fig. 12. The line diagram of 25 largest concepts for the context of HSE web users
The rest rules and their support and confidence can be derived by some
calculus, which is not usually clear from the base definition.
Exercise 15. 1. Find the Luxenburger base for the customers context with min sup =
1/3 and min conf = 1/2. 2. Check whether Concept Explorer generates the as-
sociation rule base (for the same context) that consists of the Duquenne-Guigues
base and the Luxenburger base.
One of the first algorithms that were explicitly designed to compute frequent
closed itemsets is Close [56]. Inspired by Apriori it traverses the database in
a level-wise manner and generates requent closed itemsets by computing the
closures of all minimal generators. See more detailed survey in [61].
It is interesting that after the years of co-existence, one of the Apriori’s
authors has started to apply FCA in text mining [62]. FCA is also included into
textbooks on data mining, see Chapter 8&9 in [63].
Another interesting subdomain of frequent pattern mining, where lattice-
based methods are successfully used, is so called sequential pattern mining
[64,41].
For example, consider a bicluster definition from paper [70]. Bi-Max algo-
rithm described in [70] constructs inclusion-maximal biclusters defined as
follows:
Definition 27. Given m genes, n situations and a binary table e such that
eij = 1 (gene i is active in situation j) or eij = 0 (gene i is not active in situation
j) for all i ∈ [1, m] and j ∈ [1, n], the pair (G, C) ∈ 2{1,...,n} × 2{1,...,m} is called
an inclusion-maximal bicluster if and only if (1) ∀i ∈ G, j ∈ C : eij = 1 and
(2) @(G1 , C1 ) ∈ 2{1,...,n} × 2{1,...,m} with (a) ∀i1 ∈ G1 , ∀j1 ∈ C1 : ei1 j1 = 1 and
(b) G ⊆ G1 ∧ C ⊆ C1 ∧ (G1 , C1 ) 6= (G, C).
Proposition 8. For every pair (G, C), G ⊆ H, C ⊆ S the following two state-
ments are equivalent.
1. (G, C) is an inclusion-maximal bicluster of the table e;
2. (G, C) is a formal concept of the context (H, S, E).
Proposition 9. 1. 0 ≤ ρ ≤ 1.
2. OA-bicluster (m0 , g 0 ) is a formal concept iff ρ = 1.
3. if (m0 , g 0 ) is a OA-bicluster, then (g 00 , g 0 ) ≤ (m0 , m00 ).
Exercise 17. a. Check that properties 1. and 2. from Proposition 9 follow directly
by definitions. b. Use antimonotonicity of (·)0 to prove 3.
In figure 13 one can see the structure of the OA-bicluster for a particular pair
(g, m) ∈ I of a certain context (G, M, I). In general, only the regions (g 00 , g 0 ) and
(m0 , m00 ) are full of non-empty pairs, i.e. have maximal density ρ = 1, since they
are object and attribute formal concepts respectively. Several black cells indicate
non-empty pairs which one may found in such a bicluster. It is quite clear, the
density parameter ρ would be a bicluster quality measure which shows how many
non-empty pairs the bicluster contains.
g'
m' g g''
m''
Proposition 10. The constraint ρ(A, B) ≥ ρmin is neither monotonic nor anti-
monotonic w.r.t. v relation.
g1 ×××××
g2 ××××
g3 × ××
g4 × ××
g5 × ××
The number of OA-biclusters of a context can be much less than the number
of formal concepts (which may be exponential in |G| + |M |), as stated by the
following proposition.
Proposition 12. For a given formal context K = (G, M, I) and ρmin = 0 the
largest number of OA-biclusters is equal to |I|, all OA-biclusters can be generated
in time O(|I| · (|G| + |M |)).
Proposition 13. For a given formal context K = (G, M, I) and ρmin > 0 the
largest number of OA-biclusters is equal to |I|, all OA-biclusters can be generated
in time O(|I| · |G| · |M |).
Triadic FCA and triclustering As we have mentioned, there are such data
sources as folksonomies, for example, a bookmarking website for scientific lit-
36 Dmitry I. Ignatov
erature Bibsonomy 18 [97]; the underlying structure includes triples (user, tag,
bookmark) like one in Fig. 14.
Fig. 14. An example of Bibsonomy relation for three paper, five authors and five tags.
Example 13. For the bibsonomy example, one of the triadic concepts is
|I ∩ (X × Y × Z)|
ρ(T ) := .
|X||Y ||Z|
Definition 31. The tricluster T is called dense iff its density is not less than
some predefined threshold, i.e. ρ(T ) ≥ ρmin .
1. Evaluation criteria: The average density, the coverage, the diversity and the
number of triclusters, and the computation time and noise tolerance for the
algorithms.
2. Benchmark datasets: We use triadic datasets from publicly available internet
data as well as synthetic datasets with various noise models.
Example 14. Examples of Prime OAC triclusters with their density indication
for the IMDB context are given below:
1. 36%, {The Shawshank Redemption (1994), Cool Hand Luke (1967), Ameri-
can History X (1998), A Clockwork Orange (1971), The Green Mile (1999)},
{Prison, Murder, Friend, Shawshank, Banker}, {Crime, Drama}
2. 56, 67%, {The Godfather: Part II (1974), The Usual Suspects (1995)}, {Cuba,
New York, Business, 1920s, 1950s}, {Crime, Drama, Thriller}
3. 60%, {Toy Story (1995), Toy Story 2 (1999)}, {Jealousy, Toy, Spaceman,
Little Boy, Fight}, {Fantasy, Comedy, Animation, Family, Adventure}
the attribute space to improve the results of decision tree induction. Note that
FCA helps to perform feature selection via conceptual scaling and has quite ev-
ident relations with Rough Sets theory, a popular tool for feature selection in
classification [122]. [123] proposed Navigala, a navigation-based approach for su-
pervised classification, and applied it to noisy symbol recognition. Lattice-based
approaches were also successfully used for classification of data with complex
descriptions such as graphs or trees [75,124]. Moreover, in [125] (Chapter 4,
“Concept Learning”) FCA is suggested as an alternative learning framework.
There are three subcontexts of K = (G, M, I), the first two are used for the
training sample: Kε := (Gε , M, Iε ), ε ∈ {−, +, τ } with respective derivation
operators (·)+ , (·)− , and (·)τ .
To apply JSM-method in FCA terms we need to scale the given data. One
may use nominal scaling as below.
M F Y M i O HE Sp Se HS A L w w̄
g1 × × × × ×
g2 × × × × ×
g3 × × × × ×
g4 × × × × ×
g5 × × × × ×
g6 × × × × ×
g7 × × × × ×
HE HS
A,F
Mi,F M
L,HE,Y,M
O SE,Mi,Sp
g5
A Sp O Y g7 g6
g3 g2 g4 g1 HS
L,SE
Fig. 15. The line diagrams of the lattice of positive hypotheses (left) and the lattice
of negative hypotheses (right).
Exercise 19. For the credit scoring context, classify all undetermined examples.
l
A = δ(g) for A ⊆ G (1)
g∈A
Therefore
Interval vectors as patterns. Let us call p-adic vectors of intervals as interval vec-
tors. In this case for two interval vectors of the same dimension e = h[ai , bi ]ii∈[1,p]
and f = h[ci , di ]ii∈[1,p] we define similarity operation via the intersection of the
corresponding components of interval vectors, i.e.:
The Artist Ghost Casablanca Mamma Mia! Dogma Die Hard Leon
User1 4 4 5 0 0 0 0
User2 5 5 3 4 3 0 0
User3 0 0 0 4 4 0 0
User4 0 0 0 5 4 5 3
User5 0 0 0 0 0 5 5
User6 0 0 0 0 0 4 4
Each user of this table can be described by vector of ratings’ intervals. For
example, δ(u1 ) = h[4, 4], [4, 4], [5, 5], [0, 0], [0, 0], [0, 0], [0, 0]i. If some new user
u likes movie Leon, a movie recommender system would reply who else like
this movie by applying operator 2: [4, 5] Leon = {u5 , u6 }. Moreover, the sys-
tem would retrieve the movies that users 5 and 6 liked, hypothesizing that
they have similar tastes with u. Thus, operator 1 results in d = {u5 , u6 } =
h[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [4, 5], [4, 5]i, suggesting that Die Hard is worth watch-
ing for the target user u.
44 Dmitry I. Ignatov
Exercise 20. 1. Compose a small program, e.g. in Python, that enumerates all
pattern concepts from the movie recommender example directly by definition or
adapt CbO this end. 2. In case there is no possibility to perform 1., consider
the subtable of the first four users and the first four movies from the movie
recommender example. Find all pattern concepts by the definition. Build the
line diagram of the pattern concept lattice.
However, Pattern Structures is not the only attempt to fit FCA to data
with more complex description than Boolean one. Thus, during the past years,
the research on extending FCA theory to cope with imprecise and incomplete
information made significant progress. The underlying model is a so called fuzzy
concepts lattice; there are several definitions of such a lattice, but the basic
assumption usually is that an object may posses attributes to some degree [136].
For example, in sociological studies age representation requires a special care:
a person being a teenager cannot be treated as a truly adult one on the first
day when his/her age exceeds a threshold of 18 years old (moreover, for formal
reasons this age may differ in different countries). However, it is usually the case
when we deal with nominal scaling; even ordinal scaling may lead to information
loss because of the chosen granularity level. So, we need a flexible measure of
being an adult and a teenage person at the same and it might be a degree lying in
[0,1] interval for each such attribute. Another way to characterise this imprecision
or roughness can be done in rough sets terms [137]. An interested reader is
invited to follow a survey on Fuzzy and Rough FCA in [138]. The correspondence
between Pattern Structures and Fuzzy FCA can be found in [139].
Even this tiny example shows that the algorithm has identified three factors
that significantly reduces the dimensionality of the data.
Mathematics Sociology
100
Statistics
10 Software Engineering 1
Public Administration
Economics
1
World Economy
70 Business Informatics Management Logistics
1
4
5 2
30 2 4
1
19
1
1 2
Fig. 16. Other programmes which entrants of Applied Mathematics and Informatics
programme also applied.
Logistics 100
Management
Business Informatics
- Other - Software Engineering
Economics
6
Mathematics 3
32
69 46
3
+ Applied Mathematics and Informatics
8
+ Business Informatics + Software Engineering
3 4
18
5 22 5
17
3
33
5
2 3
7
2
12
Fig. 17. “Efficient” choice of entrants to Applied Mathematics and Informatics pro-
gramme.
Label “- Other -” means that the entrant canceled his application preferring
another university or not to study this year altogether.
Together with diagram in Fig. 16 1 this diagram provides us with more pre-
cise knowledge about preferences of entrants to the Applied Mathematics and
Informatics programme. More than two thirds of entrants who successfully apply
to the Applied Math programme nevertheless prefer to study at another univer-
sity. Whereas just 18 percent of successful applicants then become students on
the Applied Mathematics and Informatics programme. Exactly 5 percent prefer
to study Software Engineering and 5 percent of entrants who choose Applied
Mathematics and Informatics also successfully applied to Software Engineer-
ing. It can be interpreted as equality of entrants preferences concerning these
two programmes. Additionally, 5 percent prefer Business Informatics and only
two percent of entrants who prefer Applied Mathematics and Informatics also
successfully apply to Business Informatics, therefore in the pair Business Infor-
matics and Applied Mathematics and Informatics the latter one is less preferable
by entrants.
Here we should note that the sum of nodes percents with labels containing
plus sign and node “- Other -” must equal to 100%, however here it does not
because we excluded some nodes during filtering.
We built diagrams of “efficient” choice for every programme. Analysis of these
diagrams helps us to recognise some relations between programmes in terms of
entrants preferences. For example, some programmes in most cases is rather
backup than actual entrants preference. Some programmes are close to each
other by subject of study, these relations are also expressed by diagrams. With
help of formalised survey data we found some possible factors of entrants’ choice
among some particular programmes. These knowledge can help our university to
50 Dmitry I. Ignatov
QuDA was developed in early 2000s as “a software environment for those who
want to learn Data Mining by doing” at the Intellectics group of the Darm-
stadt Technical University of Technology [149,150,53]. It includes various tech-
niques, such as association rule mining, decision trees and rule-based learning,
JSM-reasoning (including various reasoning scheme [151]), Bayesian learning,
and interesting subgroup discovery. It also provides the experimenter with error
estimation and model selection tools as well several preprocessing and post-
processing utilities, including data cleansing tools, line diagrams, visualisation
of attribute distributions, and a convenient rule navigator, etc. It was mostly
aimed to support scientific and teaching activities in the field of Machine Learn-
ing and Data Mining. However, since QuDA has open architecture and support
the most common data formats as well as the Predictive Model Markup Lan-
guage (PMML)22 , it can be easily integrated into a working Data Mining circle.
Originally, it was an acronym for “Qualitative Data Analysis”. Now, since QuDA
finally includes many quantitative methods integrated into it from WEKA23 , this
name is a backronym 24 since it has lost its original meaning.
Exercise 21. Download QuDa 25 . Refer to QuDa’s manual [149] for details and
prepare the credit scoring context in csv format for opening in the QuDA en-
vironment. Perform nominal scaling of attributes and apply JSM classifier with
basic setup. Compare the obtained rules with the hypotheses obtained manually.
Exercise 22. For zoo dataset available with QuDa (or some other dataset that
suitable for classification from UCI ML repository26 ), perform nominal scaling
and comparison of JSM-classification against all available methods 1) by splitting
data into 80:20 training-to-test sample size ration 2) by 10-fold cross-validation.
Compare learning curves and confusion matrices. Identify all non-covered ex-
amples by JSM-method. Change scaling type for attribute “number of legs”.
Reiterate comparison and check which methods have improved their classifica-
tion quality.
22
http://www.dmg.org/
23
http://www.cs.waikato.ac.nz/ml/weka/
24
http://en.wikipedia.org/wiki/Backronym
25
http://sourceforge.net/projects/quda/; its alternative compilation for RuSSIR
2014 is vaialable at http://bit.ly/QuDA4RuSSIR2014
26
http://archive.ics.uci.edu/ml/datasets.html
Introduction to Formal Concept Analysis and Its Applications in IR 51
Lattice-based models and FCA itself are not mainstream directions of modern
IR; they attracted numerous researchers because of their interpretability and
human-centerdness, but their intrinsic complexity is a serious challenge to make
them working on a Web scale.
Thus, from early works on Information Retrieval it is known that usage of a
lattice as a search space requires treatment of the enormous number of subsets
of documents: 10310,100 for a collection of one million documents [152]. At that
time it was rather natural in library classification domain to consider documents
and their categories, which may form requests as a combination of simple logical
operations like AND, NOT, and OR [153]. Thus, Mooers considered transfor-
mations T : P → L, where P is the space of all possible document descriptors
and L is the space of all possible document subsets [152]. Thus, T retrieves the
largest set of documents from L according to a query (prescription) from P .
At that period in Soviet Russia, All-Soviet Institute for Scientific and Tech-
nical Information (VINITI) was organised to facilitate information interchange
and fulfill growing scientific needs in cataloging and processing of scientific pub-
lications. Around the mid 1960s, Yulii A. Shreider, one of the leading researchers
of VINITI, considered the problem of automatic classification of documents and
their retrieval by means of a model featuring a triple (M, L, f ), where M is a
set of documents, L is a set of attributes and f : M → 2L maps each document
to a set attributes from L [154]. There, similarity of two documents was defined
via non-emptiness of the intersection of their descriptions f (d1 ) ∩ f (d2 ). In that
paper, Shreider mentioned the relevance of lattices to problems of document
classification and retrieval, where he also cited the work of Soergel [155] on this
issue.
Thus, these two introduced mappings, T and f highly resemble to conven-
tional prime operators in FCA for the context of documents and their attributes
(keywords, terms, descriptors) with “document-term containment” relation. In
the middle of 80s, Godin et al. [156] proposed a lattice-based retrieval model
for database browsing, where objects (documents, e.g. course syllabi) were de-
scribed by associated keywords. The resulting (in fact, concept) lattice used for
navigation by query modification using its generality/specificity relation.
In 90-s several FCA-based IR models and systems appeared, the reviews
can be found [157,158]. Thus, in [157], Carpineto and Romano classified main
IR problems that can be solved by FCA means through review of their own
studies by the year 2005. Uta Priss described a current state of FCA for IR
domain [158] by the year 2004. Recently, a survey on FCA-based systems and
methods for IR including prospective affordances was presented at FCA for
IR workshop at ECIR 2013 [159], and our colleagues, Codocedo and Napoli,
taking an inspiration, are summarising the latest work on the topic in a separate
forthcoming survey.
Below, we shortly overview our own study on applying FCA-based IR meth-
ods for describing the state-of-the-art in FCA for IR field. The rest topics are
52 Dmitry I. Ignatov
In [4], we visually represented the literature on FCA and IR using concept lat-
tices, in which the objects are the scientific papers and the attributes are the
relevant terms available in the title, keywords and abstract of the papers. We
developed an IR tool with a central FCA component that we use to index the
papers with a thesaurus containing terms related to FCA research and to gen-
erate the lattices. It helped us to zoom in and give an extensive overview of 103
papers published between 2003 and 2009 on using FCA in information retrieval.
Information Retrieval
web services
browsing
software
mining
FCA
P aper1 × × × ×
P aper2 × ××
P aper3 × × ×
P aper4 × × ×
P aper5 × ××
and anomalies in the resulting lattices. The layered thesaurus contains multiple
abstraction levels. The first and finest level of granularity contains the search
terms of which most are grouped together based on their semantic meaning to
form the term clusters at the second level of granularity. The papers downloaded
from the Web were converted to plain text and the abstract, title and keywords
were extracted. The open source tool Lucene27 was used to index the extracted
parts of the papers using the thesaurus. The result was a cross table describing
the relationships between the papers and the term clusters or research topics
from the thesaurus. This cross table was used as a basis to generate the lattices.
The most relevant scientific sources sources that were used in the search
for primary studies contain the work published in those journals, conferences
and workshops which are of recognized quality within the research com-munity.
These sources are: IEEE Computer Society, ACM Digital Library, Sciencedi-
rect, Springerlink, EBSCOhost, Google Scholar, Conference repositories: ICFCA,
ICCS and CLA conferences. Other important sources such as DBLP or CiteSeer
were not explicitly included since they were indexed by some of the mentioned
sources (e.g. Google Scholar). In the selected sources we used various search
terms including “Formal Concept Analysis”, “FCA”, “concept lattices”, “Infor-
mation Retrieval”. To identify the major categories for the literature survey we
also took into account the number of citations of the FCA papers at CiteseerX.
The efficient retrieval of relevant information is promoted by the FCA repre-
sentation that makes the inherent logical structure of the information transpar-
ent. FCA can be used for multiple purposes in IR [15,158]. First, FCA provides
an elegant language for IR modeling and is an interesting instrument for brows-
ing and automatic retrieval through document collections. Second, FCA can
also support query refinement, ranking and enrichment by external resources.
Because a document-term lattice structures the available information as clus-
ters of related documents which are partially ordered, lattices can be used to
make suggestions for query enlargement in cases where too few documents are
retrieved and for query refinement in cases where too many documents are re-
trieved. Third, lattices can be used for querying and navigation supporting rele-
vance feedback. An initial query corresponds to a start node in a document-term
lattice. Users can then navigate to related nodes. Further, queries are used to
“prune” a document-term lattice to help users focus their search (Carpineto et
al. 1996b). For many purposes, some extra facilities are needed such as processing
large document collections quickly, allowing more flexible matching operations,
allowing ranked retrieval and give contextual answers to user queries. The past
years many FCA researchers have also devoted attention to these issues.
86 % of the papers on FCA and information retrieval are covered by the
research topics in Fig. 6.1. Further, in our study, we intuitively introduced the
process of transforming data repositories into browsable FCA representations
and performing query expansion and refinement operations. Then we considered
28 % of papers on using FCA for representation of and navigation in image,
service, web, etc. document collections. Defining and processing complex queries
27
https://lucene.apache.org/core/
54 Dmitry I. Ignatov
covered 6% of the papers and was described as well. The review of papers on
contextual answers (6% of papers) and ranking of query results (6% of papers)
concluded the case-study.
Knowledge representation and browsing with FCA In 28% of the 103 selected
papers, FCA is used for browsing and navigation through document collections.
In more than half of these papers (18% of total number of papers), a combination
of navigation and querying based on the FCA lattices is pro-posed. Annotation
of documents and finding optimal document descriptors play an important role
in effective information retrieval (9% of papers). All FCA-based approaches for
information retrieval and browsing through large data repositories are based on
the same underlying model. We first have the set D containing objects such as
web pages, web services, images or other digitally available items. The set A
of attributes can consist of terms, tags, descriptions, etc. These attributes can
be related to certain objects through a relation I ⊆ D × A which indicates the
terms, tags, etc. can be used to describe the data elements in D. This triple
(D, A, I) is a formal context from which the concept lattice can be created.
Query result improvement with FCA Search engines are increasingly being used
by amongst others web users who have an information need. The intent of a
concept corresponds to a query and the extent contains the search results. A
query q features a set of terms T and the system returns the answer by evaluating
Introduction to Formal Concept Analysis and Its Applications in IR 55
({d8 , d9 }, {t1 , t2 , t3 , t4 , t5 })
Query enlargement, i.e. retrieving additional relevant web pages, can be per-
formed by navigating to an upper neighbor of the current concept in the lattice
by removing a term from the query items. The user can navigate for example
to a superconcept ((Bc ∪ {d})00 , (Bc ∪ {d})0 ) by adding document d. The com-
bination of subsequent refine and expand operations can be seen as navigation
through the query space. Typically, navigation and querying are two completely
separate processes, and the combination of both results in a more flexible and
user-friendly method. These topics are investigated in 8 % of the IR papers. See
the comprehensive survey on query refinement in [165].
Concept lattice based ranking. 6% of IR papers in our study are devoted concept
lattice ranking.
Below we explain the concept lattice based ranking (CLR) proposed in [166]
and compared with hierarchical clustering based (HCR) and best-first matching
(BFR) rankings. The experiments with two public benchmark collections showed
that CLR outperformed the two competitive methods when the ranked docu-
ments did no match the query and was comparable to BMF and better than
HCR in the rest cases.
Let (D, T, I) be a document context, where D is the set of documents,
T is the set of their terms, and I ⊆ D × T . Consider the ordered set of
56 Dmitry I. Ignatov
all concepts (L(D, T, I), ≺) with the nearest neighbour relation ≺, i.e., for
c1 , c2 ∈ L(D, T, I), c1 ≺ c2 iff c1 c2 or c1 ≺ c2 . Then distance between
concepts c1 and c2 is defined as the least natural number n as follows:
∃ci1 , . . . , cin ∈ L(D, T, I) such that c1 = ci0 ≺ ci1 . . . ≺ cin = c2 .
For each query q and d1 , d2 ∈ D, d1 is ranked higher than d2 if the distance
between (d001 , d01 ) and (q 00 , q 0 ) is shorter than than the distance between (d002 , d02 )
and (q 00 , q 0 ). It means that we need less query transformations to obtain q from
d1 than to do that from d2 .
As the author of CLR admitted in [15], CLR is sensitive to addition new
documents into the search collection. Further analysis of usage of concept neig-
bourhood based similarity for IR needs is given in [84].
The resulting ranking yields p4 < p1 < p2 < p3 = p5 . A curious reader may
admit that concepts with the same ranks lie in concentric circles around the
query concept at the same distance. Obviously, for concepts from the same such
circle we need their subsequent ranking, e.g. by best-match ranking via dot
product of the document and query profiles based on term frequency.
FCA
2
software
1
web services
3 IR mining
browsing
3 2 0 p4
p5 3 p3 3 2 p2 1 p1
Fig. 20. Concept lattice based ranking for the query “browsing, FCA”; the distance
values are given inside the corresponding circles.
Fig. 22. An example of FooCA’s web search interface. It processes the results of search
queries to Yahoo or Google and organises them into the interactive cross-table.
Introduction to Formal Concept Analysis and Its Applications in IR 59
Fig. 23. An example of SearchSleuth’s web interface. It processes the results of search
queries to Yahoo. Passing to more general (more specific) categories is done by clicking
-term(+term).
60 Dmitry I. Ignatov
of a query. Cole et al. [175] discuss a document discovery tool named Concep-
tual Email Manager (CEM) which is based on FCA. The program allows users to
navigate through emails using a visual lattice. The paper also discusses how con-
ceptual ontologies can support traditional document retrieval systems and aid
knowledge discovery in document collections. The development of this software
is based on earlier research on retrieval of information from semi-structured texts
([176,177]). Building further on this work is the Mail-Sleuth software (Eklund
et al. [178]) which can be used to mine large email archives. Eklund et al. [179]
use FCA for displaying, searching and navigating through help content in a help
system. Stojanovic [180] presents an FCA-based method for query refinement
that provides a user with the queries that are “nearby” the given query. Their
approach for query space navigation was validated in the context of searching
medical abstracts. Stojanovic [180] presents the SMART system for navigation
through an online product catalog. The products in the database are described
by elements of an ontology and visualized with a lattice, in which users can nav-
igate from a very general product-attribute cluster containing a lot of products
to very specific clusters that seem to contain a few, but for the user highly rele-
vant products. Spyratos et al. [181] describe an approach for query tuning that
integrates navigation and querying into a single process. The FCA lattice serves
for navigation and the attributes for query formulation. Le Grand et al. [182]
present an IR method based on FCA in conjunction with semantics to provide
contextual answers to web queries. An overall lattice is built from tourism web
pages. Then, users formulate their query and the best-matching concepts are
returned, users may then navigate within the lattice by generalizing or refining
their query. Eklund et al. [183] present AnnotationSleuth to extend a standard
search and browsing interface to feature a conceptual neighborhood centered on
a formal concept derived from curatorial tags in a museum management system.
Cigarran et al. [184] focus on the automatic selection of noun phrases as docu-
ments descriptors to build an FCA based IR system. Automatic attribute selec-
tion is important when using FCA in a free text document retrieval framework.
Optimal attributes as document descriptors should produce smaller, clearer and
more browsable concept lattices with better clustering features. Recio-Garcia et
al. [185] use FCA to perform semantic annotation of web pages with domain
ontologies. Similarity matching techniques from Case Based Reasoning can be
applied to retrieve these annotated pages as cases. Liu et al. [186] use FCA to
optimise a personal news search engine to help users obtain the news content
they need rapidly. The proposed technique combines the construction of user
background using FCA, the optimisation of query keywords based on the user’s
background and a new layout strategy of search results based on a “Concept
Tree”. Lungley et al. [187] use implicit user feedback for adapting the underly-
ing domain model of an intranet search system. FCA is used as an interactive
interface to identify query refinement terms which help achieve better document
descriptions and more browsable lattices.
Introduction to Formal Concept Analysis and Its Applications in IR 61
Further it was extended for work with document collections[192]. Since Camelis
uses lattice-based navigation and queriying by formulas, it overcomes current
drawbacks of tree-like navigation imposed by current restrictions of file-systems.
Recently, the previous studies of Eklund et al. [84] in organising navigation
through annotated collections of images in virtual museums resulted in an iPad
application that allows users to explore an art collection via semantically linked
pathways that are generated using Formal Concept Analysis34 . In fact navigation
in this application is organised by showing context and relationships among
objects in a museum collection.
domain expert in mining real-world enterprise applications and makes use of spe-
cific domain knowledge, including human intelligence and domain-specific con-
straints. Our approach was empirically validated at the Amsterdam-Amstelland
police to identify suspects and victims of human trafficking in 266,157 suspicious
activity reports. Based on guidelines of the Attorney Generals of the Netherlands,
we first defined multiple early warning indicators that were used to index the
police reports.
Information Retrieval
Bulgarian license plate for
routine motor vehicle inspection.
Expensive cars
It was a Mercedes GLK with
license plate BL XXX. The car
Prostitutes
Id-papers
was driving around in circles
Vacation
in a prostitution area. On the
backseat of the car we noticed
two well dressed young girls. We
asked for their identification Report1 × × × × × ×
papers but they did not speak Report2 × × × × ×
English nor Dutch. The driver Report3 × × × ×
of the car was in possession of Report4 × ×
their papers and told us that Report5 × ×
they were on vacation in the
Netherlands for two weeks etc.
Our method based on FCA consists of four main types of analysis which are
carried out as follows:
pects in these networks, etc. With police officers we discussed and compared
various FCA-based visualisation methods of criminal networks.
In our investigations we also used the model that was developed by Bullens
and Van Horn [197] for the identification of loverboys who typically force girls of
Dutch nationality into prostitution. Loverboys use their love affair with a woman
to force her to work in prostitution. Forcing girls and women into prostitution
through a loverboy approach is seen as a special kind of human trafficking in the
Netherlands (article 250a of the code of criminal law). This model is a resource
used by the Amsterdam-Amstelland police during the trainings of police officers
about this topic. A typical loverboy approach consists of three main phases which
give rise to corresponding indicators:
1. Preparatory activities to recruit girls.
2. Forcing her into prostitution.
3. Keeping the girl in prostitution by emotional dependence or social isolation.
Fig. 24. Line diagram for the report context of loverboy suspect B
Investigating these reports shows that he frequently visits the red light dis-
trict and has strong relationships with other pimps. One of these pimps is the
suspect of another loverboy case. From the six observations where B was seen in
the red light district, four are violence related, including the observation of Hs
suspicious burn wounds. The other violence-related observations are situations
of fights with customers who are unwilling to leave or pay. Such violence-related
observations are related to pimps who want to protect their prostitutes from
customers and competing gangs. Since in the Netherlands, prostitution is legal,
each prostitute has the right to ask the police to protect her. The violence ob-
servations of the suspect strengthened the suspicion of B being the pimp of H.
Moreover, we found another girl R who was also a potential victim of him. These
indications were enough to create a summary report and send a request for using
further investigation techniques to the public prosecutor.
Using concept lattices, we revealed numerous unknown human trafficking and
loverboy suspects. Indepth investigation by the police resulted in a confirmation
of their involvement in illegal activities resulting in actual arrestments been
made. This approach was embedded into operational policing practice and is
now successfully used on a daily basis to cope with the vastly growing amount
of unstructured information.
Introduction to Formal Concept Analysis and Its Applications in IR 65
1. D-miner algorithm for detecting large market sectors as concepts and our
biclustering algorithm;
2. Coron system for constructing association rules;
3. Construction of association metarules using morphological analysis;
4. Construction of association metarules using ontologies (thematic catalogs).
Detecting large market sectors with D-miner and OA-biclustering. The D-miner
algorithm [203] constructs the set of concepts satisfying given constraints on sizes
of extents and intents (i.e. intersection of icebergs and dual icebergs). D-miner
takes as input a context and two parameters: minimal admissible extent and
intent sizes and outputs a “band” of the concept lattice: all concepts satisfying
constraints given by parameter values (|intent| ≥ m and |extent| ≥ n, where
m, n ∈ N, see table 6).
Example 20. We provide examples of two intents of formal concepts for the case
|L| = 53, where |L| is a number of formal concepts obtained by D-miner.
Hotel market.
{ angeles hotel los, atlanta hotel, baltimore hotel, dallas hotel, denver hotel, hotel
chicago, diego hotel san, francisco hotel san, hotel houston, hotel miami, hotel new or-
leans, hotel new york, hotel orlando, hotel philadelphia, hotel seattle, hotel vancouver}
Weight loss drug market.
66 Dmitry I. Ignatov
{ adipex buy, adipex online, adipex order, adipex prescription, buy didrex,
buy ionamin, ionamin purchase, buy phentermine, didrex online, ionamin on-
line, ionamin order, online order phentermine, online phentermine, order phen-
termine, phentermine prescription, phentermine purchase }
Applying the biclustering algorithm to our data we obtained 87 OA-biclusters
(ρ = 0.85), which is much less than the number of concepts found by D-miner.
Expert interpretation of these biclusters implies that each market described by
formal concepts found by D-miner (where each market can be represented by
several formal concepts) corresponds to a bicluster among these 87. The number
of formal concepts generated by D-miner becomes feasible for human interpreta-
tion if there are no more than 20 firms and about 15 terms. For these thresholds
D-miner could find only large markets and ignored important average-size mar-
kets. For our data these ignored markets were, e.g. car and flower markets, which
were found using biclustering approach.
Example 21. Flower market OA-bicluster.
({ 24, 130, 170, 260, 344, 415, 530, 614, 616, 867, 926, 1017, 1153, 1160, 1220, 1361,
1410, 1538, 1756, 1893 }, {’anniversary flower’, ’arrangement flower’, ’birthday flower’,
’bouquet flower’, ’buy flower’, ’buy flower online’, ’delivery flower’, ’flower fresh’, ’flower
gift’, ’flower line’, ’flower online’, ’flower online order’, ’flower online send’, ’flower online
shop’, ’flower rose’, ’flower send’, ’flower shop’, ’flower sympathy’, ’red rose’}), with
ρ ≈ 0.84
Recommendations based on association rules. Using the Coron system (see [204])
we construct the informative basis of association rules [205].
Example 22. Here are some examples of association rules:
– {evitamin} → {cvitamin}, supp=31 [1.55%] and conf=0.86;
– {gif t graduation} → {anniversary gif t}, supp=41 [2.05%] and conf=0.82.
Introduction to Formal Concept Analysis and Its Applications in IR 67
The value supp = 31 of the first rule means that 31 companies bought phrases
“e vitamin” and “c vitamin”. The value conf = 0.861 means that 86,1% com-
panies that bought the phrase “e vitamin” also bought the phrase “c vitamin”.
To make recommendations for each particular company one may use an
approach proposed in [202]. For company f we find all association rules, the
antecedent of which contain all the phrases bought by the company, then we
construct the set Tu of unique advertising phrases not bought by the company
f before. Then we order these phrases by decreasing of confidence of the rules
where the phrases occur in the consequences. If buying a phrase is predicted by
multiple rules, we take the largest confidence.
Metarules of the following forms seem also to be reasonable. First, one can
F T S IT S
look for rules of the form t −−→ si , i.e., rules, the consequent of which
i
contain all terms containing at least one word with the stem common to a word
in the antecedent term. Obviously, constructing rules of this type may result in
the fusion of phrases related to different market sectors, e.g. “black jack” and
FT
“black coat”. Second, we considered rules of the form t −−→ ( si )IT S , i.e., rules
S
i
with the consequent with the set of stems being the same as the set of stems
68 Dmitry I. Ignatov
Introduction to Formal Concept Analysis and Its Applications in IR 69
m1 m2 m3 m4 m5 m6 f1 f2 f3 f4 f5
u1 1 0 1 1 0 0 0 1 1 0 0
u2 1 0 1 1 0 0 1 0 0 1 0
u3 1 0 1 1 0 1 0 1 0 1 0
u4 1 0 1 1 0 0 1 0 0 1 0
u5 0 0 0 0 1 1 1 0 0 0 1
u6 1 0 1 1 0 0 0 1 1 0 0
u7 1 0 0 1 1 1 1 0 0 0 1
g1 1 0 1 1 1 1 0 0 0 0 0
g2 0 1 1 0 1 1 0 0 0 0 0
g3 1 0 0 1 0 0 0 0 0 0 0
where Ũ denotes a set of k users that are the most similar to user u, who
have rated item m. For example, the function aggr may have the following form
[209]:
P
sim(ũ, u) · rũ,m ,
ũ∈Ũ
ru,m = P .
sim(u, ũ)
ũ∈Ũ
Similarity measure between users u and ũ, sim(ũ, u), is essentially an inverse
distance measure and is used as a weight, i.e., the more similar users c and ũ
are, the more weight rating rũ,m will carry in the prediction of rũ,m .
Similarity between two users is based on their ratings of items that both
users have rated. There are several popular approaches: Pearson correlation,
cosine-based, and Hamming-based similarities.
We mainly use the cosine-based and normalised Hamming-based similarities.
To apply this approach in case of FCA-based BMF recommender algorithm
we simply consider the user-factor matrices obtained after factorisation of the
initial data as an input.
Example 27. For the input matrix in Table 10 one can find the following covering
factors:
100100011
0 1 0 0 0 1 1 0 0 10010000000
1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 0 0 1 0
0 0 0 0 1 1 1 0 0 0 1
0 1 0 0 0 1 1 0 0
0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0
◦ 0 0 0 0 1 1 0 0 0 0 0
1 0 0 1 0 0 0 1 1
1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0
0 1 1 0 0 1 0 0 0 0 0
1 0 0 0 1 0 0 1 0
1 0 1 1 0 0 0 0 0 0 0
0 0 0 0 1 0 1 0 0
10110001000
100000000
However, in this case in the obtained user profiles vectors most of the com-
ponents are getting zeros, and thus we lose similarity information.
To smooth the loss effects we proposed the following weighted projection:
P
Iuv · Qf v
Iu· · Qf · v∈V
P̃uf = = P ,
||Qf · ||1 Qf v
v∈V
where P˜uf indicates whether factor f covers user u, Iu· is a binary vector
describing profile of user u, Qf · is a binary vector of items belonging to factor f
(the corresponding row of Q in decomposition eq. (3)). The coordinates of the
obtained projection vector lie within [0; 1].
The proposed approach and compared ones have been implemented in C++35
and evaluated on MovieLens-100k data set. This data set features 100000 ratings
in five-star scale, 1682 Movies, Contextual information about movies (19 genres),
943 users (each user has rated at least 20 movies), and demographic info for the
35
https://github.com/MaratAkhmatnurov/BMFCARS
Introduction to Formal Concept Analysis and Its Applications in IR 73
users (gender, age, occupation, zip (ignored)). The users have been divided into
seven age groups: under 18, 18-25, 26-35, 36-45, 45-49, 50-55, 56+.
Five star ratings are converted to binary scale by the following rule:
(
1, Rij > 3,
Iij =
0, else
The scaled dataset is split into two sets according to bimodal cross-validation
scheme [210]: the training set and the test set with a ratio 80:20, and 20% of
ratings in the test set are hidden.
0.4 1
0.35 0.8
Precision
MAE
0.3 0.6
0.25 0.4
0.2 0.2
0 20 40 60 80 100 0 20 40 60 80 100
Number of neighbours Number of neighbours
0.4 0.4
BMF80+context
0.3 0.3 SVD85+context
F−measure
SVD85
Recall
0.2 0.2
0.1 0.1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Number of neighbours Number of neighbours
In our previous study, with the original BMF-based scheme (weighting is not
used), we obtained comparable results in terms of MAE, and both Precision and
Recall [94,145].
|FA ∩ FB |
sim(A, B) = P [min{π(FA )} = min{π(FB )}] = .
|FA ∪ FB |
Further we used FCA to define cluster of near duplicate documents.
Let KDF = (D, F, I ⊆ D × F ) be a context of documents, where D is
a set of documents, F is a set of hash codes (fingerprints), and I shows that a
document d has an attribute f whenever dIf .
For a subset of documents A ⊆ D, A0 describe their similarity in terms of
common fingerprints, and the closed set A00 is a cluster of similar documents.
36
https://academy.yandex.ru/events/imat/grant2005/
37
http://romip.ru/en/collections/narod.html
Introduction to Formal Concept Analysis and Its Applications in IR 75
To find all near duplicate clusters, we need to enumerate all intents of the
context KF D = (F, D, I ⊆ F × D) such that their common set of fingerprints
exceeds a threshold set by user.
In fact, to this end we need to use nothing but frequent itemsets of documents.
A set of documents A ⊆ D is called k-frequent if |A0 | > k, where k is a
parameter.
This unit outputs five values: 1) the number of duplicate pairs in the ROMIP
collection, 2) the number of duplicate pairs for our realisation, 3) the number
of unique duplicate pairs in the ROMIP collection, 4) the number of unique
duplicate pairs in our results, 5) the number of common pairs for the ROMIP
collection and our results.
In step 7, we used a leader in time efficiency, the algorithm FPmax* [216],
from the competition organised in series of workshops on Frequent Itemset Min-
ing Implementations (FIMI)38 .
can process very large datasets, however, to compare with Cluto (which is much
more time consuming as we show below) we took collection narod.1.xml that
contains 6941 documents.
Shingling parameters used in experiments were as follows: the number of
words in shingles was 10 and 20, the offset was always taken to be 1 (which means
that the initial set of shingles contained all possible contiguous word sequences
of a given length). The sizes of resulting document images were taken in the
interval 100 to 200 shingles. As frequency thresholds defining frequent closed sets
(i.e., the numbers of common shingles in document images from one cluster) we
experimentally studied different values in intervals, where the maximal value
is equal to the number of shingles in the document image. For example, the
interval [85, 100] for document images with 100 shingles, the interval [135, 150]
for document images of size 150, etc. Obviously, choosing the maximal value of
an interval, we obtain clusters where document images coincide completely.
For parameters taking values in these intervals we studied the relation be-
tween resulting clusters of duplicates and ROMIP collection of duplicates, which
consists of pairs of web documents that are considered to be near duplicates. Sim-
ilarity of each pair of documents in this list is based on Edit Distance measure,
two documents were taken to be duplicates by authors of this testbed if the value
of the Edit Distance measure exceeds threshold 0.85. As we show below, this def-
inition of a duplicate is prone to errors, however making a testbed by manual
marking duplication in a large web document collection is hardly feasible. Unfor-
tunately, standard lists of near-duplicates were missing at that period even for
standard corpora such as TREC or Reuters collection [218]. For validating their
methods, researchers create ad-hoc lists of duplicates using slightly transformed
documents from standard collections. Now the situation is drastically better, see,
for example, workshop series on Plagiarism Analysis, Authorship Identification,
and Near-Duplicate Detection (PAN)40 .
In our study for each such pair we found an intent that contains both el-
ements of the pair, and vice versa, for each cluster of very similar documents
(i.e., for each corresponding closed set of documents with more than k common
description units) we take each pair of documents in the cluster and looked for
the corresponding pair in the ROMIP collection. As result we obtain the number
of common number of near duplicate pairs found by our method and those in the
ROMIP collection, and the number of unique pairs of HSE duplicates (document
pairs occurring in a cluster of “very similar documents” and not occurring in
the ROMIP collection). The results of our experiments showed that the ROMIP
collection of duplicates, considered to be a benchmark, is far from being perfect.
First, we detected that a large number of false duplicate pairs in this collection
due to similar framing of documents. For example the pages with the following
information in table 11 about historical personalities 1 and 2 were declared to
be near duplicates.
40
http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/
plagiarism-detection.html
Introduction to Formal Concept Analysis and Its Applications in IR 77
However these pages, as well as many other analogous false duplicate pairs in
ROMIP collection do not belong to concept-based (maximal frequent) clusters
generated in our approach.
Second, in our study we also looked for false duplicate clusters in the ROMIP
collection, caused by transitive closure of the binary relation “X is a duplicate
of Y ” (as in the typical definition of a document cluster in [215]). Since the
similarity relation is generally not transitive, the clusters formed by transitive
closure of the relation may contain absolutely dissimilar documents. Note that if
clusters are defined via maximal frequent itemsets (subsets of attributes) there
cannot be effects like this, because documents in these clusters share necessarily
large itemsets (common subsets of attributes).
We analysed about 10000 duplicate document pairs and found four rather
big false duplicate clusters about 50-60 documents each. Further studies on this
collection see in [219].
We shortly summarise experimental results below:
Graphs and tables show that for 5000 clusters the output of ClusterRB has al-
most the same value of F-measure (0.63) as FPmax* for threshold 150 (F1=0,61).
However, computations took 4 hours for ClusterRB and half a second for FP-
max*.
We continued our developments of that time and developed GUI and du-
plicate document generator (for a provided text collection) for testing purposes
[220].The archive of these projects is freely available at Bitbucket41 .
Later on, we proposed a prototype of near-duplicate detection system for web-
shop owners. It’s a typical situation for this online businesses to buy description
41
https://bitbucket.org/dimanomachine/nearduplicatesarch
78 Dmitry I. Ignatov
Table 12. Comparison of the obtained clusters in terms of pairs of near duplicate
documents
of their goods from so-called copyrighters. A copyrighter may cheat from time to
time and provide the owner with some almost identical descriptions for different
items. In that study we demonstrated how we can use FCA for revealing and fast
online clustering of such duplicates in a real online perfume shop. Our results
were also applicable for near duplicate detection in collections of R&D project’s
documents [221].
algorithm for folksonomic data by considering the input triadic data as an undi-
rected tripartite graph. The weights for each type of edge were assigned accord-
ing to the occurrences of the third entity, e.g. an edge u, t being weighted with
|r ∈ R : (u, t, r) ∈ Y |, the number occurrences of the related tags.
Formally, the weight spreading condition looks as follows:
Later on, they implemented (not only) all this features in the Bibsonomy
systems [97].
Moreover, during those studies they admitted that search query logs natu-
rally forms folksonomic data, (users, queries, resources), where the resources
are those that were clicked by a user after performing a query [223]. Predictably,
they gave a name logsonomy to this new data structure. When Bibsonomy was
at the early stages, it faced spam abuse problem and in 2008 ECML PKDD
discovery challenge 47 addressed this problem. The year after the challenging
problem 48 were recommendations for Bibsonomy and it resulted in new fruitful
algorithms [224].
contents of the site. For example, interaction with members of each group may
be organized in a special manner. In the performed study we used an approach
based on formal concepts for constructing taxonomies of groups of web users.
For our experiments we have chosen four target websites: the site of the State
University Higher School of Economics, an e-shop of household equipment, the
site of a large bank, and the site of a car e-shop (the names of the last three
sites cannot be disclosed due to legal agreements).
Users of these sites are described by attributes that correspond to other sites,
either external (from three groups of sites: finance, media, education) or internal
(web-pages of the site). More precisely, initial “external” data consists of user
records each containing the user id, the time when the user first entered this site,
the time of his/her last visit, and the total number of sessions during the period
under consideration. An “internal” user record, on the other hand, is simply a
list of pages within the target website visited by a particular user.
By “external” and “internal” taxonomies we mean (parts of) concept lattices
for contexts with either “external” or “internal” attributes. For example, the
external context has the form Ke = (U, Se , Ie ), where U is the set of all users
of the target site, Se is the set of all sites from a sample (not including the
target one), the incidence relation Ie is given by all pairs (u, s) : u ∈ U, s ∈ Se ,
Introduction to Formal Concept Analysis and Its Applications in IR 81
such that user u visited site s. Analogously, the internal context is of the form
Ki = (U, Si , Ii ), where Si is the set of all own pages of the target site.
A concept of this context is a pair (A, B) such that A is a group of users that
visited together all other sites from B.
As we have mentioned, one of the target websites was the site of our univer-
sity50 .
We received “external” data with the following fields for each user-site pair:
(user id, time of the first visit, time of the last visit, total number of
sessions during the period). “Internal” data have almost the same format
with an additional field url page, which corresponds to a particular visited page
of the target site.
The provided information was gathered from about 10000 sites of Russian
segment of Internet (domain .ru). Describing users in terms of sites they vis-
ited, we had to tackle the problem of dimensionality, since the resulting concept
lattices can be very large (exponential in the worst case in terms of objects or
attributes). To reduce the size of input data we used the following techniques.
For each user we selected only those sites that were visited by more than
a certain number of times during the observation period. This gave us infor-
mation about permanent interests of particular users. Each target web site was
considered in terms of sites of three groups: newspaper sites, financial sites, and
educational sites.
However, even for large reduction of input size, concept lattices can be very
large. For example, a context of size 4125 × 225 gave rise to a lattice with 57 329
concepts.
To choose interesting groups of users we employed stability index of a concept
defined in [226,227] and considered in [88] (in slightly different form) as a tool for
constructing taxonomies. On one hand, stability index shows the independence
of an intent on particular objects of extent (which may appear or not appear in
the context depending on random factors). On the other hand, stability index of
a concept shows how much extent of a concept is different from similar smaller
extents (if this difference is very small, then its doubtful that extent refers to a
“stable category”). For detailed motivation of stability indices see [226,227,88].
Obviously, 0 ≤ σ(A, B) ≤ 1.
The stability index of a concept indicates how much the concept intent de-
pends on particular objects of the extent. A stable intent (with stability index
50
www.hse.ru
82 Dmitry I. Ignatov
51
http://spigit.com/
52
http://www.brightidea.com/
53
http://www.innocentive.com/
54
http://www.imaginatik.com/
55
http://www.kaggle.com
56
http://witology.com/
57
http://www.wikivote.ru/
58
http://sberbank21.ru/
59
http://witology.com/en/clients_n_projects/3693/
84 Dmitry I. Ignatov
stage). The last recommendation type is very important for stimulating user’s
activity on Witology platform during the stage of counteridea generation.
water
air
plane ×
amphibian car × ×
catamaran ×
car ××
submarine ××
The main steps of attribute exploration, as a dialog between system A and
expert E for transport context, is as follows:
Introduction to Formal Concept Analysis and Its Applications in IR 85
Fig. 28. The taxonomy of transportation means as an example of not a tree-like (mul-
tiple) inheritance
– Step 1. A Question: Is it true that, when an object has attribute “Can move
underwater”, it also has attribute “Can move by water”?
– Step 1. E Answer: Yes, it is. The expert knows that it is true for submarines
and there are no other types of underwater transport.
– Step 2. A Question: Is it true that, when an object has attributes “Can move
by air” and “Can move by water” have attributes “Can move by surface”
and “Can move underwater”?
– Step 2. E Answer: No, it is not. There is a counterexample, {hydroplane}0 =
{air, water}.
– Step 3. A Question: Is it true that, when an object has attributes “Can move
by air”, “Can move by water” “Can move underwater” have attributes “Can
move by surface”?
– Step 3. E Answer: Yes, it is. {air, water, underwater}0 = ∅.
– Steps 4, 5, 6 Trivial questions.
The resulting concept lattice can be considered as a non-tree like taxonomy
of transportation means since it allows multiple inheritance in the concept hier-
archy. If the expert suppose that the not only objects but attributes are missed
then object exploration can be done in similar manner, e.g. by the same proce-
dure on the transposed context.
Exercise 23. 1. Compare the concept lattices from the previous example be-
fore starting and after completion of the attribute exploration. What is/are
new concept(s) that we have obtained? How can it/they be interpreted? 2. Per-
form attribute exploration with ConceptExplorer for a slightly modified context
from [237]
86 Dmitry I. Ignatov
Mediterranean
European
Asian
EU
G7
France ××××
Turkey × × ×
Germany ×××
Exercise 24. Build concept lattice from the context of terms extracted from texts
(left). Find the transformation that resulted in the tree-like ontology of terms
on the right side.
bookable
driveable
bookable
rentable
rideable
joinable
hotel ×
excursion trip driveable apartment
apartment ××
car ×××
bike ×××× rideable car
excursion × ×
trip × ×
bike
Another example where FCA can help is ontology merging: The authors
of [235] successfully tested their FCA-based merging approach on two text col-
lection from touristic domain.
There is also strong connection between Description Logic, Ontologies and
Formal Concept Analysis [237].
Introduction to Formal Concept Analysis and Its Applications in IR 87
8 Conclusion
In the end of the invited talk at the “FCA meets IR” workshop 2013, Prof.
Carpineto has summarised strengths and limitations of FCA for IR. It seems
to be evident that IR will be increasingly relying on contextual knowledge and
structured data and FCA can improve both query pre-processing and query post-
processing of modern IR systems. Among the mentioned technologies that could
benefit from FCA are query expansion, web search diversification, ontology-
based information retrieval, querying and navigating RDF (there is a progress to
this date [243]), and many others. However, the community needs to endeavour
(by theoretical advances and system engineering) to deploy a comprehensive
FCA-based tool for information retrieval and integrate it with existing search
and indexing taking into account both the intrinsic complexity issues and the
problem of good features generation.
Even in an extensive tutorial it is not possible to cover all models and ap-
plications of Formal Concept Analysis. For example, concept lattices and its
applications in social sciences including Social Network Analysis deserve a spe-
cial treatment. The grounding steps have been done by Vincent Duquenne [85],
Linton Freeman [86] and their collaborators (see also [89] for our SNA-related
study). Another large and interesting domain is Software Engineering [244,245].
60
http://code.google.com/p/ontocomp/
61
http://www.co-ode.org/downloads/protege-x/
62
https://github.com/rjoberon/web-attribute-exploration
88 Dmitry I. Ignatov
For these two and many other topics, we also refer the readers to the recent
surveys [5,6].
Overall, we hope that this introductory material with many examples and
exercises will help the reader not only to understand the theory basics, but
having this rich variety of tools and showcases to use FCA in practice.
Acknowledgments. The author would like to thank all colleagues who have
made this tutorial possible: Jaume Baixeries, Pavel Braslavsky, Peter Becker,
Radim Belohlavek, Aliaksandr Birukou, Jean-Francois Boulicaut, Claudio Carpineto,
Florent Domenach, Fritjhof Dau, Vincent Duquenne, Bernhard Ganter, Katja
Hofmann, Robert Jaeshke, Evgenia Revne (Il’ina), Nikolay Karpov, Mehdy Kay-
toue, Sergei Kuznetsov, Rokia Missaoui, Elena Nenova, Engelbert Mephu Nguifo,
Alexei Neznanov, Lhouari Nourin, Bjoern Koester, Natalia Konstantinova, Amedeo
Napoli, Sergei Obiedkov, Jonas Poelmans, Nikita Romashkin, Paolo Rosso, Se-
bastian Rudolph, Alexander Tuzhilin, Pavel Serdyukov, Baris Serkaya, Dominik
Slezak, Marcin Szchuka, and, last but not least, the brave listeners. The author
would also like to commemorate Ilya Segalovich who inspired the author’s en-
thusiasm in Information Retrieval studies, by giving personal explanations of
near duplicate detection techniques in 2005, in particular.
Special thank should go to my grandmother, Vera, who has been hosting me
in a peaceful countryside place, Prechistoe, during the last two weeks of the final
preparations.
The author was partially supported by the Russian Foundation for Basic Re-
search grants no. 13-07-00504 and 14-01-93960 and prepared the tutorial within
the project “Data mining based on applied ontologies and lattices of closed de-
scriptions” supported by the Basic Research Program of the National Research
University Higher School of Economics.
References
1. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval.
Cambridge University Press (2008)
2. Wille, R.: Restructuring lattice theory: An approach based on hierarchies of
concepts. In Rival, I., ed.: Ordered Sets. Volume 83 of NATO Advanced Study
Institutes Series. Springer Netherlands (1982) 445–470
3. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. 1st
edn. Springer-Verlag New York, Inc., Secaucus, NJ, USA (1999)
4. Poelmans, J., Ignatov, D.I., Viaene, S., Dedene, G., Kuznetsov, S.O.: Text mining
scientific papers: A survey on fca-based information retrieval research. In Perner,
P., ed.: ICDM. Volume 7377 of Lecture Notes in Computer Science., Springer
(2012) 273–287
5. Poelmans, J., Kuznetsov, S.O., Ignatov, D.I., Dedene, G.: Formal concept analysis
in knowledge processing: A survey on models and techniques. Expert Syst. Appl.
40(16) (2013) 6601–6623
6. Poelmans, J., Ignatov, D.I., Kuznetsov, S.O., Dedene, G.: Formal concept analysis
in knowledge processing: A survey on applications. Expert Syst. Appl. 40(16)
(2013) 6538–6560
Introduction to Formal Concept Analysis and Its Applications in IR 89
7. Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S.M., Agichtein,
E., Segalovich, I., Yilmaz, E., eds.: Advances in Information Retrieval - 35th
European Conference on IR Research, ECIR 2013, Moscow, Russia, March 24-27,
2013. Proceedings. In Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J.,
Rüger, S.M., Agichtein, E., Segalovich, I., Yilmaz, E., eds.: ECIR. Volume 7814
of Lecture Notes in Computer Science., Springer (2013)
8. Arnauld, A., Nicole, P.: Logic or the Art of Thinking, translated by Jill V.
Buroker. Cambridge University Press, (1996)
9. Birkhoff, G.: Lattice Theory (Third ed.). Providence, R.I.: Amer. Math.Soc.
(1967)
10. Ore, O.: Galois connexions. Trans. Amer. Math. Soc. 55(3) (1944) 494–513
11. Barbut, M., Monjardet, B.: Ordre et classification. Hachette, Paris (1970)
12. Duquenne, V.: Latticial structures in data analysis. Theoretical Computer Science
217(2) (1999) 407 – 436 ORDAL’96.
13. Wolski, M.: Galois connections and data analysis. Fundam. Inform. 60(1-4)
(2004) 401–415
14. Kuznetsov, S.O.: Galois connections in data analysis: Contributions from the
soviet era and modern russian research. In: Formal Concept Analysis, Foundations
and Applications. (2005) 196–225
15. Carpineto, C., Romano, G.: Concept data analysis - theory and applications.
Wiley (2005)
16. Davey, B.A., Priestley, H.A.: Introduction to Lattices and Order. 2nd edition edn.
Cambridge University Press (2002)
17. Dominich, S.: The Modern Algebra of Information Retrieval. 1 edn. Springer
Publishing Company, Incorporated (2008)
18. Wolff, K.E.: A first course in formal concept analysis. how to understand line
diagrams. In Faulbaum, F., ed.: In: Faulbaum, F. (ed.). Volume 4 of SoftStat’93.
Advances in Statistical Software. (1993) 429–438
19. Belohlávek, R.: Introduction to Formal Concept Analysis. Palacky University,
Olomouc. (2008)
20. Kuznetsov, S.O., Obiedkov, S.A.: Comparing performance of algorithms for gen-
erating concept lattices. J. Exp. Theor. Artif. Intell. 14(2-3) (2002) 189–216
21. Kourie, D.G., Obiedkov, S.A., Watson, B.W., van der Merwe, D.: An incremental
algorithm to construct a lattice of set intersections. Sci. Comput. Program. 74(3)
(2009) 128–142
22. Krajca, P., Vychodil, V.: Distributed algorithm for computing formal concepts
using map-reduce framework. In: N. Adams et al. (Eds.): IDA 2009. Volume
LNCS 5772. (2009) 333–344
23. Xu, B., de Frein, R., Robson, E., Foghlu, M.O.: Distributed formal concept
analysis algorithms based on an iterative mapreduce framework. In Domenach,
F., Ignatov, D., Poelmans, J., eds.: ICFCA 2012. Volume LNAI 7278. (2012)
292–308
24. Armstrong, W.: Dependency structures of data base relationships. Information
Processing (74) (1974) 580–583
25. Maier, D.: The Theory of Relational Databases. Computer Science Press (1983)
26. Guigues, J.L., Duquenne, V.: Familles minimales d’implications informatives rsul-
tant d’un tableau de donnes binaires. Mathmatiques et Sciences Humaines 95(1)
(1986) 5–18 In French.
27. Bazhanov, K., Obiedkov, S.A.: Optimizations in computing the duquenne-guigues
basis of implications. Ann. Math. Artif. Intell. 70(1-2) (2014) 5–24
90 Dmitry I. Ignatov
28. Baixeries, J., Kaytoue, M., Napoli, A.: Characterization of database dependencies
with FCA and pattern structures. In: Analysis of Images, Social Networks and
Texts - Third International Conference, AIST 2014, Yekaterinburg, Russia, April
10-12, 2014, Revised Selected Papers. (2014) 3–14
29. Yevtushenko, S.A.: System of data analysis ”concept explorer”. (in russian). In:
Proceedings of the 7th national conference on Artificial Intelligence KII-2000.
(2000) 127–134
30. Yevtushenko, S.: Computing and Visualizing Concept Lattices. PhD thesis, TU
Darmstadt, Fachbereich Informatik (2004)
31. Yevtushenko, S.A.: Concept Explorer. The User Guide. (September 12 2006)
32. Becker, P.: Numerical analysis in conceptual systems with toscanaj. In: Concept
Lattices, Second International Conference on Formal Concept Analysis, ICFCA
2004, Sydney, Australia, February 23-26, 2004, Proceedings. (2004) 96–103
33. Becker, P., Correia, J.H.: The toscanaj suite for implementing conceptual infor-
mation systems. In: Formal Concept Analysis, Foundations and Applications.
(2005) 324–348
34. Vogt, F., Wille, R.: TOSCANA - a graphical tool for analyzing and exploring
data. In: Graph Drawing, DIMACS International Workshop, GD ’94, Princeton,
New Jersey, USA, October 10-12, 1994, Proceedings. (1994) 226–233
35. Valtchev, P., Grosser, D., Roume, C., Hacene, M.R.: Galicia: an open platform
for lattices. In: A. de Moor B. Ganter, editor, Using Conceptual Structures:
Contributions to 11th Intl. Conference on Conceptual Structures. (2003) 241–254
36. Lahcen, B., Kwuida., L.: Lattice miner: A tool for concept lattice construction
and exploration. In: Suplementary Proceeding of International Conference on
Formal concept analysis (ICFCA’10). (2010)
37. Poelmans, J., Elzinga, P., Ignatov, D.I., Kuznetsov, S.O.: Semi-automated knowl-
edge discovery: identifying and profiling human trafficking. Int. J. General Sys-
tems 41(8) (2012) 774–804
38. Poelmans, J., Elzinga, P., Neznanov, A., Viaene, S., Kuznetsov, S., Ignatov,
D., Dedene, G.: Concept relation discovery and innovation enabling technology
(cordiet). In: Proceedings of 1st International Workshop on Concept Discovery
in Unstructured Data. Volume 757 of CEUR Workshop proceedings. (2011)
39. Neznanov, A., Ilvovsky, D., Kuznetsov, S.O.: Fcart: A new fca-based system for
data analysis and knowledge discovery. In: Contributions to the 11th International
Conference on Formal Concept Analysis, TU Dresden (2013) 31–44
40. Neznanov, A., Parinov, A.: FCA analyst session and data access tools in FCART.
In: Artificial Intelligence: Methodology, Systems, and Applications - 16th Inter-
national Conference, AIMSA 2014, Varna, Bulgaria, September 11-13, 2014. Pro-
ceedings. (2014) 214–221
41. Buzmakov, A., Neznanov, A.: Practical computing with pattern structures in
FCART environment. In: Proceedings of the International Workshop ”What can
FCA do for Artificial Intelligence?” (FCA4AI at IJCAI 2013), Beijing, China,
August 5, 2013. (2013) 49–56
42. Domenach, F.: CryptoLat – a Pedagogical Software on Lattice Cryptomorphisms
and Lattice Properties. In: Proceedings of the Tenth International Conference
on Concept Lattices and Their Applications, La Rochelle, France, October 15-18,
2013. (2013) 93–103
43. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In Bocca, J.B., Jarke, M., Zaniolo, C., eds.: VLDB, Morgan Kaufmann
(1994) 487–499
Introduction to Formal Concept Analysis and Its Applications in IR 91
64. Zaki, M.J.: Spade: An efficient algorithm for mining frequent sequences. Machine
Learning 42 (2001) 31–60
65. Vander Wal, T.: Folksonomy coinage and definition. (2007)
http://vanderwal.net/folksonomy.html (accessed on 12.03.2012).
66. Mirkin, B.: Mathematical Classification and Clustering. Kluwer, Dordrecht (1996)
67. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis:
A survey. IEEE/ACM Trans. Comput. Biology Bioinform. 1(1) (2004) 24–45
68. Eren, K., Deveci, M., Kktun, O., atalyrek, .V.: A comparative analysis of biclus-
tering algorithms for gene expression data. Briefings in Bioinformatics (2012)
69. Besson, J., Robardet, C., Boulicaut, J.F., Rome, S.: Constraint-based concept
mining and its application to microarray data analysis. Intell. Data Anal. 9(1)
(2005) 59–82
70. Barkow, S., Bleuler, S., Prelic, A., Zimmermann, P., Zitzler, E.: Bicat: a biclus-
tering analysis toolbox. Bioinformatics 22(10) (2006) 1282–1283
71. Tarca, A.L., Carey, V.J., wen Chen, X., Romero, R., Drǎghici, S.: Machine learn-
ing and its applications to biology. PLoS Comput Biol 3(6) (June 2007) e116
72. Hanczar, B., Nadif, M.: Bagging for biclustering: Application to microarray data.
In: Machine Learning and Knowledge Discovery in Databases. Volume 6321 of
LNCS. Springer (2010) 490–505
73. Kaytoue, M., Kuznetsov, S.O., Napoli, A., Duplessis, S.: Mining gene expression
data with pattern structures in formal concept analysis. Inf. Sci. 181(10) (2011)
1989–2001
74. Blinova, V.G., Dobrynin, D.A., Finn, V.K., Kuznetsov, S.O., Pankratova, E.S.:
Toxicology analysis by means of the jsm-method. Bioinformatics 19(10) (2003)
1201–1207
75. Kuznetsov, S., Samokhin, M.: Learning closed sets of labeled graphs for chemical
applications. In: ILP 2005. Volume 3625 of LNCS (LNAI)., Springer (2005) 190–
208
76. DiMaggio, P.A., Subramani, A., Judson, R.S., Floudas, C.A.: A novel framework
for predicting in vivo toxicities from in vitro data using optimal methods for
dense and sparse matrix reordering and logistic regression. Toxicological Sciences
118(1) (2010) 251–265
77. Asses, Y., Buzmakov, A., Bourquard, T., Kuznetsov, S.O., Napoli, A.: A hybrid
classification approach based on FCA and emerging patterns – an application for
the classification of biological inhibitors. In: Proceedings of The 9th Int. Conf. on
Concept Lattices and Their Applications. (2012) 211–222
78. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph
partitioning. In: Proceedings of the seventh ACM SIGKDD international confer-
ence on Knowledge discovery and data mining. KDD ’01, New York, NY, USA,
ACM (2001) 269–274
79. Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora
using formal concept analysis. J. Artif. Intell. Res. (JAIR) 24 (2005) 305–339
80. Banerjee, A., Dhillon, I.S., Ghosh, J., Merugu, S., Modha, D.S.: A Generalized
Maximum Entropy Approach to Bregman Co-clustering and Matrix Approxima-
tion. Journal of Machine Learning Research 8 (2007) 1919–1986
81. Ignatov, D.I., Kuznetsov, S.O.: Frequent itemset mining for clustering near du-
plicate web documents. [248] 185–200
82. Carpineto, C., Michini, C., Nicolussi, R.: A concept lattice-based kernel for SVM
text classification. In: ICFCA 2009. Volume LNAI 5548., Springer (2009) 237–250
Introduction to Formal Concept Analysis and Its Applications in IR 93
83. Koester, B.: Conceptual knowledge retrieval with fooca: Improving web search
engine results with contexts and concept hierarchies. In Perner, P., ed.: Industrial
Conference on Data Mining. Volume 4065 of Lecture Notes in Computer Science.,
Springer (2006) 176–190
84. Eklund, P.W., Ducrou, J., Dau, F.: Concept similarity and related categories in
information retrieval using formal concept analysis. Int. J. General Systems 41(8)
(2012) 826–846
85. Duquenne, V.: Lattice analysis and the representation of handicap associations.
Social Networks 18(3) (1996) 217–230
86. Freeman, L.C.: Cliques, Galois lattices, and the structure of human social groups.
Social Networks 18 (1996) 173–187
87. Latapy, M., Magnien, C., Vecchio, N.D.: Basic notions for the analysis of large
two-mode networks. Social Networks 30(1) (2008) 31–48
88. Roth, C., Obiedkov, S.A., Kourie, D.G.: On Succinct Representation of Knowl-
edge Community Taxonomies with Formal Concept Analysis. Int. J. Found. Com-
put. Sci. 19(2) (2008) 383–404
89. Gnatyshak, D., Ignatov, D.I., Semenov, A., Poelmans, J.: Gaining insight in social
networks with biclustering and triclustering. In: BIR. Volume 128 of Lecture Notes
in Business Information Processing., Springer (2012) 162–171
90. du Boucher-Ryan, P., Bridge, D.G.: Collaborative Recommending using Formal
Concept Analysis. Knowl.-Based Syst. 19(5) (2006) 309–315
91. Symeonidis, P., Nanopoulos, A., Papadopoulos, A.N., Manolopoulos, Y.: Nearest-
biclusters collaborative filtering based on constant and coherent values. Inf. Retr.
11(1) (2008) 51–75
92. Ignatov, D.I., Kuznetsov, S.O.: Concept-based Recommendations for Internet
Advertisement. In Belohlavek, R., Kuznetsov, S.O., eds.: Proc. CLA 2008. Volume
Vol. 433 of CEUR WS., Palack University, Olomouc, 2008 (2008) 157–166
93. Nanopoulos, A., Rafailidis, D., Symeonidis, P., Manolopoulos, Y.: Musicbox:
Personalized music recommendation based on cubic analysis of social tags. IEEE
Transactions on Audio, Speech & Language Processing 18(2) (2010) 407–412
94. Ignatov, D.I., Nenova, E., Konstantinova, N., Konstantinov, A.V.: Boolean Matrix
Factorisation for Collaborative Filtering: An FCA-Based Approach. In: Artificial
Intelligence: Methodology, Systems, and Applications. Volume 8722 of LNCS.
Springer (2014) 47–58
95. Ignatov, D.I.: Mathematical Models, Algorithms and Software Tools of Biclus-
tering Based on Closed Sets. PhD thesis, National Research University Higher
School of Economics (2010)
96. Ignatov, D.I., Kuznetsov, S.O., Poelmans, J.: Concept-based biclustering for inter-
net advertisement. In: ICDM Workshops, IEEE Computer Society (2012) 123–130
97. Benz, D., Hotho, A., Jäschke, R., Krause, B., Mitzlaff, F., Schmitz, C., Stumme,
G.: The social bookmark and publication management system bibsonomy – A
platform for evaluating and demonstrating web 2.0 research. VLDB J. 19(6)
(2010) 849–875
98. Zhao, L., Zaki, M.J.: Tricluster: An effective algorithm for mining coherent clus-
ters in 3D microarray data. In Özcan, F., ed.: SIGMOD Conference, ACM (2005)
694–705
99. Li, A., Tuck, D.: An effective tri-clustering algorithm combining expression data
with gene regulation information. Gene regulation and systems biology 3 (2009)
49–64
100. Wille, R.: The basic theorem of Triadic Concept Analysis. Order 12 (1995)
149–158
94 Dmitry I. Ignatov
101. Lehmann, F., Wille, R.: A triadic approach to formal concept analysis. In: Pro-
ceedings of the Third International Conference on Conceptual Structures: Appli-
cations, Implementation and Theory, London, UK, Springer-Verlag (1995) 32–43
102. Krolak-Schwerdt, S., Orlik, P., Ganter, B.: Tripat: a model for analyzing three-
mode binary data. In Bock, H.H., Lenski, W., Richter, M., eds.: Information
Systems and Data Analysis. Studies in Classification, Data Analysis, and Knowl-
edge Organization. Springer Berlin Heidelberg (1994) 298–307
103. Ji, L., Tan, K.L., Tung, A.K.H.: Mining frequent closed cubes in 3d datasets. In:
Proceedings of the 32nd international conference on Very large data bases. VLDB
’06, VLDB Endowment (2006) 811–822
104. Cerf, L., Besson, J., Robardet, C., Boulicaut, J.F.: Closed patterns meet n-ary
relations. ACM Trans. Knowl. Discov. Data 3 (March 2009) 3:1–3:36
105. Cerf, L., Besson, J., Nguyen, K.N., Boulicaut, J.F.: Closed and noise-tolerant
patterns in n-ary relations. Data Min. Knowl. Discov. 26(3) (2013) 574–619
106. Georgii, E., Tsuda, K., Schölkopf, B.: Multi-way set enumeration in weight ten-
sors. Machine Learning 82(2) (2011) 123–155
107. Spyropoulou, E., De Bie, T., Boley, M.: Interesting pattern mining in multi-
relational data. Data Mining and Knowledge Discovery 28(3) (2014) 808–849
108. Voutsadakis, G.: Polyadic concept analysis. Order 19(3) (2002) 295–304
109. Ignatov, D., Gnatyshak, D., Kuznetsov, S., Mirkin, B.: Triadic formal concept
analysis and triclustering: searching for optimal patterns. Machine Learning
(2015) 1–32
110. Mirkin, B., Kramarenko, A.V.: Approximate bicluster and tricluster boxes in the
analysis of binary data. [246] 248–256
111. Gnatyshak, D., Ignatov, D.I., Kuznetsov, S.O.: From triadic fca to triclustering:
Experimental comparison of some triclustering algorithms. [249] 249–260
112. Gnatyshak, D.V., Ignatov, D.I., Kuznetsov, S.O., Nourine, L.: A one-pass triclus-
tering approach: Is there any room for big data? In: CLA 2014. (2014)
113. Ganter, B., Kuznetsov, S.O.: Hypotheses and version spaces. In de Moor, A., Lex,
W., Ganter, B., eds.: ICCS. Volume 2746 of Lecture Notes in Computer Science.,
Springer (2003) 83–95
114. Belohlávek, R., Baets, B.D., Outrata, J., Vychodil, V.: Inducing decision trees
via concept lattices. Int. J. General Systems 38(4) (2009) 455–467
115. Carpineto, C., Romano, G.: Galois: An order-theoretic approach to conceptual
clustering. In: Proceeding of ICML93, Amherst. (1993) 33–40
116. Carpineto, C., Romano, G.: A lattice conceptual clustering system and its appli-
cation to browsing retrieval. Machine Learning Vol. 24 (1996) 95–122
117. Fu, H., Fu, H., Njiwoua, P., Nguifo, E.M.: A Comparative Study of FCA-Based
Supervised Classification Algorithms. In: 2nd Int. Conf. on Formal Concept Anal-
ysis, ICFCA 2004, Sydney, Australia, February 23-26, 2004, Proceedings. (2004)
313–320
118. Rudolph, S.: Using FCA for encoding closure operators into neural networks. In:
15th International Conference on Conceptual Structures, ICCS 2007, Sheffield,
UK, July 22-27, 2007, Proceedings. (2007) 321–332
119. Tsopzé, N., Nguifo, E.M., Tindo, G.: CLANN: concept lattice-based artificial
neural network for supervised classification. In: Proceedings of the 5th Int. Conf.
on Concept Lattices and Their Applications, CLA 2007. (2007)
120. Outrata, J.: Boolean factor analysis for data preprocessing in machine learning.
In: The Ninth International Conference on Machine Learning and Applications,
ICMLA 2010, Washington, DC, USA, 12-14 December 2010. (2010) 899–902
Introduction to Formal Concept Analysis and Its Applications in IR 95
121. Belohlávek, R., Outrata, J., Trnecka, M.: Impact of boolean factorization as
preprocessing methods for classification of boolean data. Ann. Math. Artif. Intell.
72(1-2) (2014) 3–22
122. Ganter, B., Kuznetsov, S.O.: Scale coarsening as feature selection. In: Proceed-
ings of the 6th International Conference on Formal Concept Analysis. ICFCA’08,
Berlin, Heidelberg, Springer-Verlag (2008) 217–228
123. Visani, M., Bertet, K., Ogier, J.: Navigala: an original symbol classifier based on
navigation through a Galois lattice. IJPRAI 25(4) (2011) 449–473
124. Zaki, M.J., Aggarwal, C.C.: Xrules: An effective algorithm for structural classifi-
cation of XML data. Machine Learning 62(1-2) (2006) 137–170
125. Flach, P.: Machine Learning: The Art and Science of Algorithms That Make
Sense of Data. Cambridge University Press, New York, NY, USA (2012)
126. Finn, V.: On machine-oriented formalization of plausible reasoning in f.bacon-
j.s.mill style. Semiotika i Informatika (20) (1983) 35–101 (in Russian).
127. Kuznetsov, S.: Jsm-method as a machine learning. Method. Itogi Nauki i
Tekhniki, ser. Informatika (15) (1991) 17–53 (in Russian).
128. Gusakova, S.: Paleography with jsm-method. Technical report, VINITI (2001)
129. Ganter, B., Kuznetsov, S.: Formalizing hypotheses with concepts. In Ganter,
B., Mineau, G., eds.: Conceptual Structures: Logical, Linguistic, and Computa-
tional Issues. Volume 1867 of Lecture Notes in Computer Science. Springer Berlin
Heidelberg (2000) 342–356
130. Zhuk, R., Ignatov, D.I., Konstantinova, N.: Concept learning from triadic data. In:
Proceedings of the Second International Conference on Information Technology
and Quantitative Management, ITQM 2014, National Research University Higher
School of Economics (HSE), Moscow, Russia, June 3-5, 2014. (2014) 928–938
131. Ignatov, D.I., Zhuk, R., Konstantinova, N.: Learning hypotheses from triadic la-
beled data. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web
Intelligence (WI) and Intelligent Agent Technologies (IAT), Warsaw, Poland, Au-
gust 11-14, 2014 - Volume I. (2014) 474–480
132. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: Concep-
tual Structures: Broadening the Base, 9th International Conference on Conceptual
Structures, ICCS 2001, Stanford, CA, USA, July 30-August 3, 2001, Proceedings.
(2001) 129–142
133. Buzmakov, A., Egho, E., Jay, N., Kuznetsov, S.O., Napoli, A., Raı̈ssi, C.: On pro-
jections of sequential pattern structures (with an application on care trajectories).
[249] 199–208
134. Kuznetsov, S.O.: Scalable knowledge discovery in complex data with pattern
structures. In: Pattern Recognition and Machine Intelligence - 5th International
Conference, PReMI 2013, Kolkata, India, December 10-14, 2013. Proceedings.
(2013) 30–39
135. Strok, F., Galitsky, B., Ilvovsky, D., Kuznetsov, S.: Pattern structure projec-
tions for learning discourse structures. In Agre, G., Hitzler, P., Krisnadhi, A.,
Kuznetsov, S., eds.: Artificial Intelligence: Methodology, Systems, and Applica-
tions. Volume 8722 of Lecture Notes in Computer Science. Springer International
Publishing (2014) 254–260
136. Belohlávek, R.: What is a fuzzy concept lattice? II. [246] 19–26
137. Kent, R.E.: Rough concept analysis: A synthesis of rough sets and formal concept
analysis. Fundam. Inform. 27(2/3) (1996) 169–181
138. Poelmans, J., Ignatov, D.I., Kuznetsov, S.O., Dedene, G.: Fuzzy and rough formal
concept analysis: a survey. Int. J. General Systems 43(2) (2014) 105–134
96 Dmitry I. Ignatov
Conference on Conceptual Structures, ICCS 2001, Stanford, CA, USA, July 30-
August 3, 2001, Proceedings. (2001) 319–332
177. Eklund, P.W., Cole, R.J.: A knowledge representation for information filtering
using formal concept analysis. Electron. Trans. Artif. Intell. 4(C) (2000) 51–61
178. Eklund, P.W., Ducrou, J., Brawn, P.: Concept lattices for information visualiza-
tion: Can novices read line-diagrams? [247] 57–73
179. Eklund, P.W., Wormuth, B.: Restructuring help systems using formal concept
analysis. In: Formal Concept Analysis, Third International Conference, ICFCA
2005, Lens, France, February 14-18, 2005, Proceedings. (2005) 129–144
180. Stojanovic, N.: On the query refinement in the ontology-based searching for
information. Inf. Syst. 30(7) (2005) 543–563
181. Spyratos, N., Meghini, C.: Preference-based query tuning through refine-
ment/enlargement in a formal context. In: Foundations of Information and Knowl-
edge Systems, 4th International Symposium, FoIKS 2006, Budapest, Hungary,
February 14-17, 2006, Proceedings. (2006) 278–293
182. Grand, B.L., Aufaure, M., Soto, M.: Semantic and conceptual context-aware
information retrieval. In: Advanced Internet Based Systems and Applications,
Second International Conference on Signal-Image Technology and Internet-Based
Systems, SITIS 2006, Hammamet, Tunisia, December 17-21, 2006, Revised Se-
lected Papers. (2006) 247–258
183. Eklund, P.W., Ducrou, J.: Navigation and annotation with formal concept analy-
sis. In: Knowledge Acquisition: Approaches, Algorithms and Applications, Pacific
Rim Knowledge Acquisition Workshop, PKAW 2008, Hanoi, Vietnam, December
15-16, 2008, Revised Selected Papers. (2008) 118–121
184. Cigarrán, J.M., Peñas, A., Gonzalo, J., Verdejo, F.: Automatic selection of noun
phrases as document descriptors in an fca-based information retrieval system. In:
Formal Concept Analysis, Third International Conference, ICFCA 2005, Lens,
France, February 14-18, 2005, Proceedings. (2005) 49–63
185. Recio-Garcı́a, J.A., Gómez-Martı́n, M.A., Dı́az-Agudo, B., González-Calero, P.A.:
Improving annotation in the semantic web and case authoring in textual CBR.
In: Advances in Case-Based Reasoning, 8th European Conference, ECCBR 2006,
Fethiye, Turkey, September 4-7, 2006, Proceedings. (2006) 226–240
186. Liu, M., Shao, M., Zhang, W., Wu, C.: Reduction method for concept lattices
based on rough set theory and its application. Computers & Mathematics with
Applications 53(9) (2007) 1390–1410
187. Lungley, D., Kruschwitz, U.: Automatically maintained domain knowledge: Initial
findings. In: Advances in Information Retrieval, 31th European Conference on
IR Research, ECIR 2009, Toulouse, France, April 6-9, 2009. Proceedings. (2009)
739–743
188. Ahmad, I., Jang, T.: Old fashion text-based image retrieval using FCA. In: ICIP
(3). (2003) 33–36
189. Ducrou, J., Vormbrock, B., Eklund, P.W.: Fca-based browsing and searching of
a collection of images. In: Conceptual Structures: Inspiration and Application,
14th International Conference on Conceptual Structures, ICCS 2006, Aalborg,
Denmark, July 16-21, 2006, Proceedings. (2006) 203–214
190. Ducrou, J.: Dvdsleuth: A case study in applied formal concept analysis for nav-
igating web catalogs. In: Conceptual Structures: Knowledge Architectures for
Smart Applications, 15th International Conference on Conceptual Structures,
ICCS 2007, Sheffield, UK, July 22-27, 2007, Proceedings. (2007) 496–500
Introduction to Formal Concept Analysis and Its Applications in IR 99
191. Amato, G., Meghini, C.: Faceted content-based image retrieval. In: 19th Interna-
tional Workshop on Database and Expert Systems Applications (DEXA 2008),
1-5 September 2008, Turin, Italy. (2008) 402–406
192. Ferré, S.: Camelis: a logical information system to organise and browse a collection
of documents. Int. J. General Systems 38(4) (2009) 379–403
193. Poelmans, J., Elzinga, P., Viaene, S., Dedene, G.: Formally analysing the concepts
of domestic violence. Expert Syst. Appl. 38(4) (2011) 3116–3130
194. Wolff, K.E.: States, transitions, and life tracks in temporal concept analysis. In:
Formal Concept Analysis, Foundations and Applications. (2005) 127–148
195. Elzinga, P., Poelmans, J., Viaene, S., Dedene, G., Morsing, S.: Terrorist threat
assessment with formal concept analysis. In: IEEE International Conference on
Intelligence and Security Informatics, ISI 2010, Vancouver, BC, Canada, May
23-26, 2010, Proceedings. (2010) 77–82
196. Elzinga, P., Wolff, K.E., Poelmans, J.: Analyzing chat conversations of pedophiles
with temporal relational semantic systems. In: 2012 European Intelligence and
Security Informatics Conference, EISIC 2012, Odense, Denmark, August 22-24,
2012. (2012) 242–249
197. Bullens, R., Van Horn, J.: Daad uit liefde: Gedwongen prostitutie van jonge
meisjes. Justitiele Verkenningen 26(6) (2000) 25–41
198. Koester, B., Schmidt, S.: Information superiority via formal concept analysis. In
Argamon, S., Howard, N., eds.: Computational Methods for Counterterrorism.
Springer Berlin Heidelberg (2009) 143–171
199. Obiedkov, S.A., Kourie, D.G., Eloff, J.H.P.: Building access control models with
attribute exploration. Computers & Security 28(1-2) (2009) 2–7
200. Dau, F., Knechtel, M.: Access policy design supported by FCA methods. [248]
141–154
201. Zhukov, L.E.: Spectral clustering of large advertiser datasets. Technical report,
Overture R&D (April 2004)
202. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Analysis of recommendation
algorithms for e-commerce. In: ACM Conference on Electronic Commerce. (2000)
158–167
203. Besson, J., Robardet, C., Boulicaut, J.F., Rome, S.: Constraint-based bi-set min-
ing for biologically relevant pattern discovery in microarray data. Intelligent Data
Analysis journal 9(1) (2005) 59–82
204. Szathmary, L., Napoli, A.: CORON: A Framework for Levelwise Itemset Mining
Algorithms. In: Suppl. Proc. of ICFCA ’05, Lens, France. (Feb 2005) 110–113
205. Szathmary, L., Napoli, A., Kuznetsov, S.O.: ZART: A Multifunctional Itemset
Mining Algorithm. In: Proc. of the 5th Intl. Conf. on Concept Lattices and Their
Applications (CLA ’07), Montpellier, France (Oct 2007) 26–37
206. Crystal, D.: A dictionary of linguistics and phonetics. third edn. Oxford: Blackwell
Publishers (1991)
207. Symeonidis, P., Ruxanda, M.M., Nanopoulos, A., Manolopoulos, Y.: Ternary
semantic analysis of social tags for personalized music recommendation. In Bello,
J.P., Chew, E., Turnbull, D., eds.: ISMIR. (2008) 219–224
208. Alqadah, F., Reddy, C., Hu, J., Alqadah, H.: Biclustering neighborhood-based
collaborative filtering method for top-n recommender systems. Knowledge and
Information Systems (2014) 1–17
209. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender sys-
tems: A survey of the state-of-the-art and possible extensions. IEEE Trans. on
Knowl. and Data Eng. 17(6) (June 2005) 734–749
100 Dmitry I. Ignatov
210. Ignatov, D.I., Poelmans, J., Dedene, G., Viaene, S.: A new cross-validation tech-
nique to evaluate quality of recommender systems. In Kundu, M., Mitra, S.,
Mazumdar, D., Pal, S., eds.: Perception and Machine Intelligence. Volume 7143
of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2012) 195–202
211. Brin, S., Davis, J., Garcı́a-Molina, H.: Copy detection mechanisms for digital
documents. SIGMOD Rec. 24(2) (May 1995) 398–409
212. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of
the web. Computer Networks 29(8-13) (1997) 1157–1166
213. Ilyinsky, S., Kuzmin, M., Melkov, A., Segalovich, I.: An efficient method to detect
duplicates of web documents with the use of inverted index. In: Proc. 11th Int.
World Wide Web Conference (WWW’2002), Honolulu, Hawaii, USA, 7-11 May
2002, ACM (2002)
214. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise indepen-
dent permutations (extended abstract). In: Proceedings of the Thirtieth Annual
ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23-26,
1998. (1998) 327–336
215. Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Combinato-
rial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada,
June 21-23, 2000, Proceedings. (2000) 1–10
216. Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In:
FIMI ’03, Frequent Itemset Mining Implementations, Proceedings of the ICDM
2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003,
Melbourne, Florida, USA. (2003)
217. Karypis, G.: Cluto. a clustering toolkit. Technical Report: 2-017 MN 55455, Uni-
versity of Minnesota, Department of Computer Science Minneapolis (November
28 2003)
218. Potthast, M., Stein, B.: New issues in near-duplicate detection. In: Data Analysis,
Machine Learning and Applications - Proceedings of the 31st Annual Conference
of the Gesellschaft für Klassifikation e.V., Albert-Ludwigs-Universität Freiburg,
March 7-9, 2007. (2007) 601–609
219. Zelenkov, Y.G., Segalovich, I.V.: Comparative analysis of near-duplicate detection
methods of web documents. In: Proc. 9th All-Russian Scientific Conference Digital
Libraries: Advanced Methods and Technologies, Digital Collections, Pereslavl-
Zalessky. (2007) 166–174 (in Russian).
220. Ignatov, D.I., Jánosi-Rancz, K.T., Kuznetzov, S.O.: Towards a framework for
near-duplicate detection in document collections based on closed sets of attributes.
Acta Universitatis Sapientiae. Informatica 1(2) (2009) 215–233
221. Ignatov, D., Kuznetsov, S., Lopatnikova, V., Selitskiy, I.: Development and apro-
bation of near duplicate detection system for collections of r&d documents. Busi-
ness informatics (4) (2008) 21–28 (in Russian).
222. Ley, M.: DBLP - some lessons learned. PVLDB 2(2) (2009) 1493–1500
223. Benz, D., Hotho, A., Jäschke, R., Krause, B., Stumme, G.: Query logs as folk-
sonomies. Datenbank-Spektrum 10(1) (2010) 15–24
224. Doerfel, S., Jäschke, R.: An analysis of tag-recommender evaluation procedures.
In: Seventh ACM Conference on Recommender Systems, RecSys ’13, Hong Kong,
China, October 12-16, 2013. (2013) 343–346
225. Kuznetsov, S.O., Ignatov, D.I.: Concept stability for constructing taxonomies of
web-site users. In: Proc. Social Network Analysis and Conceptual Structures: Ex-
ploring Opportunities, S. Obiedkov, C. Roth (Eds.), Clermont-Ferrand (France),
February 16, 2007. (2007) 19–24
Introduction to Formal Concept Analysis and Its Applications in IR 101
242. Jäschke, R., Rudolph, S.: Attribute exploration on the web. In Cellier, P., Distel,
F., Ganter, B., eds.: Contributions to the 11th International Conference on Formal
Concept Analysis, Technische Universitt Dresden (May 2013) 19–34
243. Codocedo, V., Lykourentzou, I., Napoli, A.: A semantic approach to concept
lattice-based information retrieval. Ann. Math. Artif. Intell. 72(1-2) (2014) 169–
195
244. Tilley, T., Cole, R., Becker, P., Eklund, P.W.: A survey of formal concept anal-
ysis support for software engineering activities. In: Formal Concept Analysis,
Foundations and Applications. (2005) 250–271
245. Arévalo, G., Desnos, N., Huchard, M., Urtado, C., Vauttier, S.: Formal concept
analysis-based service classification to dynamically build efficient software com-
ponent directories. Int. J. General Systems 38(4) (2009) 427–453
246. Kuznetsov, S.O., Slezak, D., Hepting, D.H., Mirkin, B., eds.: Rough Sets, Fuzzy
Sets, Data Mining and Granular Computing - 13th International Conference,
RSFDGrC 2011, Moscow, Russia, June 25-27, 2011. Proceedings. In Kuznetsov,
S.O., Slezak, D., Hepting, D.H., Mirkin, B., eds.: RSFDGrC. Volume 6743 of
Lecture Notes in Computer Science., Springer (2011)
247. Eklund, P.W., ed.: Concept Lattices, Second International Conference on For-
mal Concept Analysis, ICFCA 2004, Sydney, Australia, February 23-26, 2004,
Proceedings. In Eklund, P.W., ed.: ICFCA. Volume 2961 of Lecture Notes in
Computer Science., Springer (2004)
248. Rudolph, S., Dau, F., Kuznetsov, S.O., eds.: Conceptual Structures: Leveraging
Semantic Technologies, 17th International Conference on Conceptual Structures,
ICCS 2009, Moscow, Russia, July 26-31, 2009. Proceedings. In Rudolph, S., Dau,
F., Kuznetsov, S.O., eds.: ICCS. Volume 5662 of Lecture Notes in Computer
Science., Springer (2009)
249. Ojeda-Aciego, M., Outrata, J., eds.: Proceedings of the Tenth International Con-
ference on Concept Lattices and Their Applications, La Rochelle, France, October
15-18, 2013. In Ojeda-Aciego, M., Outrata, J., eds.: CLA. Volume 1062 of CEUR
Workshop Proceedings., CEUR-WS.org (2013)
250. Pfeiffer, H.D., Ignatov, D.I., Poelmans, J., Gadiraju, N., eds.: Conceptual Struc-
tures for STEM Research and Education, 20th International Conference on Con-
ceptual Structures, ICCS 2013, Mumbai, India, January 10-12, 2013. Proceedings.
Volume 7735 of Lecture Notes in Computer Science., Springer (2013)
FCA has significantly influenced information retrieval by providing a structured, logical approach for modeling and navigating document collections. It aids in query refinement, ranking, and enrichment, by structuring information as clusters of related documents. FCA-based systems enable effective query expansion and reconfiguration, enhancing retrieval effectiveness. Examples include the use of FCA for automated navigation, retrieval of images, and enhancing web search through dynamic lattice construction, as well as integration into systems like CreChainDo and KAnavigator for specialized document navigation .
FCA facilitates the exploration of textual data in information retrieval contexts by organizing data into concept lattices that provide an intuitive visualization for users to navigate and retrieve information effectively. This structure helps in query refinement, document representation, and clustering, allowing users to browse, search, and refine queries through a partial order of related documents and terms . FCA also supports ontology-based information retrieval by enabling query expansion and refinement, which can improve both the pre-processing and post-processing stages of information retrieval systems . By structuring data into clusters, FCA assists in identifying relationships between data, enhancing users' ability to interpret and retrieve relevant information . Additionally, FCA-based tools can facilitate tasks like text classification and clustering, further aiding in managing and retrieving large collections of textual information efficiently .
FCA (Formal Concept Analysis) contributes to recommender systems by utilizing concept lattices for organizing and analyzing user and item data, which helps in identifying patterns and associations. It facilitates the understanding of user preferences through Boolean Matrix Factorisation and user-based k-nearest neighbors strategies, allowing systems to deliver more accurate recommendations . Additionally, FCA aids in enriching rating matrices with user and item context information, such as demographic or categorical attributes, providing a more nuanced view for making recommendations . FCA's strength in capturing the relationships and structures in data makes it applicable for developing recommendations such as association rule mining to suggest product features or items to users . Finally, FCA’s capability in handling binary relations helps in simplifying complex data and producing interpretable models, which are useful in the recommender system context .
Lattice-based methods in sequential pattern mining, particularly when employing Formal Concept Analysis (FCA), offer significant advantages such as the ability to handle more complex data descriptions beyond binary attributes, like intervals, sequences, and graphs, by using pattern structures and a similarity operator . This adaptability allows for effective processing of complex data while reducing algorithmic complexity through lazy evaluation and parallelization, making it suitable for large data sets and Big Data contexts . FCAs' concept lattices enable the identification and visualization of frequent patterns and relationships within data, facilitating tasks such as text classification, clustering, and data mining . Using lattices, FCA can manage data imprecision through fuzzy concept lattices, thus addressing challenges in scaling nominal or ordinal data without information loss .
The Next Closure algorithm optimizes the generation of formal concepts by systematically using the lectic (lexicographic) order on subsets of the object set G. It starts with the object that is maximal with respect to a predefined linear order and generates formal concepts in an efficient sequence, ensuring canonicity by avoiding repetitions. The linear order plays a crucial role as it defines the sequence in which the potential generators are considered. This order is used to find the next object to add to the extent of the current partial formal concept to generate new concepts, thereby ensuring that no concept is skipped and each is generated exactly once . By this method, the algorithm reduces memory usage and avoids unnecessary recalculations, making it efficient with polynomial delay . The use of a linear order also facilitates the binary search when generating the concept lattice from the formal concepts, further optimizing the diagram drawing process .
In a many-valued context, a conceptual scale for an attribute m creates a scaled or transformed context Sm, where scale values correspond to the unique values that attribute m can take across all objects. Scale attributes represent the derived attributes in the scaled context, which are specific to each value m(g) that attribute m takes on for objects g in G. This transformation allows the many-valued data to be represented in a more homogeneous one-valued context suitable for traditional FCA analysis .
A many-valued context in Formal Concept Analysis (FCA) involves a set of objects, attributes, and possible values for these attributes, forming a ternary relation I ⊆G × M × W, which means an attribute can have multiple values. In contrast, a one-valued context has a binary relation between objects and attributes, where each attribute has only one value. Many-valued contexts are handled in FCA by transforming them into one-valued contexts using a technique called "conceptual scaling." This involves creating a scale for each attribute, which is a one-valued context that represents different values of the attribute as individual attributes . This transformation allows the application of traditional FCA methods to many-valued contexts by converting them into a structure compatible with the concept lattice generation .
In a formal concept lattice, the object concept and the attribute concept are integral to identifying and labeling formal concepts. For each object \(g \) in the formal context \((G, M, I)\), the object concept is represented as \(({g}'', {g}')\), and for each attribute \(m \), the attribute concept is represented as \(({m}', {m}'' )\). These concepts help establish the structure of the lattice by ensuring that every formal concept can be written as \((X'', X')\) for a subset \(X \subseteq G\) or as \((Y', Y'')\) for a subset \(Y \subseteq M\). Moreover, these concepts are fundamental for constructing and visualizing the line diagram of a concept lattice, allowing for concise labeling by attaching attribute or object names to the respective circles representing each concept .
The Close by One (CbO) algorithm operates with a time complexity of O(|G|2|M||L|) and a polynomial delay of O(|G|3|M|), focusing on a depth-first search traversal through a tree-like structure of concept generation. Each node represents a closed set of objects, which can be identified by a specific path from the root, and non-canonic generations are easily detected using simple checks . In contrast, the Next Closure algorithm offers a more efficient approach by only computing closures for specific subsets and using a canonicity test that doesn't require referencing a list of already generated concepts, thereby minimizing storage needs. It operates sequence-wise based on a linear order of elements and primarily ensures novelty and non-redundancy of generated closures . These differences in their approach to traversal and storage reflect variations in their efficiency and operational subtleties when processing formal concepts.
The iceberg concept lattice is associated with frequent concepts in formal concept analysis by representing only those concepts whose extent (or the intersection of all objects having a particular set of attributes) meets a minimum support threshold. This means it is an order filter on the concept lattice, where only the frequent concepts are retained, reducing the size of the complete lattice to a more manageable size based on the chosen threshold . The frequent closed itemsets form the nodes of the iceberg lattice, which is isomorphic to the lattice of all closed itemsets with a support threshold set to zero, effectively filtering the nodes to include only significant itemsets . This approach is especially useful in applications like data mining, where it helps in managing large datasets by focusing on the most significant patterns .