Generalized Vector Model
Classic models enforce independence of index terms. For the Vector model:
Set of term vectors {k1, k2, ..., kt} are linearly independent and form a basis for the subspace of interest. ki kj = 0
Frequently, this is interpreted as:
i,j
In 1985, Wong, Ziarko, and Wong proposed an interpretation in which the set of terms is linearly independent, but not pairwise orthogonal.
Key Idea:
In
the generalized vector model, two index terms might be non-orthogonal and are represented in terms of smaller components (minterms). As before let,
wij be the weight associated with [ki,dj] {k1, k2, ..., kt} be the set of all terms
If
these weights are all binary, all patterns of occurrence of terms within documents can be represented by the minterms:
m1 = (0,0, ..., 0) m5 = (0,0,1, ..., 0) m2 = (1,0, ..., 0) m3 = (0,1, ..., 0) t m4 = (1,1, ..., 0) m2 = (1,1,1, ..., 1) In here, m2 indicates documents in which solely the term k1 occurs.
Key Idea:
The
basis for the generalized vector model is formed by a set of 2 vectors defined over the set of minterms, as follows: 0 1 2 ...
m1
= (1, 0, 0, ..., 0, 0) m2 = (0, 1, 0, ..., 0, 0) m3 = (0, 0, 1, ..., 0, 0) t m2 = (0, 0, 0, ..., 0, 1)
Notice
i,j
that,
mi mj = 0 i.e., pairwise orthogonal
Key Idea:
Minterm
The
vectors are pairwise orthogonal. But, this does not mean that the index terms are independent:
minterm m4 is given by: m4 = (1, 1, 0, ..., 0, 0) This minterm indicates the occurrence of the terms k1 and k2 within a same document. If such document exists in a collection, we say that the minterm m4 is active and that a dependency between these two terms is induced. The generalized vector model adopts as a basic foundation the notion that cooccurence of terms within documents induces dependencies among them.
Forming the Term Vectors
The
ki
vector associated with the term ki is computed as: ci,r mr
=
r, g i(m r )=1
sqrt( r, g
ci,r
(m r
)=1 ci,r )
dj | l, gl(dj)=gl(mr)
wij
The
weight c i,r associated with the pair [ki,mr] sums up the weights of the term ki in all the documents which have a term occurrence pattern given by mr. Notice that for a collection of size N, only N minterms t affect the ranking (and not 2 ).
Dependency between Index Terms
A degree
of correlation between the terms ki and kj can now be computed as: ki kj = r, g
(m )=1 g (mr )=1 i r
j
c i,r * c j,r
This
degree of correlation sums up (in a weighted form) the dependencies between ki and kj induced by the documents in the collection (represented by the mr minterms).
The Generalized Vector Model: An Example k1
d2 d4 d1 d6 d5 d7 d3
k2
k3
d1 d2 d3 d4 d5 d6 d7 q
k1 2 1 0 2 1 1 0 1
k2 0 0 1 0 2 2 5 2
k3 1 0 3 0 4 0 0 3
Computation of C i,r
wij
d1 d2 d3 d4 d5 d6 d7 q k1 2 1 0 2 1 0 0 1 k2 0 0 1 0 2 2 5 2 k3 1 0 3 0 4 2 0 3
d1 = m6 d2 = m2 d3 = m7 d4 = m2 d5 = m8 d6 = m7 d7 = m3 q = m8
k1 1 1 0 1 1 0 0 1
k2 0 0 1 0 1 1 1 1
k3 1 0 1 0 1 1 0 1
c i,r = dj | l, gl(dj)=gl(mr) wij
m1 m2 m3 m4 m5 m6 m7 m8 c1,r 0 3 0 0 0 2 0 1 c2,r 0 0 5 0 0 0 3 2 c3,r 0 0 0 0 0 1 5 4
Computation of Index Term Vectors
m1 m2 m3 m4 m5 m6 m7 m8 c1,r 0 3 0 0 0 2 0 1 c2,r 0 0 5 0 0 0 3 2 c3,r 0 0 0 0 0 1 5 4
k1
1 (3 m2 + 2 m6 + m8 ) sqrt(32 + 22 + 12 ) k2 = 1 (5 m3 + 3 m7 + 2 m8 ) 2 2 2 sqrt(5 + 3 + 2 ) k3 = 1 (1 m6 + 5 m7 + 4 m8 ) 2 2 2 sqrt(1 + 5 + 4 )
Computation of Document Vectors
d1 d2 d3 d4 d5 d6 d7 q k1 2 1 0 2 1 0 0 1 k2 0 0 1 0 2 2 5 2 k3 1 0 3 0 4 2 0 3
d1 d2 d3 d4 d5 d6 d7 q
= 2 k1 + k3 = k1 = k2 + 3 k3 = 2 k1 = k1 + 2 k2 + 4 k3 = 2 k2 + 2 k3 = 5 k2 = k1 + 2 k2 + 3 k3
Ranking Computation
k1
1 (3 m2 + 2 m6 + m8 ) 2 2 2 sqrt(3 + 2 + 1 ) k2 = 1 (5 m3 + 3 m7 + 2 m8 ) sqrt(52 + 32 + 22 ) k3 = 1 (1 m6 + 5 m7 + 4 m8 ) 2 2 2 sqrt(1 + 5 + 4 )
d1 d2 d3 d4 d5 d6 d7 q
= 2 k1 + k3 = k1 = k2 + 3 k3 = 2 k1 = k1 + 2 k2 + 4 k3 = 2 k2 + 2 k3 = 5 k2 = k1 + 2 k2 + 3 k3
Conclusions
Model considers correlations among index terms Not clear in which situations it is superior to the standard Vector model Computation costs are higher Model does introduce interesting new ideas