0% found this document useful (0 votes)
18 views23 pages

04 Index Optimization

This chapter discusses text indexing, which involves converting unstructured texts into a list of words through tokenization, stemming, and stop word removal. It outlines the necessity of text indexing for efficient text processing and provides a detailed explanation of each step involved, including additional techniques for enhancing performance. The chapter also covers the implementation of these processes in Java and concludes with a summary and further discussions.

Uploaded by

ddls0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views23 pages

04 Index Optimization

This chapter discusses text indexing, which involves converting unstructured texts into a list of words through tokenization, stemming, and stop word removal. It outlines the necessity of text indexing for efficient text processing and provides a detailed explanation of each step involved, including additional techniques for enhancing performance. The chapter also covers the implementation of these processes in Java and concludes with a summary and further discussions.

Uploaded by

ddls0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

20 Taeho Jo

2 Text Indexing

This chapter is concerned with text indexing which is the initial step of
processing texts. Text indexing is the process of converting a text or texts into
a list of words. The basic steps of text indexing are tokenization, stemming,
and stop word removal. Tokenization is the process of splitting a text by
white space, stemming is the process of converting each token into its root
form and stop word removal is the process of removing stop words from a
text. Therefore, this chapter is intended to study text indexing as the initial
step.
Keywords: Text Indexing, Tokenization, Stemming, Stop Word Removal,
Index Filtering, Index Expansion
This chapter is organized into four sections. In Section 2.1, we present
the overview of text indexing, and explain it functionally step by step in
Section 2.2. In Section 2.3, we present and explain the implementation of text
indexing in Java. In Section 2.4, we cover the further steps for reinforcing the
text indexing with respect to its performance and efficiency. In Section 2.5.,
we summarize this chapter and make further discussions.

2.1 Overview of Text Indexing

The text indexing is defined as the process of converting a text or texts into
a list of words [70]. Since a text or texts are given as unstructured forms by
itself or themselves essentially, it is almost impossible to process its raw form
directly by a computer program. In other words, the text indexing means
the process of segmenting a text which consists of sentences into included
words. A list of words is the result from indexing a text as the output of
text indexing, and will become the input to the text representation which is
covered in Chapter 3. Therefore, in this section, we provide the overview of
text indexing before discussing it in detail.
Let us consider the necessities of text indexing [101]. The text is essentially
the unstructured data unlikely the numerical one, so computer programs
cannot process it as its raw form. It is impossible to apply the numerical
operations to texts and is not easy to encode a text into its own numerical
value. A text is a too long string which is different from a short string, so
it is very difficult to give text its own categorical value. Therefore, what
mentioned above becomes the reason of needing to segment a text into words
which are short strings as the process of text indexing.
The three basic steps of text indexing is illustrated in Figure 9. The first
step is the tokenization which is the process of segmenting a text by white
spaces or punctuation marks into tokens. The second step is the stemming
which is the process of converting each token into its own root form by the
grammar rules. The last step is the stop-word removal which is the pro-
Text Mining: Concepts, Implementation, and Big Data Challenge 21

Fig. 9 The Three Steps of Text Indexing

cess of removing the grammatical words such as articles, conjunctions, and


prepositions. The tokenization is the prerequisite for the next two steps, but
the stemming and the stop-word removal may be swapped with each other,
depending on the given situation.
The additional steps may be considered for improving the performance
and the efficiency of the text mining and information retrieval tasks. We
may attach the POS (Position Of Speech) tagging which is the process of
classifying words into one of noun, verb, adjective, and so on, grammatically
to the text indexing as an additional module. The weights of words may
be computed as their importance degree, and words with relatively lower
weights may be removed after the three steps, which is called index filtering.
We take external words which are relevant to the strongly weighted words,
using search engines, add them to the results from indexing a text, which is
called index expansion. We may consider the case of applying both of them,
altogether, which is called index optimization.
The occurrences and weights of words may depend strongly on the text
lengths; the overestimation and underestimation may be caused to long texts
and short ones, respectively. We may reject too short text which consists
of only their titles, and select only part in long texts as the text length
normalization. We may use the relative frequencies or weights, instead of
absolute ones, by dividing the absolute ones by their maximum. We consider
applying the index filtering to long texts and index expansion to short ones.
Without the above actions, long texts may retrieve more frequently to given
query in the information retrieval tasks, and have strong influences in text
mining tasks.
22 Taeho Jo

Fig. 10 Functional View of Tokenization

2.2 Steps of Text Indexing

This section is concerned with the three basic steps of text indexing. In
Section 2.2.1, we explain the tokenization as the first step, together with its
examples. In Section 2.2.2 and 2.2.3, we study the stemming and the stop-
word removal, respectively. In Section 2.2.4, we cover schemes of computing
weights of words. Therefore, this section is intended to study the process of
indexing a text or texts with the three basic steps and computing the word
weights.

2.2.1 Tokenization

The tokenization is defined as the process of segmenting a text or texts into


tokens by the white space or the punctuation marks. It is able to apply the
tokenization to the source codes in C, C++, and Java [1], as well as the texts
which are written in a natural language. However, the scope is restricted to
only text in this study, in spite of the possibility. The morphological analysis
is required for tokenizing texts which are written in oriental languages: Chi-
nese, Japanese, and Korean. So, here, omitting the morphological analysis,
we explain the process of tokenizing texts which are written in English.
The functional view of the tokenization is illustrated in Figure 10. A text
is given as the input, and the list of tokens is generated as the output in the
process. The text is segmented into tokens by the white space or the punctu-
ation marks. As the subsequent processing, the words which include special
characters or numerical values are removed, and the tokens are changed into
their lower case characters. The list of tokens become the input of the next
steps of text indexing: the stemming or the stop-word removal.
The process of tokenizing a text is shown in Figure 11. The given text is
partitioned into tokens by the white space, the punctuation marks, and the
special characters. The words which include one or some of special characters,
such as “16%” are removed. The first character of each sentence is given as
Text Mining: Concepts, Implementation, and Big Data Challenge 23

Fig. 11 The Process of Tokenizing Text

Fig. 12 The Example of Tokenizing Text

the upper case character, so it should be changed into the lower case. The
redundant words should be removed after the steps of text indexing.
The example of tokenizing a text is illustrated in Figure 12. In the example,
the text consists of the two sentences. The text is segmented into tokens by
the white space, as illustrated in the right side in Figure 12. The first word of
the two sentences, “Text”, is converted into “text”, by the third tokenization
step. In the example, all of words has no special character, so the second step
is passed.
Let us consider the process of tokenizing the text written in one of the
oriental languages, Chinese or Japanese. The two languages do not allow the
white space in their sentences, so it is impossible to tokenize the text by it. It
is required to develop and attach the morphological analyzer which segments
the text depend on the grammatical rules for tokenizing the text. Even if
the white space is allowed in the Korean texts, tokens which are results from
segmenting each of them are not incomplete, so the morphological analyzer is
also required for doing the further one. Since English is currently used as the
24 Taeho Jo

Fig. 13 Stemming Process

global language in our world, the scope is restricted to the only texts which
are written in English, in this study.

2.2.2 Stemming

The stemming refers to the process of mapping each token which is generated
from the previous step into its own root form [70]. The stemming rules which
are the association rules of tokens with its own root form is required for
implementing it. The stemming is usually applicable to the nouns, the verbs,
and the adjective, as shown in Figure 13. The list of root forms is generated
as the output of this step. Therefore, in this subsection, we describe the
stemming which is the second or third step of text indexing.
In the stemming, the nouns which are given as their plural form are con-
verted into their singular form, as shown in Figure 13. In order to convert
it into its singular form, the character, ‘s’, is removed from the noun in the
regular cases. However, we need to consider the some exceptional cases in
the stemming process; some nouns which end with the character, ‘s’, such
as ‘process’, the post fix, ‘es’, should be removed instead of ‘s’, and the plu-
ral and singular form are completely different from each other like the case
of words, ‘child’ and ‘children’. Before stemming nouns, we need to classify
words into noun or not by the POS tagging. Here, we use the association
rules of each noun with its own plural forms for implementing the stemming.
Let us consider the case of stemming verbs into their root forms. The verbs
which are given in their third personal singular form need to be converted
into their root forms by removing the postfix, ‘s’ or ‘es’. The verbs which are
given as the noun or used as the present progressive form may be changed
Text Mining: Concepts, Implementation, and Big Data Challenge 25

Fig. 14 Hard vs Soft


Stemming

into their root forms by removing the post fix, ‘ing’. If the verbs are used as
the past tense or particle, it is converted into its root form by removing the
postfix, ‘ed’, in the regular cases. Note that the irregular verbs where their
root and varied forms are different, completely exist as the exceptional cases.
Let us consider the case of stemming adjectives into their root forms.
If the adjective is given as the adverb, it is changed into its root form by
removing the postfix, ‘ly’. If the adjectives are given as their comparative
degree, the postfix, ‘er’, is removed for taking its root form. If the adjectives
are expressed as their superlative degree, the postfix, ‘est’ is removed in
each of them. However, the majority of adjectives are expressed as their
comparative and superlative degree by adding the words, ‘more’ or ‘most’,
before them, respectively.
The two version of stemming rules are illustrated in Figure 14. The nouns
may be derived from verbs by adding the postfixes, such as ‘ation’ and ‘ment’.
The two words, ‘assignment’ and ‘categorization’, are the examples of this
case. The noun, ‘complexity’, is derived from the adjective, ‘complex’. In the
soft stemming, the words are used as root forms by themselves, whereas in
the hard stemming, they are converted into their verbs or adjectives.

2.2.3 Stop-Word Removal

The stop-word removal refers to the process of removing the stop-words from
the list of tokens or stemmed words [70]. The stop-words are the grammatical
words which are irrelevant to text contents, so they need to be removed
for more efficiency. The stop-word list is loaded from a file, and if they are
registered in the list, they are removed. The stemming and the stop-word
26 Taeho Jo

removal may be swapped; the stop-words are removed before stemming the
tokens. Therefore, in this subsection, we provide the detail description of the
stop-words and the stop-word removal.
The stop-word refers to the word which functions only grammatically and
is irrelevant to the given text contents. The prepositions, such as ‘in’, ‘on’, ‘to’;
and so on, typically belong to the stop-word group. The conjunctions such as
‘and’, ‘or’, ‘but’, and ‘however’, also belong to the group. The definite article,
‘the’, and the infinite article, ‘a’ and ‘an’, are also more frequent stop-words.
The stop-words occur dominantly in all texts in the collection; removing them
causes to improve very much efficiency in processing texts.
Let us explain the process of removing the stop-words in the text indexing.
The stop-word list is prepared as a file, and is loaded from it. For each word, if
it is registered in the list, it is removed. The remaining words after removing
the stop-words are usually nouns, verbs, and adjectives. Instead of loading the
stop-word list from a file, we may consider using the classifier which decide
whether each word is a stop-word or not.
Some of nouns, verbs, and adjectives as the remaining words may occur
in other texts as well as in the current one. The words are called common
words, and they are not useful for identifying text contents, so they need to
be removed like the stop-words. The TF-IDF (Term Frequency Inverse Term
Frequency) weight is the criteria for deciding whether each word is a common
word, or not, and will be explained in Section 2.2.4. The remaining words
after removing the stop-words are useful in the news article collection, but
not all of them are so in the technical one medical text collections. We may
consider the further removal, depending on text lengths; it is applied to long
texts.
Let us consider the sentimental analysis as the kind of text classification;
it is the process of classifying texts into one of the three categories: positive,
neural, and negative. We may need some of the stop-words for doing the task,
as the exceptional cases. For example, the stop-words, such as ’but’, ’not’,
and ’however’, indicate the negative opinion. Not all of the stop-words are
useful for identifying sentimental style of texts; we need to decide whether
each stop-word is useful, or not. If implementing the text processing system
for doing the sentimental analysis, the process of text indexing needs to be
modified.

2.2.4 Term Weighting

The term weighting refers to the process of calculating and assigning the
weight of each word as its importance degree. We may need the process for
removing words further as well as stop-words for the further efficiency. The
term frequency and the TF-IDF weight are the popular schemes of weighting
words, so they will be described formally, showing their mathematical equa-
tions. The term weights may be used as attribute values in encoding texts
Text Mining: Concepts, Implementation, and Big Data Challenge 27

into numerical vectors, which will be covered in the next chapter. Therefore,
in this subsection, we describe the schemes of weighing words: TF-IDF and
its variants.
We may use the term frequency which is the occurrences of each word in
the given text as the scheme of weighting words [79]. It is assumed that the
stop-words which occur most frequently in texts are completely removed in
advance. The words are weighted by counting their occurrences in the given
text in this scheme. There are two kinds of term frequency: the absolute term
frequency as the occurrences of words and the relative term frequency as the
ratio of their occurrences to the maximal ones. The relative term frequency
is preferred to the absolute one, in order to avoid the overestimation and
underestimation by the text lengths.
The TF-IDF (Term Frequency-Inverse Document Frequency) is the most
popular scheme of weighting words in the area of information retrieval, but
requires the references to the entire text collection which is called corpus. The
word weights which are computed by the TF-IDF scheme are proportional to
the occurrences in the given text, but reversely proportional to that in other
texts. The TF-IDF weight, wij , of the word, ti , in the text, di is computed
by equation (1) [102],
N
wij = log (1 + logT Fi ) (1)
DFi
where N is the total number of texts in the collection, DFi is the number of
texts which include the word, ti , in the collection, and T Fi is the occurrences
of the word, ti in the text, di If the word occur only one time in the text,
its weight depends on the number of texts which include itself in the corpus,
assuming that N is a constant. In encoding a text into a numerical vector,
zero is assigned to the word which does not occur at all in the text.
We may derive some variants from the TF-IDF weight which is calculated
by equation (1)[4]. By replacing the absolute term frequency by the relative
one, equation (1) is modified into equation (2).
N T Fi
wij = log (1 + log ) (2)
DFi T Fmax
where T Fmax is the maximal occurrences in the text, di . By modifying the
N N
inverse document frequency, log DF i
into 1 + log DF i
, the word weight is com-
puted by equation (3),

N T Fi
wij = (1 + log )(1 + log ) (3)
DFi T Fmax
−DFi
We may use log N DF i
as the probabilistic inverse document frequency, in-
N N
stead of log DFi and 1 + log DF i
, so the words are weighted by equation (4),
28 Taeho Jo

N − DFi T Fi
wij = (1 + log )(1 + log ) (4)
DFi T Fmax
In implementing the text indexing, we adopt equation (3) for computing the
word weights.
In addition to the text indexing and the term weighting, we may need the
POS tagging. As mentioned in the previous section, the POS tagging refers
to the process of classifying each word into one of the grammatical categories.
Since the words around the given word are indicators, HMM (Hidden Markov
Model) tends to be used as the approach [69]. We may apply the POS tag-
ging to the stemming, the stop-word removal, and the term weighting as the
optional step. However, the POS tagging is omitted in implementing the text
indexing program, in Section 2.3.

2.3 Text Indexing: Implementation

This section is concerned with process of implementing the text indexing


program in Java. In Section 2.3.1, we define the classes which are involved
in implementing the text indexing program. In Section 2.3.2, we mention the
stemming rule for implementing the stemming process as the data structure.
In Section 2.3.3, we implement the methods of the classes. Therefore, this
section is intended to provide the guide for implementing the text indexing.

2.3.1 Class Definition

In this subsection, we present the classes which are involved in implementing


the text indexing program. Since the file access is the most frequent task in
indexing a text, we need the class for doing it. Since a text is given as input
and the word list is generated as the output, we need to define the classes for
processing a text and words. We need to define the integrated class which is
called API (Application Program Interface) class. In this subsection, we will
explain the each class definition.
In Figure 15, the class, ‘FileString’, is defined. The property, ‘fileName’,
indicates the name of the file to which we try to access, and the property,
‘fileString’, indicates the stream which is loaded and saved from and to file;
the object of the class ‘FileString’ is created with the file name or with the file
name and the file stream. The method, ‘loadFileString’, loads the contents
from the file as a string and assigns it to the property, ‘filesString’. The
method, ‘saveFileString’ saves the property value, ‘fileString’ into the file
which is named by the property, ‘fileName’. The method, ‘getFileString’,
gets the value which is assigned to the property, ‘fileString’, and the method,
‘setFileString’, mutates it.
Text Mining: Concepts, Implementation, and Big Data Challenge 29

Fig. 15 The Definition of Class: FileString

The class, ‘Word’, is defined in Figure 16. The property, ‘wordName’ indi-
cates the word by itself, and the property, ‘wordWeight’, indicates the impor-
tance degree, in the class, ‘Word’. The object of class, ’Word’ is created by
the value of the property, ‘wordName’, and the property values are accessed
and mutated by the methods whose prefix is ‘get’ and ‘set’, respectively. The
method, ‘computeWordWeight’, computes the word weight and assigns it to
the property, ‘wordWeight’. The weight is computed by equation (3), and its
implementation is omitted in this book.
The class, ‘Text’ is defined in Figure 17. Each text is identified by its own
file name under assuming that each text is given as a single file; the property,
‘fileName’, becomes the text identifier. The property, ‘textContent’, is the
input of indexing the text, and the property, ’wordList’, is the output which
consists of objects of the class, ‘Word’. By the method, ‘loadTextContent’,
the text is loaded from the file, and it is indexed into the list of words by
the method, ‘indexTextContent’. The three methods whose access level is set
‘private’ never be called externally, but called in the method, ’indexTextCon-
tent’; we will explain their implementations in Section 2.3.3.
The class, ‘TextIndexAPI’ is defined in Figure 18. The postfix, ‘API’, which
is attached to ‘TextIndex’, indicates the interface between the main program
30 Taeho Jo

Fig. 16 The Definition of Class: Word

and the involved classes. The object of the class is created with the file name
and the directory path which identifies the corpus location, and the method,
‘indexText’ is involved for indexing the text into the list of objects of the
class, ‘Word’, in the main program. The list of objects is converted into a
stream for saving it into a file as the results. The final output of this program
is given as a file where the list of words and their weights are saved.

2.3.2 Stemming Rule

This subsection is concerned with the stemming rule as the data structure
for implementing the stemming process. The stemming rule is given as the
association of the root form with its derivations, for each word. We add one
more class, ‘StemmingRule’ to those which are mentioned in Section 2.3.1. We
need to load the list of stemming rules from a file, initially. In this subsection,
we define the class, ‘StemmingRule’, and implement the involved methods.
The class, ‘StemmingRule’ is defined in Figure 19. The property, ‘root-
Form’, indicates the root of each word, and the property, ‘variedFormList’,
indicates the list of varied forms. It is assumed that in the file, the root form
and its varied forms are given as a line: “root varied1 varied2 ...”. An object
Text Mining: Concepts, Implementation, and Big Data Challenge 31

Fig. 17 The Definition of


Class: Text

is created line by line, reading the stream from the file, the line is segmented
into tokens by the white space, and the first token is set as the root and the
others are set as its varied forms. The method, ‘isRegistered’ checks whether
the word is registered in the list of varied forms, or not.
The method, ‘stemString’ of the class, ‘Text’, is implemented in Figure
20. The list of stemming rules is loaded from the file, by calling the method,
‘loadStemmingRuleList’. The objects of the class, ‘StemmingRule’, are cre-
ated line by line in the method, ‘loadStemmingRule’. Each word is checked in
each stemming rule whether it is registered as its varied form, or not, and if so,
the word is changed into its root form, by getting it from the corresponding
stemming rule. Redundant words are removed after stemming words.
The method, ‘isRegistered’ is implemented in Figure 2.11. The method
belongs to the predicate method which has the prefix ‘is’, and checks the
program status by returning true or false. If discovering the varied form
which matches the value which is assigned to the property, ‘wordName’, it
returns true. The sequential search is adopted in the current implementation,
but it need to be replaced by the more advanced search strategies such as the
binary search and the hashing, in the next version. When using the sequential
32 Taeho Jo

Fig. 18 The Definition of Class: TextIndexAPI

search, it takes the quadratic complexity, O(n2 ), for stemming each token,
assuming the n stemming rules and n varied forms per each stemming rule.
Since it takes very high cost for stemming a word, as mentioned above, we
need to improve the search for varied forms in each stemming rule and use
one of more advanced search strategies. We need to sort the varied forms in
each stemming rule, and replace the sequential search by the binary search.
In each stemming rule, we may build a binary search tree or its variants such
as RVL tree and Red-Black Tree. As the alternative strategy, we implement
the search for the varied forms by the hashing which takes only constant
complexity. We need also to improve the search for matching stemming rule
as well as the search for the varied forms.

2.3.3 Method Implementations

This section is concerned with implementing methods in the classes which


were mentioned in Section 2.3.1. We already described the definitions and
methods of the class, ‘StemmingRule‘, in Section 2.3.2. This subsection fo-
cuses on the two methods, ‘loadFileString’ and ’saveFileString’, in the class,
‘FileString’, and the three methods, ‘tokenizeString’, ‘removeStopWords’, and
’indexTextContent’ in the class, ‘Text’. Since the class, ‘Word’, has the meth-
ods with their trivial implementations, we omit the explanation about them.
Therefore, in this subsection, we explain the method implementations of the
classes, ‘FileString’ and ’Text’.
We present implementing the methods, ‘loadFileString’ and ‘saveFileString’,
in Fgiure 22 and 23. The method, ’loadFileString’, takes what is stored in
Text Mining: Concepts, Implementation, and Big Data Challenge 33

Fig. 19 The Definition of Class: StemmingRule

the given file with the exception handling for opening it. The file is opened,
its content is read as a string by invoking ’readFully’, and it is assigned to
the property, ‘fileString’. The method, ‘saveFileString’, saves the value of the
property, ‘fileString’ into the given file with the exception handling. In its
implementation, the file is opened, the property values is written by invoking
the method, ‘writeBytes’.
The method, ‘tokenizeString’ is implemented in Figure 24. By invoking
the method, ’loadTextContent’, the text which is the input of text indexing
is loaded from the file. The text which is given as a string is segmented
into tokens by invoking the method, ‘split’, and the string which is given
as the argument in the method indicates the list of characters of partition
boundaries. Each token is converted into its lower cases, and added to the
property, ‘wordList’. The token list is generated as the results from the step.
The method, ’removeStopWords’ is implemented in Figure 25. The list
of stop-words is loaded from the file, by invoking the method, ’loadStop-
WordList’. For each word in the property, ’wordList’, it is checked whether it
is registered in the stop-word list, and if so, it is removed from the property.
The method, ’isRegistered’, in the class, ’StemmingRule’ checks the registra-
tion in the varied form list, while the method, ’isRegistered’, in the class,
’Text’, does it in the stop-word list. By replacing the sequential search which
34 Taeho Jo

Fig. 20 The Implementation of Method: stemWordList

is used in the current implementation by the more advanced search strategies,


we improve the process of removing the stop-words.
In Figure 26, the method, ‘indexTextContent’ is implemented. In its imple-
mentation, the methods whose access levels are set private are involved. The
string which is loaded from the file by invoking the method, ‘loadTextCon-
tent’ is tokenized into tokens by the method, ‘tokenizeTextContent’, they are
stemmed into their roots by invoking the method, ‘stemWordList’, and among
them, the stop-words are removed by invoking the method, ‘removeStop-
Words’, following the steps of text indexing which were illustrated in Figure
9. The additional steps which are mentioned in Section 2.4 may be imple-
mented by adding more methods in the class. Therefore, the implementation
Text Mining: Concepts, Implementation, and Big Data Challenge 35

Fig. 21 The Implementation of Method: isRegistered

Fig. 22 The Implementation of Method: loadFileString

Fig. 23 The Implementation of Method: saveFileString

is intended to provide the prototype version of indexing a text or texts rather


than the real version.
36 Taeho Jo

Fig. 24 The Implementation of Method: tokenizeTextContent

2.4 Additional Steps

This section is concerned with the additional steps to ones which were cov-
ered in Section 2.2 and 2.3. In Section , we describe the index filtering as
the further removal of some words. In Section 2.4.2, we consider the index
expansion which is the addition of more relevant words to the list of words,
as the opposite one to the index filtering. In Section 2.4.3, we cover the in-
dex optimization as the composite of the index filtering and expansion. In
this section, we describe the three further steps to the basic ones which were
mentioned in the previous sections.

2.4.1 Index Filtering

In Figure 27, we illustrate the index filtering which is the process of removing
further words among remaining ones after doing the stop-words. The list of
words which result from indexing a text with the basic three steps is given
as the input, and the words are called actual words, here. The actual words
are weighted by one of the equations which were presented in Section 2.2.4,
and only words with their higher weights are selected among them. The
index filtering generates a shorter list of words as its output and is usually
applicable to long texts. Therefore, in this subsection, we describe the process
of removing some words, further.
The weights of words which are generated from indexing a text with the
three basic steps are computed. As mentioned in the previous section, the
weighs are proportional to the absolute and relative frequencies of words in
Text Mining: Concepts, Implementation, and Big Data Challenge 37

Fig. 25 The Implementation of Method: removeStopWords

Fig. 26 The Implementation of Method: indexTextContent

the given text, and reversely proportional to the number of other texts which
include the words in the corpus. The words are sorted by their weights, and
ones with their lower weights are cut off. The index filtering means the process
of removing some of nouns, verbs, and adjectives in addition to the stop-words
for more efficiency. Therefore, in the process, the longer list of words is given
as the input, and the shorter list is generated as the output.
38 Taeho Jo

Fig. 27 Controlling Word List from Indexing Text

Let us consider the three schemes of selecting words by their weights.


The first scheme is called rank based selection where the words are sorted
by their weights and the fixed number of words are selected among them.
The second scheme is called score based selection where the words with their
higher weights than the given threshold are selected. The third scheme is the
hybrid one which is the mixture of the first and second scheme. There are
the trade-off between the two schemes; the rank based selection where the
constant number of words are selected, but it is required to sort the words by
their weights, and the score base selection where it does not require to sort
them, but very variable number of words is selected.
We may consider the word positions in the given text for doing the index
filtering, as well as their weights. Almost all of texts tend to be written,
putting the essential part in the first and last paragraph. Even if the words
have their identical weights, they need to be discriminated among each other
by their positions in the given text. The more importance should be put to the
words in the first and last paragraphs, than those in the medium ones. The
grammatical information about words such as their grammatical categories
and their roles in the sentences may be also considered as well as the posting
information.
Let make remarks on the index filtering. Before applying the index filtering
as the additional step, we need to consider the text length. We expect more
efficiency which results from applying the index filtering to long texts by
getting more compact representations. When applying it to the short texts,
the information loss is caused. We had better omitting the index filtering to
the short text, in order to avoid the information loss.
Text Mining: Concepts, Implementation, and Big Data Challenge 39

Fig. 28 Index Expansion

2.4.2 Index Expansion

The index expansion refers to the process of adding more relevant words
to ones to which are indexed from the text, and it is illustrated in Figure
28[108]. The text is indexed into a list of words with the three steps. The
external words, called virtual words, which are relevant to the actual words
are fetched through the search engines. The external words are added to the
list of actual words, and the longer list is generated as the output. Therefore,
in this subsection, we describe the index expansion process.
The collocation of words is the essential criteria for fetching associated
words from the external sources. The collocation refers to the occurrences
of more than two words in the same text. The more collocations of words,
the more semantic cohesion. The collocation of the two words, ti and tj is
computed by equation (5),
2DFij
col(ti , tj ) = (5)
DFi + DFj
where DFij is the number texts which include the two words, ti and tj in the
corpus. The more texts which include the both words, the higher collocation
of them, by equation (5).
In Figure 28, the process of expanding the list of words is illustrated. The
given text is indexed into the list of words with the three steps. For each
word in the list, its collocations with other words in the external sources are
computed by equation (5), and ones with higher collocations are fetched. The
fetched words are added to the list of actual words, and the list is expanded
with the virtual words. It is more desirable to fetch the external words which
are relevant to only some actual words rather than all ones for more efficiency.
The example of the index expansion is presented in Figure 29. The text
is indexed into words: ‘computer’, ‘software’, ‘information’, and ‘data’. The
words are sent as the queries, and their associated words are retrieved as
the virtual words. Both the actual and virtual words are given as the text
surrogates, through the index expansion. However, note that the process of
fetching the relevant words from the external sources is very expansive.
Let us consider the query expansion which is the process of adding more
relevant words to the given query, as the similar task as the index expansion
40 Taeho Jo

Fig. 29 Example of Index Expansion

Fig. 30 Index Optimization

[97]. The query is given by a user, and he or she expresses his or her informa-
tion needs, unclearly and abstractly. The initial query is given by the user,
its collocations with other words are computed, and highly collocated words
are fetched. The references to the user profiles may be used for doing the
query expansion, as well as the collocations. Therefore, the query expansion
is intended to make the query more clear and specific.

2.4.3 Index Optimization

The index optimization refers to the process of optimizing a list of words in


order to maximize the information retrieval performance, and is illustrated
in Figure 30[36]. It is viewed as the combination of the index filtering and
expansion which were covered in the Section 2.4.1 and 2.4.2. In other words,
the index optimization contains the removal of unnecessary words for more
efficiency and addition of more relevant words for more reliability. It is in-
terpreted into an instance of word classification where each word is classified
into one of the three categories: ‘expansion, ‘inclusion’, and ‘removal’. In this
subsection, we provide the detail description of the index optimization as the
composite task of the index filtering and expansion.
In the Section 2.4.1 and 2.4.2, we already studied the two tasks, and we
need the both for getting the more efficiency and reliability. The index fil-
Text Mining: Concepts, Implementation, and Big Data Challenge 41

Fig. 31 View of Index Optimization into Classification Task

tering is useful for long texts for more efficiency, and the index expansion is
so for short texts for more reliability. The actual words in the given text are
classified into the three groups: the words with their higher weights which
should be reinforced by adding more relevant words to them, the words with
their medium weights which should be remained, and the words with their
lower weights which should be removed. The index expansion causes more un-
necessary computations, and the index filtering causes the information loss.
Therefore, we need the index optimization which is applicable to any text
with regardless of its length for solving problems from the two tasks.
In Figure 30, the index optimization process is presented. With the three
basic steps, the text is indexed into a list of words. The shorter list of words
is generated by the index filtering. The collocations of the words in the short
list with other words are computed, and their associated words are fetched
and added to the list. As illustrated in Figure 30, the index optimization is
viewed into the combination of the two tasks.
In 2015 and 2016, Jo interpreted the index optimization into the classifi-
cation task as another view, as illustrated in Figure 31 [36] [42]. The three
categories, ‘expansion’, ‘inclusion, and ‘removal’, are predefined, in advance.
The words which are labeled with one of the three categories are prepared
as the sample data, and the adopted machine learning is trained with them.
The words which are indexed from a text are classified into one of the three
categories. To the words which are labeled with ‘expansion’, their associated
words are added from the external sources, to the words which are labeled
with ‘inclusion’, they are remained, and to the words which are labeled with
‘removal’, they are excluded.
We need to consider some issues in applying the machine learning algo-
rithms to the index optimization which is viewed into the classification task.
The first issue is how many sample words we need to prepare for the ro-
bustness, in advance. The second issue is to use which of machine learning
algorithm we use as the classifier. The third issue is how to define the domain
scope of text as source of sample words: the narrow scope where there are
the more reliability but less sufficiency of sample words, and the broad scope
where there are the more sufficiency but the less reliability of them. The last
issue is how to define the attributes for deciding one of the three categories.
42 Taeho Jo

2.5 Summary

This chapter is concerned with the process of indexing a text or texts into
a list of words. The three steps are basically needed for doing the text in-
dexing. The text indexing is implemented in Java for providing the guide for
developing the text processing system. The index filtering, expansion, and op-
timization are considered as the further text indexing steps. In this section,
we summarize the entire content of this chapter.
The text indexing refers to the process of mapping a text or texts into a list
of words. It is needed for encoding texts into their structured forms. The list
of nouns, verbs, and adjectives is usually the output of text indexing with the
basic three steps. We consider the addition of the index filtering, the index
expansion, and the index optimization to the basic steps for improving the
efficiency and the reliability. We consider other kinds of information about
words such as the text length, the grammatical categories of words, and word
positions, for doing the additional tasks.
A text is indexed into a list of words with the three basic steps: tok-
enization, stemming, and stop-word removal. The tokenization refers to the
process of segmenting a text simply by the white spaces and the punctuation
marks. The stemming refers to the process of changing each word into its own
root form. The stop-word removal means the process of removing the stop
words which functions only grammatically, with regardless of the content.
The words are usually weighted by the TF-IDF scheme and its variants.
We tried to implement the text indexing program in Java, in order to
provide the development guides. We define the classes: ’FileString’, ’Stem-
mingRule’, ’Word’, ’Text’, and ’TextIndexAPI’. The class, ’StemmingRule’,
is for defining the data structure for associating a root form with its varied
forms. We implemented the process of indexing a text by invoking the meth-
ods which correspond to the text indexing steps. We need to improve the
search algorithm which is used in the implementation for more scalability.
The index filtering, expansion, and optimization are considered as the addi-
tional tasks for improving the efficiency and the reliability. The index filtering
refers to the process of removing nouns, verbs, adjectives which are unimpor-
tant for identifying the content for more efficiency. The index expansion refers
to the process of adding associated words which are fetched from the external
sources to the original list of words. The index optimization where important
words are selected through the index filtering and their associated words are
added through the index expansion; it is viewed as the combination of the
both tasks. The index optimization is regarded as a classification task and
we may apply a machine learning algorithm.

You might also like