0% found this document useful (0 votes)
201 views300 pages

Understanding OpenText Search Engine 21

OpenText Search Engine working

Uploaded by

Sumit Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views300 pages

Understanding OpenText Search Engine 21

OpenText Search Engine working

Uploaded by

Sumit Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 300

Understanding Search Engine 21

Product: OpenText Content Server


Version: 21.2
Task / Topic: Indexing and Search
Audience: Administrators, Developers
Platform: All
Document ID 500408
Updated: April 2021

White Paper
Understanding Search Engine 21
Patrick Pidduck, Director, Product Management

The Information Company™ 2


Understanding Search Engine 21

Foreword
Since 2009, I have had the honor of working with the extraordinary software
development teams at OpenText responsible for the OpenText Search Engine. Search
has always been a fundamental component of the OpenText Content Suite Platform,
and OpenText pioneered several key technologies that serve as the foundation for
modern search engines. Our team has built upon more than 25 years of search
innovation, and contributed to external research initiatives such as TREC for years.
OpenText knows search.
In the last few years, our customers have pushed scalability and reliability requirements
to new levels. The OpenText Search Engine has met these goals, and continues to
improve with each quarterly product update. Several billion documents in a single
search index, unthinkable just a few years ago, is a reality today at customer sites.
This edition of the “Understanding Search” document covers the capabilities of Search
Engine 21. Search Engine 21 is the most recent version, superseding 16.2, 16, 10.5,
10.0 and versions reaching back to Content Server 9.7. We understand our enterprise
customer needs, and this latest search engine provides seamless upgrade paths from
all supported versions of Content Server. While protecting your existing investments,
we continue to add incredible new capabilities, such as efficient search methods
optimized for eDiscovery and Classification applications, enhanced backups, and
integrated performance monitoring.
This document would not be possible without the help of our resident search experts.
As always, you have my thanks: Alex and Alex, Ann, Annie, Christine, Daniel, Dave,
Dave, Dave, Hiral, Jody, Johan, Kyle, Laura, Mariana, Michelle, Mike, Ming, Parmis,
Paul, Ray, Rick, Riston, Ryan, Scott and Stephen.
Patrick.

The Information Company™ 3


Understanding Search Engine 21

Contents
Basics........................................................................................................................... 3
Overview ................................................................................................................. 3
Introduction ...................................................................................................... 3
Disclaimer ........................................................................................................ 3
Relative Strengths ............................................................................................ 4
Upgrade Migration: .................................................................................... 4
Transactional Capability: ........................................................................... 4
Metadata Updates: .................................................................................... 4
Search-Driven Update: .............................................................................. 4
Maintenance Commitment: ....................................................................... 4
Data Integrity: ............................................................................................ 4
Scaling: ...................................................................................................... 4
Advanced Queries: .................................................................................... 5
Related Components ....................................................................................... 5
Admin Server ............................................................................................. 5
Document Conversion Server ................................................................... 5
IPool Library .............................................................................................. 5
Content Server Search Administration ...................................................... 5
Query Languages ...................................................................................... 6
Remote Search.......................................................................................... 6
Backwards Compatibility .................................................................................. 6
Installation with Content Server ....................................................................... 6
Search Engine Components .................................................................................. 7
Update Distributor ............................................................................................ 8
Index Engines .................................................................................................. 8
Search Federator ............................................................................................. 8
Search Engines ................................................................................................ 9
Inter-Process Communication ................................................................................ 9
External Socket Connections ........................................................................... 9
Internal Socket Connections .......................................................................... 10
Search Federator Connections ...................................................................... 11
Search Queues........................................................................................ 11
Queue Servicing ...................................................................................... 11
Search Timeouts...................................................................................... 12
Testing Timeouts...................................................................................... 13
File System .................................................................................................... 13
Server Names ................................................................................................ 13
Partitions ............................................................................................................... 13
Basic Concepts .............................................................................................. 13
Large Object Partitions .................................................................................. 16
Regions and Metadata .............................................................................................. 18

The Information Company™ 4


Understanding Search Engine 21

Metadata Regions ................................................................................................ 18


Region Names ............................................................................................... 18
Nested Region Names ................................................................................... 19
DROP - Blocking Indexing of Regions ........................................................... 19
Removing Regions from the Index ................................................................ 20
Removing Empty Regions ............................................................................. 20
Renaming Regions ........................................................................................ 21
Merging Regions ............................................................................................ 21
Changing Region Types ................................................................................. 22
LONG Region Conversion ....................................................................... 23
Multiple Values in Regions ............................................................................. 23
Attributes in Text Regions .............................................................................. 24
Region Size Attribute ..................................................................................... 25
Metadata Region Types........................................................................................ 26
Key 26
Text................................................................................................................. 27
Rank ............................................................................................................... 28
Integer ............................................................................................................ 28
Long Integer ................................................................................................... 28
Timestamp ..................................................................................................... 28
Enumerated List ............................................................................................. 30
Boolean .......................................................................................................... 30
Date................................................................................................................ 30
Currency......................................................................................................... 30
Date Time Pair ............................................................................................... 31
User Definition Triplet..................................................................................... 31
Aggregate-Text Regions ................................................................................ 31
CHAIN Regions .................................................................................................... 32
Text Metadata Storage ......................................................................................... 33
Configuring the Storage Modes ..................................................................... 35
Memory Storage (RAM) ................................................................................. 36
Disk Storage (Value Storage) ........................................................................ 36
Low Memory Mode ........................................................................................ 37
Merge File Storage ........................................................................................ 37
Retrieval Storage ........................................................................................... 38
Storage Mode Conversion ............................................................................. 38
Reserved Regions ................................................................................................ 39
OTData - Full Text Region ............................................................................. 39
OTMeta .......................................................................................................... 39
XML Text Regions .......................................................................................... 39
OTObject ........................................................................................................ 40
OTCheckSum ................................................................................................ 40
OTMetadataChecksum .................................................................................. 40

The Information Company™ 5


Understanding Search Engine 21

OTContentStatus ........................................................................................... 41
OTTextSize..................................................................................................... 43
OTContentLanguage ..................................................................................... 43
OTPartitionName ........................................................................................... 43
OTPartitionMode ............................................................................................ 44
OTIndexError ................................................................................................. 44
OTScore ......................................................................................................... 44
TimeStamp Regions....................................................................................... 45
OTObjectIndexTime................................................................................. 45
OTContentUpdateTime ........................................................................... 45
OTMetadataUpdateTime ......................................................................... 45
OTObjectUpdateTime .............................................................................. 46
_OTDomain .................................................................................................... 46
_OTShadow ................................................................................................... 46
Regions and Content Server ................................................................................ 46
MIME and File Types ..................................................................................... 47
Extracted Document Properties ..................................................................... 47
Workflow ........................................................................................................ 48
Categories and Attributes............................................................................... 48
Forms ............................................................................................................. 49
Custom Applications ...................................................................................... 49
Default Search Settings ........................................................................................ 49
Indexing and Query .................................................................................................. 50
Indexing ................................................................................................................ 50
Indexing using IPools ..................................................................................... 51
AddOrReplace ............................................................................................... 53
AddOrModify .................................................................................................. 54
Modify............................................................................................................. 54
Delete ............................................................................................................. 54
DeleteByQuery ............................................................................................... 54
ModifyByQuery .............................................................................................. 55
Transactional Indexing ................................................................................... 56
IPool Quarantine...................................................................................... 56
Query Interface ..................................................................................................... 56
Select Command ........................................................................................... 57
Set Cursor Command .................................................................................... 58
Get Results Command................................................................................... 59
Get Facets Command .................................................................................... 61
Date Facets .................................................................................................... 62
FileSize Facets .............................................................................................. 63
Expand Command ......................................................................................... 64
Hit Highlight Command .................................................................................. 65
Get Time......................................................................................................... 66

The Information Company™ 6


Understanding Search Engine 21

Set Command ................................................................................................ 66


Get Regions Command ................................................................................. 67
OTSQL Query Language...................................................................................... 69
SELECT Syntax ............................................................................................. 70
FACETS Statement........................................................................................ 71
WHERE Clause ............................................................................................. 71
WHERE Relationships ................................................................................... 72
WHERE Terms ............................................................................................... 73
WHERE Operators ......................................................................................... 73
Proximity - prox operator................................................................................ 76
Proximity - span operator ............................................................................... 77
Proximity – practical considerations .............................................................. 78
WHERE Regions ........................................................................................... 79
Priority Region Chains ................................................................................... 80
Minimum and Maximum Regions................................................................... 81
Any or All Regions .......................................................................................... 82
Regular Expressions ...................................................................................... 82
Relative Date Queries .................................................................................... 85
Matching Lists of Terms ................................................................................. 86
ORDEREDBY ................................................................................................ 88
ORDEREDBY Default ............................................................................. 89
ORDEREDBY Nothing ............................................................................ 89
ORDEREDBY Relevancy ........................................................................ 89
ORDEREDBY RankingExpression.......................................................... 89
ORDEREDBY Region ............................................................................. 89
ORDEREDBY Existence ......................................................................... 90
ORDEREDBY Rawcount ......................................................................... 90
ORDEREDBY Score[N] ........................................................................... 90
Performance Considerations for Sort Order ............................................ 90
Text Locale Sensitivity ............................................................................. 91
Facets ................................................................................................................... 91
Purpose of Facets .......................................................................................... 91
Requesting Facets ......................................................................................... 92
Facet Caching ................................................................................................ 92
Text Region Facets ........................................................................................ 93
Date Facets .................................................................................................... 93
FileSize Facets .............................................................................................. 94
Facet Security Considerations ....................................................................... 94
Facet Configuration Settings.......................................................................... 95
Reserving Facet Memory ..................................................................................... 96
Facet Performance Considerations ............................................................... 96
Protected Facets ............................................................................................ 97
Search Agent Scheduling............................................................................... 98

The Information Company™ 7


Understanding Search Engine 21

Interval Execution .................................................................................... 98


Search Agent Configuration ........................................................................... 99
Search Agent Query Syntax......................................................................... 100
New Search Agent Query Files .................................................................... 100
Search Agent iPools..................................................................................... 100
Performance Considerations ....................................................................... 102
Relevance Computation ..................................................................................... 102
Retrieving the Relevance Score .................................................................. 103
Components of Relevance........................................................................... 103
Date Ranking ............................................................................................... 104
Object Type Ranking .................................................................................... 105
Text Region Ranking .................................................................................... 106
Full Text Search Ranking ............................................................................. 106
Relative frequency ................................................................................. 106
Frequency.............................................................................................. 106
Commonality.......................................................................................... 106
Object Ranking ............................................................................................ 107
Relevance Boost Overview .......................................................................... 107
Query Boost ................................................................................................. 108
Date Boost ................................................................................................... 108
Integer Boost ................................................................................................ 109
Multiple Boost Values ................................................................................... 110
Query versus Date / Integer Boost .............................................................. 110
Content Server Relevance Tuning ..................................................................... 110
Date Relevance ........................................................................................... 111
Boosting Object Types ................................................................................. 111
Boosting Text Regions ................................................................................. 114
Default Search Regions ............................................................................... 114
Using Recommender ................................................................................... 115
User Context ................................................................................................ 115
Enforcing Relevancy .................................................................................... 115
Extended Query Concepts ..................................................................................... 117
Thesaurus ........................................................................................................... 117
Overview ...................................................................................................... 117
Thesaurus Files ........................................................................................... 117
Thesaurus Queries ...................................................................................... 118
Creating Thesaurus Files ............................................................................. 118
Content Server Considerations .................................................................... 119
Stemming ........................................................................................................... 120
English Stemming Rules .............................................................................. 121
French Stemming Rules .............................................................................. 121
Spanish Stemming Rules............................................................................. 122
Italian Stemming Rules ................................................................................ 122

The Information Company™ 8


Understanding Search Engine 21

German Stemming Rules............................................................................. 122


Alternative Stemming Algorithm ................................................................... 123
Content Server and Stemming..................................................................... 124
Phonetic Matching .............................................................................................. 124
Exact Substring Searching ................................................................................. 125
Configuration ................................................................................................ 125
Substring Performance ................................................................................ 126
Substring Variations ..................................................................................... 126
Included Tokenizers ..................................................................................... 127
Preserving other Query Features ................................................................ 128
Part Numbers and File Names ........................................................................... 128
Problem ........................................................................................................ 128
Like Operator ............................................................................................... 128
Like Defaults ................................................................................................ 129
Shadow Regions .......................................................................................... 129
Token Generation with Like.......................................................................... 130
Limitations .................................................................................................... 130
User Guidance ............................................................................................. 131
Email Domain Search ......................................................................................... 132
Text Operator - Similarity .................................................................................... 133
Top Words........................................................................................................... 135
Stop Words ......................................................................................................... 136
Advanced Feature Configuration .......................................................................... 137
Accumulator ........................................................................................................ 137
Accumulator Chunking ....................................................................................... 138
Reverse Dictionary ............................................................................................. 139
Transaction Logs ................................................................................................ 140
Protection ........................................................................................................... 141
Text Metadata Size ...................................................................................... 141
Text Metadata Values ................................................................................... 141
Incorrect Indexing of Thumbnail Commands ............................................... 141
Cleanup Thread .................................................................................................. 142
Merge Thread ..................................................................................................... 143
Merge Tokens ........................................................................................ 144
Too Many Sub-Indexes .......................................................................... 145
Tokenizer ............................................................................................................ 145
Language Support ....................................................................................... 145
Case Sensitivity ........................................................................................... 145
Standard Tokenizer Behavior ....................................................................... 146
Customizing the Tokenizer ........................................................................... 147
Tokenizer File Syntax ......................................................................................... 147
Tokenizer Character Mapping ...................................................................... 148
Latin Extended-A Character Set Mapping ............................................. 149

The Information Company™ 9


Understanding Search Engine 21

Arabic Characters ........................................................................................ 149


Complete List of Character Mappings ......................................................... 151
Tokenizer Ranges ........................................................................................ 151
Tokenizer Regular Expressions .......................................................................... 151
East Asian Characters ................................................................................. 152
Tokenizer Options ........................................................................................ 153
Testing Tokenizer Changes ................................................................................ 153
Sample Tokenizer ............................................................................................... 154
Metadata Tokenizers .......................................................................................... 155
Metadata Tokenizer Example 1.................................................................... 156
Metadata Tokenizer Example 2.................................................................... 156
Administration and Optimization .......................................................................... 159
Index Quality Queries ......................................................................................... 159
Index Error Counts ....................................................................................... 159
Content Quality Assessment ........................................................................ 159
Partition Sizes .............................................................................................. 159
Metadata Corruption .................................................................................... 159
Bad Format Detection .................................................................................. 159
Text Metadata Truncation............................................................................. 159
Text Value Truncation ................................................................................... 160
Search Result Caching ....................................................................................... 160
Query Time Analysis ........................................................................................... 160
Administration API .............................................................................................. 162
getstatustext ................................................................................................. 162
getstatuscode ............................................................................................... 167
registerWithRMIRegistry .............................................................................. 168
checkpoint .................................................................................................... 168
reloadSettings .............................................................................................. 168
getsystemvalue ............................................................................................ 168
addRegionsOrFields .................................................................................... 169
runSearchAgents ......................................................................................... 169
runSearchAgent ........................................................................................... 169
runSearchAgentOnUpdated......................................................................... 170
runSearchAgentsOnUpdated ....................................................................... 170
Server Optimization ............................................................................................ 170
Metadata Region Fragmentation ................................................................. 170
Partition Metadata Memory Sizing ............................................................... 171
Automatic Partition Modes ........................................................................... 173
Memory Usage Mode Switching............................................................ 173
Disk Usage Mode Switching .................................................................. 174
Selecting a Text Metadata Storage Mode .................................................... 175
High Ingestion Environments ....................................................................... 175
Update Distributor Bottlenecks .............................................................. 176

The Information Company™ 10


Understanding Search Engine 21

Operation Counts .................................................................................. 178


Percentage Times.................................................................................. 178
Backup Times ........................................................................................ 178
Agent Times........................................................................................... 178
NetIO Stats ............................................................................................ 178
Checkpoint Writing Thresholds.............................................................. 179
Index Batch Sizes .................................................................................. 180
Partition Biasing..................................................................................... 180
Parallel Checkpoints .............................................................................. 181
Testing for Object Ownership ................................................................ 182
Compressed Communications .............................................................. 183
Data Storage Optimization .................................................................... 183
Scanning Long Lists .............................................................................. 183
Ingestion versus Size ............................................................................ 184
Content Server Considerations ............................................................. 184
Ingestion Rate Case Study .......................................................................... 184
Re-Indexing .................................................................................................. 186
Optimize Regions to be Indexed .................................................................. 187
Selecting a Storage System......................................................................... 187
Measuring Network Quality .......................................................................... 189
Measuring Disk Performance....................................................................... 190
Checkpoint Compression ............................................................................. 191
Disk Configuration Settings.......................................................................... 191
Delayed Commit .................................................................................... 191
Chunk Size ............................................................................................ 192
Query Parallelism .................................................................................. 192
Throttling Indexing ................................................................................. 192
Small Read Cache................................................................................. 193
File Retries ............................................................................................ 193
Indexing Large Objects....................................................................................... 193
Servers with Multiple CPUs ................................................................................ 194
Virtual Machines........................................................................................... 194
Garbage Collection ...................................................................................... 195
File Monitoring ............................................................................................. 196
Virus Scanning ............................................................................................. 196
Thread Management.................................................................................... 196
Scalability ........................................................................................................... 196
Query Availability.......................................................................................... 196
Indexing High Availability ............................................................................. 198
Sizing a Search Grid........................................................................................... 199
Minimizing Metadata .................................................................................... 199
Metadata Types............................................................................................ 199
Hot Phrases and Summaries ....................................................................... 199

The Information Company™ 11


Understanding Search Engine 21

Partition RAM Size ....................................................................................... 199


Sample Data Point................................................................................. 200
Memory Use .......................................................................................... 200
Redundancy ................................................................................................. 200
Spare Capacity ............................................................................................ 200
Indexing Performance .................................................................................. 201
CPU Requirements ...................................................................................... 201
Maintenance ....................................................................................................... 202
Log Files ............................................................................................................. 202
Log Levels .................................................................................................... 203
Log File Management .................................................................................. 203
RMI Logging ................................................................................................. 204
Backup and Restore .................................................................................... 204
Application Level Index Verification ............................................................. 204
Purging a Partition Index.............................................................................. 204
Step 1 .................................................................................................... 204
Step 2 .................................................................................................... 205
Step 3 .................................................................................................... 205
Step 4 .................................................................................................... 205
Security Considerations...................................................................................... 205
Java Security Policy ..................................................................................... 206
Backup and Restore ........................................................................................... 207
Backup Feature – Method 1......................................................................... 207
Running Backups from Command Line................................................. 210
Restoring Partitions ............................................................................... 210
Backup – Method 2 ...................................................................................... 211
Backup Utilities – Method 3 ......................................................................... 211
Differential Backup ................................................................................ 211
Backup Process Overview .................................................................... 211
Sample [Link] File ................................................................................ 212
Sample Lang File................................................................................... 213
Sample [Link] File .......................................................................... 215
Running the Backup Utility .................................................................... 216
Restore Process – Method 3 ....................................................................... 216
Preparation ............................................................................................ 217
Analysis ................................................................................................. 217
Copy ...................................................................................................... 217
Validate .................................................................................................. 217
[Link] File....................................................................................... 217
Index and Configuration Files ............................................................................... 220
Index Files .......................................................................................................... 220
Signature File ............................................................................................... 221
Accumulator Log File ................................................................................... 221

The Information Company™ 12


Understanding Search Engine 21

Metadata Checkpoint Files .......................................................................... 221


Lock File ................................................................................................ 222
Control File ............................................................................................ 222
Top Words ............................................................................................. 222
Config File ............................................................................................. 222
Metalogs....................................................................................................... 222
Index Fragment Folders ............................................................................... 222
Core, Region and Other ........................................................................ 223
Index Files ............................................................................................. 223
Object Files............................................................................................ 223
Offset File .............................................................................................. 224
Skip File ................................................................................................. 224
Map File ................................................................................................. 224
Low Memory Metadata Files ........................................................................ 224
Metadata Merge Files .................................................................................. 224
Configuration Files.............................................................................................. 225
[Link] ..................................................................................................... 225
Search.ini_override ...................................................................................... 225
[Link]..................................................................................................... 226
[Link] ............................................................................... 226
[Link].................................................................................... 227
[Link] Summary....................................................................................... 228
General Section ........................................................................................... 228
Partition Section ........................................................................................... 229
DataFlow Section ......................................................................................... 229
Update Distributor Section ........................................................................... 236
Index Engine Section ................................................................................... 237
Search Federator Section ............................................................................ 238
Search Engine Section ................................................................................ 240
DiskRet Section ........................................................................................... 241
Search Agent Section .................................................................................. 241
Field Alias Section ........................................................................................ 241
Index Maker Section .................................................................................... 241
Reloadable Settings ........................................................................................... 242
Common Values ........................................................................................... 242
Search Engines .................................................................................................. 243
Update Distributor............................................................................................... 243
Tokenizer Mapping ............................................................................................. 244
Additional Information ............................................................................................ 256
Version History ................................................................................................... 256
Search Engine 10 ........................................................................................ 256
Search Engine 10 Update 1 ......................................................................... 256
Search Engine 10 Update 2 ......................................................................... 257

The Information Company™ 13


Understanding Search Engine 21

Search Engine 10 Update 3 ......................................................................... 257


Search Engine 10 Update 4 ......................................................................... 257
Search Engine 10 Update 5 ......................................................................... 258
Search Engine 10 Update 5 Release 2 ....................................................... 258
Search Engine 10 Update 6 ......................................................................... 258
Search Engine 10 Update 7 ......................................................................... 258
Search Engine 10 Update 8 ......................................................................... 258
Search Engine 10 Update 9 ......................................................................... 259
Search Engine 10 Update 10 ....................................................................... 259
Search Engine 10 Update 11 ....................................................................... 259
Search Engine 10 Update 12 ....................................................................... 259
Search Engine 10.5 ..................................................................................... 260
Search Engine 10.5 Update 2014-03 .......................................................... 260
Search Engine 10.5 Update 2014-03 R2 .................................................... 260
Search Engine 10.5 Update 2014-06 ......................................................... 260
Search Engine 10.5 Update 2014-09 ......................................................... 261
Search Engine 10.5 Update 2014-12 ......................................................... 261
Search Engine 10.5 Update 2015-03 ......................................................... 261
Search Engine 10.5 Update 2015-06 ......................................................... 261
Search Engine 10.5 Update 2015-09 ......................................................... 262
Search Engine 10.5 Update 2015-12 ......................................................... 262
Search Engine 16 Update 2016-03 ............................................................ 262
Search Engine 16.0.1 (June 2016) .............................................................. 262
Search Engine 16.0.2 (September 2016) .................................................... 263
Search Engine 16.0.3 (December 2016) ..................................................... 263
Search Engine 16.2.0 (March 2017) ............................................................ 263
Search Engine 16.2.1 (June 2017) .............................................................. 263
Search Engine 16.2.2 (September 2017) .................................................... 264
Search Engine 16.2.3 (December 2017) ..................................................... 264
Search Engine 16.2.4 (March 2018) ............................................................ 264
Search Engine 16.2.5 (June 2018) .............................................................. 264
Search Engine 16.2.6 (September 2018) .................................................... 265
Search Engine 16.2.7 (December 2018) ..................................................... 265
Search Engine 16.2.8 (March 2019) ............................................................ 265
Search Engine 16.2.9 (June 2019) .............................................................. 265
Search Engine 16.2.10 (September 2019) .................................................. 265
Search Engine 16.2.11 (December 2019) ................................................... 265
Search Engine 20.2 (March 2020) ............................................................... 266
Search Engine 20.3 (July 2020) .................................................................. 266
Search Engine 20.4 (October 2020) ............................................................ 266
Search Engine 21.1 (January 2021) ............................................................ 266
Search Engine 21.2 (April 2021) .................................................................. 267
Error Codes ............................................................................................................. 268

The Information Company™ 14


Understanding Search Engine 21

Update Distributor ........................................................................................ 268


Index Engine ................................................................................................ 269
Search Federator ......................................................................................... 269
Search Engine ............................................................................................. 270
Utilities ................................................................................................................ 270
General Syntax ............................................................................................ 270
Backup ......................................................................................................... 271
Restore......................................................................................................... 271
DumpKeys.................................................................................................... 271
VerifyIndex ................................................................................................... 272
RebuildIndex ................................................................................................ 274
LogInterleaver .............................................................................................. 274
[Link] ............................................................... 274
[Link] ....................................... 275
ProfileMetadata ............................................................................................ 275
[Link] ................................................................ 276
SearchClient................................................................................................. 277
Repair BaseOffset Errors ............................................................................. 278
Problem Illustration ...................................................................................... 278
Repair Option 1 ............................................................................................ 279
Repair Option 2 ............................................................................................ 280
New Base Offset Errors ...................................................................................... 283
Index of Terms ......................................................................................................... 284

The Information Company™ 15


Understanding Search Engine 21

Basics
This section is an overview of Search Engine 21, and introduces fundamental concepts
needed to understand some of the later topics.

Overview

Introduction
Search Engine 21 (“OTSE” – OpenText Search Engine) is the search engine provided
as part of OpenText Content Server. This document provides information about the
most common Search Engine 21 features and configuration, suitable for
administrators, application integrators and support staff tasked with maintaining and
tuning a search grid. If you are looking for information on the internal details of the
data structures and algorithms, you won’t find it here.
This document is based upon the features and capabilities of Search Engine 21.2,
which has a release date of April 2021.

Where possible, the discussion of OTSE is isolated from the larger


context of Content Server in which it operates. However, there are
instances where references to Content Server are necessary due to
the tight integration between Content Server and OTSE. Paragraphs
which are specific to Content Server are usually designated by
means of the icon you see at the left.
Occasionally, items of particular interest will be highlighted by means
of a sticky note icon, as seen here.

Disclaimer

DISCLAIMER:
This document is not official OpenText product
documentation. Any procedures or sample code is specific to the
scenario presented in this White Paper, and is delivered as-is and
is for educational purposes only. It is presented as a guide to
supplement official OpenText product documentation.
While efforts have been made to ensure correctness, the
information here is supplementary to the product documentation
and release notes.

The Information Company™ 3


Understanding Search Engine 21

Relative Strengths
There are many search engines available on the market, each of which has relative
merits. Search Engine 21 is a product of the ECM market space, developed by
OpenText, with a proven record as part of OpenText Content Server. This search
engine has been in active use and development for many years, and was previously
known by names such as “OT7” and “Search Engine 10”.
Because of the nature of OpenText ECM solutions, OTSE has a feature set oriented
towards enterprise-grade ECM applications. Some of the pertinent features which
make OTSE a preferred solution for these applications include:
Upgrade Migration:
As new features and capabilities are added, you are not required to re-index your data.
OTSE includes transparent conversion of older indexes to newer versions. Our
experience is that customers with large data sets often do not have the time or
infrastructure to re-index their data, so this is a key requirement.
Transactional Capability:
During indexing, objects are committed to the index in much the same way that
databases perform updates. If a catastrophic outage happens in the midst of a
transaction, the system can recover without data corruption. Additionally, logical
groups of objects for indexing can be treated as a single transaction, and the entire
transaction can be rolled back in the event that one object cannot be handled properly.
Metadata Updates:
The OpenText search technology has the ability to make in-place updates of some or
all of the metadata for an object. This represents a significant performance
improvement over search technology that must delete and add complete objects,
particularly for ECM applications where metadata may be changing frequently.
Search-Driven Update:
OTSE has the ability to perform bulk operations, such as modification and deletion, on
sets of data that match search criteria. This allows for very efficient index updates for
specific types of transactions.
Maintenance Commitment:
OpenText controls the code and release schedules. This way, we can ensure that our
ECM solutions customers will have a supported search solution throughout the life of
their ECM application.
Data Integrity:
OTSE contains a number of features that allow the quality, consistency and integrity of
the search index and the data to be assessed. These features give system
administrators the tools they need to ensure that mission critical applications are
operating within specification.
Scaling:
Not only can OTSE support very large indices (1 billion+ objects), it can be restructured
to add capacity, rebalance the distribution of objects across servers, switch portions
from read-write to update-only or read-only, and perform in-place addition or removal

The Information Company™ 4


Understanding Search Engine 21

of metadata fields. OTSE shelters applications from the complexity of tracking which
objects are indexed into each Search Engine.
Advanced Queries:
Customers engaged in Records Management, Discovery and Classification have
unique query features available optimized for these applications. Examples include
searching for N of M terms in a document, searching for similar information, and
conditional term matching where sparse metadata exists.

Related Components
The scope of this document is constrained to the core OTSE components which are
located within the search JAR file ([Link]).
There are a number of other components of both the overall search solution and
Content Server which are strongly related to OTSE but are not covered in this
document. In some instances, because of the tight relationship with other components,
references may be made in this document to these other components. For a complete
understanding of the search technology, you may wish to also learn about the following
products and technologies:
Admin Server
The Admin Server is a middleware application which provides control, monitoring and
management of processes for Content Server. The Admin Server performs a number
of services, and is critical to the operation of the search grid when used with Content
Server. As a rule of thumb, there is generally one Admin Server installed on each
physical computer hosting OTSE components.
Document Conversion Server
DCS is a set of processes and services responsible for preparing data prior to indexing.
DCS performs tasks such as managing the data flows and IPools during ingestion,
extracting text and metadata from content, generating hot phrases and summaries,
performing language identification, and more. You should ensure that DCS is optimally
configured for use with your application before indexing objects.
IPool Library
Interchange Pools (IPools) are a mechanism for managing batch-oriented Data Flows
within Content Server. IPools are used to encapsulate data for indexing. OTSE uses
the Java Native Interface (JNI) to leverage OpenText libraries for reading and writing
IPools.
Content Server Search Administration
While most OTSE setup is managed using configuration files, in practice many of these
files are generated and controlled by Content Server. Many of the concepts and
settings described in this document have analogous settings within Content Server
Search Administration pages, and should be managed from those pages wherever
possible.

The Information Company™ 5


Understanding Search Engine 21

Query Languages
This document describes the search query language implemented by the OTSE. It is
common for applications to hide the OTSE query language and provide an alternative
query language to end users. The Content Server query language – LQL – is NOT
described in this document.
Remote Search
Content Server Remote Search currently uses code within OTSE to facilitate obtaining
search results from remote instances of Content Server.

Backwards Compatibility
OTSE is capable of reading all indexes and index configuration files from all released
versions of OpenText Search Engine 20, Search Engine 16.2, Search Engine 16,
Search Engine 10.5, Search Engine 10, and OT7. OT7 is the predecessor to SE10.0
that was part of Content Server 9.6 and 9.7. For most of these, an index conversion
will take place. The new index will not be readable by older versions of the search
engines.
Indexes created with OT6 are not directly readable. Search Engine 10 can be used to
convert an OT6 index to a format Search Engine 10 can use, which can then be
upgraded in a second step using OTSE. In practice, given the improvements and fixes
since OT6, you would be best advised to re-index extremely old data sets. You should
consult with OpenText Customer Support if you are considering a migration from these
older search indices.

Installation with Content Server


This update of OTSE is optimized to be run on the OpenJDK 11.x Java platform, as
installed with current OpenText Content Server updates.
If necessary, OTSE can be provided separately from Content Server, and applied as
an update for fixes to older versions of Content Server running within a Java 8 runtime
environment. OTSE itself is contained within a Java container called
“[Link]”.
There are many services and components to OTSE contained with the OTSEARCH
JAR file, which are differentiated at startup by means of command line parameters.
When deployed with Content Server, multiple copies of [Link] (or [Link] on
Windows) are made with distinct names in order to help differentiate the various Java
processes when using monitoring tools. If you do upgrade the version of Java used
with OTSE and Content Server, you must also remember to make new versions of
these Java wrapper programs. The names for the copies of the [Link] file used by
Content Server are:
[Link]
[Link]
[Link]
[Link]

The Information Company™ 6


Understanding Search Engine 21

[Link]
[Link]
[Link]
[Link]
[Link]
The [Link] file is specifically for Content Server Remote Search, and is
not a requirement for other OTSE installations.

Search Engine Components


OTSE is comprised of a number of logical components. These are logical components
because physically they are the all located within the same program (contained within
the [Link] file), but started in a different mode of operation based upon
command line parameters. This section presents an overview of each component and
its purpose.

The Information Company™ 7


Understanding Search Engine 21

Update Distributor
The Update Distributor is the front end for indexing. The Update Distributor performs
the following tasks, not necessarily in this order:
• Monitors an input IPool directory to check for indexing requests.
• Reads IPools, unpacks the indexing requests.
• Breaks larger IPools into smaller batches if necessary.
• Determines which Index Engines should service an indexing request.
• Sends indexing requests to Index Engines.
• Rolls back transactions and sets aside the IPool message if indexing of an
object fails.
• Rebalance objects to a new Index Engine during update operations if a
partition is too full or retired.
• Manages which Index Engines can write Checkpoints.
• Grants merge tokens to Index Engines that have insufficient disk space.
• Controls sequence of operations for Index Engines writing backups.

Index Engines
An Index Engine is responsible for adding, removing and updating objects in the search
index. The Index Engines accept requests from the Update Distributor, and update the
index as appropriate. Multiple Index Engines in a system are common, each one
representing a portion of the overall index known as a “partition”.
The search index itself is stored on disk. In operation, portions of the search index are
loaded into memory for performance reasons.
Index Engines are also responsible for tasks such as:
• Converting older versions of the index to newer formats.
• Converting metadata from one type to another.
• Converting metadata between different storage modes.
• Background operations to merge (compact) index files.
• Writing backup files and transaction logs.

Search Federator
The Search Federator is the entry point for search queries. The Search Federator
receives queries from Content Server, sends queries to Search Engines, gathers the
results from all Search Engines together, and responds to the Content Server with the
search results.
The Search Federator performs tasks such as:
• Maintaining the queues for search requests.

The Information Company™ 8


Understanding Search Engine 21

• Issuing search queries to the Search Engines.


• Gathering and sorting results from Search Engines.
• Removing duplicate entries from search results.
• Caching search results for long queries.
• Running scheduled Search Agents.

Search Engines
The Search Engines perform the heavy lifting for search queries. They are responsible
for performing searches on a single partition, computing relevance score, sorting
results, and retrieving metadata regions to return in a query. Every partition requires
a Search Engine to support queries.
The Search Engines keep an in-memory representation of key data that replicates the
memory in the Index Engines. The files on disk are shared with the Index Engines.
Search Engines read Checkpoint files at startup and incremental Metalog and
AccumLog files during operation to keep their view of the index data current. These
Metalog and AccumLog files are checked every 10 seconds by default, and any time a
search query is run.
Search Engines also perform tasks such as building facets, and computing position
information used for highlighting search results.

Inter-Process Communication
Each component of the search engine exposes APIs for a variety of purposes. This
section outlines the various communication methods used.

External Socket Connections


Each component of OTSE listens on a configurable port number for socket-level
communications. The port number is configured within the [Link] file. These
socket connections are used for tasks such as:
• Search queries
• Configuration queries
• Status monitoring
• Shutdown, restart and reload
• Backup and restore
Within Content Server, the administration pages allow you to set the IP Address and
Port Numbers used by each OTSE component.
The administrator must ensure that there are no port conflicts for components installed
on the same computer. Socket communication may also occur across computers in a

The Information Company™ 9


Understanding Search Engine 21

distributed system. You must ensure that socket communications are not blocked by
firewalls, switches or other networking elements.

Some customers have encountered intermittent problems that can


be traced back to the support of sockets. For example, the Windows
operating system has a configurable limit on the total number of
sockets that can be active, and reserves connections for several
minutes. You may need to adjust the maximum number of
connections upwards, and the reservation time downwards within the
Operating System.

Internal Socket Connections


The primary socket communications of interest are from the Update Distributor to the
Index Engines, and from the Search Federator to the Search Engines.
The Update Distributor typically initiates transactions with a broadcast message to
determine which (if any) Index Engine owns an object. If an Index Engine responds,
then the index update request is directed to that specific Index Engine, otherwise the
Update Distributor selects a partition to receive a new item.
The Search Federator, on the other hand, typically broadcasts requests to all the
Search Engines, receives responses from all the partitions, then prepares a
consolidated response.
Socket connections consume system resources, which vary depending on the size of
the search grid. The majority of connections are consumed as listeners, with the single
largest number of possible connections allocated to the search engines, where each
possible simultaneous query on each search engine requires a thread.
For example, if you have 1 Admin Server, 1 Search Federator and 20 Search Engines
with 10 simultaneous queries possible, the peak resource consumption for
communication between the Search Federator and Search Engines is as follows:

Sockets

Threads 210

Connections 200

Ports 21

The socket connections allocate and hold the threads and connections. Although this
uses the maximum number of resources, there are performance benefits and
predictability since it avoids allocation and re-use overhead that may exist within Java
or the operating system.

The Information Company™ 10


Understanding Search Engine 21

Search Federator Connections


Search Queues
A search queue is responsible for listening for search requests on a port, adding
requests to a queue, and allowing a defined number of requests to be executed
concurrently, each on a separate thread.
There are two search queues that may be used, the “normal” queue, and the “low
priority” queue. The low priority queue was first introduced in version 16.2.6, prior
versions supported only a single queue. The motivation for the low priority queue is
based on usage patterns in Content Server. There are background programmatic
operations that perform searches, and there are interactive user searches. The
programmatic searches have the potential to consume all available search capacity,
blocking users from having access. The purpose of the having two queues is to allow
specific search capacities to be independently reserved for background searches and
user searches.
Use of both queues is optional. By convention, the “normal” queue is always used.
[SearchFederator_xxx]
SearchPort=8500
WorkerThreads=5
QueueSize=25

The low priority queue is disabled by default, and activated with the following settings:
[SearchFederator_xxx]
LowPrioritySearchPort=-1
LowPriorityWorkerThreads=2
LowPriorityQueueSize=25

Note that using the low priority queue requires an additional port. As a general
recommendation, small values (perhaps 2 or 3) should be used for the threads to
prevent the low priority searches from consuming too many resources.
Queue Servicing
There are three phases to servicing a request to the Search Federator.
Phase 1 – Content Server indicates a desire to start a search query by opening
a connection to the Search Federator. The connection is put on an operating
system / Java queue (not in the search code).
Phase 2 – a dedicated thread takes the connection from Java, and places it in
an internal queue. If the internal queue is full, the request is discarded and the
connection is closed.
Phase 3 – when a search worker thread becomes available, the connection is
removed from the queue and given to the worker. At this point, the worker
responds to Content Server to indicate it is ready to receive the search request,
and Content Server sends the search query for processing.

The Information Company™ 11


Understanding Search Engine 21

Note that in versions prior to 20.2, the process around Phase 1 and Phase 2 were
different – the pending requests were left on the operating system queue, and the
internal queue had an effective size of 1.
Search Timeouts
The Search Federator places a limit on how long it will wait for an application which
has opened a search transaction. If the application does not initiate a message in the
available time, then the Search Federator will close the connection and terminate the
transaction.
Keeping a connection and transaction open is expensive from a resource perspective,
and applications that leave connections open and idle can block search activity by
consuming all available threads from the search query pool.
There are two timeout values. The first is the time between the acknowledging that a
worker is ready to receive a query the first message arriving. This is expected to be a
short time, and the default is 10 seconds. The second is the time between messages
– for instance between consecutive “GET RESULTS” messages. This is longer, with
a default of 120 seconds. Both times can be adjusted or disabled in the
search.ini_override file. Bear in mind these are timeouts from the server perspective
– Content Server will also have timeout values from the client perspective.

NOTE: if you are testing search, or using an interactive client for


querying search, these timeout values (especially the initial
connection timeout) will likely be too short, and you may wish to
adjust the timeouts accordingly.

Within the [SearchFederator] section of the [Link] file, you may specify the time the
Federator will wait between a connection being created and the first command arriving
(10 second default):
FirstCommandReadTimeoutInMS=10000
Time the Search Federator will wait between commands (2 minute default):
SubsequentCommandReadTimeoutInMS=120000
In either case, the timeouts can be completely disabled with a value of 0.
The Search Federator also places a limit on how long it will wait for a response from a
Search Engine with an open search session. If the Search Engine does not reply within
the available time, then the Search Federator will terminate the search session. For
example, if the Search Federator has issued a “SELECT” to a Search Engine, it will
wait a limited amount of time for the reply. This timeout value is in the [DataFlow]
section of the [Link] file has a default value of 2 minutes:
QueryTimeOutInMS= 120000
The search session on a Search Engine will regularly ping the Search Federator to
ensure that it is still responding. If the Search Federator does not answer, then the
Search Engine will terminate its search session to recover resources. In addition, there
is a failsafe timeout which is the maximum time that a Search Engine will leave a
session active. In normal operation, even if the Search Federator fails, this is not

The Information Company™ 12


Understanding Search Engine 21

typically encountered. Located in the [DataFlow] section of the [Link] file, the
failsafe timeout value is 6 hours:
SessionTimeOutInMS= 21600000
Testing Timeouts
In a test environment, search results are often completed too quickly to permit testing
of system behavior for long searches and search timeouts. For test purposes, there is
a configuration setting that will cause all searches to take at least a defined period of
time. In production environments, this value should be 0.
MinSearchTimeInMS=0

File System
The Index Engines communicate updates to the Search Engines using a shared file
system. At various times, files may be locked to ensure data integrity during updates.
It is important that the Search and Index Engines have accurate file information for this
to work correctly. Some file systems use aggressive caching techniques that can break
this communication method. The Microsoft SMB2 caching is one example, and it must
be disabled for correct operation of OTSE. Microsoft SMB3 reverts to using the SMB2
protocol in many situations, and so should also be avoided. You must disable SMB2
caching on the servers running the search processes and on the file server. Similarly,
Microsoft Distributed File System (DFS) is known to have unpredictable file locking
behavior and must not be used.
Some customers have also experienced locking issues with NFS, and have needed to
use the NOLOCK or NOAC parameter in their NFS configuration to ensure correct
operation.

Server Names
Java enforces strict adherence to the various IETF standards for URIs and server
naming conventions. RFC 952, RFC 2396 and RFC 2373 are examples. Some
operating systems allow server names that do not meet the criteria for these standards.
When this happens, OTSE will likely fail with exceptions at startup. One example we
have seen is violation of this rule in RFC 952: “The rightmost label of a domain name
consisting of two or more labels, begins with an alpha character”. This means a domain
name such as “zulu.server3.7up” is invalid because the “7” must instead be an alpha
character.

Partitions

Basic Concepts
The concept of partitions is central to how OTSE scales and manages search indexes.
A search index may be broken horizontally into a number of pieces. These pieces are
known as “partitions” in OTSE terminology. The sum of all the partitions together
represents the search index.

The Information Company™ 13


Understanding Search Engine 21

Splitting an index into partitions is needed for a number of possible reasons:


• For best query performance, some metadata can be stored in memory. There
are practical limits on the amount of memory that can or should be used by a
single Java process. Using partitions allows these limits to be overcome.
• OTSE can often provide better indexing or searching performance by allowing
operations to be distributed to multiple partitions. These partitions can be run
on separate physical or virtual computers or CPUs to improve performance.
• Indexing and searching are disk-intensive activities. By splitting an index into
partitions, the index can be distributed over multiple physical disks and I/O
connections, improving overall search performance.
Each partition is a self-contained subset of the search grid. Each has its own index
files, a Search Engine, and an Index Engine. The partitions are tied together by the
Update Distributor (for indexing) and by the Search Federator (for queries).
Each partition is relatively independent of the other partitions in the system during
indexing. If one partition is given an object to index, the other partitions are idle. The
Update Distributor can distribute the indexing load across multiple partitions. For
systems with high indexing volumes, using multiple partitions this way can help achieve
higher performance, since partitions can be indexing objects in parallel.
A search query normally is serviced by all partitions. Only partitions containing matches
to the query will return results. The Search Federator will blend results from multiple
partitions into a consolidated set of search results.

Update-Only Partitions
It is possible to place a partition in “Update-Only” mode. In this mode, the partition will
not accept new objects to index, but it will update existing objects or delete existing
objects. If a partition is marked as Update-Only, then the Update Distributor will not
send it new objects.
Update-Only behavior is a legacy feature inherited from OT7, and is still supported for
backwards compatibility. However, it is recommended that you do not use Update-
Only mode for future applications. In normal Read-Write mode, OTSE contains a
dynamic “soft” update-only feature which is generally superior. The use and
configuration of dynamic update-only mode is covered elsewhere in this document.
Beginning with Content Server 16, Update-Only mode is not available as a
configuration option from within Content Server.
The default storage mechanism for text metadata is independently configured for
Update-Only partitions. If your default configuration for Update-Only mode differs from
Read-Write mode, then the Index Engines will convert the index data structures the
first time they restart after the configuration is changed. This default configuration
setting is found in the [Link] file.

The Information Company™ 14


Understanding Search Engine 21

Read-Only Partitions
OTSE allows partitions to be placed in a “Read-Only” mode. In this mode, the partition
will respond to search queries, but will not process any indexing requests. Objects
cannot be added to the partition, removed or modified.
In operation, when started, the Index Engines for Read-Only partitions will shut down
once they have verified the index integrity. This means that fewer system resources
are being consumed. It also means that, since there is no Index Engine to respond to
the Update Distributor, a new instance of an object will be created in another partition
if you attempt to replace or update an object in a Read-Only partition.
You should only use Read-Only partitions in very specific cases. Customers will
occasionally get into trouble because they use Read-Only partitions when their
applications are still updating objects. This would happen in an application such as
Records Management – a “hold” is put on an object in a Read-Only partition, and a
duplicate entry is inadvertently created in another partition. Similarly, moving items to
another folder, updating classifications, updating category attributes and other
operations will cause this type of behavior. The search engines then respond to search
queries with multiple copies of objects.
The use of “Retired” mode for partitions avoids these issues, and should be considered
instead of Read-Only mode. Beginning with Content Server 16, Read-Only mode will
no longer be provided as a configuration option in the Content Server administration
interface.
Read-Only partitions also have a distinct default configuration for text metadata
storage in the [Link] file, and changing to or from Read-Only mode
may trigger data conversion on startup.

Retired Partitions
OTSE allows partitions to be placed in a “Retired” mode. This mode of operation is
intended for use when a partition is being replaced. The behavior is close to partitions
in Update-Only mode. It will not accept new items, but it will update existing objects or
delete existing objects. If a partition is marked as Retired, then the Update Distributor
will not send it new objects. The key difference is that when an object in a Retired
partition is re-indexed, it will be deleted from the Retired partition and added to a Read-
Write partition.
Support for Retired Partitions is new starting with Search Engine 10.5. Retired mode
is strongly preferred over Read-Only mode, since Retired mode avoids problems
related to creating duplicate copies of objects in the Index.
Retired partitions are also a key feature for merging many small partitions into a set of
larger partitions. This is typical for customers upgrading older systems that use RAM
mode, and are switching to Low Memory mode. In this case, approximately 65% of
the partitions can be marked as “Retired”, and incremental re-indexing of the Retired
partitions will empty move all the objects out of the Retired partitions. When empty,
the partitions can be removed from the search grid.
One common strategy for moving items from one partition to another is to place a
partition into Retired Mode, perform a search for all items in the Retired partition, add
them to a Collection, and re-index the Collection. This moves all the items that are re-

The Information Company™ 15


Understanding Search Engine 21

indexed from the Retired partition into other partitions. In practice, there are often
items left behind in the Retired partition after this is done. Typically, this is to be
expected. Occasionally, a Content Server object will be deleted but not removed from
the index. When this happens, it cannot be Collected. In other cases, the Extractor
may be set to re-index only recent versions of objects, and will not re-index older
versions. In some cases, when a document was deleted, an associated Rendition may
not have been removed from the index. If unsure about whether a re-indexed Retired
partition can be deleted, the OpenText customer support organization may be able to
provide some guidance.
Note that when objects are deleted from a partition, some of the data structures remain
in place. For example, a dictionary entry for a word may exist, even though no objects
now contain that word. It is normal for a retired partition that has had all objects
removed to show a small non-zero size. The search engine will also mark items as
deleted, but leave them in place until scheduled processes compact and refresh the
data – which may take days depending on the situation.

Read-Write Partitions
For completeness, the normal mode of operation for a partition is “Read-Write” mode.
In this mode, the partition will accept new objects, can delete objects and update
objects.
Read-Write partitions can be configured to automatically behave as Update-Only
partitions as they become full. More information on soft Update-Only configuration is
available in the optimization section.

Large Object Partitions


In typical applications, the full text of objects being indexed is truncated, typically to 5
MB or 10 MB. In most cases, being able to search only the first 5 MB of text in objects
is sufficient. Note that this value applies to the actual text – a 100MB PowerPoint deck
may only contain 20 KB of actual text.
If searching the complete text of very large objects is required, the configuration
settings can be changed to adjust the truncation size to arbitrarily large values.
However, significantly more memory will be needed for every Search Engine and Index
Engine to handle the very large objects. If 4 extra GB of memory are needed for 100
partitions, that’s 800GB of extra RAM (4 GB x (100 search engines + 100 Index
Engines)).
To address this, the Search Engine can reserve specific partitions for very large
objects. Only those specific partitions need additional memory. When an object is
presented for indexing, the Update Distributor will send very large objects to one of
these reserved partitions, and all other objects are be sent to traditional partitions. For
more information on configuring sizes, refer to the section “Indexing Large Objects”.
To reserve a partition for large objects:
[Partition_xxx]
LargeObjectPartition=true

The Information Company™ 16


Understanding Search Engine 21

To set the size threshold for determining if an object should be sent to a large object
partition:
[DataFlow_yyyy]
ObjectSizeThresholdInBytes=1000000

The Information Company™ 17


Understanding Search Engine 21

Regions and Metadata


OTSE performance and tuning is strongly dependent upon how you configure, index
and query metadata. Everything you need to know about metadata configuration and
tuning should be here somewhere.

Metadata Regions
A region is OTSE terminology for a metadata field. Using a database analogy, you can
think of a region as being roughly equivalent to a column in a database. Understanding
and optimizing how metadata regions are defined and stored has a big impact on
performance, sizing, usability and search relevance. This section provides background
on the administration of regions to optimize the search experience.

Defining a Region
Regions are defined in the configuration file “[Link]”. This file is edited
to define the desired regions and their behaviors, and interpreted by the Index Engines
when they start. Currently, Content Server does not provide an interface for editing
and managing this file, so you must do this with a text editor.
Once a region is defined, it is recorded in the search index. Changing the definition for
an existing region in the [Link] file or attempting to index a metadata
value that is incompatible with the defined region type will usually result in an error. It
is possible to redefine the type for existing metadata regions in many cases as
explained under the heading “Changing Region Types”.

Region Names
There are limitations on the labels which can be used for a metadata region. The rules
for acceptable region names are approximately the same as the rules for valid XML
labels.
The simplified explanation is that almost any valid UTF-8 characters can be used in
the name, with some exceptions. White-space characters (various forms of spaces,
nulls and control characters) are not permitted. To remain compliant with XML naming
conventions, use of a hyphen ( “-“ ), period ( “.” ), a number ( 0-9 ) or various diacritical
marks are discouraged as the first character.
The DCS filters often create region names from extracted document properties. In
some cases, DCS will strip white space and punctuation from the property names to
ensure that the region names are comprised of valid characters.
Region names are case sensitive. The region “author” is different from the region
“Author”.

Content Server is often not case sensitive with respect to naming


regions, and may derive region names from sources such as
Categories and Attributes, or workflow fields. This could potentially
lead to name collisions in search, so be alert to possible case

The Information Company™ 18


Understanding Search Engine 21

sensitivity issues when creating new regions within Content Server


applications.
Older versions of the search engine had less error checking on region names. It is
possible that some regions exist in legacy indexes that contain null characters. There
are configuration settings in the [Link] file that will instruct OTSE to report and
delete these incorrectly formed regions (set “RemoveRegionsWithNulls=true”),
which you should only need to use if there are null character errors reported when
trying to load an index.

Nested Region Names


Region names during indexing can be expressed in an “XML-like” nesting. If this
occurs, only the top level region is recognized, and the inner values are indexed
including the nesting tags. For example, if the following is presented for indexing:

<customerName>
<firstName>bob</firstName>
<lastName>smith</lastName>
</customerName>

Then the region “customerName” is indexed, and it will have the value:

<firstName>bob</firstName><lastName>smith</lastName>.
Within the definitions file, you can define hierarchy structures that should be ignored
and flattened when looking for regions to index. In the case above, by declaring
“customerName” as a nested region, the field customerName is ignored and the regions
firstName and lastName would be recognized and indexed. This is not intended to
handle arbitrarily complex nesting structures, but was designed to accommodate a few
specific instances in data presented for indexing by Content Server. In particular,
indexing of Workflow objects within Content Server prior to Content Server 10 SP2
Update 10 is the only known requirement for the use of nested region names. Using
the above example, a nested value is expressed within the definitions file like this:

NESTED customerName

DROP - Blocking Indexing of Regions


There is a special operator available for blocking regions from being indexed: the
DROP keyword. When a region is marked for dropping, no values will be indexed for
that region. The DROP operation can only be applied before data for the region is
indexed. Once there is data indexed for a region, DROP is no longer possible, and an
error will be written to the log files.
The DROP operator is “sticky”. Once a region is marked as DROP, this status is
remembered by the index. Deleting the DROP line from the definitions file will not re-
enable indexing for that region. For most applications, use of REMOVE is
recommended instead of DROP. In the definitions file:
DROP regionName

The Information Company™ 19


Understanding Search Engine 21

Removing Regions from the Index


The definitions file allows you to remove regions entirely from the index. The REMOVE
operator is used to delete the values and index for the named region. Be cautious
using this command, since there is no recovering REMOVED data other than re-
indexing the values. REMOVE is an important operator for eliminating low-value
metadata that may be bloating your search index.
The REMOVE operator will also instruct the Index Engines to discard any indexing
requests for the named region, so the region will not be created. This is not sticky –
once the REMOVE entry is deleted from the definitions file, the Index Engine is free to
create and index this region.
The REMOVE operation has precedence over most of the “sticky” settings. NESTED
and DROP regions can be eliminated from the index using the REMOVE operator. To
eliminate a region from the index, in the definitions file:

REMOVE someRegionName

Special considerations exist for the compound region types DATETIME and USER.
USER regions must be removed together in the same way they were defined, with 3
regions removed:

REMOVE OTCreatedBy OTCreatedByFullName OTCreatedByName

DATETIME regions can also be removed in their entirety by specifying both regions:

REMOVE OTVerMDate OTVerMTime

There is a special case supported for removing the TIME portion of a DATETIME pair
to leave only the DATE field behind. Ensure that you also add a DATE field to prevent
conversion of the DATE field to TEXT. There is no method available to remove just the
date portion of a DATETIME field to leave the time intact.

REMOVE OTVerMTime
DATE OTVerMDate

Removing Empty Regions


By default, OTSE will automatically remove empty regions from the search index on
startup. If empty regions are detected and removed, this will trigger the creation of a
new Checkpoint, which will increase the startup time. Some applications in Content
Server create regions with temporary objects; when the objects are subsequently
deleted, the empty regions remain. This capability removes the administration “noise”
of empty regions. This feature can be disabled by adding the following entry in the
[Dataflow_] section of the [Link] file:

RemoveEmptyRegionsOnStartup=false

The Information Company™ 20


Understanding Search Engine 21

Renaming Regions
Consider the case where you need to change the name of a metadata field in Content
Server or a custom application. You are now confronted with the problem that data
which is already indexed is using an older name for the region.
OTSE provides a mechanism for handling these situations. Within the region
definitions file, you can rename an existing region like this:

RENAME oldRegionName newRegionName


Renaming of the region occurs at startup of the Index and Search Engines. If the new
region and the old region both already exist, then this represents a conflict, and the
startup will be aborted with an error message.
When a RENAME statement exists in the definitions file, it also affects new data being
indexed. If a region named ‘oldRegionName’ is presented for indexing, it will be
indexed instead as ‘newRegionName’.
If conversions for RENAME are required at startup, this will trigger the writing of
checkpoints.
RENAME works for regions of type enum, integer, long, Boolean, and timestamp.
RENAME also works for text regions with single values stored in RAM (not on disk).

Merging Regions
The merge capability of OTSE is similar to the RENAME capability, but is instead used
to combine two existing regions. Within the definitions file:

MERGE sourceRegion targetRegion


When the engines start, if a region named sourceRegion exists, it will be copied into a
region named targetRegion. Where a conflict exists, the targetRegion has
precedence, and the value in the sourceRegion will be discarded. After the merging
operation is complete, the sourceRegion is deleted.

It is important to note the ability of the MERGE operation to discard


data when a value exists in both the source and target regions. Use
caution.

Once an index is running, any new values for sourceRegion will instead be indexed
within the targetRegion.
If targetRegion does not exist, the effective behavior of the MERGE command is the
same as a RENAME command.
There are limitations. The MERGE operation is NOT capable of merging text metadata
values that contain attributes. For Content Server, this includes the OTName,
OTDescription and OTGUID regions. The attributes will be silently lost during the
merge operation. You must check to ensure that regions being merged do not
incorporate attributes.

The Information Company™ 21


Understanding Search Engine 21

If conversions are required for MERGE at startup, this will trigger writing new
checkpoint files.

Changing Region Types


Once a metadata region type definition is made, it is remembered in the search index.
If no explicit type definition was made, the region will have type TEXT. In theory, type
definitions should be made before indexing of objects occurs to ensure that optimal
type definitions are set. In practice, this often does not occur, and leads to a situation
in which metadata regions have type definitions that are incorrect. With Content
Server, it is common for metadata to be indexed as Text, even if it should be a Date,
Boolean or Integer.
It is possible to change the type definition for an existing search region under certain
circumstances. For a type conversion to succeed the format of the values for the target
region type must be compatible with the format of the values of the current type. For
example, attempts to convert TEXT regions to INTEGERS will work if the values are
“123”, but fail for values such as “Harold Smith”.
Assuming value compatibility, the following region type conversions are viable:

To
Boolean Integer Long Enum Text Date

Boolean  
Integer     
Long     
From Enum     
Text     
Date    
TimeStamp 

You cannot change the type of a Text region that has multiple values or uses
attribute/value pairs, since these concepts are only available for Text regions.
The procedure is as follows:
Edit the [Link] (or search.ini_override) file to include the following entry in the
[Dataflow] section: EnableRegionTypeConversionAsADate=YYYYMMDD, where
YYYMMDD is today’s date. This informs OTSE that type conversion is allowable today.
This is a safety feature to prevent inadvertent region type conversion.
Edit the [Link] file to have the desired region type definitions.
Restart the search processes. On startup, the Index Engines will determine that a
conversion is required, and use the stored values to rebuild the metadata indexes for

The Information Company™ 22


Understanding Search Engine 21

the changed regions. This process may require several minutes per partition, longer if
many region types are being defined.
In the event that a given value cannot be converted, the failure is recorded in the log
files and the OTIndexError count for metadata errors is incremented for the affected
object in the index.
You are strongly encouraged to back up an index before converting region types and
ensure that conversion has succeeded, reverting to the backups if there are problems.
In the log files, each failed conversion has an entry along these lines:
Couldn't set field OTFilterMIMEType for object
DataId=254417&Version=1 to text/plain:

With a summary of errors for each converted region like:


Total number of errors setting field
OTFilterMIMEType=112610:

LONG Region Conversion


Older versions of the default [Link] file specified type INTEGER for a
number of Content Server fields, such as the DataID or ParentID. In current versions,
these are defined as type LONG to accommodate systems that exceed 2 billion
objects. If using the old type definitions, these LONG values above 2 billion are lost.
The Search Engine will force conversion of some of these INTEGER regions to type
LONG during Index Engine startup if encountered. This conversion was introduced in
version 16.2.5 (June 2018). The list of regions to force to type LONG is defined in a
list in the [Link] file, which has a default value of:
FieldsToBeLongCSL=OTCreatedByGroupID, OTDataID, OTOwnerID,
OTParentID, OTUserGroupID, OTVerCreatedByGroupID,
OTWFManagerID, OTWFMapManagerID, OTWFMapTaskPerformerID,
OTWFMapTaskSubMapID, OTWFSubWorkMapID, OTWFTaskPerformerID

Multiple Values in Regions


A text region may be populated with multiple values. For example, your application
may have a region named “OfficeLocation”. If you are indexing a record for a customer
that had several locations, the indexing entry in the IPools might look something like
this:
<OTMeta>

<OfficeLocation>Chicago, Illinois</OfficeLocation>
<OfficeLocation>Toronto, Ontario</OfficeLocation>
<OfficeLocation>New York, New York</OfficeLocation>

</OTMeta>
This would create 3 separate values for the region OfficeLocation attached to this
object. A search for any of “Chicago”, “Ontario” or “New York” would match this object.

The Information Company™ 23


Understanding Search Engine 21

Similarly, if the region OfficeLocation is selected for retrieval, the results would return
all three values.
When updating values in regions, you cannot selectively update one specific value of
a multi-value region. If a new value is provided for OfficeLocation for this object, all 3
existing values would be replaced with the new data – which may be a single value or
multiple values.

Attributes in Text Regions


OTSE allows the use of attributes with text regions. As an illustration, consider how
Content Server uses metadata attributes to support multi-language indexing and
searching. Within Content Server, multi-language regions are represented this way for
indexing:

<OTMeta>

<OTName lang="en">My red car</OTName>
<OTName lang="fr">Mon voiture rouge</OTName>

</OTMeta>
In addition to using the multiple value capabilities of OTSE, region attributes are used
by Content Server to tag each metadata value with attribute key/value pairs. In this
example, the key is “lang”, and the values are “en” and “fr”.
When constructing a search query, use of the region attributes is optional. A search
for “red car” or a search for “rouge” will find this object and return the values. When
values are returned, the attributes are included in the results only on request.
It is possible to construct a search query against regions that have specific region
attributes. If you only want to locate objects that contain the term “rouge” in the French
language value for OTName, the where clause would look like this:
where [region "OTName"][attribute "lang"="fr"] "rouge"

The query language has also been extended to permit sorting of results using an
attribute. Consider the case where there are values for both French and English, but
the user preference is French. Sorting based on the French values is therefore
desired. Within the “ORDEREDBY” portion of a SELECT statement, the SEQ keyword
is used to specify the attribute to be used for sort preferences:

SELECT ... ORDEREDBY REGION "OTName" SEQ "fr" ASC


In this example the results are sorted by the values within the OTName region which
have an attribute value of “fr”, in ascending order. Since there is no guarantee that the
desired attribute value exists for an object, the following rules are used:
• Use the specified attribute if it exists (in this example, “fr”);
• Otherwise, if the default attribute for this region exists, use it;
• Otherwise, use the attribute which is first alphabetically;

The Information Company™ 24


Understanding Search Engine 21

• If there are no attributes, then use the first value.


The concept of a default attribute is defined in the SystemDefaultSortLanguage
entry of the [Link] file. A list of regions for which default attributes should be used
is first defined, followed by the default attributes key/value pairs for each of these
regions. A priority list can be used if desired:
DefaultMetadataAttributeFieldNames="OTName","OTDescription"
DefaultMetadataAttributeFieldNames_OTName="lang"."en"
DefaultMetadataAttributeFieldNames_OTDescription="lang"."en
","orig"."true"

NOTE: The INI entry is derived by appending _RegionName to


the base label DefaultMetadataAttributeFieldNames.

The use of attributes with text values for specifying language values is a relatively
simple example. You may index multiple attributes within a single region. You may
also have different attributes for each value. The following example illustrates this
concept for indexing:

<ProductName color="red" origin="china">Cartoon character


glass</ProductName>
<ProductName color="blue" size="large">Inflatable
Djinni</ProductName>

Within Content Server, attributes are used for multi-language regions


such as OTName and OTDescription. A multi-value region with
attributes is also used to index the object GUID, with the attributes
used to differentiate the object GUID from the version GUID.

Region Size Attribute


There is a reserved region attribute that can be used to specify the size of a region in
bytes, the otb (OpenText Bytes) region. This attribute can be supplied on any region,
and is used to prevent forgery or corruption of region data. If this attribute is present,
then the Index Engine requires that the size of the region must match the provided otb
value, measured in bytes (not UTF8 characters). If it does not match, then serious
data corruption or potential data injection is assumed and the metadata for the object
is discarded and an OTContentStatus code is used to capture the error.
For example, if an attacker provided metadata in the Description field of an object that
looked like this:

The Information Company™ 25


Understanding Search Engine 21

Silly stuff</Description><fakeRegion>Certified
Paid</fakeRegion><Description>nothing to see here

Then this data could be wrapped in a legitimate Description region when extracted for
indexing, resulting in:
<Description>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>

Which effectively forges a value for fakeRegion. By using the otb attribute,
<Description otb=94>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>

The Index Engine would notice that the Description region ended after only 11 bytes
instead of 94 bytes, and would prevent the injection of the fakeRegion by flagging the
object metadata as unacceptable. Content Server first began using this otb protection
for regions generated by Document Conversion Server in September 2016, and for
regions provided by Content Server metadata in December 2016.
The otb attribute is never stored in the index. There is a [Link] setting that will
disable this capability, which will ignore the otb value. In the [Dataflow_] section:
IgnoreOTBAttribute=true

Metadata Region Types


This section contains a list of the basic data types supported in metadata regions, and
their syntax within the region definitions file [Link].
The general format of entry in the file is a keyword, whitespace, parameters.
Whitespace can be tab or space characters.
Most region definitions are sticky, so changing the definitions file for
an existing installed application will often generate errors. For
upgrades, replacing the [Link] file is therefore usually
not recommended. When you do upgrade, you should review
release notes to see if there are new regions from DCS or Content
Server that should be manually added to existing definitions files
BEFORE indexing new data.

Key
Each object in the index must have a unique identifier, or key. The KEY entry in the
region definitions file identifies which region will be used as this unique identifier. It is
of type text and may not have multiple values. Exactly one must be defined. The
default Key name is OTObject. During indexing, the Key is typically represented by
the entry OTURN within an IPool. To paraphrase, in a default Content Server

The Information Company™ 26


Understanding Search Engine 21

installation, the OTURN entry in an IPool is treated as the Key, and populates the
region OTObject.

KEY OTObject

Text
Text, or character strings. Text strings must be defined in UTF-8 encoding. Text strings
can potentially be very large. Because of this, many customers find that the available
space in their search index is consumed quickly by text regions. To help manage the
large potential sizes, there are several methods available for storing text metadata.
This is covered in a separate section.
Text values may contain spaces and special punctuation. When represented in the
input IPools, certain characters may need to be ‘escaped’ to allow them to be
expressed in the IPools. In general, this means placing a backslash (‘\’) character
before “greater than” and “less than” characters (‘<’ and ‘>’).
There are some features available for TEXT regions which are not available for other
data types, and these may affect the decision about which type of region is suitable for
a given metadata field. TEXT regions support multiple values for an object, and TEXT
regions also support attribute keys and values.
It is possible to index numeric information in a text region, but they
are indexed as strings. When using comparison operations – such
as greater than, less than, ranges and sorting – remember that
strings sort differently than numbers. Intuitively, you expect the
number 123 to be greater than the number 50. But text comparisons
consider 123 to be less than 50. For example, in a TEXT region, a
clause of WHERE [region "partnum"] range "100~200" will
match a value of 1245872. If numeric comparisons are important, a
TEXT region is not a good choice.

TEXT is the “default” type for a region which is indexed without an


entry in the definitions file. Put another way, TEXT metadata regions
are automatically and dynamically created during indexing whenever
a new region name is encountered. If your application allows
arbitrary creation of metadata regions, this may result in unexpected
growth of the search index.
In the definitions file:

TEXT textRegionName
There are default limits on the size and number of values you can place in a text region.
It is possible to configure these limits on a per-region basis. Size is expressed in
Kbytes. These parameters are optional. More details are available in the “Protection”
section of this document.

TEXT textRegionName maxValues=200 maxSize=250

The Information Company™ 27


Understanding Search Engine 21

Rank
The rank type region is a special case for modifiers used in computing the relevance
of an object to boost its position in the result list. For example, frequently used objects
may be given a rank of 50. The default is 0. Values in this region must be between 0
and 100 inclusive. Only 1 region may be defined with type of rank. In the definitions
file:

RANK rankRegionName

Integer
An integer is a 32 bit signed value, which can represent an integer value between -
2,147,483,648 and 2,147,483,647. Integer values are stored in memory. Search
results can be sorted on an integer field. In the definitions file:
INT integerRegionName

Long Integer
A long integer is a 64 bit signed value, which can represent a number between
−9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 inclusive. LONG integer
values are stored in memory. Existing Integer fields in an index can be converted to
LONG Integer values by changing their definition. Search results can be sorted on a
LONG integer field. In the definitions file:
LONG longRegionName

Timestamp
A TIMESTAMP region encodes a date and time value. TIMESTAMP values are
expressed in a string format that is compatible with the standard ISO 8601 format. The
milliseconds and time zone are optional, but time up to the seconds is mandatory:
2011-10-21T[Link].354+05:00
2011-10-21T[Link]
Where

The Information Company™ 28


Understanding Search Engine 21

2011 – 4 digit calendar year

10 – 2 digit calendar month

21 – 2 digit calendar day

T – separates date from time

14 – 2 digit hour in 24 hour format [00 to 23]

24 – 2 digit minute [00 to 59]

17 2 digit second [00 to 59]

354 – milliseconds [000 to 999]

+05:00 – optional time zone offset preceded by + or –

NOTE: 24 is not accepted for 12 midnight, use 00.

The time zone is always optional. If omitted, the local system time zone will be
assumed. The local system time zone is determined from the operating system, but
can also be explicitly set by means of a [Link] file setting. Internally, timestamp
values are converted to UTC time before being indexed.
During search queries, lower significance time elements can be omitted. For instance,
the following will all be accepted:
2011-05-30T[Link]
2011-05-30T13:20
2011-05-30-2:30
2011
If not fully specified, during indexing the earliest possible time for a value will be used.
For example:
2011-05
Would be interpreted as:
2011-05-01T[Link].000

TIMESTAMP values are kept in memory, stored as 64 bit integers. In the definitions
file:
TIMESTAMP timestampRegionName
There are special behaviors for several reserved metadata regions that use
TIMESTAMP definitions for tracking the time when objects are indexed or modified.
See the section on Reserved Regions for more information.

The Information Company™ 29


Understanding Search Engine 21

Enumerated List
The enumerated type is ideal for metadata regions which will have one of a defined set
of values. For example, file type identifiers (Word, Excel, etc.) are members of a set
of file types. Enumerated lists use less memory than text if RAM storage is being used.
In the definitions file:
ENUM enumerableRegionName

Boolean
The BOOLEAN type is used for objects which can have a value of true or false. Fields
of type BOOLEAN use memory very efficiently. In order to accommodate the reality
that different applications represent BOOLEAN values in different ways, the indexing
processes will accept BOOLEAN values in any of the following alternate forms:
true false
yes no
1 0
on off
y n
t f
Boolean values are not case sensitive, so that False, FALSE and false are equivalent.
When retrieved, the values are always presented as true or false, regardless of which
form was used for indexing. If building a new indexing application, the use of true and
false is the preferred form.
BOOLEAN booleanRegionName

Date
A Date region accepts a string that represents a date in the form ‘YYYYMMDD’, where
YYYY is the year, MM the month, and DD the day. For example, 20130208 would
represent February 8th 2013. Date values can be used presented in search facets, and
used in relevance scoring computation. This form of a Date matches the format for
dates used in Content Server. The date portion of a DateTime region is effectively a
Date region. The Date region type is first available in Search Engine 10 Update 10.
DATE dateRegionName

Currency
A region can be defined as a currency, a feature first available with Update 2015-09.
When so declared, the input data will be assumed to be in one of several common
forms that are used to represent currency values. The data is stored internally as a
long integer, with an implied 2 decimal digits. Character strings preceding or trailing
the currency value are discarded, which would typically be a symbol or a country
currency designation. Although some tolerance of poorly formed currency values is
built in, the expectation is that well formed data with 0 or 2 digits after the decimal will
be present. Examples of valid currency representations are:
$1,376,378  1376378.00

The Information Company™ 30


Understanding Search Engine 21

1456.87 AUD  1456.87


€ 8.447,75  8447.75
$ 4000US  4000.00

CURRENCY2 ListPrice

Date Time Pair


The DateTime definition is a special case for convenience in Content Server
applications. Content Server represents dates and times for most metadata regions
as integers. This type is a convenience function that declares the relationship between
a given date region and a time region. DATES for indexing must be an integer of the
form YYYYMMDD, and TIME values must be of the form HHMMSS, where HH is based
on a 24 hour clock. There is no time zone adjustment. Both are stored as integer
regions, and can be independently indexed and queried. This type is not
recommended for new applications. In practice, most Content Server applications only
care about the date, not the time. So creating a DATE field and discarding (REMOVE)
the time portion results in smaller index sizes. In the definitions file:
DATETIME dateRegionName timeRegionName

User Definition Triplet


The User type is a special case for convenience in Content Server applications.
Content Server often uses 3 alternate values to represent a user: a user ID – which is
an integer; a username – which is a text value; and a userFullName – also a text value.
This convenience function declares the triplet as types integer, text, text. Each region
can be separately indexed and queried. In the definitions file:
USER integerRegionName textRegionName textRegionName
This type is not recommended for new applications.

Aggregate-Text Regions
An AGGREGATE-TEXT region has a search index which is the sum of all the regions
it aggregates, but does not store a copy of the values. The values remain within the
original regions. Aggregation only applies to TEXT regions.
Judicious use of AGGREGATE-TEXT regions can improve search performance and
simplify the user experience. Searching many text regions is slower than searching
against an equivalent AGGREGATE-TEXT region. When the AGGREGATE-TEXT
feature is combined with the DISK_RET storage mode for text regions, a significant
reduction in the total memory used to store the index and metadata of the aggregate
is possible if not using Low Memory mode.
AGGREGATE-TEXT regions are constructed using the [Link] file.
Create an entry along these lines:
AGGREGATE-TEXT AggName OTCreatedBy,OTModifiedBy,OTDocAuthor

The Information Company™ 31


Understanding Search Engine 21

In this example, a new field is created, “AggName”. The values from the regions
named OTCreatedBy, OTModifiedBy and OTDocAuthor are all placed as separate
values into the AggName field.
There is a special case for defining aggregates, a trailing wildcard character.
AGGREGATE-TEXT DocProperties OTFileName,OTDoc*
This would place the OTFileName region and any text region that starts with OTDoc
into the DocProperties region.
Regions that match the wildcard pattern can be excluded by using an exclamation
mark instead of a comma as the preceding delimiter. The following illustrates excluding
two regions from a pattern match:
AGGREGATE-TEXT DocProperties
OTFilterName,OTDoc*!OTDocAuthor!OTDocumentUserRating
The exclusions must be exactly specified: they must follow the wildcard operator; they
must match the wildcard operator; they must not contain wildcards.
When the Index Engines start, if the AGGREGATE-TEXT configuration has been
changed, a one-time conversion of the index takes place. The Aggregate configuration
is then subsequently applied to new objects as they are indexed or updated.
Deleting the entry for an AGGREGATE-TEXT field within the [Link] file
does not cause the field to be deleted. The REMOVE command in the
[Link] file must be used to remove an AGGREGATE-TEXT region.
REMOVING an AGGREGATE-TEXT region will delete the index for the region, but
does not eliminate the underlying regions that comprise the Aggregate.
If the definition of an AGGREGATE-TEXT field is edited to add or remove regions from
the list of regions which comprise an Aggregate, then when the Index Engines are next
started, the AGGREGATE-TEXT region will be rebuilt. This will take some time, and
results in a new checkpoint being written.
It is possible to combine AGGREGATE-TEXT with any text region storage mode. For
example, if Storage-Only mode (DISK_RET) is used, then only the Aggregate region
can be searched, but each component region can be retrieved.

CHAIN Regions
The CHAIN definition can be used to define a synthetic region which is used for
constructing queries against lists of regions. The list is prioritized. The value of the
first region that is defined (not null) is used for evaluating the query. There is no
additional storage or index penalty since the definition is an instruction used at query
execution that directs how the CHAIN region should be evaluated.
CHAIN UserHandle UserID FacebookID TwitterID

A search for [region "UserHandle"] "bsmith" would be interpreted as…


If UserID defined for object
Match object if UserID=bsmith
Else if FacebookID defined for object

The Information Company™ 32


Understanding Search Engine 21

Match object if FacebookID=bsmith


Else
Match object if TwitterID=bsmith

CHAIN regions can be used with any region type. Using different region types within
a single CHAIN region is not recommended, since not all search operators are
consistently available or applied to all region types.
The [first "UserID","FacebookID","TwitterID"] syntax in a query is equivalent to a
CHAIN region for queries. However, when a CHAIN region is predefined, the value of
the CHAIN region can also be requested in the search results using the SELECT
statement.

Text Metadata Storage


Text metadata regions usually comprise the bulk of the regions in a search index.
OTSE provides a number of alternate mechanisms for storing text regions. Each of
these alternatives has relative strengths and weaknesses, and the storage modes
should be selected to best meet the needs of your system. This section applies only
to text regions – other region types, such as integers or dates, are always stored in
memory.
Before describing each mode, it is useful to understand the requirements for storage.
Each text metadata region is comprised of an index and values. Value storage is used
to keep an exact copy of the text metadata, allowing it to be retrieved in search queries.
The values usually require significantly more space than the index.
Some modes of operation require a copy of the index or values be kept in memory, in
addition to the persistant disk storage. Other modes are designed to use the disk
representation for searching. The tables below outline the common configurations for
Text metadata fields, and illustrate differences between search operations and how the
data is stored on disk.
Index Element Locations used during Search Operations

Integers, Text Metadata Text Metadata Full Text Index


dates, times, Index Values
etc.
RAM Memory Memory Memory Merge fragments +
AccumLog

DISK Memory Memory Checkpoint + Merge fragments +


MetaLog AccumLog

Low Memory Memory MOD fragments + Checkpoint + Merge fragments +


(+DISK) MODAccumLog MetaLog AccumLog

The Information Company™ 33


Understanding Search Engine 21

Merge Files Memory MOD fragments + MODCheck + Merge fragments +


(+ Low Memory) MODAccumLog MODCheckLog AccumLog

The Information Company™ 34


Understanding Search Engine 21

Persistent Storage of Index Elements

Integers, Text Metadata Text Metadata Full Text Index


dates, times, Index Values
etc.
RAM Checkpoint + Checkpoint + Checkpoint + Merge fragments +
MetaLog MetaLog MetaLog AccumLog

DISK Checkpoint + Checkpoint + Checkpoint + Merge fragments +


MetaLog MetaLog MetaLog AccumLog

Low Memory Checkpoint + MOD fragments + Checkpoint + Merge fragments +


(+DISK) MetaLog MODAccumLog MetaLog AccumLog

Merge Files Checkpoint + MOD fragments + MODCheck + Merge fragments +


(+ Low Memory) MetaLog MODAccumLog MODCheckLog AccumLog

It is possible to change the text metadata storage modes for an existing index without
re-indexing the content. The Index Engines can perform any necessary storage mode
conversions when they are started.
Content Server exposes control over the storage modes in the search administration
pages. Beginning with Content Server 16, support of several legacy configuration
modes have been removed, forcing indexes to use DISK + Low Memory + Merge Files
as the proven best overall configuration. For most applications, the configuration file
settings described here will not need to be directly manipulated.

Configuring the Storage Modes


RAM versus DISK storage modes can be explicitly defined for a text region. If not
defined, then a default storage mode is used. Storage modes are unique to the mode
for a partition. The storage modes are defined in the [Link] file,
which looks like this:
[General]
NoAdd=DISK
ReadOnly=DISK
ReadWrite=RAM
Retired=DISK

[ReadWrite]
SomeRegionName=DISK
OtherRegionName=DISK_RET

[ReadOnly]
ImportantRegionName=RAM

The Information Company™ 35


Understanding Search Engine 21

[NoAdd]
HugeRegionName=DISK

[Retired]
HugeRegionName=DISK

The [General] section of this file specifies the default storage mode for text metadata.
The ‘NoAdd’ value is the setting for Update-Only partitions.
You can also specify storage modes for regions which differ from the default settings.
Each partition mode has a section, and a list of regions and their storage modes can
be provided. Note that Low Memory and Merge File storage modes require DISK
configuration as a pre-requisite.
The [Link] file is
generated dynamically by administration
interfaces within Content Server. Normally,
you should not edit this file.
Beginning with Content Server 16, RAM based storage, ReadOnly mode and NoAdd
mode are no longer available through the administrative interfaces.

Memory Storage (RAM)


In this configuration, the text index and values are stored on disk using the Checkpoint
system. A copy of the index and values is kept in memory for use when searching.
This provides the fastest operation when search results must be retrieved, since it
minimizes disk activity. Conversely, memory storage consumes the most memory in
partitions, and is often the limiting factor in how large a partition may be.
Memory storage is selected using the ‘RAM’ keyword in the [Link]
file. This mode of operation has been available for many years.

Disk Storage (Value Storage)


In this configuration, the index is stored on disk in the same manner as the Memory
mode above. The key difference is that a copy of the values for Text metadata regions
is not kept in memory (hence the name Value Storage). If values need to be retrieved,
they are read from Checkpoint files on disk. The index for the Text metadata is on disk,
with a copy in memory for search purposes.
Keyword searches are still fast because the index is in memory, but search queries
which need to examine the original data, such as phrase searches, are generally
slower. Retrieving values from disk for display is also [Link] you do not require the
fastest possible search performance, or for use with regions which are not commonly
searched and displayed, disk storage is a good choice. Disk storage mode is selected
in the [Link] file using the value of “DISK”. This mode of operation
has been available for many years.
Indexing is somewhat slower in Disk storage mode relative to Memory storage. A
typical Content Server installation, which has hundreds of text metadata regions, will

The Information Company™ 36


Understanding Search Engine 21

typically see a 30% reduction in the indexing performance with Disk storage relative to
Memory storage. For example, in one of the OpenText test cases using a 4-partition
system performing a 1 million+ objects indexing test: 7 hours 24 minutes with Disk
mode versus 5 hours 9 minutes in RAM mode.

Low Memory Mode


Low Memory disk storage leverages the technology used to represent the full text index
to similarly store text metadata indexes. The text metadata values are stored in the
Checkpoint file, and the text metadata index and dictionary is encoded in files stored
on disk. The overall result is a 3 to 4 times increase in the number of typical Content
Server objects that can be managed by a search partition using the same amount of
memory.
The Low Memory mode for disk indexes was introduced in Content Server 10 Update
9. Installations of Content Server 10.5 and later will default to Low Memory mode, over-
writing the OTSE default of Value Storage mode.
Configuration of Low Memory mode requires DISK mode to be configured in the
[Link] file as a pre-requisite. Once DISK mode is defined, Low
Memory mode is enabled in the [DataFlow_] section:
MODDeflateMode=1

Switching between Value Storage and Low Memory disk modes will trigger a
conversion of the index format when the Index Engines are next started. Typically,
conversion of a partition should be less than 20 minutes. Value Storage mode is
backwards compatible with versions of Search Engine 10.0 back to Update 2. Low
Memory mode is new beginning with Update 9, and partitions in Low Memory mode
cannot be read by earlier versions of Search Engine 10.5.

Merge File Storage


The Merge File storage method uses a dedicated set of files to persist the Text
metadata values. These operate much like the index files – using background merge
processes to consolidate recently changed values into larger compacted files.
Compared to the alternative of storing the Text metadata values in Checkpoint files,
this is a major advantage since the size of the Checkpoint files is significantly smaller.
This means that the time required to write Checkpoints is reduced, resulting in higher
potential indexing throughput.
The Index Engines support converting existing indexes into and out of Merge File
storage mode for text values when started. The conversion time is approximately the
time needed to start the search grid, write new checkpoints, plus possibly a few
minutes of conversion time.
DISK configuration in the [Link] file is a required prerequisite. Use
of Low Memory mode for Text Metadata index storage is strongly encouraged as a
prerequisite, since this is the tested variation. The configuration settings are located in
the [Dataflow_] section of the [Link] file. By default, Merge File storage is disabled
for backwards compatibility. The key settings are:
MODCheckMode=0

The Information Company™ 37


Understanding Search Engine 21

The Merge File storage mode is fist available in Content Server 10.5 Update 2015-03.

Retrieval Storage
This mode of storage is optimized for text metadata regions which need to be retrieved
and displayed, but do not need to be searchable. In this mode, the text values are
stored on disk within the Checkpoint file, and there is no dictionary or index at all. This
mode of operation is recommended for regions such as Hot Phrases and Summaries.
These regions do not need to be searchable since they are subsets of the full text
content (you can search the full body text instead). Typical ECM applications see a
savings of 25% of metadata memory using Retrieval Storage mode instead of Memory
Storage for these two fields.
Retrieval Storage mode can be configured in the [Link] file using the
value DISK_RET.

[DataFlow_DFname0]
DiskRetSection=DISK_RET

[DISK_RET]
RegionsOnReadWritePartitions=OTSummary,OTHP
RegionsOnNoAddPartitions=OTSummary,OTHP
RegionsOnReadOnlyPartitions=OTSummary,OTHP

Storage Mode Conversion


When the engines are started, any changes to the storage modes are applied to the
existing index. This requires index conversion, and creation of new Checkpoint files.
This process adds time to the startup. How long? It depends; the size of the index,
the number of fields to convert, the CPU, memory and disk properties are all factors.
In an appropriately scaled hardware environment, this would typically be 10 minutes
per million items in a partition or less, but this time can vary widely.
In general, you can convert between storage modes with impunity. If you put a region
into Retrieval-Only mode and later discover that it needs to be searchable, simply
change the appropriate settings in the [Link] file, restart the search
grid, and everything is wonderful.

In practice, you cannot always convert between storage modes. If


you are close to the limit of available RAM for your partition, then
converting to a more RAM-intensive storage mode may result in the
partition exceeding the available memory. Converting from Low
Memory to Value Storage mode is one example. If you have memory
available, then simply increasing the memory limits can solve this.
Otherwise, you may need to use other tricks, such as rebalancing
partitions to make more room, or deleting or moving other less-
important regions to disk to make space available. If you are
uncertain about whether you will need to convert regions, using a

The Information Company™ 38


Understanding Search Engine 21

more conservative partition memory setting may be advisable in


order to ensure you have memory available for future metadata
region tuning.

Reserved Regions
There are a number of region names which are reserved by OTSE, and application
developers must be aware of the restrictions on their use. In most scenarios, the
Document Conversion Server is part of the indexing process, and DCS will also add a
number of metadata regions that are not described here.

OTData - Full Text Region


In some cases, the full text (or body of the content) can be considered to be a region.
The region name “OTData” is reserved for this purpose. A query constructed to look
for a term in the region OTData will search the full text body.

OTMeta
The OTMeta region is reserved for use in two ways. In the first case, the region
OTMeta is reserved to indicate the collection of all metadata regions defined in the
Default Metadata List. This list is described in the [Link] file by the entry
DefaultMetadataFieldNamesCSL. A query against the OTMeta region will search
this entire list of regions. Where possible, this should be discouraged since searches
of this form may be relatively slow compared to searching in a specific region,
particularly if there are many regions included in the default search region list.
The second application is using OTMeta as the prefix for a region in a search query. A
query with a WHERE clause of [region "someRegion"] "term" is equivalent to
[region "OTMeta": "someRegion"] "term".

XML Text Regions


The full text search engine has the ability to treat indexed XML files as if they were
regions for query purposes. There is no type definition required, all data is considered
to be of type text. Consider the following XML fragment which gets indexed as part of
the text content:

<furniture>
<chairs>
4
<chairColor>red</chairColor>
</chairs>
</furniture>

You can construct a query to locate objects where the chair color is red. The WHERE
clause of the search query would look something like this:
[region "OTData":"furniture":"chairs":"chairColor"] "red"

The Information Company™ 39


Understanding Search Engine 21

The XML search capability does not require a complete XML path specification. The
following WHERE clauses would also match this result, but would potentially also
match other results that are less specific:
[region "OTData":"chairs":"chairColor"] "red"
[region "OTData":"chairs"] "red"

To be a candidate for XML search matching, the XML document must have been
assigned the value text/xml in the OTFilterMIMEType region, which is typically
the responsibility of the Document Conversion Server. The metadata region and the
value for allowing XML content search are configurable in the DataFlow section of the
[Link] file:
ContentRegionFieldName=OTFilterMIMEType
ContentRegionFieldValue=text/xml

OTObject
Each index must specify a unique key region which functions as the master reference
identifier for an object. The region which represents the key is declared in the region
definitions file, but by convention and by default, the region OTObject is almost always
used as the key. During indexing, the unique key is defined in the OTURN entry for an
IPool object.
In practice, Content Server uses strings that begin with “DataId=” for the unique
identifier of managed objects. There are special cases in the code that rely on this form
of the OTObject field to determine when certain optimizations can be applied, such as
Bloom Filters for membership within a partition. If you are creating alternative or
custom unique object identifiers, ensure that the string “DataId” is not present in the
identifier to avoid unexpected behaviors.

OTCheckSum
This region contains a checksum for the full text content indexed for an object. The
value is generated by the Index Engines. Attempts to provide an OTCheckSum value
when indexing an object will increment the metadata error count for the object, and be
ignored. You can search and retrieve this region.
Internally, the Index Engines use this field to optimize re-indexing operations by
skipping content that is unchanged. This value is also used by index verification utilities
to verify that data has not been corrupted.

OTMetadataChecksum
This region has several purposes related to checksums for metadata. You cannot
index this region, but you can query against it and retrieve the values. Internally, this
value is used to verify the correctness of the metadata. Errors in the checksum
generally indicate severe hardware errors.

The Information Company™ 40


Understanding Search Engine 21

When a new object is indexed, a checksum of each metadata value is made. These
values are combined to create an aggregate checksum value, and the checksum is
stored in the region OTMetadataChecksum.
A background process is then scheduled which runs at a low priority. This process
traverses all objects in the index and recalculates the metadata checksum. If the
recalculated value does not match the stored value, a message is logged, and an error
code (-1) is placed in the OTMetadataChecksum region for that object. Applications
can find objects with metadata checksum errors by searching for a value of -1 in this
region.
If an existing index does NOT have checksums computed, then the background
process will populate checksum values. When objects are re-indexed, changes to the
metadata will be reflected in the new checksum. Transactional integrity for metadata
regions that were not changed is preserved.
There are configuration settings in the Index Engine section of the [Link] file that
allow the feature to be ON, OFF or IDLE. When IDLE, new indexing operations will still
create checksums, but the background process will not be validating them. In the Index
Engine section of the search INI file, this entry controls the mode, where acceptable
values are ON, OFF and IDLE. Default value is OFF for backwards compatibility.
MetadataIntegrityMode=OFF (IDLE | ON)
By default, the engines will wake up once every two seconds and verify 100 objects:
MetadataIntegrityBatchSize=100
MetadataIntegrityBatchIntervalinMS=2000
Metadata regions stored on disk are excluded from this processing by default, since
disk files have other checksum validation mechanisms. It is possible to include
checksum validation for regions stored on disk, as indicated below, but the
processing is considerably slower in this mode:
TestMetadataIntegrityOnDisk=OFF (ON)

OTContentStatus
This region is used to record an indicator of the quality of the full text index for each
object. This data can assist applications with assessing the quality of the indexed data,
and taking corrective action when necessary. The status codes are roughly grouped
into 4 levels of severity – level 100, 200, 300 and 400 codes, where 100 level codes
indicate good indexed content, and level 400 codes represent significant problems with
the content.
Applications can provide a status code as part of the indexing process. If the Indexing
Engines encounter a more serious content quality condition (a higher number code)
then the higher value is used. In other words, the most serious code is recorded if
multiple status conditions exist.
The majority of the codes are generated within DCS. Based upon Content Server 16,
the defined codes are:

The Information Company™ 41


Understanding Search Engine 21

100 There is no content indexed, only metadata. This is expected behavior, since
no content was provided as part of the indexing request.

103 This is the value for a normal, successful extraction and indexing of a single
document, both text and metadata.

104 One or more metadata regions contained non-UTF8 data. The non-UTF8 bytes
were removed and best-attempt indexing of the region performed. This
behavior only exists when region forgery detection is disabled.

120 The full text content of the indexing request was correctly processed, and is
comprised of multiple objects. The metadata of only the top or parent object
was extracted. The full text content of all objects is concatenated together. An
example is when multiple documents within a single ZIP file are indexed.

125 There were multiple objects provided for indexing, but some of them were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. The metadata of only the top or parent object was extracted. The
full text content of all objects that were not discarded are concatenated
together. A typical example would be when a Word document and JPEG photo
are attached to an email object, and the JPEG was discarded as an excluded
file type.

130 There were one or more content objects provided for indexing, but all were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. There is no full text content.

150 During indexing, the statistical analyzer in the Index Engine identified that the
content has a relatively high degree of randomness. This is a warning, the data
was accepted and indexed.

300 During indexing, the text required more memory than is allowed by the
Accumulator memory settings that are currently configured. The text has been
truncated, and only the first portion of the text that fit in the available memory
has been indexed.

305 Multiple content objects were provided, and at least one but not all of them are
an unsupported file format. There is some full text content, but the content of
the unsupported files have not been indexed.

310 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects consists of an unsupported file
format.

320 Multiple content objects were provided, and at least one but not all of them
timed out while trying to extract the full text content. There is some full text
content, but the content of the objects which timed out have not been indexed.

360 Multiple content objects were provided, and at least one but not all of them
could not be read. There is some full text content, but the content of the objects
exhibiting read problems have not been indexed.

The Information Company™ 42


Understanding Search Engine 21

365 One or more content objects were provided, and the full text of at least one but
not all of them could be indexed. At least one of these objects was rejected
because of a serious internal or code error while preparing the content. This
error may or may not recur if you re-index this object.

401 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of
unsupported character encoding.

405 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because the
process timed out while trying to extract the full text content from a file.

406 Non-UTF8 data was found in metadata regions with region forgery detection
enabled. The metadata was discarded.

408 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of a
serious internal or code error while preparing the content. This error may or
may not recur if you re-index this object.

410 DCS was unable to read the contents of the IPool message or the file
containing the content. No full text content has been indexed.

OTTextSize
This region captures the size of the indexed full text content in bytes. Note that for
many languages there may be fewer characters than bytes. Note that this reflects the
size of the text extracted by DCS and filters, and can be significantly different from the
OTFileSize region defined by Content Server. Should be declared as type INTEGER.
First available in update 21.1.

OTContentLanguage
This region is optionally generated by the Document Conversion Server. DCS can
assess the full text content of an object to determine the language in which the content
is written. The language code is then typically represented in this region.

OTPartitionName
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the name of the partition which contains
the object. In a search query, OTPartitionName supports equals and not equals, for
either an exact value or a specific list of range values. Operations like regular
expressions or wildcards are not supported. This limited query set is intended to help
administrators with system management tasks, such as locating all the objects in a
given partition. In Content Server, partition names usually start with the text
“Partition_”.

The Information Company™ 43


Understanding Search Engine 21

OTPartitionMode
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the operating mode of the partition which
contains the object. In a search query, OTPartitionMode supports equals and not
equals, for either an exact value or a specific list of range values. Operations like
regular expressions or wildcards are not supported. This limited query set is intended
to help administrators with system management tasks, such as locating all the objects
in a retired partition. The mode will be one of:

ReadWrite Normal configuration, including partitions in rebalancing or soft update


only mode.

NoAdd The partition is configured for updates only.

ReadOnly The partition is configured for read-only mode.

Retired The partition is configured in retired mod.

OTIndexError
This field is used to contain a count of metadata indexing errors associated with an
object. Metadata indexing errors occur for situations such as:
• An improperly formatted metadata object. A string value within an integer or
date field would be examples of this.
• An improperly formed region name.
• Attempts to provide values for reserved and protected region names.
For each such instance, the OTIndexError count region is incremented. Applications
providing objects for indexing may provide an initial value. For example, DCS may
have found that a date or integer value it attempted to extract was incorrect, and
therefore could determine that there is already a metadata error before the Index
Engine is provided with the object.
The error counts are incremental. Updates to objects which contain metadata errors
can cause this value to become artificially inflated. For example, if an object is added
with a date error, and then 10 updates include the same date error, then error count
may be 11.
Applications can query and retrieve this field to help assess the quality of the search
index.

OTScore
This synthetic region usually contains the computed relevance score for a search result
as an integer value. With the default configurations, a relevance score is between 0
and 100. It is important to understand that the relevance score as computed does NOT
have any measurable correlation with the relevance of an object as assessed by a

The Information Company™ 44


Understanding Search Engine 21

user. These scores at best must be considered relative. For most applications,
displaying the OTScore (or computed relevance) is not normally appropriate.
Although a simple integer is presented in the OTScore, internally the relevance
differences between objects may be very small fractions. The sorting of objects
internally for relevance is based on the floating point value.

In hindsight, a better name for this region would have been


OTSortRegion. If the results are not ordered by relevance, this
region will not contain a relevance score, but will instead contain the
values which represent the sort key. If results are not sorted
(ORDEREDBY NOTHING) then OTScore will be populated with a
value of 1.

TimeStamp Regions
During indexing operations, the Index Engine can mark objects with the time that
objects are created or updated. This behavior is enabled by including the appropriate
definitions in the [Link] file as described below. When enabled, by
default these timestamps are added on all objects. If trying to minimize the index size,
you might want to add timestamps to only a subset of objects. For example, with
Content Server, you might want to add timestamps to only the Content Server “Index
Tracer” objects. For stamping only limited object types, ensure the TimeStamp fields
are defined in [Link], and add the list of object types to the [DataFlow_]
section of the [Link] file. Only objects that contain an OTSubType value in the list
will have the time stamp values added:
IndexTimestampOnlyCSL=147

OTObjectIndexTime
When an object is created, this field will be populated with the current time, as
determined by the system clock. This field has the type TIMESTAMP, and must
be declared in the [Link] file to function.
OTContentUpdateTime
When the text content of an object is updated, this value records the current time
for the update. Only actual changes to the content will trigger a change. If an object
is re-indexed, but the text content is identical, then this value will not be updated.
This region has the type TIMESTAMP, and must be declared in the
[Link] file to function.
The definition of “identical” is based upon the text as interpreted by the index
engine. Changes in the tokenizer or file format filters may result in the text being
declared “different”, even if the master object content is unchanged.
OTMetadataUpdateTime
This field records the time at which the metadata for an object was last modified.
If an object is re-indexed and no metadata changes, then this value is not updated.
This region has the type TIMESTAMP, and must be declared in the
[Link] file to function.

The Information Company™ 45


Understanding Search Engine 21

OTMetadataUpdateTime leverages the Metadata Integrity


Checksum feature. Metadata Integrity checking must be set to ON
or IDLE for the OTMetadataUpdateTime to function.

OTObjectUpdateTime
This field is updated any time the metadata OR the content is changed. You should
normally not remove this field, since it is required for correct operation of Search
Agents.

_OTDomain
The searchable email domain feature generates synthetic regions by appending this
suffix to the email region name. For instance, if your region that contains email is
OTEmailSender, then the region OTEmailSender_OTDomain will be created to
support the email domain search capability.

_OTShadow
Regions ending with the string _OTShadow are created when the LIKE operator is
configured. If the Content Server region OTName is configured for use with LIKE, then
the region OTName_OTShadow contains the extended indexing information required
by the LIKE feature.

Regions and Content Server


The purpose of indexing metadata into regions is to simplify the user task of locating
information. The quality of the search experience therefore depends on which Content
Server metadata is indexed, and which regions the user queries when looking for
objects. There is a clear tradeoff here – more metadata regions require a larger search
index and higher hardware expenditures.
Content Server is a very flexible platform, supporting a wide range of possible
applications that may include Document Management, Records Management, Data
Archival, Web Site Management, Workflow applications, Litigation Support, and many
other OpenText, custom and 3rd party solutions. The choice of which Content Server
metadata should be added to the index is therefore an important decision.
When shipped, Content Server has a default configuration for metadata indexing. For
many applications, the default configuration is acceptable, and often indexes more
regions than necessary. On the other hand, some applications such as eDiscovery
may have a higher expectation of searchable metadata than the default. Either way, it
is strongly recommended that an assessment of Content Server metadata indexing be
undertaken as part of installing Content Server.
Although this document is focused on OTSE, the choice of metadata to be indexed is
very important. Hence, we will briefly touch on Content Server metadata topics, with
the understanding that you will need to look elsewhere for details.

The Information Company™ 46


Understanding Search Engine 21

MIME and File Types


There are several regions typically used to identify the type of a file or object. There
can sometimes be confusion around the purposes and differences of these regions.
The OTLLMIMEType region is basic Content Server “system” metadata. The intent is
that Content Server has set the MIME type, typically based on browser properties or
file name extensions when the document is added to Content Server.
The OTFilterMIMEType region is added by the Document Conversion Server during
indexing, and is based on an assessment of a document by format filter technology,
usually the OpenText Document Filters.
Perhaps the most useful standard region is OTFileType. This region is added by the
Document Conversion Server, but uses a combination of file format analysis, MIME
types, OTSubType and file format extensions to provide better coverage. More
importantly, OTFileType by default has values that are more user friendly, such as
“Microsoft Word” or “Adobe PDF”. The disadvantage is that OTFileType was introduced
with Content Server 10.5, and indexes from older systems will need to re-index to apply
OTFileType values to an older index.

Extracted Document Properties


The Document Conversion Server (DCS) is responsible for extracting properties from
documents and transforming them into metadata regions prior to indexing. There are
a number of configuration settings that affect the number of document properties that
will be extracted.
The settings that have the biggest impact relate to extracted properties of Microsoft
Office documents, or EXIF / XMP data extracted from media files. In addition to a
number of standard pre-defined properties, users (or custom applications) have the
ability to add arbitrary properties to any document. If the DCS settings permit it, each
of these properties becomes a region in the search index. It is not uncommon for
customers with this feature enabled to have thousands of search regions defined this
way. These regions could represent a significant portion of the search index size and
memory requirements.
For new applications, the default DCS behavior is to extract a common subset of the
more useful standard properties for indexing, and discard the rest. This list of the
“useful” regions can be edited within DCS. Other configuration settings are available
to index all Microsoft Office document properties, or disable indexing any Microsoft
Office document properties, or to extract and index all EXIF/XMP metadata fields. Be
sure to review the DCS documentation for your version of Content Server, as the
control over extracted properties may vary based upon the version of Content Server
and the types of format filters being used.
Litigation support or eDiscovery applications may require all these regions to be
searchable. In these scenarios, you may also want to consider the use of
AGGREGATE-TEXT configuration in conjunction with DISK_RET storage modes to
make these values searchable with the minimum index sizing requirements.

The Information Company™ 47


Understanding Search Engine 21

NOTE: Legacy installations of Content Server often have indexing


of Microsoft Office document properties enabled. You may wish to
review these settings, and perhaps even remove some of the
existing Microsoft Office document properties from your current
index.

In the index, these types of regions are typically prefixed with OTDocXXXX or
OTXMP_XXXX. Be careful if you choose to remove these, since it is possible that
region names from other sources might match this naming convention. For example,
the Content Server ‘User Rating’ metadata fields OTDocSynopsis and
OTDocUserRating also have this form.

Workflow
Indexing of Workflow metadata from Content Server has been problematic historically,
but is considerably better since Content Server 10.0 Update 10.
Firstly, the default Workflow configuration indexes all the internal Workflow metadata
to the search engine. In most applications, many of these regions have no value for
user search. The default region definitions file has DROP or REMOVE instructions in
place to prevent this data from being indexed. If you need to make these metadata
fields searchable, edit the definitions file appropriately.

NOTE: Older Content Server systems defaulted to indexing all the


Workflow metadata as text regions. You may wish to consider
using removing these regions or changing their type where
possible.

The other aspect is Workflow Map attributes. These are presented as regions for
indexing in the form WFAttr_xxxx, where xxxx is text that represents the name of the
Workflow attribute. It is possible for a very large number of these WFAttr_ regions to
exist, especially in older versions of Content Server where the default setting was to
always index these regions. This increases the size of the index. If you do not need
to search on these fields, you might consider DROP or REMOVE in the definitions file.
If searching the aggregate value of these fields is sufficient, you might also want to
consider using AGGREGATE-TEXT for queries against these regions, in conjunction
with DISK_RET for storing the values.

Categories and Attributes


For search indexing purposes, the metadata fields for Categories and Attributes are
presented to the Index Engines in the form Attr_1234567_12. Depending on the
Attribute type, this is sometimes also appended with an additional underscore
character and text.
Often, Category and Attribute data is comprised of defined values, which are optimally
represented within the search index as enumerated data types (ENUM within the
definitions file), or as integer values. If you want to optimize the search index to
minimize the memory consumed by metadata, you will need to modify the region
definitions file and restart the search grid BEFORE these values are indexed. Once

The Information Company™ 48


Understanding Search Engine 21

indexed, they will be marked as type ‘TEXT’, and cannot be changed short of removing
the entire region and re-indexing the objects, or using the region type conversion
features.
This is an optimization consideration only. Leaving the Category and Attribute values
as TEXT within the index does not affect feature availability, although differences in
behavior between integer and text values may be a concern.

Forms
Within Content Server 10, the Forms module permits users to create arbitrary labels
for form fields. The region names are generated directly from these labels.
Unfortunately, this can result in conflicts with other search regions in the index. It is
recommended that you enforce a business practice of prefixing all form names with a
unique value, such as OTForm_. This will provide two major benefits: it will minimize
the chance of name conflicts, and it allows use of AGGREGATE-TEXT regions to
improve search usability.
Content Server 10.5 or later will generate region names that follow a well defined
syntax, along the lines of OTForm_1234_5678. This change makes it much easier to
identify regions associated with forms, and simplifies selecting them for REMOVE or
aggregation purposes.

Custom Applications
It is common for OpenText customers to create their own solutions using Content
Server as a platform. Often, the considerations for metadata indexing and search are
overlooked. If you have custom applications that index metadata fields, you should
consider the impact on search index size and performance.
• Only index object subtypes that are of interest to users
• Only extract metadata fields that are useful for search
• Ensure that the region definition file has optimal configuration for each region
• Provide a unique prefix so that the custom metadata will not conflict with
other region names
• If appropriate, add the custom regions to the default Content Server search
regions.

Default Search Settings


Content Server ships with a set of default search regions and default search relevance
ranking settings. Review these defaults against your application requirements, and
change them as appropriate. These settings really do have an impact on relevance
computation and object findability. Refer to the section which describes the search
relevance computation for more information.

The Information Company™ 49


Understanding Search Engine 21

Indexing and Query


There are two fundamental tasks for any search engine – put data into it, and formulate
queries to find it. This section explores how OTSE exposes these features.

Indexing
Updating the search index is performed by preparing files containing indexing
commands in a defined location. The input files and structures are in the OpenText
“IPool” format. The Update Distributor watches for these files, and initiates indexing
when IPools arrive.
A single IPool may contain many indexing commands and objects. Updates to the
index from an IPool are only “committed” once all of the messages within the IPool are
successfully handled. If either the Update Distributor or one of the Index Engines is
unable to process a message, then the indexing process will halt and the all the
changes from the IPool are rolled back when the Index Engines are restarted. This
behavior applies to serious indexing IPool errors, such as malformed IPool messages.
Objects too large, for example, are not IPool errors.

When a serious indexing problem occurs, one or more elements of


the indexing grid will have stopped with exceptions. The offending
IPool needs to be removed from the input queue; otherwise the
problem will simply repeat and recur when the indexing grid is
restarted. On the 3rd restart/attempt, the offending IPool will be
moved to quarantine.

If multiple partitions exist for an index, the Update Distributor chooses which partition
will index an object. Some operations, such as Modify By Query, are broadcast to all
the Index Engines. Most operations are specific to a single partion, and the first step
in deciding which partition to use is to ask if any of the existing Index Engines already
have an entry with the same object identifier (the “Key” value). If one of the Index
Engines responds affirmatively, then the object is given to that Index Engine to add,
modify or remove.
If no partition already has the object, the Index Engine will make a selection based
upon the Read-Write or Update-Only mode of the partitions, and whether they are full.
Partitions which are in “Update-Only” or “Retired” mode are never given new objects
to index. Partitions which are in “Read-Only” mode do not have Index Engines running,
and are not given any indexing tasks.

The order of processing is not guaranteed within an IPool. Placing


multiple operations for the same object in a single IPool may
generate unexpected results. For example, when multiple types of
operations exist in a single IPool (adds, deletes and modifies), the
Update Distributor may batch similar operations together to obtain
performance improvements.

The Information Company™ 50


Understanding Search Engine 21

NOTE: As long as we are discussing IPools, some trivia: although


IPools look very much like XML, they aren’t quite XML. IPool
syntax evolved over the years at OpenText from earlier versions of
our search technology - which were developed by a gentleman
named Tim Bray, among others. Tim leveraged his OpenText
search and SGML experience to later guide the specification of
XML.

Indexing using IPools


Interchange Pools (IPools) are used for many purposes within Content Server, and can
contain many objects or operations. IPools are used as the mechanism for providing
input into the Update Distributor for indexing. The discussion of IPools in this section
is strictly limited to an overview of IPools for the purpose of indexing objects.
IPools are not typically constructed directly by an application. OpenText provides
linkable libraries that provide utilities for reading and writing IPools. These libraries are
used by applications creating IPools, and also used through the Java Native Interface
(JNI) by the Update Distributor to read the IPools. However, when diagnosing search
indexing issues, a basic understanding of the IPool structures can be useful.
An indexing object has the following basic form shown below. Only a single object is
displayed, although an IPool may contain many objects. Note that within an IPool no
white space (new lines or indentation) is provided for formatting – it has been added
here for readability.

<Object>
<Entry>
<Key>OTURN</Key>
<Value>
<Size>16</Size>
<Raw>8273908620;ver=1</Raw>
</Value>
</Entry>
<Entry>
<Key>Operation</Key>
<Value>
<Size>12</Size>
<Raw>AddOrReplace</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>187</Size>
<Raw>

<FileName>/MyContentInstances/[Link]</FileName>
<ObjectTitle>Things that go bump</ObjectTitle>

The Information Company™ 51


Understanding Search Engine 21

<OTName>Cars</OTName>
<OTName lang=”fr”>Voitures</OTName>
<OTCurrentVersion>true</OTCurrentVersion>
</Raw>
</Value>
</Entry>
<Entry>
<Key>ContentReferenceTemp</Key>
<Value>
<Size>20</Size>
<Raw>C:/dev/[Link]</Raw>
</Value>
</Entry>
<Entry>
<Key>Content</Key>
<Value>
<Size>28</Size>
<Raw>full text to be indexed here</Raw>
</Value>
</Entry>
</Object>

The <Size> value reports the number of characters contained within a <Raw> section.
The <Raw> section contains the actual values. The <Raw> section can contain
arbitrary data expressed in UTF-8 encoding, and does not require character escaping
because the <Size> is known, although for metadata regions this data is expected to
be structured much like XML. The <Key> value specifies the top level purpose for
each entry, sometimes processed by DCS, sometimes by the Index Engines. This
object contains 5 entries – the OTURN, Operation, Metadata, and content referenced
in two different ways.
Every object to be indexed requires a unique identifier. For typical Content Server
applications, the unique identifier is provided in the region “OTURN”, as shown in this
example. The value for the OTURN is “8273908620;ver=1” – different Content Server
modules may provide OTURN values in different forms. Operations such as
ModifyByQuery would use a query “where clause” as the OTURN.
The Operation entry instructs the Index Engines how the object should be interpreted
as explained in the sections below.
The Metadata entry is used to provide the regions names and values that are provided
for indexing. In the example above, metadata for the regions FileName, ObjectTitle,
OTName and OTCurrentVersion are provided. You can specify multiple values for one
region. The OTName region, for example, has two values, and one of them also uses
the attribute key/value feature of OTSE to specify that “voitures” is the French language
value.
The entry for ContentReferenceTemp is used to identify that the content data is located
at the specified file location. The IPool libraries would normally delete the file after
processing, since by convention ContentReferenceTemp is used when a temporary

The Information Company™ 52


Understanding Search Engine 21

copy of a file was made. A permanent copy can also be specified using
ContentReference as the key, which does not delete the original. IPools given to the
Index Engines normally should NOT have either ContentReferenceTemp or
ContentReference entries, since extraction and preprocessing of files should already
have occurred to extract the raw text data. These modes are common for earlier steps
in the DCS process.
The entry for Content in the example indicates that the data in question is contained
within the IPool, in the <Raw> section. This is the normal expected use case for IPools
being consumed by the Update Distributor. Unlike this artificial example, having both
Content and ContentReferenceTemp values is atypical.

AddOrReplace
This is the primary indexing operation used to create new objects in the index. If the
object does not exist, it will be created. If an entry with the same OTURN exists in
either a Read-Write or Update-Only partition, then it will be completely replaced with
the new data, equivalent to a delete and add.
The AddOrReplace function distinguishes between content and metadata. If an object
already exists, and metadata only is provided, the existing full text content is retained.
However, the line between content and metadata is somewhat distorted. The DCS
processes will typically extract metadata from content and insert this metadata into
regions for indexing. There is a list of metadata regions which are therefore considered
to be “content”, and not replaced or deleted if content is not provided in a replace
operation.
The list of metadata considered to be content for this purpose is defined in the
[DataFlow_] section of the [Link] file by:

ExtraDCSRegionNames=OTSummary,OTHP,OTFilterMIMEType,
OTContentLanguage,OTConversionError,OTFileName,OTFileType
ExtraDCSStartsWithNames=OTDoc,OTCA.OTXMP_,OTCount_,OTMeta_
DCSStartsWithNameExemptions=OTDocumentUserComment,
OTDocumentUserExplanation
ExtrasWillOverride=false

The ExtrasWillOverride setting is used to disable this feature, which would cause the
regions to be deleted if content is not indexed in an AddOrReplace operation. The
DCSStartsWith entry is used to capture the dynamic regions that DCS extracts from
document properties.
The Exemptions list identifies regions that should not be treated as part of the full text
content, despite matching the DCS “starts with” pattern.
The AddorReplace function can also trigger “rebalancing” operations. If the target
partition is Retired or has exceeded it’s rebalancing threshold, the Update Distributor
will instead delete the object from the partition where it currently resides, and redirect
the AddorReplace operation to a partition with available space.

The Information Company™ 53


Understanding Search Engine 21

AddOrModify
The intended use of AddorModify is to update selected metadata regions for an item
thought to already exist in the index. The AddOrModify function will update an existing
object, or create a new object if it does not already exist. When modifying an existing
object, only the provided content and metadata is updated. Any metadata regions that
already exist which are not specified in the AddOrModify command will be left intact.
There is no mechanism to delete a region which has already been defined for an object,
but you can delete the values by providing an empty string as the value for the region
("").
One potential downside of the AddOrModify operation is that if you selectively modify
metadata regions and the target object is not already correctly indexed, you will create
a new object that only has the metadata regions or content which was defined in the
modify operation. This will effectively create an object which only has partial data
indexed. If you provide all metadata region values in a modify operation, this situation
will not arise. New applications may want to consider using the “ModifyByQuery” or
“Modify” indexing operator instead of AddOrModify, which do not create an object if not
already defined.

If you have “Read-Only” partitions and attempt to modify an object in


a Read-Only partition, this will create a duplicate object. This
happens because Read-Only partitions do not have Index Engines
running. No Index Engine claims ownership of the object, so it is
assumed that the object does not exist, and it is created in another
partition.

Modify
The Modify operation is used to update specific metadata in an object. Unlike the
AddOrModify operation, Modify will never create a new object. If the OTURN specified
in a Modify operation does not exist, the transaction is simply discarded. Modify can
add new metadata, or replace existing metadata. Metadata for regions not included in
the IPool message are unaffected.

Delete
The Delete function will remove an object from the index, including both the metadata
and the content.
Note that if an object exists in multiple partitions, it will only be removed from the
partition to which the Update Distributor sent the Delete operation. This is a very rare
case, and would likely only arise if partitions were marked as Read-Only, then updates
to objects in the Read-Only partition were performed.

DeleteByQuery
The DeleteByQuery operator deletes objects which meet the provided search criteria.
A standard “WHERE” clause is provided in OTURN. This operator can be used to

The Information Company™ 54


Understanding Search Engine 21

delete many objects at once. Since the Update Distributor broadcasts the function to
all active Only partitions, duplicate objects can also be removed.
DeleteByQuery is of particular usefulness for applications that no longer track the
unique identifier for an object.

Some versions of Content Server have difficulty removing Renditions


from the search index, since the delete operation given to the
indexing system happens after the information about the Rendition is
removed from the Content Server database. Using DeleteByQuery,
these objects can still be deleted from the index because they have
a unique pattern which can be located with a search.

Applications which need to perform bulk deletes on a project will also find this far more
efficient. Instead of issuing 25,432 delete requests for every object in a project, a single
DeleteByQuery operation with an OTURN of
[region "ProjectName"] "old project"
would delete all objects marked as belonging to the project in a single transaction.

ModifyByQuery
This operation is used to selectively modify the content or specific metadata regions
for objects in the index. The affected objects are specified by search parameters – a
valid “WHERE” clause within the OTURN entry of the IPool. If no objects match the
query, then no updates are performed. Every object in the index which matches the
query will have the provided regions updated. Other regions for objects are not
affected; for example, you could change the value in the region “CurrentVersion” to
“false” without modifying values in other regions.
The Update Distributor will send ModifyByQuery operations to every active partition.
To modify a specific known object, you can place an object ID in the OTURN field:
[region "OTURN"] "ObjectID=1833746;ver=3"
You can also quickly perform bulk operations, such as marking all the objects
associated with a specific project as “released”. The IPool would contain region values
such as:
<ProjectStatus>released</ProjectStatus>
And the Key field in the IPool would contain a ‘WHERE’ clause such as:
[region "ProjectName"] "Great Scott"
All objects with the value of “Great Scott” in a region labeled “ProjectName” will then
have their ProjectStatus region populated with the value “released”.
A value for a region cannot be completely removed, but it can be replaced with an
empty string by providing a region definition in the IPool that has an empty string:
<ProjectStatus></ProjectStatus>
The full text content of an object can not be updated using ModifyByQuery.

The Information Company™ 55


Understanding Search Engine 21

Transactional Indexing
The indexing process with OTSE is transactional in nature. This essentially means
that the indexing request is not deleted until the index updates have been committed
to disk.
Transactional indexing ensures that no indexing requests are lost in the event of a
power loss or similar problem while indexing is taking place.
OTSE treats all of the indexing requests within an input IPool as a single transaction.
The input IPool is not considered complete until every request in the IPool is serviced
and committed to disk. Only then is the IPool deleted.
There are performance considerations related to transactional indexing. The more
objects there are within an IPool indexing transaction, the more efficient the indexing
process is. This is because a new index fragment is created each time a transaction
completes. Many objects in a transaction therefore generate fewer new index
fragments, and use the disk bandwidth more efficiently.
The converse of this is the time to index. By collecting index updates and packaging
them into transactions, for low-load systems, the average time for an object to be
indexed is somewhat slower. The majority of applications do not have a requirement
to minimize the lag time between an object update and the moment the changes are
reflected in the index, so large numbers of objects in the indexing IPool is generally the
best approach.
OTSE does not collect objects to create transactions. The number of objects in a
transaction is set by the upstream applications which are generating the indexing
updates. By default, Content Server 16 will attempt to package up to 1000 objects
within a single indexing transaction.
IPool Quarantine
In the event that an object in an IPool cannot be indexed because of severe errors, the
affected indexing component will halt. Upon restart, all of the indexing operations for
the IPool will be rolled back. Depending on the error code and configuration settings,
the Admin Server might automatically restart the component. If an IPool fails in this
way 3 times, it is moved into quarantine and the next IPool is processed. The
quarantine location is a sub-directory named \failure in the IPool input directory. If there
are too many quarantined items, the IPool libraries can be configured to either halt or
discard the oldest IPool. Quarantine behavior is a Content Server configuration, not in
OTSE.

Query Interface
Queries to OTSE are submitted to the Search Federator over a socket connection
using a language known as OpenText Search Query Language (OTSQL). Applications
communicating directly with the Search Federator will need to understand and
implement this wire-level protocol exposed by the Search Federator. Content Server
implements this protocol, as does the Admin Server component of Content Server and
the search client built into OTSE.

The Information Company™ 56


Understanding Search Engine 21

Connection to the Search Federator requires knowledge of the computer IP address


and the port number on which the Search Federator is listening, which is configurable
within the [Link] file. The search client will need to establish a basic text socket to
engage in a query conversation, which is a generic network function which should be
possible from most programming languages. The OTSQL commands and responses
described here are conveyed across the socket connection.
A conversation with the Search Federator consists of opening a socket connection,
issuing commands, receiving responses, and closing the socket connection.
Managing the number of open connections can be important in optimizing the overall
resource use in OTSE. There are two settings: the number of queries that can be
simultaneously active (being serviced by the Search Engines); and the queue size
(maximum number of queries waiting for service). By default, the queue size is 25 and
the active query limit is 10. When the queue is full, the Search Federator simply does
not accept any additional socket connections.
A typical query conversation between an application and the Search Federator is:

open socket connection


set parameters
select
set cursor
get results
get results
get facets
hh
get time
close socket connection

Responses from the Search Federator are expressed in a clear text data stream which
explicitly includes data size information to allow parsing values without needing to
escape special characters.
The available commands are described below. The commands themselves are not
case sensitive, although parameters to the commands such as region names may be
case sensitive.

Select Command
The select command is used to initiate a query. This command is essentially the
OpenText “OTSTARTS” query language, which is described in more detail in the
OTSQL section of this document. The basic form is:

select SELECTLIST [FACETLIST] where QUERYTERMS [orderedby


ORDER]
The SELECTLIST defines the metadata regions that should be retrieved in the results.
FACETLIST is optional and defines the facet information to be computed during the
query. QUERYTERMS contains the search regions, terms and operators, such as

The Information Company™ 57


Understanding Search Engine 21

(([region "OTName"] stem "happy" AND [region


"OTModifiedDate"] range "20110101~20110201") OR "exact
string in the content").
The ORDEREDBY portion is optional – the default is to order by computed relevance,
which will include the QUERYTERMS. However, additional terms can be added to
relevance scoring, or the ordering can specify sorting based upon other regions. Note
that queries will run faster without an “ORDEREDBY” clause. If you do not care about
the order in which results are presented from the search engine, omitting this clause
can improve query performance.
The select command responds with a count of the number of results that match the
query.
A typical response to a SELECT command returns with the current cursor location and
the number of results that match the query:

<OTResult>
Cursor 0
DocSetSize 1012
</OTResult>

Set Cursor Command


This command is used to set the start location for getting results. By default, the cursor
position is set to 1 (first result) after a select operation. It is also advanced
automatically when you get results to point to the next result. If you want to retrieve
results starting at result number 100, use this command:

set cursor 100


Which responds with an acknowledgement and the current cursor location.

<OTResult>
cursor 100
</OTResult>
The cursor is automatically advanced after a get results command, which means
that use of set cursor between get results is optional if you are retrieving
consecutive sets of results. It should also be noted that moving the cursor forward is
relatively efficient. Moving the cursor backwards internally requires a reset to the start
of the results and moving forward to the desired location. If you are performing multiple
get results operations, structuring them to move strictly forward through the results
is much faster. This observation is only true within a search transaction (between open
and close operations), and has no impact on distinct queries.
There is an alternative method for managing the cursor location. The general form of
a query is:
Select … where … orderedby … starting at N for M

Where N is the number of the first desired result, and M is the number of results to
return in the Get Results command. The first result has a number of 0.

The Information Company™ 58


Understanding Search Engine 21

Select "OTObjectID" where "dogs" starting at 1000 for 250

Would return results number 1000 through 1249 when Get Results is called. This
method is not generally used or recommended, and is noted here for completeness.
Using Set Cursor with Get Results is the recommended usage pattern.

Get Results Command


This command is used to retrieve search results after a select command. The results
for a query are retained by the Search Engines until the socket connection is closed.

get results count


The parameter count is an integer, and represents the number of results that should
be returned. If there are not enough results to fulfill the count, it will return as many as
possible, and provide the actual number of results in the response.
The returned results are based upon the sort order specified in the select command,
which by default is ordered by computed relevance. Note that internally the relevance
computation is a floating point value, even though it may be reported in OTScore as
an integer. This means that even though the user may perceive a relevance score of
59 for multiple objects, the Search Engines can discriminate between results that have
relevance scores of 0.58993 and 0.58991 and order them accordingly.
The response to get results is a count of the actual number of results returned, along
with a structure that contains all the values specified in the SELECTLIST parameter of
the select command.
A typical response is of this form:

<OTResult>
ROWS 4
ROW 0
COLUMN 0 "OTObject"
DATA 25
DataId=41280133&Version=1DATA END
COLUMN 1 "OTName"
DATA 29
Approval Handilist [Link] END
ROW 1
COLUMN 0
DATA 25
DataId=41280094&Version=1DATA END
COLUMN 1
DATA 18
P&L Jun to [Link] END
ROW 2
COLUMN 0
DATA 25
DataId=41280131&Version=1DATA END

The Information Company™ 59


Understanding Search Engine 21

COLUMN 1
DATA 0
DATA END
ROW 3
COLUMN 0
DATA 25
DataId=41280093&Version=1DATA END
COLUMN 1
DATA 10
Mar [Link] END
</OTResult>
In this example, there are 4 results, indicated by the “ROW” values. ROW values are
numbered starting at 0.
Each result contains 2 returned regions, identified by the COLUMN values. In the first
ROW, the COLUMN labels are provided. To save bandwidth, the COLUMN values are not
labeled in subsequent ROWS.
The COLUMN values are numbered starting at 0, in the same order in which the regions
were requested in the SELECT statement for the query. Note that the DataId= portion
of the COLUMN 0 results is typical of how Content Server provides the data for indexing,
this is not an artifact of the search technology.
If a value is not defined for a region, the region is still returned in the results with an
empty value. ROW 2 COLUMN 1 illustrates this case.
If ATTRIBUTES were requested in the select statement, then the requested attribute
information will be appended to the get results data. In the example below, the data
element for the region “TestSplit” has 3 values. The first value had one attribute, the
language (English), the second has two attributes, and the third value has no attributes
– indicated by the empty placeholder.

COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr"
translated="true"</><></>ATTRIBUTES END
If HIT LOCATIONS were requested in the select statement, the locations are added
to the results:

COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr" translated="true"</>
<></>ATTRIBUTES END
LOCATIONS 17
0 4 6 1; 2 10 7 3 LOCATIONS END

The Information Company™ 60


Understanding Search Engine 21

The triplets indicate that the first cell (start counting at 0) has a hit at location 4, length
6, matching term 1. The third term (2) has a hit starting at character 10 with length 7,
matching query term 3.
If you are retrieving large numbers of search results, it can be more efficient to break
the operation into multiple get results operations. Typically, these “gulp” sizes are
optimal in the 500 to 2000 results range. The performance benefit of using an optimal
size is typically only about 10 percent, so this is not a critical adjustment.

Get Facets Command


If the SELECT command specified that facets should be computed, then a subsequent
GET FACETS command will retrieve the facets that were generated in the query.
There are no parameters; all the facet information that was requested is returned. The
response has the following form:
get facets
<OTResult>
ROWS 1
ROW 0
COLUMN 0 "RegionName","RegionType"
FACETS facetLength
nFacets{+},{keyLength,key,count;}FACETS END
{COLUMN n … FACETS END}
</OTResult>
The facets follow the general structure of other search results, and thus include the
ROW and COLUMN constructs. Only ROW 0 is used, with each facet set represented
within a COLUMN. Column numbers start at 0.
The COLUMN line includes the RegionName and RegionType. The RegionName is
the same as the name of region for which a facet was requested in the SELECT
statement. The RegionType may be used by an application to optimize how the facets
should be interpreted. The RegionType will be one of:
Date
Integer
Text
UserLogin
UserName
Enum
FileSize
The next line contains the text FACETS with the facetLength value. This is the total
length of the string in bytes on the next line including the FACETS END statement.
The next line contains the actual facet data. The first integer, nFacets, is the number
of key/value pairs that are included in the facet results for this column. The key/value
pairs are represented by data triplets of keyLength, key and count. The key is the text
of the value. The count is an integer. The keyLength is the number of bytes in the key
– using a length simplifies parsing.
Note that there is a special case for nFacets, where it may be appended with a plus
(+) character. This indicates that building of the facet data structures terminated

The Information Company™ 61


Understanding Search Engine 21

because of size restrictions. This means that there are facet values in the index for
this region that have not been considered in computing these facet results.
The facet data is terminated with the FACETS END text.
A simple example of output from a get facets command is included below. Note the
special case where a facet has no values, as illustrated in the COLUMN 1 values.
get facets
<OTResult>
ROWS 1
ROW 0
COLUMN 0 "OTModifyDate","Date"
FACETS 45
3,9,d20120605,14;9,d20120528,4;9,d20120514,1;
FACETS END
COLUMN 1 "OTUserName","UserLogin"
FACETS 3
1,;FACETS END
</OTResult>

Date Facets
Facets for regions that are defined as type DATE in the [Link] file have
a special presentation in the facet results.
Each date value is placed into buckets representing days, weeks, quarters, months
and years. Instead of the most frequent values being returned in facets, the most
recent values are returned instead. For most search-based applications, the
“recentness” of an object is a key consideration, and the implementation of date facets
reflects this requirement.
A single date value may be represented in multiple buckets. For example, if today is
July 1st 2012, an object with an OTCreateDate of June 30 2012 may be represented in
the facet values for yesterday, for this week, for last month, last quarter and this year.
Each date bucket type has a distinct naming convention to help parsers discriminate
between the buckets.
• Years have the form y2012. Years are aligned to the calendar. The current year
will include dates from the start of the year to today.
• Quarters have the form q201204, which represent the year and the month in which
the quarter starts. Quarters start in January, April, July and October. The current
quarter will include dates from the start of the quarter to today.
• Months have the form m201206, which represent the year and the month. Month
facets are aligned to the calendar month. The current month will include dates from
the start of the month to today.
• Weeks have the form w20120624, which represents the year, month and first day
of the week. Weeks are always aligned to start on Sundays. The current week will
include dates from the start of the week to today.

The Information Company™ 62


Understanding Search Engine 21

• Days have the form d20120630, which represents the year, month and day.
If the contents of a date bucket are empty (count of zero), then no result is returned for
that bucket.
Refer to the FACETS portion of the SELECT statement for information on requesting
the number of facet values for each of years, quarters, months, weeks and years.

FileSize Facets
The [Link] file can be used to identify integer or long regions that should be treated
as FileSize facets. Size facets are optimized for values that represent file sizes.
Clearly, discrete file size facets are useless. File sizes have the property that they
range from 0 to Gigabytes, but are psychologically thought of in geometric sizes. The
FileSize facet places integers into ranges that follow this geometric pattern. The entire
set of sizes is returned, rather than the most frequent counts for facets. Applications
presenting facets may choose to combine these ranges into larger ranges.
The buckets for FileSize facets and the corresponding labels for those buckets are
captured in the table below:

Label Integer Range


0b 0
1b 1
2b 2 to 4
5b 5 to 9
10b 10 to 19
20b 20 to 49
50b 50 to 99
100b 100 to 199
200b 200 to 499
500b 500 to 999
1k 1000 to 1999
2k 2000 to 4999
5k 5000 to 9999
10k 10000 to 19999
20k 20000 to 49999
50k 50000 to 99999
100k 100000 to 199999

The Information Company™ 63


Understanding Search Engine 21

200k 200000 to 499999


500k 500000 to 999,999
1m 1,000,000 to 1,999,999
2m 2,000,000 to 4,999,999
5m 5,000,000 to 9,999,999
10m 10,000,000 to 19,999,999
20m 20,000,000 to 49,999,999
50m 50,000,000 to 99,999,999
100m 100,000,000 to 199,999,999
200m 200,000,000 to 499,999,999
500m 500,000,000 to 999,999,999
1g 1,000,000,000 to 1,999,999,999
2g 2,000,000,000 to 4,999,999,999
5g 5,000,000,000 to 9,999,999,999
10g 10,000,000,000 to 19,999,999,999
Label Integer Range
20g 20,000,000,000 to 49,999,999,999
50g 50,000,000,000 to 99,999,999,999
100g 100,000,000,000 to 199,999,999,999
big >= 200,000,000
negative <0
undefined No value for field

The list of integer regions to be presented as FileSize facets is within the [Link] file
in the [Dataflow_] section. The default regions shown here are tailored for typical
Content Server installations:
GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize

Expand Command
This command is used to determine the list of words that are used in a search query
for a given term expansion operation. Term expansions occur when features such as
stemming, regular expressions or a thesaurus are used in a term. The simple case of
stemming to match boat and boats is illustrated below.

> expand stem "boat"


<OTResult>

The Information Company™ 64


Understanding Search Engine 21

ROWS 2
ROW 0
COLUMN 0 "Data"
DATA 4
boatDATA END
ROW 1
COLUMN 0 "Data"
DATA 5
boatsDATA END
</OTResult>
The following operator examples also work:

> expand thesaurus "boat"


> expand regex "^boat.*"
> expand phonetic "boat"
> expand range "sa~sc"
> expand < "apples"
> expand phonetic "boat"
Some of these cases can generate a very large number of matches. For regular
expressions or left-truncation this operation is potentially very slow, and should be used
judiciously. It is possible to limit the result set by appending the maximum number of
desired results to the expand operator within square brackets. The default limit is 100
terms; the example below limits the result to 5 terms.

> expand[5] thesaurus "boat"


One possible application of the expand operation is to establish which terms should be
provided to the hit highlighting function.

Hit Highlight Command


The hh command is used to identify the characters within text that match the search
query. This is used by applications displaying search results that want to emphasize
the text that matches the query. The hh command is passed a block of text to be
analyzed and a list of terms to match. The output from hh is a list of start and end
positions of characters to be highlighted in the target text.
In the basic form, the hh command sequence has the following form:

> HH
> DATA 61
> The <B>rain</B> in <Tag>Spain</Tag> falls mainly on the
plain
> TERMS 2
> the
> spain falls

<OTResult>
HITS 3

The Information Company™ 65


Understanding Search Engine 21

0,3,0
52,3,0
24,17,1
</OTResult>
After the TERMS element, each keyword to be matched is entered on a separate line.
If there are multiple words in the line, it is considered to be a phrase to be matched.
This example requests hit highlighting for the terms “the” and “spain falls”.
The results are comprised of numeric triplets, where each triplet is of the form
POSITION,LENGTH,TERM. The position starts at 0, and the term numbering starts at
0.
The hit highlighting code strips common HTML formatting characters out of the data.
In this example, the </Tag> is ignored when matching the phrase “spain falls”, although
these formatting tags are counted in the character positions.
You may need to use the EXPAND command to obtain a list of terms that should be
tested in hit highlighting.

Get Time
While a query is executing, detailed timing information for each element of the query
is tracked. The Get Time command will return this data, including total time, wait time,
execution time, and execution time broken down by each command execution within
the connection. To obtain accurate information about the entire search query, this
should be the last command executed before closing the connection.
<OTResult>
<TIME>
<ELAPSED>68638</ELAPSED>
<SELECT>21329</SELECT>
<GET RESULTS>610</GET RESULTS>
<GET FACETS>187</GET FACETS>
<HH>0</HH>
<GET STATS>31</GET STATS>
<EXECUTION>22157</EXECUTION>
<WAIT>46481</WAIT>
</TIME>
</OTResult>

Set Command
The set command is used to specify values for variables that apply to the subsequent
operations. The supported set operations include:

Set lexicon English


Set thesaurus Spanish
Set uniqueids true [maxNum]
The lexicon variable specifies the language preference for stemming. The thesaurus
variable identifies which thesaurus file should be used.

The Information Company™ 66


Understanding Search Engine 21

Set uniqueids true requests that the Search Federator remove duplicate results from
multiple Search Engines. The optional maxNum parameter is the upper limit on
performing de-duplication. If there are more results than maxNum, de-duplication does
not occur. De-duplication is generally not recommended, since it can negatively impact
query performance and increases the memory used by the Search Federator.
Duplicates of objects may exist if a partition was placed in read-only mode, and
subsequent attempts are made to modify an object managed by the read-only partition.
This causes a new instance of the object to be created in a read-write partition. De-
duplication is a method of last recourse if you have misused the read-only mode for
partitions.
The Set Lexicon and Set Thesaurus commands are usually the first operations in a
handshaking sequence for a search query. If one or more search engines are
unavailable, the return message is:

MESSAGE 2 401 "Search engine(s) not ready.


This can be used as a convenience for an application to try another Search Federator
in environments that wish to support automated failover for high availability. This does
not apply if RMI is being used between the Search Federator and Search Engines.

Get Regions Command


This command is not typically used in a search query. Instead it is used by an
application to discover the list of regions that exist in the search index. The first row
represents the titles for the columns in the result. Column 0 is the name of the region,
column 1 is labeled “Description” for historic compatibility reasons, but the data
returned in this region is always empty. After the title row, there will be one row for
every region defined in the index.
get regions

<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "Description"
DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 0
DATA END
ROW 2
COLUMN 0
DATA 16

The Information Company™ 67


Understanding Search Engine 21

OTWFSubWorkMapIDDATA END
COLUMN 1
DATA 0
DATA END

</OTResult>
The Get Regions command can take an optional parameter, “types”.
get regions types
When the types parameter is present, this function will include the type definition for
the region in the response. This type definition can be used to provide optimized
interfaces for users (for example, integer comparisons instead of text modifiers). If
multiple partitions report different types, then the Search Federator will respond with
the value “inconsistent” as the type. Note that differences in region types for partitions
in Retired mode are allowed; the assessment of inconsistency is based only on
partitions that are not Retired. The possible types are: Integer, Long, Enum, Date, Text,
Boolean, Timestamp.
<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "RegionType"
DATA 4
DateDATA END
COLUMN 2 "Description"
DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 4
EnumDATA END
COLUMN 2
DATA 0
DATA END

</OTResult>

Another optional parameter is facets


get regions facets

When the facets parameter is present, then the type definition of generated facets is
included in the response. Normally, the facet types are the same as the region types,

The Information Company™ 68


Understanding Search Engine 21

but the special handling of integers that represent file sizes is an exception, returning
the value ‘FileSize’.
get regions types facets

<OTResult>

ROW 98
COLUMN 0
DATA 12
OTObjectSizeDATA END
COLUMN 1
DATA 4
LongDATA END
COLUMN 2
DATA 8
FileSizeDATA END
COLUMN 3
DATA 0
DATA END

</OTResult>

OTSQL Query Language


The SELECT command supported by the Query Interface implements the OpenText
Search Query Language, also known as OTSQL. Within this language, a query is
comprised of a number of basic parts, all contained on a single line:

SELECT parameters
FACETS parameters
WHERE clauses
ORDEREDBY parameters
Content Server users do not directly use OTSQL. The Content Server search query
language is known as LQL (historically, the Livelink Query Language). LQL is similar
to OTSQL in most respects, but provides some convenience operators and generally
uses different keywords. LQL in Content Server represents only the subset of OTSQL
that defines the WHERE clauses. Some of the differences between LQL and OTSQL
include:

LQL OTSQL
termset termset

stemset stemset

near, qlnear prox[10,f]

The Information Company™ 69


Understanding Search Engine 21

qlprox prox

term* right-truncation term

*term left-truncation term

t*er? regex ^t.+er.?$

qlregion region

qlleft-truncation left-truncation

qlright-truncation right-truncation

qlthesaurus thesaurus

qlstem Stem

qlphonetic phonetic

qlregex regex

qlrange range

qllike like

in in

any any

text text

” « » ‟ ″ “ ”„″ "

SELECT Syntax
The SELECT section is used to specify which regions in the index should be included
in the returned results. The more regions that are requested, the longer the ‘get results’
operations will take, but this does not impact the query time.

SELECT "region1","region2","region3"
To return all of the regions use the * keyword. For a Content Server installation, this is
not recommended, since there may be hundreds of regions. Requesting the minimum
necessary regions is suggested for optimal performance.
If you want to return information about the key/value attributes within text regions, you
can use the ATTRIBUTES modifiers:

SELECT "OTName","OTObject" WITH ALL ATTRIBUTES


SELECT "OTName","OTObject" WITH ATTRIBUTE "lang"
SELECT "OTName","OTObject" WITH ATTRIBUTES "lang" "color"

The Information Company™ 70


Understanding Search Engine 21

When attributes are requested, the response in the get results command is modified
to append the attribute information (see the “get results” description for more
information). The primary usage for requesting attributes is to identify language tags
attached to values in multi-language applications. The attributes modifier is applied to
all the regions specified in the select list.
The select statement can also be modified to request hit locations within the results:

SELECT "OTName","OTObject" WITH HIT LOCATIONS


When requested, the hit locations will be appended to the get results response, with
ordered triplets indicating the query term hit character position, length and term which
matched. The hit locations will be returned for all selected regions when requested.
You can request both hit locations and attributes in a single select statement.

SELECT "OTName","OTObject" WITH HIT LOCATIONS WITH


ATTRIBUTE "lang"

FACETS Statement
The FACETS section specifies whether facets are desired, and if so, for which regions.
This is optional, with the default being no facets returned. Refer to the next major
section of this document entitled “Facets” for a complete description of the FACETS
statement.
Sample facet requests:
FACETS "regionX"[10],"regionY"
FACETS "OTCreateDate"[d100,m24]

The ‘get facets’ command is used to retrieve the results. See the commands section
for additional details.

WHERE Clause
The WHERE clause defines the rules by which an object satisfies the search query.
The basic form is:

WHERE <clause1> relationship <clause2> relationship


<clause3>
A query determines which objects satisfy the search by means of search clauses. A
WHERE clause is comprised of a region, operator and term, although only the term is
mandatory.
The following are simple WHERE clauses:

where "red"
where "red riding hood"
where [region "name"] "red riding hood"
where [region "FileSize"] >= "1000" and [region "FileSize"]
< "10000"

The Information Company™ 71


Understanding Search Engine 21

WHERE Relationships
Each WHERE clause in a query is evaluated relative to other WHERE clauses by a
logical relationship. The supported relationships are:

AND Requires both the left and right expression.

AND-NOT Requires the left expression be true but the right


expression be false.

OR Requires that either the left expression or the right


expression (or both) are satisfied.

XOR The exclusive or operator requires that either the left


expression or the right expression is satisfied, but not
both.

SOR The synonym OR operator matches terms in the same way


that the OR operator does, but the way relevance score is
computed is somewhat different. In an OR operation, if
both terms are satisfied, they both contribute to the
relevance score. With a SOR operation, only the term with
the highest contribution is added to the relevance.

PROX[distance,order] The proximity operator is an “AND” operation which


PROX[10] requires that the left and right expressions be within
PROX “distance” words of each other. If order is present (T for
PROX[50,T] true, F for false), the left expression must also precede the
right expression. If no parameters are specified, the
interpretation is PROX[10,F].
The PROX operator ONLY works with simple terms and
phrases. It does not work in conjunction with expanded
term sets (wildcards, regular expressions, stemming, etc.).
Refer to the SPAN operator for more advanced proximity
options.

Relationships are evaluated from left to right. Brackets can be used to clarify and
modify the order of evaluation of clauses. For example, using single letters a through
d to represent entire clauses:

where a or b and c and-not d


Is interpreted by OTSE as:

where (((a or b) and c) and-not d)


Brackets can be used to change the order of evaluation:

Where a or ((b and c) and-not d)


In an actual query, this might look something like:

The Information Company™ 72


Understanding Search Engine 21

where thesaurus "pyjamas" or (([region "color"] "pink" and


[region "pattern"] = "polka dots")
and-not [region "theme"] stem "boxers")

WHERE Terms
The search terms in a WHERE clause should normally be enclosed in quotes.
Although there are some specific cases where the lack of quotes is tolerated, if you
are writing a query application, quotes are recommended in all cases.
The first form of a search term is the simple token. This is a value which is normally
expected to pass through the tokenizer and be recognized in its entirety as a single
token. All operators work on simple terms.
"hello"
"pottery123"
"3.1415926"
The second form is an exact phrase. Not all operators are compatible with phrases.
Phrases should normally only be used in string comparison operations.
"the quick brown fox"
"1334.8556/995-x"
You can also request that matches are only returned when the entire value is an exact
match for the phrase. For example, if there is a search region “ProjectName”, and
possible values are “Plan A” and “Plan A Extended”, searching for “Plan A” will match
both of these cases. Preceding the phrase with an equality operator ( = ) can
differentiate these, and match only the values that do not include the “Extended” term:
[region "ProjectName"] = "Plan A"
Finally, there is a special case for search terms, the * character (asterisk or star) or the
keyword all, with no quotation marks. This value is interpreted by the search engine
to match any object which has a value for the specified region. This will not match
objects if the region does not have a value defined for an object.
[region "name"] *
[region "name"] all

WHERE Operators
Each WHERE clause is comprised of a region specification, a comparison operation,
and a term. The region is optional, and if missing is assumed to be the default search
region list. The operation is optional, and if absent is assumed to match any token
within the region.
The following operators function with either simple tokens or phrases:

This is the default operation, where no operator is explicitly provided.


Matches any value within the region. For example a query for “York”
will match a value of “New York”.

The Information Company™ 73


Understanding Search Engine 21

= Use of the equality operator will only match if the entire value is
identical to the term provided. “York” will not match “New York” but a
query for “New York” will.
!= Will match all values which exist and do not exactly match the term.

The next set of operators is available for use with integers, dates and text metadata
values. They are disabled by default for full text query, since comparison queries in
full text are generally misleading and perform very slowly, although this behavior can
be changed by setting AllowFullTextComparison=true in the [Link] file.
These operators also have special capabilities for Date regions described later.

< Will match all values which exist and are less
than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
<= Will match all values which exist and are less
than or equal to the specified term. If a phrase
is provided, only the first term in the phrase is
used.
> Will match all values which exist and are
greater than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
>= Will match all values which exist and are
greater than or equal to the specified term. If
a phrase is provided, only the first term in the
phrase is used.

Constructing a query of the form


[region"x"] > "20150621" and [region "x"] < "20160101"

is not efficient. To improve performance, the query syntax parser will attempt to identify
usage patterns where multiple comparisons are made to a single region, and convert
it to the more efficient form of
[region"x"] range "20150621~20160101"

The following operators are designed for use with single tokens, not phrases. Some
limited phrase support is available with some of the operators as noted in the
explanations.

The Information Company™ 74


Understanding Search Engine 21

range "start~to" Will match any value between the start term
and the end term, inclusive. Note that the start
term must be less than the end term.
range "value1|value2|value3" The range operator can be provided with a list
of terms or phrases. This is equivalent to
value1 OR value2 OR value3. This operator
matches any value in a region; it is not
restricted to matching entire values.
thesaurus Will match the exact term or synonyms for the
term using the currently defined thesaurus.
phonetic Will match phonetic equivalents for the term.
If applied to a phrase, phonetic matching for
each word in the phrase will be performed.
Refer to the Phonetic matching section for
more information.
regex Will interpret the term as a regular expression.
Values which satisfy the regular expression
match the term. Regular expressions apply
only to a single token. Regular expressions
are more fully described later.
stem Will match values that meet the stemming
rules. Refer to the Stemming section for more
information. If stemming is applied to a
phrase, then the last word in the phrase is
stemmed.
right-truncation Right truncation matches terms which begin
with the provided search term. The user
would typically consider this as term*. If used
with a phrase, then the last word in the phrase
is stemmed.
left-truncation Left truncation matches terms which end with
the provided search term. The user would
typically consider this to be of the form *term.
This operator is valid only for single tokens.
like String matching optimized for part number and
file names. Only valid with “Likable” regions.
any (term,"search phrase") Match any term or phrase in the list. Unlike
the IN operator, partial matches within a
metadata region are acceptable. Equivalent to
(term SOR "search phrase").

The Information Company™ 75


Understanding Search Engine 21

in (term, "search phrase") Match any term or phrase in the list. Within a
region, only matches complete values.
Equivalent to (=term SOR ="search phrase").
not in (term, "search phrase") Excludes any objects containing the term or
phrase. For regions, equivalent to (and-not
[region "xx"] in (term,"search phrase")).
termset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of terms and phrases. N% may also be
used.
stemset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of the stems (singular/plural) of the terms
and phrases. N% may also be used.
text (something to search)
For large blocks of text, finds objects with
similar common terms. Check Advanced
Concepts section for more details.
span (distance, query)
Match query within distance number of terms.

NOTE: the behavior of comparison operations depends


upon the type definition of the region. Text string
comparisons use a text sort, so that 2000 > 1000000 for
values stored in a text region.

The following examples illustrate usage of WHERE operators:


<= "100"
stem "flower"
range "250~300"
range "alice|bob|carol|dave"
left-truncation "ntext"
right-truncation "opent"
= "my fair lady"
in (car,auto,suv,"sport utility vehicle")
any (house,home, "place of residence")
text (there must be documents with similar information)
span (5, swamp and (gas or methane))

Proximity - prox operator


A common requirement is to find search terms that are near one another. The PROX
operator provides an easy way to locate two terms within a specified distance, with
optional ordering. For example

The Information Company™ 76


Understanding Search Engine 21

big prox[3,t] truck

Will match “big truck” or “big red truck” but not “truck is big”. The second parameter is
a single letter indicating whether order needs to match. Use a ‘t’ (true) or f (false). In
the example above, using f would match “truck is big”.

Proximity - span operator


Many proximity requirements are complex, especially for discovery and privacy
applications. Consider searching for “Michael Smith”.
Michael may also be known as Mike,
His middle names are James T., but his name or initial is optional in the text;
Last name was given verbally, might have been Smithe, Smit, or Smyth.
The “span” operator allows more complex queries to be evaluated and tested to ensure
that the entire query falls within a defined number of search terms. You can thus
construct a search of the form:
span(4, (michael or mike) and (smith or smithe or smit or
smyth))

The first parameter of the span operator is the maximum distance between terms that
will satisfy the query. These fragments would meet the distance of 4 requirement:
Mike smith
A smith named Michael
Michael Herbert James Smit

This would not:


Mike never met Bob Smith

The span operator supports query fragments for any combination of AND, OR, and
nesting (brackets) for single search terms.
“space” and span(10, ((Yellow and sun) or (blue and moon))
and (earth or planet))

The span operator can be used with full text, but not with text metadata.
A span query is a relatively expensive operation and can be very expensive when used
with wildcards (left-truncation and right-truncation) or regular expressions. By default,
the engine is configured to disable support for these types of term expansions within
the span operator. If term expansion is enabled, the search engines will store
temporary working data on disk files during the evaluation of the span. Temporary files
are stored by each Search Engine in their corresponding index\tmp directory, and files
are named matchingWordsNNNNN and spanValuesNNNNN, where NNNNN is a
dynamically generated unique value. The temporary files are deleted when the query
completes, and also by the general purpose cleanup thread which runs from time to
time.

The Information Company™ 77


Understanding Search Engine 21

If abused, the span operator has the potential to require large amounts of disk space
and will take a long time to execute. There are a number of limits set by default in the
[Link] configuration file, which can be adjusted if more complex queries must be
run. When a limit is reached, the search will be terminated as unsuccessful. The limits
apply to a single partition (not the entire query for the entire index) and are located in
the [Dataflow_] section of the configuration file, with the defaults shown below.

SpanScanning=false
By default, use of term expansion (regex and wildcards) is not permitted with the span
operator. Set true to enable.

SpanMaxNumOfWords=20000
The upper limit on the number of terms that will be considered when wildcards and
regular expressions are expanded.

SpanMaxNumOfOffsets=1000000
Each term in the span expression may exist multiple times in documents. This file
stores the locations of the terms being evaluated. This is the upper limit for the number
of instances of matching terms.

SpanMaxTmpDirSizeInMB=1000
Limits the temporary disk space the partition can use for storing temporary data during
span operation evaluation.

SpanDiskModeSizeOfOr=30
The cost of executing a span is directly related to the number of “OR” operations in the
span query. This setting is an upper limit on the number of “OR” Boolean operators
that can be assessed.

Proximity – practical considerations


When using the prox or span operators, you may need to increase the distance to
accommodate pattern and tokenizer behavior. Keep in mind that the distance is
measured internally in the search engine by “tokens”, not by words.
In addition, if pattern insertion features of the Document Conversion Server are
enabled, unique tokens will be inserted into the full text at locations where phone
numbers, email addresses, hash tags or other items are detected.
Both the tokenization and pattern behaviors can increase the distance between words.
As a result, adding a small additional distance to the prox and span operators may be
needed to capture all the expected results.

The Information Company™ 78


Understanding Search Engine 21

WHERE Regions
A region is specified within square brackets with a region keyword, and enclosed in
quotation marks. The search term is likewise enclosed in quotation marks. There are
specific cases which are unambiguous and quotation marks are not required, but for
consistency your application should use quotation marks regularly. Region names are
case sensitive!
If the region portion of a WHERE clause is absent then the default search list is used
to determine the regions.
The following are examples of WHERE clauses using regions:
[region "OTNAME"] "cars"
[region "OTNAME"] all
[region "OTDate"] > "20100602"
[region "abc”] <= "string1"
Regions are grouped by OTSE into content and metadata regions, which are internally
represented by OTData and OTMeta. The representation of the “OTNAME” in the
example above is actually an abbreviated form of:
[region "OTMeta":"OTNAME"]
You can use OTMeta without a region name to examine all of the metadata regions.
However, this is relatively slow (depending on the number of regions) and in many
cases is not logical because of the different type definitions for regions.
You can also use OTMeta with some surrounding syntax to search within metadata
regions. For example, the clause:
[region "OTMeta"] "<someRegion>123 ABC</someRegion>"
Will find the exact value ‘123 ABC’ within the region “someRegion”. This is a much
slower way to locate the value, but there may be special cases where matching a
phrase anchored to the start or end of a region is needed.
You can specify searching in the full text using the OTData region:
[region "OTData"] "looking for this"
If you have indexed XML content, you can also search within specific XML regions of
the full text content using the XML structure, refer the section on indexing XML data for
more information.
The WHERE clause can also be used to set restrictions on attribute/value tags for text
metadata. For example, to restrict a search to looking at French language values of
the OTName field, you might use the syntax:
[region "OTName"][attribute "lang"="fr"] "voiture"
This presumes that “lang” is the attribute name, and “fr” is the value for that attribute.
Multiple attribute fields are possible, which effectively operates as a Boolean “and”,
requiring that both attributes must match:
[region "OTName"][attribute "lang"="fr"][attribute
"size"="med"] all

The Information Company™ 79


Understanding Search Engine 21

Priority Region Chains


Certain types of search queries are very difficult to construct using Boolean operations.
In particular, OTSE supports a prioritized region evaluation method for use with similar
sparse regions. Consider document “creation” dates. A Content Server object may or
may not have dates from several possible sources… the source (disk) creation date,
the source (disk) modified date, an extracted date from Microsoft Office document
properties, the date the object was added to Content Server. If the source create date
is defined, it is the best quality information – it should be used in evaluating the query.
If it is not defined, then the source modified date should be used, if defined. Next most
reliable date is the MSOffice property, and as a last recourse use the Content Server
date only if none of the other date values exist for an object.
These priority chains of related metadata regions can be easily specified using the
“first” region declaration in a WHERE clause. For example, to find all objects with the
“best” date earlier than 5 years ago…
[first "OTExternalCreateDate", "OTExternalModifyDate",
"OTDocCreatedDate", "OTCreateDate"] < "-5y"

This syntax can be used to dynamically define the regions and their priority as part of
the query. However, this approach does not allow the value that matched the query to
be returned. If retrieving of a priority value is necessary, then a synthetic region
declaration must be made in the [Link] file:
CHAIN GoodDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate OTCreateDate

A query can then be made using the pre-defined date, and the GoodDate field can also
be returned as a target of the SELECT:
[region "GoodDate"] < "-5y"

For those interested in trying to construct the equivalent query using standard Boolean
operators, an example is shown below. Note that using the ‘first’ feature is not only
more convenient, but the implementation is more efficient. Internally, a new operator
performs the necessary logic with fewer operations, it is not simply converted to this
Boolean equivalent:
[region "OTExternalCreateDate"] < "-5y" or ([region
"OTExternalCreateDate"] != all and ([region
"OTExternalModifyDate"] < "-5y" or ([region
"OTExternalModifyDate"] != all and ([region
"OTDocCreatedDate"] < "-5y" or ([region "OTDocCreatedDate"]
!= all and ([region "OTCreateDate"] < "-5y"))))))

The ‘first’ region method can be used with all region types and most operators.
However, search within a specific text metadata attribute value with the CHAIN / first
operator is not supported.

The Information Company™ 80


Understanding Search Engine 21

Minimum and Maximum Regions


Similar to the use of region chains, the search engine can be instructed to evaluate an
object based upon the minimum or maximum value of a set of regions. These can be
dynamically constructed as part of the query, as illustrated here:
[max "OTExternalCreateDate", "OTExternalModifyDate",
"OTDocCreatedDate", "OTCreateDate"] < "-5y"

[min "Attr1", "Attr2", "Attr3"] ="6"

The min and max operators will skip assessment when an object lacks a value. For
example, if an object had only Attr2 defined in the example above, then it would
automatically be evaluated as the minimum value. If none of the regions has a value,
the object does not match.
Min and max region assessments work for all data types, although not all operations
are supported. Supported operations include comparisons against a value (<,=, >,
etc.), basic term and phrase matching, IN, ranges, etc. However, operators that
expand to multiple elements are not available, such as termset, stemset, thesaurus,
wildcards and regular expressions.
For multi-value TEXT metadata regions, the smallest value in a set of values for a
region will be used when assessing a minimum region, and the largest value will be
used when assessing a maximum region.
In addition to specifying ad-hoc minimum and maximum region evaluations in a query,
a synthetic region may be defined as a convenience using the [Link]
file:
MIN SmallAttr Attr1 Attr2 Attr3
MAX BigDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate

A query could then be constructed using the predefined region:


[region "SmallAttr"] ="6"

A predefined region has the additional property that the tested value can also be
returned in a SELECT statement. Note that no additional storage or indexes are
created, this region definition is a directive to the query constructor. Both the dynamic
and predefined approaches execute identically.
As a point of interest, it is usually possible to construct an equivalent query using
standard Boolean logic, although the min and max forms are computationally more
efficient. The equivalent query is quite complex, and varies depending on the nature
of the comparison (greater than, equal, less than) and whether a minimum or maximum
is required. Where multi-value text is present, there is no Boolean logic equivalent. As
one example,
[min created,modified,record,system] >= "20150403"

Is equivalent to:

The Information Company™ 81


Understanding Search Engine 21

(([region created] >= "20150403" or [region created] != *) and


([region modified] >= "20150403" or [region modified] != *)
and ([region embedded] >= "20150403" or [region embedded] !=
*) and ([region record] >= "20150403" or [region record] != *)
and ([region system] >= "20150403" or [region system] != *)
and (([region created] >= "20150403") or ([region modified] >=
"20150403") or ([region embedded] >= "20150403") or ([region
record] >= "20150403") or ([region system] >= "20150403")))

Any or All Regions


To simplify constructing queries that need to find the same result in multiple regions,
the Any and All region specification is available. This feature is first available in the
16.2.3 (2017-12) update of the search engine.
The any region designation is a syntax shortcut for using the OR operator. The
convenience form:
[any "r1", "r2", "r3"] "bob"

Is equivalent to constructing this query using OR:


[region "r1"] "bob" or [region "r2"] "bob" or
[region "r3"] "bob"

Similarly, the all region designation is a syntax shortcut for using the AND operator.
The convenience form:
[all "r4", "r5", "r6"] "sue"

Is equivalent to constructing this query using AND:


[region "r4"] "sue" and [region "r5"] "sue" and
[region "r6"] "sue"

Regular Expressions
OTSE supports the use of regular expressions for matching tokens. A regular
expression is a pattern of characters. In the OTSE query language, a term preceded
by the operator regex is interpreted as a regular expression. Patterns are defined
using the following rules:

. A period will match any single character.


[ ] Square brackets are used to enclose a character set or range of
characters. A range of characters consists of two characters separated
by a hyphen, such as 0-9. The characters within a range are determined
by their ordering in the UTF8 character set. Examples:
[a-z] matches the letters of the alphabet

The Information Company™ 82


Understanding Search Engine 21

[$#.!%] matches a number of punctuation symbols. Contrary to popular


belief, this does not match obscene words.
[^ ] The caret ^ symbol has special meaning. If it immediately follows the
opening square brace, then it is a negation of the range. For example:
[^0-9x] matches any character except the digits 0 through 9 or the letter
x.
[ Within a range, the caret symbol is an escape character which allows
^]] matching a closing square bracket, hyphen or caret. This use of the
caret allows these special characters to be used in a range. For
example:
[abc^]^-] matches any of the letters a, b or c or the closing square brace
or the hyphen.
^ The caret symbol is an anchor used to denote the beginning of a word
when used as the first character in the regular expression.
For example, the pattern
"^sp" will match spain or sporadic, but not hospital or wasp.
$ The dollar sign, when used as the last character in a pattern, denotes
an anchor at the end of a word. For example, the pattern
"sp$" will match wasp, but not spain or hospital.
* The asterisk matches the smallest preceding range zero or more times.
The preceding pattern may be a character or a range. For example, the
regular expression
"ad*" will match a, ad, add, addition.

+ The plus character matches the smallest preceding range one or more
times. For example,
"tr[eay]+ " will match words like try, tree, trey, treayaaa or country. It will
not match tr.
? The question mark character matches the smallest preceding range
exactly zero or one time. Reusing the previous example:
"tr[eay]? " will match try or pictr. However, it will not match tree.
| The vertical bar functions as an OR operation between patterns.
"go|stay" will match cargo or stay.
The range "[a-c]" could be represented as "a|b|c".

The Information Company™ 83


Understanding Search Engine 21

( ) Braces are used to group patterns together. This allows complex


patterns to be constructed.
"ho(us|m)e" will match both house and home.

\ The backslash character is used as an escape character to indicate the


following character should be interpreted literally, and not interpreted as
an operation. Use a double \\ to match the \ character.
"func\(a\)" matches func(a).
"3\.14" matches 3.14, but not 3714 ("3.14" will match 3714).
“folder\\subfolder” will match folder\subfolder.
Some additional examples:

"^l(uke|eia)" Match words that start with luke or leia

"^....s?$" Match five letter words that end with the letter s or four letter
words.

Not sure how you spell encyclopedia. Starts with ‘en’, has some
"^en[a-z]+p[eaid]+$" letters, then a ‘p’, then some combination of e, a, i and d. Mind
you, this also matches envelope.

"(0?[1-9])|(1[0-2]):[0-5][0-9]" Find words that contain a string that might be a time in 12 hour
format, such as 1:30, 03:26, 12:59
"^s(ch)?m[iy](th|dt|tt)e?$" Match words like smith, smyth, Schmidt, smitte.
"^ope.+ext$” Matches the common user expectation of a wildcard in the middle
of a word: ope*ext.

Within a WHERE clause, the regex operator looks like this:


[region "Size"] regex "^(small|med)"
Regular expressions can be very expensive operations. In the worst case, the entire
internal dictionary may need to be examined to test every word as a potential match.
The most effective way to reduce the cost of finding candidate words is by anchoring
the start of the regular expression with a caret.
It is also important to make the expression as targeted as possible. If the regular
expression matches thousands of possible words, then the resulting search query will
have an effective “OR” operation of thousands of terms.
Note that the search index has typically normalized the indexed words (see the section
on the Tokenizer for details). Usually, this means that all dictionary entries are lower
case, and the use of upper case within a regular expression is normally not appropriate.

The Information Company™ 84


Understanding Search Engine 21

NOTE: the search index has typically normalized the


indexed words to lower case (see the section on the
Tokenizer for details). Unless you are using a tokenizer
that preserves case, the use of upper case within a regular
expression is normally not appropriate.

Relative Date Queries


When searching within a region of type DATE or TIMESTAMP, there are special
operations available that simplify the creation of common relative date searches.
Relative date queries can use day, week, month, quarter or year comparisons,
represented by an integer immediately followed by the letter d, w, m, q or y. Positive
integers represent periods in the future; negative numbers are periods in the past. As
an example, "-1y" means previous year.
The current date determines the meaning of the current week, month quarter or year,
which can be expressly used in the query with the integer 0. For example, the current
month can be represented by "0m".
Relative weeks, months, quarters or years are aligned to their calendar boundaries;
they are not shortcuts for 7, 30, 90 or 365 days. The first day of the week is determined
by the system locale, which is Sunday for most areas of the world. Calendar quarters
are used, comprised of three month periods starting in January, April, July and October.
Relative date queries are supported for comparison { < <= >= > }, but not for equality
(or inequality).
For illustration, assume that the [Link] file contains the following entry
that captures the date a contract ends:
DATE EndDate

The query syntax has the form:


[region "regionName"] comparator "rDate"
e.g.
[region "EndDate"] >= "-365d"
The following table illustrates how the relative date value is interpreted, assuming that
today is Monday 13 October 2014.

rDate Meaning Effective Query

>= +1m Next month or later >= 20141101

< 0d Before today < 20141013

> -7d More recent than 7 days ago > 20141006

> -1w Later than last week > 20141011

>= 0w This week or later >= 20141012

The Information Company™ 85


Understanding Search Engine 21

>= -1y Last year or later >= 20131013

>= -365d Last 365 days or later >= 20131013

> -1y After last year > 20131231

< -2y Before previous 2 years < 20120101

>= -1q Last quarter or later (after July 1) >= 20140601

> -1q After last quarter > 20140930

<= 0q Before end of this quarter <= 20141231

> -16m After June 2013 > 20130631

If a TIMESTAMP region is used, the internal conversion is similar, but is expressed to


the millisecond level where necessary.

<= 0q Before end of this quarter <= 20141231T[Link].999

> -7d More recent than 7 days ago > 20141006T[Link].999

Matching Lists of Terms


There are 3 query operations that are optimized for matching items in a list of terms:
IN, TERMSET, and STEMSET. These operations are valid only for the full text (body),
or text metadata regions.
The IN operator takes a list of simple terms or phrases, and is a more concise method
of matching items in a list than using OR operations. Consider the term
[region "lake"] in(superior,erie, "Lake of the Woods")

This is equivalent to:


[region "lake"] ="superior" SOR [region "lake"] ="erie" SOR
[region "lake"] ="Lake of the Woods"

Using the SOR operator ensures that multiple matches won’t rank the result higher.
Note the use of the = modifier; the IN operator will only match entire values in metadata
regions. The behavior in full text content is slightly different, in that the entire value
matching is no longer pertinent.
in(superior,erie, "Lake of the Woods")

In full text queries is equivalent to


"superior" SOR "erie" SOR "Lake of the Woods"

The Information Company™ 86


Understanding Search Engine 21

The TERMSET feature allows you to locate objects that have at least N matching
values from the provided list. For example, the clause:
termset(5,Water, river, lake, pond, stream, creek, rain,
rainfall, dam)

will match an object that contains 5 or more of the terms and phrases. This is a very
powerful construct for discovery and classification applications. There is no simple
equivalent representation. The example above could be expressed like…
SELECT ... WHERE
(stream AND pond AND lake AND river AND water) OR
(creek AND pond AND lake AND river AND water) OR
(creek AND stream AND lake AND river AND water) OR
(creek AND stream AND pond AND river AND water) OR
(creek AND stream AND pond AND lake AND water) OR
(creek AND stream AND pond AND lake AND river) OR
(rain AND pond AND lake AND river AND water) OR
(rain AND stream AND lake AND river AND water) OR …

Fully written out, this query is comprised of 126 lines with 629 operators. The
TERMSET operator is powerful, concise, and eliminates errors constructing complex
queries. The implementation of TERMSET and STEMSET is also internally optimized
for these cases. Queries may operate considerably faster with less memory using
TERMSET/STEMSET compared to executing the fully expanded equivalent queries
constructed of AND / OR terms.
The value of N can also be a percentage, meaning that it must match at least the
specified percentage of terms. 50% of 4 terms means that 2 or more matching terms
are needed. 51% means that 3 or more must match, since the percentage is a
minimum requirement. Using percentages is typically useful when there are longer
lists of candidate matching terms. These are equivalent:
Termset( 3, Water, river, lake, "duck pond", "stream")
Termset( 50%, Water, river, lake, "duck pond", "stream")

Negative values for N are interpreted to mean M-N as the threshold. For example, if
there are 10 terms, a value of -2 is equivalent to a value of 8 for N. It may be of interest
to note that at the endpoints for a list of N terms, TERMSET 1 is an effective OR, and
TERMSET N is an effective AND.
Termset (1, red, blue, green)  red OR blue OR green
Termset (3, red, blue, green)  red AND blue AND green

The STEMSET operator is similar to TERMSET, except that it matches stems of the
values (that is, singular and plural variations).
stemset(5, Water, river, lake, pond, stream, creek, rain,
rainfall, dam)

Would match an object that contains:

The Information Company™ 87


Understanding Search Engine 21

Water, rivers, ponds, stream, rain

Being singular/plural aware means that a document that had only the words:
Water, river, rivers, pond, ponds

will not match, since STEMSET considers the singular and plural forms of river and
pond to be the same term. This document therefore only has 3 matching terms, instead
of the desired 5. Essentially,
stemset(2,water,river,pond)

can be thought of as
((stem(water) and stem(river)) or (stem(water) and
stem(pond)) or (stem(river) and stem(pond)))

or, in a somewhat simplified form which doesn’t really cover all the variations of
stemming,
((water or waters) and (river or rivers)) or ((water or
waters) and (pond or ponds)) or ((river or rivers) and
(pond or ponds))

Unlike the IN operator, STEMSET and TERMSET are not constrained to matching only
full values in text metadata regions. The negation of these operators is possible using
NOT, and can be interpreted as follows:
(m or n) not termset(2,a,b,c)
 (m or n) and-not (termset(2,a,b,c))

region["r"] not stemset(2,x,y,z)


 not (region["r"] stemset(2,x,y,z))

The TERMSET and STEMSET operators were first introduced in version 16.0.1 (June
2016).

ORDEREDBY
The ORDEREDBY portion of a query is optional. Its purpose is to give you control over
how the search results should be sorted (ranked) and returned in the get results
command. If omitted from the query, the result ranking is sorted by the relevance score
in descending order. This means that the most “relevant” results are returned first.

Within Content Server, ordering of results is not available in the


Livelink Query Language (LQL). Content Server injects the
appropriate ORDEREDBY statements as needed depending upon
the way results are displayed.

The Information Company™ 88


Understanding Search Engine 21

ORDEREDBY takes parameters, with the first parameter determining whether


additional parameters are accepted. These are:
ORDEREDBY Default
This is the same as omitting the parameter entirely, and ranks results by relevance.
ORDEREDBY Nothing
This parameter identifies that no sorting of the search results takes place. This
provides both a memory and performance improvement, especially if retrieving large
sets of search results. The order returned by a specific Search Engine is repeatable.
Where multiple partitions exist, the overall order is not repeatable, since the Search
Federator will select results based on the order in which Search Engines completed
their individual searches.
ORDEREDBY Relevancy
This is the default if the parameter is omitted. Results are ranked by relevance, with
the most relevant results firsts.
ORDEREDBY RankingExpression
The RankingExpression method allows you to extend the WHERE clauses to provide
additional parameters for evaluating relevance. This does not affect the objects
selected, only their relevance computation. For example:
ORDEREDBY RankingExpression ([region "size"] "small"
OR [region "color"]!= "green")
This would modify the relevance computation to favor objects which have the value
“small” in the region “size”, or any value except “green” in the “color” region. The same
rules that apply to WHERE clauses are used here.
Note that these values SUPPLEMENT the WHERE clauses, not replace them in the
scoring.
ORDEREDBY Region
The Region ordering allows you to sort the results by one or more specific fields.
ORDEREDBY REGION "OTCreateDate" ASC, "Author" DESC
This example sorts first by OTCreateDate in ascending order, and for objects with
identical OTCreateDate values they are further sorted by Author descending. Use of
ascending (ASC) or descending (DESC) is optional, with ascending being the default.
Regions are separated by commas.
If sorting on a multi-value text region, the first value (as provided during indexing) is
used as the sort key.
There are special syntax cases available for text regions which have multi-language
metadata, allowing you to specify which of the language values for the region should
be used in the sort.
ORDEREDBY REGION "OTName" SEQ "fr" DESC
The use of SEQ or SEQUENCE, followed by the language code, requests that the
value in the OTName region which has the key/value attribute pair of “lang”=”fr” be
used when sorting this region. If there is no French language value defined, then the
system default language for the value will be used. If there is also no system default

The Information Company™ 89


Understanding Search Engine 21

language, then the language with the smallest value is used, otherwise use the
standard “no attribute” sorting.
ORDEREDBY Existence
Rank the search results by the number of matching terms in an object. This modifies
the standard relevance computation slightly, so that the number of times a term
appears is not important, only the number of terms which exist in the document.
ORDEREDBY Rawcount
Rank the search results by the number of instances of terms in an object. This modifies
the standard relevance computation slightly, so that the number of times a term
appears is highly rated. The default scoring algorithm considers the number of times
a word appears, but it is only a modifier. Using Rawcount will make the number of
times words appear a major factor in the score.
ORDEREDBY Score[N]
Rank the search results using a combination of the ranking computation (global
settings) and boost values specified as parameters in the query. Refer to the
Relevance section of this document for details.
Performance Considerations for Sort Order
In some cases, the sorting requested for results can be a factor in search performance.
Sorting is performed in the search engines, and each search engine requires
temporary memory allocation and time to perform the sorting. For both time and
memory, the key variables are the type of sort, and the cursor position of the requested
results.
Orderedby Nothing is the fastest performer, and uses the least memory – since it skips
the sorting step entirely. If your application needs to gather all the results from a query,
the use of Nothing as the sort order is strongly recommended, especially if you are
dealing with large data sets. Sorting and retrieving 1 million results may require on the
order of 100 Mbytes of temporary memory. Sorting by Nothing will avoid this penalty.
Sorting by primitive data types such as floats (relevance), integers, or dates is the next
best performing configuration. Roughly speaking, primitive types require about 4
Mbytes of RAM for each 100,000 results the cursor is advanced.
Sorting by string values is slower and uses more memory. Performance may start to
become material moving the cursor past about 20,000 results. Memory requirement
varies depending on the lengths of the strings, but typically runs about 15 Mbytes of
temporary memory per 100,000 results the cursor is advanced.
Sorting on multiple fields is slower, and uses more memory. The performance
penalties are difficult to predict, since they depend on the numbers and types of sorts.
The order also matters – a sort on a number first, then on a string uses about 8 Mbytes
per 100,000 results the cursor advances. Reversing it to sort on the string first, then a
number, would use more memory than just a string sort.
What does it mean when we talk about advancing the cursor position? Regardless of
how many search results there are, if you are only retrieving the first few hundred, the
sort time and memory required will be low. However, if you want sorted results
numbers 99,900 to 100,000 – then the cursor must be advanced to at least position

The Information Company™ 90


Understanding Search Engine 21

100,000. The search engines must sort at least that number of results, requiring
significant resources. When asking for results 1 to 100, the search engines can
optimize their sorting implementation to focus on just the ensuring the minimum set of
values are properly sorted.
The memory resources required for sorting are per search engine, per concurrent
search query. If you want to support up to 10 concurrent queries, each asking for
100,000 results, then each search engine may need over 150 Mbytes of working space
available. In normal types of applications this pattern is rarely observed, and in practice
most applications use relatively small amounts of memory to retrieve less than 10,000
results from a few concurrent queries.
Text Locale Sensitivity
When ordering results by a text region, locale-sensitive sorting is used by default. As
a result, sorting can differ somewhat depending upon the locale. Locale-sensitive
collation generally groups accented characters near their unaccented equivalents.
Depending on the locale, multiple characters may be considered as a single logical
character, and some punctuation may be ignored.
The locale for a system is determined from the operating system by Java, and uses
the Java system variables [Link], [Link] and [Link]. For
debugging, these values are logged during startup. In Java, locale can explicitly be
set to override system defaults as command line parameters. For example:
java -[Link]=CA -[Link]=fr …
Locale sensitive sorting was first added in 20.4, and can be disabled in the [Dataflow_]
section of the [Link] file by requesting the older behavior:
OrderedbyRegionOld=true

Facets

Purpose of Facets
Facets allow metadata statistics about a search query to be retrieved. For example, if
facets are built for the region “Author”, and there were 300 results, facets might supply
the following information from the “Author” region:
Mike 121
Alexandra 72
David 32
Michelle 21
Stephen 19
Alex 11
Paul 6
The interpretation would be that of the 300 results, 121 of them had the value “Mike”
in the “Author” region, 72 had the value “Alexandra”, and so forth. As an application
developer, you can present this information to the user to help them understand more
about their search results. It is also common to allow the user to “drill down” into the
results based on facets. For example, the user might determine they only want results
authored by Ferdinand. They select Ferdinand, which re-issues the same search, this

The Information Company™ 91


Understanding Search Engine 21

time with an additional clause in the query along the lines of AND [region
"Author"] "Ferdinand" (require “Ferdinand” in the region “Author”).

Requesting Facets
OTSE generates facet results when requested within the search queries. There are
no special configuration settings necessary to use facets, although optimization by
protecting commonly required facets may be a good idea. To request facets, in the
‘SELECT’ portion of the query, you add text along these lines:
SELECT “OTObject”,“OTSummary” FACETS
"Author","CreationDate" WHERE …
OTSE would then generate facets for two regions: Author and CreationDate. There is
no defined limit to the number of facets that can be requested for a query, but memory
or performance limitations will become a factor for large numbers of facets. The design
optimizations selected for OTSE are based on expectations of 100 or fewer distinct
facets in use at any time.
Once the query completes, you retrieve the results from the search engines with the
command:
GET FACETS
The output from the GET FACETS command is described in more detail in the Query
Interface section.
Like the search results, the facets for the query are retained until the query is
terminated or times out. Except for date facets, the values are returned sorted from
highest frequency to lowest frequency.
When facet values are returned, there are a couple of additional values provided. The
number of facet values identifies the total number of facet values found. The returned
count is the number of facet values actually returned, which is usually smaller. There
is also an overflow indicator, which identifies whether the number of facet values
exceeded the configurable limit – meaning that the facet results are not exact since
they are incomplete.
In most applications, a user is not interested in reviewing thousands of possible
metadata values in a facet. Usually, only the most common values are of interest. The
facets implementation allows you to place a limit on the number of values for each
facet you want to see. Using syntax such as:
SELECT "OTObject" FACETS "Author"[5], "DocType"[15]
This would return only the 5 highest frequency values in the field “Author” and the 15
highest frequency values in the field “DocType”. By default, the first 20 values are
returned. This default can be overridden by a configuration setting. You are strongly
advised to limit the number of values returned, especially with facets that may contain
arbitrary values, since they can potentially contain millions of values which would
significantly impact search performance.

Facet Caching
Facets data structures are built on demand. Once created for a given facet, the
structure is retained in memory so that subsequent queries using the facet are very

The Information Company™ 92


Understanding Search Engine 21

fast. In order to keep memory use constrained, there is a maximum number of facets
that the search engine will retain. If a query requests new facets that are not in memory
and the maximum number of facets is exceeded, then the search engine will delete the
facet structure that has not been used for the longest time. The default is to retain up
to 25 facet structures in memory. There is a 10 minute “safety margin” – meaning that
even if 25 facets are exceeded, a facet that was used in the last 10 minutes will not be
deleted. A facet that that is included in a query can also not be deleted. The limit is
therefore a guideline rather than an absolute maximum.
If your applications use more than 25 facets regularly, then search query performance
may suffer as facet data structures are regularly created and deleted. You can adjust
the number of facets to retain in memory in the [Dataflow_] section of the [Link]
file:
MaximumNumberOfCachedFacets=25

Text Region Facets


Use caution when requesting facets for fields that may contain arbitrary text values.
There may be very large numbers of values, which can result in poor performance for
search queries. As a minimum, ensure that you specify an upper limit on the number
of facet values you want to retrieve, which is described in more detail near the end of
the Facets section.
In building facets, the values of the text fields are examined to build the facets. If the
text regions for which facets are built are stored on disk, then the performance for
search will be impacted. You should consider using RAM storage for text regions for
which you expect to retrieve facets.
Text regions also support multiple values. In these cases, each value is separately
returned. If the region “DocType” for an object had multiple values (“ZIP”, “XML”,
“Word”), then the object is counted 3 times in the facets, once for each of the values.

Date Facets
Date facets represent a special case, which has been constructed specifically to
address a very common and important requirement, namely presenting facets that
represent the “recentness” of an object in the index. Date facets are not designed to
handle arbitrary dates or future dates.
If facets are requested for regions of type DATE, special handling occurs. Each day
within the supported time range is counted multiple times – as a day, within a week,
within a month, within a calendar quarter, and within a calendar year.
Date facets are not sorted by frequency. Instead they are ordered by recentness. If
you have requested facets for 8 months, you will always get the most recent 8 months
returned. When constructing a query for date facets, the syntax within the SELECT
statement is:
… FACETS "CreateDate"[d30,w0,m12,q0,y10] …

The facet counts are optionally specified as a letter followed by the number of facet
values desired, where:

The Information Company™ 93


Understanding Search Engine 21

d –
number of days, including today
w –
number of weeks starting on Sunday, including today
m –
number of months, including the current month
q –
number of calendar quarters (Jan, Apr, Jul, Oct),
including the current quarter
y – number of calendar years, including the current year

The example above would request the last 30 days, the last 12 months, the last 10
years, and no facets for weeks or quarters. To obtain no values for a category, specify
zero. Omitting the category will result in the default number of values being returned.
If the count for a value is zero, then no facet value will be returned.
The default number of date values to be returned is defined in the [Link] file. In
the [DataFlow_] section:
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10

The values returned for date facets are formatted to easily identify their type and date
range.
Days: d20120126 (dYYYYMMDD) 26 Jan 2012
Weeks: w20120108 (wYYYYMMDD) week starting 8 Jan 2012
Months: m201202 (mYYYYMM) Feb 2012
Quarters: q201204 (qYYYYMM) quarter starting Apr 2012
Years: y2012 (yYYYY) year 2012

Date facets can only be built for dates where the day is within range of the
maximum number of facet values, per the settings described later. The default is
32767, or about 90 years.

FileSize Facets
Integer regions may be marked in the [Link] file to have their facets presented as
FileSize facets. This mode groups file sizes into a set of about 30 pre-defined ranges.
This mode ignores the number of facets request, and always returns a fixed number of
facet values representing the buckets (or ranges). Details of these facet values are
described in the get facets command section.

Facet Security Considerations


When writing an application that leverages search facets, you may need to consider
the security implications. In a typical application such as Content Server, search
results are post-processed to filter out results that a particular user is not entitled to
see. It is more difficult to do this with facet values.

The Information Company™ 94


Understanding Search Engine 21

For applications in which the security requirements are high, you must ensure that
facets which contain sensitive information are not made available to users without
suitable clearance. In many cases, it is considered acceptable to display facets which
do not contain sensitive data, such as file sizes, object types, or dates. It might also
be possible to achieve acceptable security by reducing the exactness of the object
counts – displaying a more generic frequency count (eg: 1 to 4 “bars”, or labels such
as “many” or “few”) instead of the precise counts from the search engine.
Ultimately, you will need to choose an appropriate tradeoff between user convenience
and improved user search experience versus the risk that a user might glean harmful
information from facets values.

Facet Configuration Settings


There are a number of configuration settings in the [Link] file for facets. All settings
are located in the [DataFlow_] section of the [Link] file.
The expected number of facets is used to determine the initial amount of memory that
should be allocated when the facet data structure is created. This does not place an
upper limit on the number of facets that are possible, since the structure can grow. It
increases performance when the facet data structures are built.
ExpectedNumberOfFacets=8
The expected number of values per facet is used to determine the amount of memory
that should be allocated when a new facet data structure is created. This does not
place an upper limit on the number of values that are possible, since the structure can
grow. This is a minor optimization. Because of the bit-field representation used for
facet structures internally, this value should be a power of 2.
ExpectedNumberOfValuesPerFacet=16
The maximum facet value length represents the maximum length of a value that will
be considered for facet purposes. Longer strings are truncated, which means that the
facet system would treat the distinct values “I am a long facet value ending with 0123”
and “I am a long facet value ending with 4567” as identical values. This limit allows
control over memory used for facets.
MaximumFacetValueLength=32
Facets can be entirely disabled within the Search Engines by setting the following value
to false:
UseFacetDataStructure=true

The maximum number of values per facet sets the upper limit on how many distinct
facet values are possible. This limitation is present as a failsafe from abuse, and
presumes the typical facet application is intended for much smaller data sets.
Increasing this value will increase the amount of memory required to store facet
information. Because the internal data structures use bit-fields, the optimal setting for
this value are 1 less than a power of 2 (eg: 2**N – 1). It should be noted that multi-
value text fields consume a facet value for every combination of text values contained
in the field. For example, if the region “Colors” can contain combinations of “red”,
“blue”, “green” and “black”, then 15 combinations are possible and 15 of the facet
values could potentially be used. If you expect to create facets for regions that may

The Information Company™ 95


Understanding Search Engine 21

have many combinations (such as email distribution lists) then this number may need
to be very large, and you may be limited by usable memory.
MaximumNumberOfValuesPerFacet=32767
The number of desired facets is the default number for the “most common N” facet
values to be returned if the number of desired facets is not specified in the query. This
ini setting does not affect the special return values for Date type facets.
NumberOfDesiredFacetValues=20

Reserving Facet Memory


Each search engine allocates memory for facets from its general pool of available
RAM. For small data sets and small numbers of facets, there is often enough memory
available that the search engines can draw from the memory allocation without any
impact. However, as the facet data sets become larger, it is a good idea to increase
the Java memory limit (the –XMX= parameter on the startup command line).
The increase in memory depends on factors such as the number of facets, the number
of distinct values for a facet, and the size of the values. Roughly speaking, a 2 GB
partition would need twice the memory of a 1 GB partition. Facets with a small set of
possible values, such as OTFileType or a Classification, require relatively little memory.
Facets with large numbers of possible string values, such as the parent ID, keywords
or hot phrases, would approach the theoretical maximum memory requirement.
For reference, the approximate formula for the upper limit in bytes excluding Java
object overhead, is:
MaxSizeInBytes = 1/8* ( MaxNumberOfObjects *
MaximumNumberOfCachedFacets *
Log_2(MaximumNumberOfValuesPerFacet)) +
MaximumNumberOfCachedFacets * MaximumNumberOfValuesPerFacet
* MaximumFacetValueLength

You are unlikely to need to resort to these calculations. In practice, for a 1 GB partition
in Low Memory mode (3 to 5 million objects) with 10 to 15 typical facets in use, memory
consumption by facets is usually less than 40 MB. Content Server uses this guideline
for its default setting.
To re-iterate: facet memory allocation is NOT an explicit setting. Simply increase the
Java heap size available on the command line. Content Server 16 exposes this
incremental memory allocation for search engines in its search admin pages.

Facet Performance Considerations


Computation of facet data increases the cost of a query. As with all performance
discussions, the list of variables and parameters makes providing solid performance
data difficult.
During normal operation, after the facet data structures are generated, computing facet
information for a single metadata region is relatively fast, typically less than 50
milliseconds. This time varies primarily depending on the number of search results,
since the facet values for every result need to be added together. If your typical queries

The Information Company™ 96


Understanding Search Engine 21

return many million results, average times would be closer to 1 second. As more facets
are requested for a query, these time are additive. Experience has been that facet
computation is not a material consideration for performance in most scenarios.
Conversely, initial generation of facet data structures can be relatively expensive. Each
potential metadata value must be examined, and a new facet value created or the data
structures updated if it already exists. The time to perform this task varies widely based
on the number of items in the partition, the data type, the number of possible unique
values, and for text metadata – whether the values are stored in memory or on disk.
For example, if there is an enumerated data type with less than 100 possible values in
a partition containing just 1 million items, generation of the facet data structures is likely
less than 1 second.
At the other extreme, generating facet data structures on a text region that has high
cardinality (e.g. 200,000 possible values, such as a folder location or keywords/hot
phrases), in a large partition containing 10 million items that is configured for storage
on disk will take considerably longer, potentially many minutes.
For larger systems in particular, limiting the use of facets for regions with high
cardinality may be necessary to meet performance objectives.

Protected Facets
As noted above, the time required to generate facet data structures can be material. In
addition to building search facets on demand, it is possible to specify facets that are
known to be commonly used. On startup, the data structures for these facets will be
built if they are not in the Checkpoint; they are excluded from facet recycling (never
destroyed); and they are optionally saved in the Checkpoint file for faster loading on
next startup. Content Server uses this feature. To build protected facets at startup, in
the [Link] file, specify the regions in the [Dataflow_] section:
PrecomputeFacetsCSL=region1,region2,region3

As an option, the protected facets may be stored in the Checkpoint file. This also
means a copy of the facet data is maintained in the Index Engines, which requires
additional memory. To enable persisting facets in the Checkpoints, in the [Dataflow_]
section of the [Link] file add:
PersistFacetDataStructure=true

When specifying protected regions, you should also ensure that the desired number of
cached facets is greater than or equal to the number of protected facets specified in
this list. The desired number represents the point at which the search engine will begin
recycling non-protected facets to make room for new facets requested in queries. In
addition, the maximum number of facets should be higher still. The maximum number
of facets is the limit, which may be higher than the desired number if there are many
facets requested in a single query. Beyond this maximum number, the facet requests
are discarded.
DesiredNumberOfCachedFacets=16
MaximumNumberOfCachedFacets=25

The Information Company™ 97


Understanding Search Engine 21

Search Agents
Search Agents are stored queries that are tested against new and changed objects as
part of the indexing process. The two most common uses of Search Agents are to stay
up to date on topics of interest, and for assigning classifications.
The monitoring case is illustrated by the Content Server concept of Prospectors.
Consider a situation where you want to know everything about a particular customer.
You construct a query to match the name of the customer or a few of the known key
contacts at that customer. By adding this as a Prospector, you are notified any time
new data is indexed that matches this query.
For classification, you construct a set of queries that define a specific classification
profile. For example, if all customer service requests use a form that contains the text
“customer support ticket”, then this query is attached to the classification agent, and
any object containing this phrase is marked with the classification. By using many
queries, you can build a complete set of classification categories. One object may
match several possible queries, and be tagged with multiple classifications this way. In
Content Server, this is known as Intelligent Classification.
In operation, the queries to be tested against new data are contained in a file. Matches
to the search agent queries are placed in iPools which are monitored by the parent
application, typically Content Server.

Search Agent Scheduling


There are two ways that Search Agents can be run: with every indexing transaction, or
interval since last run.
Transaction based execution is the default for backwards compatibility. Indexing
transactions are generally in the range of 500 items, so the overhead of running the
agent queries with each transaction can be very high.
Interval Execution
The preferred approach is to run the agents based on an interval. At the completion
of an indexing transaction, the time since the last agent run completed is check, and if
suitable time has elapsed, the agents are run. The metadata region
OTObjectUpdateTime is used to construct queries that match objects which have been
added or changed since the last agent execution. This mode of operation was first
introduced in version 16.2.11 and can dramatically improve search throughput when
agents are used.
To configure interval execution, set the “EveryTransaction” value to false, and the
interval in milliseconds (default is 5 minutes). As a rough guide, the default interval
setting will use about 10% of the compute resources for executing search agents
compared to transaction-based agents.
[UpdateDistributor_xxx]
RunAgentsEveryTransaction=false
RunAgentIntervalInMS=30000

If the interval is set to a value of -1, the agent execution will pause. There is no loss of
activity – when the interval is restored to a positive value, the agent queries will include

The Information Company™ 98


Understanding Search Engine 21

all objects that were indexed while paused. Pausing may be desirable if there is a
temporary need to maximize indexing performance.
The Update Distributor keeps track of the agent execution in files that are stored in a
subdirectory of the search index:
index/enterprise/controls

The files are named upDist.N and contain the timestamp for each of the last agent
runs, expressed in Unix time (also known as Epoch time or POSIX time, i.e.
milliseconds since Jan 1, 1970). Sample file below.
UpDistVersion 1
SearchAgentBaseTimestamp 1571261889130 "MySA0"
SearchAgentBaseTimestamp 1571261889130 "MySA1"
EndOfUpDistState
The timestamp field used by default is the OTObjectUpdateTime. The field can be
changed, but there are currently no known scenarios where the default value should
not be used.
[Dataflow_xxx]
AgentTimestampField=OTObjectUpdateTime
When using interval agent execution, the Update Distributor timing summaries will
include the time spent running agent queries, identified with the label SAgents.

Search Agent Configuration


Search Agents are defined within the [Link] file. Multiple Search Agents are
possible. Each Search Agent has a separate section and an entry in the [Dataflow_]
section identifying the name of the agent. You must specify one query file per search
agent.
[Dataflow_]
SA0=agent1

[SearchAgent_agent1]
operation=OTProspector
readArea=d:\\locationpath
readIpool=334
queryFile=d:\\someDirectory\[Link]
The readArea and readIpool parameters specify the file path and directory name where
iPools with results from the Search Agent should be written. These are then consumed
by the controlling application.
The queryFile contains the search queries to be applied during indexing. You can have
many search queries within each queryFile.
The operation can be one of OTProspector or OTClassify. This value does not change
the operation of the search agents, but is recorded in the output iPools, and is used to
help the application (typically Content Server) determine how the iPool should be
processed.

The Information Company™ 99


Understanding Search Engine 21

Search Agent Query Syntax


A Search Agent query file uses the UTF-8 character set encoded in plain text form,
with no Byte Order Marker (BOM) at the start. The file has the following sample syntax:
SET LEXICON English
SET THESAURUS English
SET EXTRAFILTER ( != "<OTCurrentVersion>FALSE" ) and
( [ region "OTSubType" ] range
"136|557|144|749|751|0|145" )
SYNC "SomeIdentifier"
SELECT "OTObject", "OTScore" WHERE [REGION
"OTEmailSenderAddress"] "bsmith@[Link]"
get results score 1
SYNC "AnotherID"
SELECT "OTObject", "OTScore" WHERE [REGION "OTDate"] >
"20110201"
get results score 1
The SET and SELECT commands are identical to the SET and SELECT query
commands that would be issued to the Search Federator in regular searching. The
EXTRAFILTER section at the beginning is a shortcut representation for adding more
query terms to the WHERE clause of every SELECT statement.
The SYNC command basically echoes the string into the output iPool messages. It is
used by the application requesting the Search Agents to separate or identify the results
from each search query.
The special command “get results score X” requests that all results with a computed
relevance score greater or equal to X are returned. Given that the relevance score is
always between 1 and 100, the value of 1 here requests that all results be returned.

New Search Agent Query Files


Search Agent query files are assumed to have the file extension “.in”. The Update
Distributor consumes these files. In order to prevent contention in the event that an
application tries to modify a search agent query file during processing, a special file
naming convention is used.
The application requesting Search Agents should create the query file with the file
extension “.new”. When the Update Distributor next runs the Search Agents, it will look
for files with the .new extension, and rename them to .in files.
For example, assume that your application creates a Search Agent file named
[Link]. The Update Distributor will delete any existing [Link] file and rename
[Link] to [Link]. This approach allows Search Agent queries to be modified
without changing the [Link] file and restarting the Update Distributor.

Search Agent iPools


Search Agents generate IPools for external consumption in a specific form, illustrated
below with a fragment from an Intelligent Classification IPool from Content Server.
White space has been added for legibility.

The Information Company™ 100


Understanding Search Engine 21

<Object>
<Entry>
<Key>OPERATION</Key>
<Value>
<Size>10</Size>
<Raw>OTClassify</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>297</Size>
<Raw>
<SYNC1>2959</SYNC1>
<Q1N0>OTObject</Q1N0>
<Q1R0C0>
<OTObject>DataId=16412&Version=1</OTObject>
</Q1R0C0>
<Q1N1>OTScore</Q1N1>
<Q1R0C1>93</Q1R0C1>
<Q1R1C0>
<OTObject>DataId=16389&Version=1</OTObject>
</Q1R1C0>
<Q1R1C1>39</Q1R1C1>
<Q1R2C0>
<OTObject>DataId=16390&Version=1</OTObject>
</Q1R2C0>
<Q1R2C1>29</Q1R2C1>
</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>2178</Size>
<Raw>
<SYNC2>3276</SYNC2>
<Q2N0>OTObject</Q2N0>
<Q2R0C0>
<OTObject>DataId=16388&Version=0</OTObject>
</Q2R0C0>
<Q2N1>OTScore</Q2N1>
<Q2R0C1>71</Q2R0C1>
<Q2R1C0>
<OTObject>DataId=16398&Version=0</OTObject>
</Q2R1C0>
<Q2R1C1>71</Q2R1C1>
<Q2R2C0>

The Information Company™ 101


Understanding Search Engine 21

<OTObject>DataId=16409&Version=0</OTObject>

The Search Agent type, in this case OTClassify, is the first entry in the IPool. This
value is drawn from the [Link] file in the Search Agent configuration setting.

NOTE: Each section contains a SYNC value, which is the separator


specified in the search agent query file. Content Server uses these
SYNC values to match the search results to the originating query.

The search results themselves are presented with a naming convention that reflects a
QUERY, ROW, COLUMN numbering convention. For instance, the value <Q2R0C1>
is used for Query 2, Row 0 (the first result), Column 1 (the second region in the select
clause). Likewise, the value <Q1N0> is used to label the Name of Column 1 for Query
1 (in this case “OTObject”). Note that the names of the regions are only provided in
the first row for a given query.

Performance Considerations
Search Agents are not free. Although the Agents are only applied to newly added
objects, the frequency, complexity and number of queries run as agents can have a
noticeable impact on indexing performance. For applications with high indexing rates,
Search Agents may not be an appropriate feature.
If you require these types of features for high indexing volumes, you can consider
implementing your solution using standard search queries, serviced by the Search
Engines. By enabling the TIMESTAMP feature for objects, the exact indexing time of
objects can be determined, and a pure search application can provide similar features,
running on a scheduled interval.

Relevance Computation
Relevance is a measurement of how well actual search results meet the user
expectations for search result ranking. Relevance is a subjective area, based upon
user judgments and perception, and often requires experimentation and tuning to
optimize. This is one of the fundamental challenges with relevance tuning: if you
improve relevance for one type of user, you may well be reducing relevance for other
users who have different expectations.
Relevance is a method for determining how close to the top of the list a search result
should be placed. However, relevance has NO IMPACT on whether an object actually
satisfies a query. If a query matches 100,000 items, tuning relevance only affects the
ordering of the items, not which items are matched.
Search relevance is not entirely the responsibility of the search engine. Relevance
scoring is a function of many parameters, most of which are provided by the
application, such as Content Server. Tuning Content Server is also required to

The Information Company™ 102


Understanding Search Engine 21

optimize search relevance, but this document will focus more on the OTSE
contributions to relevance.
For typical users trying to find objects, relevance is an important consideration, and the
search results are usually presented sorted by the relevance score. However,
relevance is not a consideration for certain types of applications. For example, Legal
Discovery search is concerned with locating all objects, but does not care about the
order of presentation. Likewise, when using search to browse, results are often sorted
by date or object name.

Retrieving the Relevance Score


The search results will commonly return a relevance score value in the OTScore
region. Simply select OTScore as part of the search query to include the relevance
score in the results.
Some notes about this value are in order. Firstly, this score is NOT an assessment of
the relevance of the object that reflects user expectations. This is a relative value that
assists with ordering of results. In other words, a value of 100 does not mean a perfect
match for the query, and a relevance score value of 20 does not mean the object is
irrelevant. Because of this, displaying the relevance score to users may be misleading,
and is generally not recommended for casual users.
If you are writing an application that consumes search results, you should also be
aware that the OTScore field does not always contain a relevance score number. This
region contains the region values that reflect the requested sort order for results.
Often, this is relevance. But if you are sorting results by date or text regions, then the
OTScore region will not contain the relevance score.

Components of Relevance
There are two different types of computations that are applied to objects in the index
to determine their relevance. The first is “ranking”, which is a computation applied in
the same way on every search query. Ranking typically adjusts relevance by giving
higher weights to recently created objects, office documents, or known important
locations. Before Search Engine 16, Ranking was the only available relevance scoring
method, and ranking and relevance were often used interchangeably.
Beginning with Search Engine 16, a second type of relevance computation is available,
known as “boost”. Unlike Ranking, the Boosting parameters are dynamic, and are
provided on each query. This permits the application to add relevance adjustments
based on context, such as the user identity or current folder location.
The remainder of this section will cover the Ranking capabilities, with Boost features
detailed later. You can mix and match both Ranking and Boost, although each
additional relevance feature slightly increases the overall search query time.
In most cases, the ranking configuration is comprised of weights and regions. The
weights indicate how important the parameter is in scoring. Note that these weight
values are relative. Setting all the weights high is the same as setting all the weights
to a medium value. The difference in weights is ultimately what matters.

The Information Company™ 103


Understanding Search Engine 21

Some of the explanations below contain simplified versions of the equations used to
compute the scores. They are simplified to the extent that a number of additional
computations are performed to adjust the results from each computation to a
normalized range. The equations presented here are only intended to clarify the
impact that adjustments to the parameters make on the ranking computations.
Note: result scoring has been improved with OTSE 21.2. The relevance computation
no longer includes adjustments for a number of search clauses that have little meaning
for relevance, such as Boolean operations, and “not” clauses (not in(), not termset(),
not stemset(), etc.). Synonym-Or (“SOR”) has changed to score matching either/both
terms as a single value.

Date Ranking
The date an object was created or updated is typically an important aspect of
relevance, especially for a dynamic or social application. In these cases, users tend
to favor objects that are recent. Applications such as archival on the other hand
typically do not care about recentness, and different settings might be appropriate.
The date ranking parameter allows you to identify metadata regions which contain date
values that reflect the recentness of an object, and configure their scoring parameters.
Date ranking is computed using a decay rate from the current date. The decay rate is
one of the configurable values. Small values for decay rates will reduce the score of
older items more rapidly. A simplified approximation of the algorithm is:
Date Relevance = decay / (recentness + decay)
In practice, a very aggressive value that strongly favors recent objects would be a
decay rate of 20 days. Consider this chart of some representative values. The decimal
values in the body of the table represent the contribution to ranking, with higher values
representing higher ranking.

AGE IN DAYS

DECAY RATE 3 10 30 60 180

10 0.77 0.50 0.25 .014 0.05


20 0.87 0.67 0.40 0.25 0.10
30 0.91 0.75 0.50 0.33 0.14
50 0.94 0.83 0.63 0.45 0.22
100 0.97 0.91 0.77 0.63 0.36

Clearly, small values of decay rates generate small ranking contributions for older
items. Remember that the date ranking value is only one component of the ranking
score, and you also control the weight to be applied to this computed value.
The syntax for the date ranking configuration in the [Link] file is:
DateFieldRankers="dateRegion",decay,weight

The Information Company™ 104


Understanding Search Engine 21

For example, the following would use the last modified date on an object to compute
date ranking, with a moderately aggressive decay of 45 – but then make the overall
contribution of date to the ranking score small by giving it a weight of 2:
DateFieldRankers="OTModifiedDate",45,2

The date scoring algorithm supports multiple elements. For example, if you had two
different metadata regions that commonly contain important dates that reflect object
recentness, you can specify both, and each is independently computed and added to
the overall ranking score:
DateFieldRankers="OTCreateDate",45,50;"OTVerCDate",30,30

The DateFieldRankers setting is recorded in the [Link] file, and Content server
exposes this configuration setting in the search administration pages.

Object Type Ranking


The ranking computation contains a relatively flexible Type Ranking mechanism for
selectively adjusting the ranking score if the contents of a region match a specific value.
Within Content Server, this capability is presented as a way to boost the score for
certain MIME types or object subtypes. In practice, this feature can be used in more
flexible ways if your application requires it.
The syntax for the object type ranking component looks like this:
TypeFieldRankers="region1",RW1:"textA",TWa:"textB",TWa:"textC",TWc;
"region2",RW2:"textD",TWd:"textE",TWe;
Where:
regionName is an ENUM or TEXT type region in which tests for the specified text will
occur
TW (TextWeight) is the relative weight for this text string within this region, and
should be an integer from 1 to 100.
Text is a string to check. If found, the associated text weight is used
RW is the region weight (importance) attached to Object Type Ranking relative to
other elements of the ranking computation.
Support for TEXT in addition to ENUM was first available starting with Search Engine
10.0 Update 11.
A simple example is shown below, which illustrates an application which is NOT object
type scoring. In this example, we examine a region that describes the department
which owns an object. If Finance owns it, attach a score of 40, and a score of 30 if
Sales owns it. Then set an overall weight for this test relative to the other ranking
components of 33.
TypeFieldRankers="Department",40:"Finance",30:"Sales",33;

The Information Company™ 105


Understanding Search Engine 21

NOTE: These are adjustment values for ranking. Any objects


which do not meet the criteria for the adjustment have an
effective score of 0 for this computation.
Content Server exposes this [Link] setting in the search
administration pages.

Text Region Ranking


It is common that specific metadata regions will contain text that is deemed to be
particularly important when determining the relevance for keyword matching. Often,
these are metadata regions that would contain the name or description of an object.
OTSE allows you to identify the text regions that should be given extra weight when
assessing keyword hits. The syntax is:
ExtraWeightFieldRankers="region1",weight1;"region2",weight2
In order for this feature to work, the regions being adjusted must also be included in
the list of default search regions. Typically, extra weight would be given to fields such
as the file name or description of an object. This setting is found in the [Link] file,
and Content Server exposes this configuration setting in the search administration
pages.

Full Text Search Ranking


Even in the absence of any of the specific ranking variables, OTSE will compute the
rank of an object based upon the statistical distribution of the matching search terms.
The algorithm is quite complex, and varies depending on the types of operators in the
query. The default value for the relative weight of Full Text component is 100. This
weight is applied to the larger of either the Full Text component, or the default search
regions component. The text ranking algorithm is roughly based on the industry
standard “BM25” formula. Some general guidelines:

Relative frequency
The relative ratio of matched search terms to the overall content size is a factor. The
higher this ratio, the higher the relevance. An obvious example… assume you search
for “combustible”. If document ROMEO has the word combustible 30 times in 1000
words (3%) and document JULIETTE has 50 instances of combustible in 2000 words
(2.5%), then document ROMEO will be ranked higher.
Frequency
The more often the search terms occur in the text for an object, the higher the ranking
score.
Commonality
The more common a search term is in the dictionary for this partition, the less weight
it is given in computing the text score. For example, with typical English language
data, if you search for keywords “the” AND “scooter” – the value given to matches for
“scooter” will be considerably higher than matches for “the”, since “the” is overly
common.

The Information Company™ 106


Understanding Search Engine 21

The full text search ranking algorithm is applied to the indexed content, plus any
metadata regions defined in the default search list. The relative weight of the full text
search is also configurable. Both values are specified in the [Link] file.
The default region search list is defined in the search INI file as:
DefaultMetadataFieldNamesCSL="OTName,OTDComment"
ExpressionWeight=100
Content Server exposes the list of default regions to search in the administration pages
for search, and the values are stored in the [Link] file. Remember to ensure that
any metadata text regions given an adjusted score are included in this default region
search list.

Object Ranking
The search ranking algorithm also allows external applications to provide ranking hints
for objects. In a defined metadata field, the application can provide a numeric ranking
score – an integer between 0 and 100. The search ranking algorithm can incorporate
this ranking value into the overall rank. You have the ability to set a ranking value for
each object, define the field to be used for object ranking, and assign an overall weight
to Object Ranking relative to other elements of the ranking algorithm. If there is no
Object Ranking value for an object, it gets a ranking adjustment of zero.
The Object Ranking settings are kept in the [Link] file. In the example below,
OTObjectScore is the metadata region that contains the ranking value, and 80 is the
relative weight attached to the Object Ranking component of the ranking calculation.
ObjectRankRanker="OTObjectScore",80
If you are developing applications around search, using the Object Ranking feature
can improve the overall user experience. Some of the common events used to modify
the ranking include tracking objects that are popular for download, objects placed in
particular “important” folders, how frequently objects are bookmarked, or other
situations which are appropriate to the application. As a developer, you also need to
remember to degrade the object ranking over time – an object which is important now
may well lose its relevance later.
One other observation for developers setting Object Ranking values: as described
elsewhere in this document, OTSE supports indexing select metadata regions for
objects. You do not need to re-index the entire object in order to set the Object Rank
value; using the ModifyByQuery indexing operation is usually a good choice. Re-
indexing the entire object each time a ranking value changes would likely have a
material negative impact on overall system performance – both on the application and
OTSE.
Within Content Server, the use of Object Ranking is a feature that is leverage by the
Recommender module.

Relevance Boost Overview


Unlike ranking adjustments to relevance, boosting adjustments are specified in the
search query, and can differ with each query. The boost syntax varies depending on
the type of boost being requested. In operation, the ranking operation takes place first,

The Information Company™ 107


Understanding Search Engine 21

and results in an interim score in the range of 0 to 100. Boost operations are applied
later, and modify the ranking score to generate the final relevance score.
Relevance boosting is specified in the ORDEREDBY section of the search query:
SELECT … WHERE … ORDEREDBY SCORE[N] boost parameters

SCORE[N] identifies that boost adjusting is desired. N is a multiplier (in percent) of the
relevance computed in the ranking algorithm. Normally, N of 100 would be
recommended, which means that the ranking values are used without modification. If
N was 80, then the ranking values would be multiplied by 0.8 before final adjustments
from boosting. Setting the value of N to 0 would cause the ranking component of
relevance to be ignored (treated as 0).
There are three types of boost operations that may be applied, text, integer and date.
Boosting may allow the score rise above 100, but never below 0.

Query Boost
This boost method is used to adjust the relevance based on whether an object matches
query clauses. For illustration, consider the following example…
SELECT "OTObject" where "animal" ORDEREDBY Score[100] "dog"
BOOST[-10] "cat" BOOST[+15] ("t-rex" and "evolution")
BOOST[+%40]

The query will match items containing the text “animal”. However, we are less
interested in objects that also contain the text “dog”, so 10 is subtracted from the
relevance score. The user likes cats, so if the result contains the text “cat”, then we
add 15 to the score. If the result contained both “dog” and “cat”, then the net
adjustment would be +5. The full text clauses do not need to be simple, as shown with
the dinosaur adjustment. The dinosaur adjustment also illustrates that the relevance
can be boosted by a relative percentage. The text clause can also specify text
metadata regions and include complex parameters…
SELECT "OTObject" where "accident" ORDEREDBY Score[100]
([region "model"] in ("ford","Toyota",”gm") and [region
"Date"] > "-2m") BOOST[+15]

Date Boost
This boost method is used to adjust the relevance based on how closely the value in a
Date region matches a target date. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Date,"region","target",range,adjust]

Date is a keyword, indicating the boost method.


Region is the metadata field in the search index that should be tested.
Target is the date we are comparing against.

The Information Company™ 108


Understanding Search Engine 21

Range is an integer number of days on either side of the target for which a
boost adjustment should be applied.
Adjust is an integer value that specifies the maximum adjustment to be
applied if the value in the region is an exact match for the target. The
adjustment is reduced in a linear fashion based on distance from the target.
An example is in order.
SELECT … ORDEREDBY Score[100]
BOOST[Date,"OTCreateDate","20140415",60,40]

This boost essentially states: Examine the value in OTCreateDate for each matching
search result. If the value is April 15 2014, then add 40 to the relevance score. If the
value in the OTCreateDate field is within 60 days of April 15, then add a pro-rated
value. For example, if the value in OTCreateDate was May 30 (45 days away), then
adjust the relevance score by 10 (which is 40 * (60-45)/60).
The intent of this type of boost is to help users find items based on dates. A typical
use case might be “I am trying to find a document that I think was issued June of 2000,
but maybe I am off by 6 months”. Any document in that +/- 6 month range gets a
boosted relevance, with a higher adjustment the closer to the target date.
Another common application would be adjusting for recentness, where the target date
is today, and all objects with dates within 90 days receive an adjustment.

Integer Boost
This boost method is designed to allow a range of values to be mapped to a relevance
contribution. For example, if there was a “usefulness” rating for a document on a scale
of 1 to 10, you could use that range to boost relevance on the objects. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Integer,"region",lower,upper,adjust]

Integer is a keyword, indicating the boost method.


Region is the metadata field in the search index that should be tested.
Lower is an integer representing to starting point of the range of interest. Values
at the lower range receive small adjustments. Values lower than the lower limit
are ignored, and have not adjustment.
Upper is an integer representing the high end of the range of interest. Values near
the upper limit receive large adjustments. Values higher than the upper limit are
ignored, and have no adjustment.
Adjust is an integer value that specifies the maximum adjustment to be applied if
the value in the region is an exact match for the Upper value. Between Lower and
Upper, the adjustment is scaled proportionately.
To illustrate the concept:
SELECT … ORDEREDBY Score[100]
BOOST[Integer,"Popularity",100,200,30]

The Information Company™ 109


Understanding Search Engine 21

This boost essentially states: Items with a Popularity value greater than 100 and less
than or equal to 200 will receive a relevance boost of up to 30. A value of 200 gets the
maximum adjustment of 30. A value of 120 would get a boost of 6 [ =30*(120-
100)/(200-100) ].

Multiple Boost Values


Multiple boost values can be requested. Note that each boost computation will
increase the search query time. A query with multiple boost values might look like this:
SELECT … ORDEREDBY Score[100]
BOOST[Integer,"Popularity","100",20,10]
BOOST[Date,"SalesDate","20160115",16,5] (in ("sales
order","purchase order", "PO") and [region "salesrep"]
="doug") BOOST[-5]

Query versus Date / Integer Boost


You can use Date or Integer metadata regions in a Query Boost. For example, if you
simply wanted to boost the relevance of objects created on July 8 2015, you could use:
SELECT … ORDEREDBY Score[100] [region "Date"] "20150708"
BOOST[+15]

So why are there separate methods for Dates and Integers? The Date and Integer
boost features allow the boost adjustment to be varied depending on how close the
values are to a target, versus the all or nothing adjustment that occurs with Query
Boosting. If you have applications where getting close is useful, versus matching
exactly, the Date or Integer Boosting is superior.

Content Server Relevance Tuning


The section on “Relevance Computation” covers the generic principles available for
adjusting the search relevance scoring algorithm within OTSE. In this section, we
briefly look at some considerations for Content Server.
A small survey of Content Server customers revealed some interesting data. The
majority of Content Server installations are using the default relevance algorithm
settings. Content Server is a very flexible solution, used for a wide range of
applications. It is likely that many customers can improve their search experience by
understanding the effects of adjusting the relevance settings. This is particularly true
of customers that have upgraded from older versions of Content Server, and simply
bring forward their old configuration as part of the upgrade.
The first step is to consider your application and user expectations. In some cases,
search relevance won’t be an issue. For example, if you always sort results by date or
a metadata region, then search relevance scores are immaterial. If your primary
objective is building collections for eDiscovery applications, then gathering all search
results is far more important than which ones show up at the top of the list.

The Information Company™ 110


Understanding Search Engine 21

For most customers however, a review of their search expectations and some Content
Server 16 considerations are in order.

Date Relevance
This is usually an important factor. Content Server has many ‘Date’ fields, where the
date represents specific information. Consider some of the following:
Creation Date – usually refers to the date an object was added to the system. Often
this is a good value for relevance, but the creation date only refers to the first version.
Versioned objects which are updated will not change this date, which reduces its value
for these data types.
Version Creation Date – for versioned objects, such as documents, this is a good
choice. Each version of the object gets an updated version creation date. On the other
hand, many objects do not have the concept of a version creation date.
Modified Date – for some types of objects, such as folders, the modified date clearly
identifies when the folder has been created or updated. However, for other types of
objects, the modified date is too volatile. Depending upon other settings in Content
Server, the modified date may change for many reasons, and therefore does not reflect
the user expectation for when an object has truly changed.
Understanding which types of objects are most important in your application for search
relevance will help you determine which Content Server date values should be used
for date relevance scoring.
There are several other date fields in Content Server that may also be used. Review
the types of objects that are most important for your application, and choose dates that
best reflect creation or change that users would consider material to search relevance.
Recent experiments suggest that new default values for Content Server using both the
Creation Date and the Version Creation Date, with relatively high weights, may be a
good choice for typical document management and workflow applications.

Boosting Object Types


Historically, a feature known as Object Type Ranking has been used with defaults that
provide a boost to objects based upon their MIME types or their Content Server object
subtypes. Usually, this is used to boost typical “Office” document formats.
This is very easy to review and optimize for the types of content that you want to
emphasize in your system. If boosting Microsoft Office documents based upon their
MIME types continues to be important, there is an important consideration here.
Historically, there were very few MIME types used for Microsoft Office documents. With
the recent versions of Microsoft Office, this situation has changed. There are now
more than 20 MIME types officially used to represent Microsoft Office 2007 files alone.
The following chart is from the Microsoft technet web site.

The Information Company™ 111


Understanding Search Engine 21

File Extension File Type MIME Type

.docx Word 2007 document application/[Link]-


[Link]
ument

.docm Word 2007 macro- application/[Link]-


enabled document [Link].12

.dotx Word 2007 template application/[Link]-


[Link]
plate

.dotm Word 2007 macro- application/[Link]-


enabled document [Link].12
template

.xlsx Excel 2007 workbook application/[Link]-


[Link]

.xlsm Excel 2007 macro- application/[Link]-


enabled workbook [Link].12

.xltx Excel 2007 template application/[Link]-


[Link]
e

.xltm Excel 2007 macro- application/[Link]-


enabled workbook [Link].12
template

.xlsb Excel 2007 binary application/[Link]-


workbook [Link].12

.xlam Excel 2007 add-in application/[Link]-


[Link].12

.pptx PowerPoint 2007 application/[Link]-


presentation [Link]
ation

.pptm PowerPoint 2007 application/[Link]-


macro-enabled [Link]
presentation .12

.ppsx PowerPoint 2007 slide application/[Link]-


show [Link]
ow

The Information Company™ 112


Understanding Search Engine 21

File Extension File Type MIME Type

.ppsm PowerPoint 2007 application/[Link]-


macro-enabled slide [Link].1
show 2

.potx PowerPoint 2007 application/[Link]-


template [Link]
e

.potm PowerPoint 2007 application/[Link]-


macro-enabled [Link].12
presentation template

.ppam PowerPoint 2007 add- application/[Link]-


in [Link].12

.sldx PowerPoint 2007 slide application/[Link]-


[Link]

.sldm PowerPoint 2007 application/[Link]-


macro-enabled slide [Link].12

.one OneNote 2007 section application/onenote

.onetoc2 OneNote 2007 TOC application/onenote

.onetmp OneNote 2007 application/onenote


temporary file

.onepkg OneNote 2007 application/onenote


package

.thmx 2007 Office system application/[Link]-officetheme


release theme

For new installations of Content Server, the use of MIME types and OTSubTypes for
Type Ranking is discouraged in favor of using OTFileType instead. OTFileType is a
generated by the Document Conversion Server during indexing, and gives every object
a type such as “Microsoft Word”, “Adobe PDF” or “Audio”. This greatly simplifies
constructing the Type Rank, and improves accuracy.
Note that OTFileType was introduced in Content Server 10 Update 5, with some minor
tuning since then. If you have older data, then you may need to re-index the objects.
Details about the values for OTFileType are not included in this document. Some of
the more common values you may want to configure for Type Ranking using the
OTFileType region might be:
Word, Excel, PowerPoint, PDF, Folder, “Web Page”, Text, Audio, Video or Email.

The Information Company™ 113


Understanding Search Engine 21

Boosting Text Regions


For the majority of Content Server customers, boosting the name and description of
an object (OTName and OTDescription) are reasonable approaches. This basically
means that if the search keywords are in the name or description, the object gets
pushed higher in the rankings.
If users do not enter a name for the object, the file name of a document often becomes
the name of the object. However, the original file name may be different than the
managed object name. Because of this, you may also wish to consider adding the
actual file name to the boosted search regions. Be aware that this may “double boost”
objects where the file name is the same as the object name.
If you have a lot of content that is comprised of web pages, you may wish to add the
HTML keywords field (typically OTFilterKeywords) to the list of boosted text regions.
Remember to make sure that any regions that you boost here are also in the default
metadata region search list.

Default Search Regions


If you do not specify a region in which a term should be searched, the default behavior
is for OTSE to search within a list of regions. The relative rank of this component is
shared with the full text content weight. This is the default behavior that a typical user
will leverage in a basic search query; searching within specific regions is generally an
advanced search feature in most applications.
When using the default search regions, it is not necessary to find all the search terms
within a single region. For example, if the search term is blue butterflies in
the amazon, a potential match could have blue in the name, butterflies in the
description, and the remaining keywords in the body text.
Content Server ships with a number of default regions configured for relevance
searches. Default regions are searched if the user does not specify a region for a
WHERE clause. Default regions can simplify typical searches and improve relevance,
but each region added to the default list increases the time required to performance a
search. The choice of regions that should be included in the default search regions is
an important consideration when fine tuning search relevance. As a general rule, you
should try and include any regions that contain overview, name or descriptive values.
Taxonomy labels, if used, are another good [Link] should definitely review
these with an eye towards your expected use. Some examples…
Does the average user need to find workflow items? Perhaps workflow values
should be removed from the list.
If email messages are a key part of the managed content you wish to find, adding
the email sender, recipient or subject fields to the default search regions may be a
good idea.
If you have added custom applications with descriptive metadata regions, you may
want to consider whether any of those regions should be in the default search
region list.

The Information Company™ 114


Understanding Search Engine 21

Are HTML pages a key part of your data? Consider adding the HTML keywords
region to the default search regions.
Some applications, such as eDiscovery, are biased towards searching all possible
regions. The challenge is this: more default search regions results in slower query
performance. For small numbers of regions, this is not an issue. For eDiscovery, with
thousands of potential Microsoft Office document properties, this performance
degradation can be material. The “Aggregate-Text” features of the search engine may
be helpful for these situations.

Using Recommender
Recommender is a feature of Content Server which monitors user activity, and
leverages the Object Ranking feature of the search engine to boost the relevance
scores of certain objects. Specifically, the feature of Recommender known as “Object
Ranker” is responsible for computing relevance adjustments and triggering the
appropriate indexing updates. You can review the use of Recommender in the Content
Server documentation.

User Context
Statistically, a user is more likely to be searching for objects that meet one of more of
these types of criteria…
• It is located in my personal work area;
• It was created by me;
• It is located in the folder in which I am currently working;
• It is located in a sub-folder of my current location;
• It is in a location where I was recently working;

OTSE has no knowledge of the user performing a search. Content Server, however,
is aware of the user identity and location. New to Content Server 16, the relevance
boost features allow user context to be incorporated in relevance computation. For
example, each query could specify that items with the current user in the “created by”
metadata fields are emphasized, or that objects in specific locations and folders have
their relevance score enhanced. You should review these configuration settings in
Content Server, and adjust them to reflect your expected user behaviors.

Enforcing Relevancy
Adding Ranking Expressions to a search query results in more work for the Search
Engines. If the default relevance computation is performed (based on the WHERE
clause), then no material penalty occurs since the values are already retrieved as part
of the query evaluation. The Search Engines have an optimization that will determine
if the Ranking Expression is the same as the WHERE clause, in which case the
Ranking Expression computation is skipped. In updates of Content Server prior to
December 2015, the Ranking Expression differs from the WHERE clause, which will
reduce query performance.

The Information Company™ 115


Understanding Search Engine 21

There is a configuration setting that will ignore the Ranking Expression and enforce
use of the default WHERE clause ranking. Effectively, this is the same as using
ORDEREDBY RELEVANCY in the query. For older updates of Content Server that
install the 2015-12 or later update, this setting can be used to achieve a modest search
query performance gain. In the [Link] file [Dataflow_] section, add:

ConvertREtoRelevancy=true

The Information Company™ 116


Understanding Search Engine 21

Extended Query Concepts


What do words like relevance, stemming and phonetic matching really mean? Key
search concepts and their implementation within OTSE are found within this section.

Thesaurus
OTSE has the ability to search not only for keywords, but for synonyms of keywords,
using a thesaurus system. This section of the document explores the use of a
thesaurus with OTSE.

Overview
Searching with a thesaurus specified allows a query to match synonyms of words. For
example, the English thesaurus might have an entry for house which includes “home”,
“residence” and “dwelling”. A search for the keyword “house” would also match any of
those words if the thesaurus is enabled.
The list of synonyms to be used is contained within a thesaurus file. You can have
many thesaurus files, and each query can specify which thesaurus file should be used.
In practice, this flexibility is generally used to select a thesaurus containing synonyms
for a particular language. OTSE ships with a number of standard thesaurus files:
English, French, German, Spanish, and Multilingual.
It is also possible to use a thesaurus to help find specialized words in specific
applications. For example, a medical thesaurus file could contain alternate names for
drugs, symptoms or other medical terminology. A custom corporate thesaurus could
contain synonyms for products, part numbers, customer names or departments.

Thesaurus Files
Thesaurus files should be placed in the “config” directory. They should follow a naming
convention of “[Link]”, where xxx defines the language and identifies the
thesaurus file as provided in the search query. By convention, OpenText default
thesaurus files are provided for English, French, German, Spanish and Euro
(multilingual) as follows:
[Link]
[Link]
[Link]
[Link]
[Link]
Thesaurus files are stored in a proprietary file format which is optimized for
performance and size. These files are created using a thesaurus builder utility, which
converts a thesaurus from the Princeton WordNet format to the OpenText thesaurus
format.

The Information Company™ 117


Understanding Search Engine 21

Thesaurus Queries
In order to leverage a thesaurus in a search query, you choose the thesaurus using
the “SET” command, and specify thesaurus use for a search term using the “thesaurus”
operator in the query select statement.
set thesaurus eng
select “OTName” where thesaurus “home”
The value for the language (in this case “eng”) must match the extension of the
thesaurus file. This is an optional statement. The default language setting for the
Thesaurus is English.
The “thesaurus” operator in the select statement only applies to simple single terms –
it cannot be combined with other features such as proximity, stemming, wildcards or
phrase search.

Creating Thesaurus Files


OTSE contains several utility functions that will create thesaurus files from different
types of data sources.
One supported format is the Princeton WordNet format, a well documented format for
representing thesaurus information. Thesauri for many languages and purposes are
available in this format, many of which are available at no cost. You can create or edit
a WordNet file to create a custom thesaurus, then convert it to OpenText format using
the utility.
The sample syntax here assumes that you are running from the <OTHOME>/jre/bin
directory. The command line for converting a WordNet thesaurus to OpenText format
is:
java -Xmx500M -classpath ../../bin/[Link]
[Link]
y:/sourceWordNet ../config/destThesaurus
The second supported format is the EuroWordNet format. This format is used to link
multiple thesaurus files together as a package. OTSE has two utilities which can build
a thesaurus from a EuroWordNet format. The first of these is used to extract thesaurus
information for a single language, with the form:
java -Xmx500M -classpath ../../bin/[Link]
[Link]
y:/eurowordnet/French/Text ../config/[Link]
The second utility is used to build a multilingual thesaurus, incorporating all the
available EuroWordNet languages into a single thesaurus. Note that this does not
allow you to search in one language and find synonyms in the other languages. The
syntax for building a multilingual thesaurus is:
java -Xmx500M -classpath ../../bin/[Link]
[Link]
aurus y:/eurowordnet/General/Multi ../config/[Link]
OTSE is also capable of generating a thesaurus file from a more generic XML file
representation. The syntax is:

The Information Company™ 118


Understanding Search Engine 21

java -Xmx500M -classpath ../../bin/[Link]


[Link] –name
thesaurusname –infile inputxmlfile
OTSE can convert an existing thesaurus file to this XML format:
java -Xmx500M -classpath ../../bin/[Link]
[Link] –name
thesaurusFileName –outfile xmloutputfile
The form of the XML file to generate or read in the thesaurus management utilities is
shown below. This example is limited to a single entry, the <Headword> section would
be repeated once per entry.
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<OTThesaurus>
<Headword>
<Headword_Text>answer</Headword_Text>
<Meaning>
<Meaning_Text>noun meaning</Meaning_Text>
<PartOfSpeech>noun</PartOfSpeech>
<Synonym>response</Synonym>
<Synonym>reply</Synonym>
<Synonym>acknowledgement</Synonym>
<Synonym>riposte</Synonym>
<Synonym>return</Synonym>
<Synonym>retort</Synonym>
<Synonym>repartee</Synonym>
</Meaning>
<Meaning>
<Meaning_Text>verb meaning</Meaning_Text>
<PartOfSpeech>verb</PartOfSpeech>
<Synonym>respond</Synonym>
<Synonym>reply</Synonym>
<Synonym>rebut</Synonym>
<Synonym>retort</Synonym>
<Synonym>rejoin</Synonym>
<Synonym>écho</Synonym>
</Meaning>
</Headword>
</OTThesaurus>

Content Server Considerations


Within Content Server, the search thesaurus features are abstracted. Labels for
languages such as “English” and “French” are mapped to “eng” and “fre” using
configuration files. The search operator in LQL is “qlthesaurus” instead of “thesaurus”.
There is a separate configuration setting for the default thesaurus language. This
thesaurus configuration is covered in more detail in the Content Server administrator
documentation.

The Information Company™ 119


Understanding Search Engine 21

Stemming
Stemming is a method used to find words which have similar root forms, called “stems”.
The easiest way to explain stemming is by example.
The words flowers, flowering and flowered all have the same stem: flower. When
stemming is applied during a search, then a search for one of these words would match
any of these words.
The special terminology “stem” is used since the common element is not always a
word. For instance, for algorithmic reasons, the stem for “baby” might be “babi”, which
facilitates matching words such as babied or babies.
Stemming algorithms are not foolproof. In our example of “flower”, the stemming
algorithm might identify that “flow” is the stem – and try to find matches such as flows,
flowing or flowed. Stemming is a useful tool, but cannot always be relied upon to
behave as a user expects.
The concepts that make stemming possible are not applicable to all languages. In
general, Western European languages can use stemming, since plurals, tenses and
gender are typically formulated in terms of appending different endings to root forms
of words. Accordingly, the algorithms for stemming are different for each language.
There are many languages, such as East Asian languages, where the concept of
stemming does not apply.
Because of the language-specific aspects of stemming, a search engine has many
options available for how stemming should be implemented. One approach is to stem
words during indexing, and create an index of word stems. This can result in very fast
searches (since the stems are all pre-computed), but requires that you know the
language at index time. If only one language will ever be used, this is acceptable. In
multi-language environments, it is less useful. Some search implementations will
guess at the language during indexing and stem accordingly, which is statistically
useful but not always correct.
OTSE applies stemming rules at query time. This reduces the size of the index (since
word stems are not stored), but has a query performance penalty since the stems for
candidate words must be computed for each query.
The other key advantage of query-time stemming is that true multi-lingual stemming
can be used. Consider an index containing the following words:
Arrives (in English documents)
Arrivons (in French documents)
Arriva (in Spanish documents)
Each of these words might have the same stem (“Arriv”). By applying the stemming
algorithm at query time, the search system can differentiate between the English,
French and Spanish forms of the word based on the language preferences used for
stemming, since the English algorithms would not generate query expansions for the
words arrivons or arriva. This approach is not perfect, since in many cases similar
languages have common rules. For example, the French word “arriver” would match
the English stem for “Arrived”, since the postfix “er” is also common in the English
language.

The Information Company™ 120


Understanding Search Engine 21

OTSE supplies stemming rules for 5 languages: English, French, German, Spanish
and Italian. When building a search query, you request the stemming rules in the “SET”
command, using the language preference. To request a match for keyword stems, use
the “stem” operator on a keyword in the select statement:
SET language fre
select “OTName” where stem “arrive”
The stem operator does work not in conjunction with other operators, such as proximity,
wildcards and exact phrase searches.

English Stemming Rules


The basic objective for English stemming is to find singular and plural forms of words.
Additional rules prevent short words from being stemmed. A summary of the expansion
rules are outlined here:
-s rule:
plural: add -s if doesn't end in -s
singular: remove -s if doesn't end in -ss
-es rule:
plural: add -es if ends with -s, -z, -x, -sh, -ch, -o
but not if already ends in -es
singular: remove -es of these cases
address common misspellings
replace -s with -es if ends with -zs, -xs, -shs, -
chs, -os
replace -es with -s if ends with -zes, -xes, -shes, -
ches, -oes
-y rule:
plural: replace -y with -ies if consonant before y,
e.g. baby/babies
singular: replace -ies with -y if consonant before -ies
suffix substitutions:
English -f / -ves, e.g. wolf/wolves
English -fe / -ves, e.g. knife/knives
English -man / -men (with no length check), e.g.
man/men, workman/workmen

French Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which has normalized the text to remove accents. Expansion
rules are:
drop leading "l'" or "d'" (e.g. "l'oiseau" and "d'oiseau"
are replaced with "oiseau")
apply the following suffix substitutions:
-au / -aux, e.g. bureau/bureaux, noyau/noyaux,
oiseau/oiseaux
-eu / -eux, e.g. cheveu/cheveux
-al / -aux, e.g. animal/animaux

The Information Company™ 121


Understanding Search Engine 21

-ou / -oux, e.g. bijou/bijoux


same -s rule as English, e.g. famille/familles
same -es rule as English, e.g. sandwich/sandwiches

Spanish Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which has normalized the text to remove accents. Expansion
rules are:
plural: add -es if word does not end in -e, e.g.
borrador/borradores, ley/leyes, tisú/tisúes
singular: remove -es if resulting word would not end in -e
suffix substitution: -z / -ces, e.g. voz/voces
same -s rule as English, e.g. libro/libros, pera/peras,
café/cafés, camping/campings

Italian Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which has normalized the text to remove accents. Expansion
rules are:
drop leading "l'" or "d'" (e.g. "l'amico" and "d'amico" are
replaced with "amico")
apply the following suffix substitutions:
-o / -i, e.g. gelato/gelati
-a / -e, e.g. casa/case
-e / -i, e.g. bicchiere/bicchieri
-co / -chi, e.g. casco/caschi
-go / -ghi, e.g. lago/laghi
-ica / -iche, e.g. amica/amiche
-ga / -ghe, e.g. paga/paghe
-cia / -ce, e.g. faccia/facce
-cio / -ci, e.g. bacio/baci
-zio / -zi, e.g. negozio/negozi
-gio / -gi, e.g. vantaggio/vantaggi

German Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which typically expands umlaut characters ( äpful  aepful )
and expands the sharp S ( fußball  fussball ). Expansion rules are:
add-umlaut rule to make plural:
i.e. to the last "a", "o", "u" or "au" (not including
"u" which are part of "au") add an umlaut if one was
not already there (i.e. replace with "ae", "oe", "ue"
or "aeu" (respectively))
e.g. Apfel/Äpfel, Boden/Böden
drop-umlaut rule to make singular:

The Information Company™ 122


Understanding Search Engine 21

i.e. replace the last "ae", "oe" or "ue" with "a",


"o" or "u" respectively
plural:
add -e or -en or -er if does not end with e or
e+consonant, add -n otherwise (except when already
ends with -n); also, when adding -e or -er, create
another variant of it with the add-umlaut rule
e.g. Hund/Hunde, Zeit/Zeiten, Kleid/Kleider,
Kugel/Kugeln, Gans/Gänse, Koch/Köche, Fluss/Flüsse,
Maus/Mäuse, Haus/Häuser
singular:
remove -e or -en or -er if the resulting word would
not end with e or e+consonant, remove -n otherwise
(except when ends with -nn); also, when removing -e
or -er, create another variant of it with the drop-
umlaut rule
suffix substitution:
-in / -innen, e.g. Lehrerin/Lehrerinnen
plural:
add -se if ends with -nis, and create another variant
of it with the add-umlaut rule, e.g.
Erlebnis/Erlebnisse
singular:
drop -se if ends with -nisse, and create another
variant of it with the drop-umlaut rule
same -s rule as English, e.g. Auto/Autos

Alternative Stemming Algorithm


There are two implementations of stemming available within OTSE. The default
implementation works by determining the stem of a word, then creating candidate
singular and plural forms to test against. For instance, if you search for “cover”, it forms
the stem “cover”, then tests for “cover” and “covers”. This implementation is optimized
for speed, and only expands the word list to common forms.
There is a second implementation which is much more rigorous and aggressive, which
tests each word in the dictionary to see if it is a possible match for the keywords. This
second implementation is considerably slower, and also matches variations such as
“coverous”, “coverly”, “covered95b”. For most customer applications, this more
aggressive form of stemming is not necessary or appropriate. This form of stemming
was the default in OT7.
The older, more aggressive form of stemming can be enabled in the [Link] file
within the [Dataflow_] section:
UseOldStemmingRules=true

The Information Company™ 123


Understanding Search Engine 21

Content Server and Stemming


Within Content Server, stemming for a term can be enabled in several ways. Firstly,
the administrator manages a global setting from the search administration pages that
enables stemming on all simple search bar keyword searches. Within the LQL
language, the prefix QLSTEM is used to specifically request stemming for a keyword.
In the advanced search pages, the modifier “relatedto” invokes stemming.
If stemming is enabled by default, it can usually be disabled in search bar queries by
adding quotation marks around each term. For example, search for “large” “size”.

Phonetic Matching
Phonetic matching, or “sounds like” algorithms, are used to match words that have
similarities when spoken aloud. There are many possible algorithms that can be used
for phonetic matching, and OTSE contains a phonetic matching algorithm which is a
variation of the classic US Government ‘Soundex’ algorithm.
Phonetic algorithms are primarily designed to help match surnames, particularly where
the names have been transcribed with potential errors. Matching surnames is of
particular interest for a number of reasons:
Many surnames were recorded as phonetic equivalents from other languages, often
with variations in spelling.
A name which sounds generally similar may in fact have different spelling, particularly
with language variations. Consider the dozens of variations of the name “Stephen”
that exist, including Steven, Steffen, Steffan, Stephan, Steafán, and Esteban.
There is no master dictionary that contains a “right” way to spell a surname, so it is
common for people hearing a name to write it as they think it should be spelled. Smith,
Smithe, and Smyth are all legitimate surnames – you cannot perform spelling
correction, since they are all correct.
In many applications, names are recorded over a poor quality phone connection, which
can introduce errors. I say ‘Pidduck’, the recipient hears and records ‘Pittock’.
All phonetic matching algorithms share some common attributes. They relax the rules
for matching search terms in certain ways. The result is more terms matching, but with
a decrease in accuracy. This decrease occurs because the algorithms can match
words which are clearly not related, despite having similar phonetic properties.
Matching “Schmidt” when querying for “Smith” makes sense. But you also need to be
prepared for false matches, such as finding “Country” when searching for “Ghandi”.
Phonetic matching is generally NOT recommended for general keyword searching. It
is intended for use with names, and works best when applied against a metadata
region which is known to contain names. Otherwise, the number of false positives will
almost certainly be frustrating to a user.
There is one phonetic matching algorithm within OTSE, a modified “Soundex”
implementation. This algorithm is optimized for English. However, the algorithm is
sufficiently generic that it does provide useful results for many Western European
languages. The phonetic matching does not work for non-European languages.

The Information Company™ 124


Understanding Search Engine 21

To request a phonetic match for a keyword in a query, use the modifier ‘phonetic’:
Select X where [region "UserName"] phonetic "smith"
A phonetic modifier can only be applied to a simple keyword, and cannot be combined
with other features such as proximity, wildcards, regular expressions or exact phrase
searches.
There are two dictionaries of terms within the search engine, the primary dictionary for
terms that are “typical” western language words (Western European characters, no
punctuation or numbers), and the secondary dictionary for everything else. Phonetic
matching searches only for terms that meet the criteria for inclusion in the primary
dictionary.

Exact Substring Searching


A particularly difficult use case for searching involves finding exact substrings within
text metadata regions. While regular expressions can find exact substrings, they have
two major restrictions: they are potentially very expensive (slow) and only work on a
single token. Starting with SE10.5 Update 2015-09, a new capability for efficient exact
substring matching has been added that addresses these limitations.
For example, if a metadata region VendorPart has a value such as:
“Vendor_Acme:SSU 876MJACF/24 3.5inchesus”

Users might be accustomed to working only with a subset of the complete value, and
expect to find matches using arbitrary substrings of the value, such as:
Acme:SSU 87
ACF/24
F/24 3.5inches

The traditional searches using tokens, regular expressions and Like operators are not
sufficient.

Configuration
The implementation of exact substring matching is configured on a per-region basis,
and is valid only for text metadata regions. A custom tokenizer ID is configured for the
region in the [Link] file; the custom tokenizer is specified in the
[Link] file; the custom tokenizer is constructed to encode the entire value using 4-
grams.
For example, in [Link] file [DataFlow_xxx] section:
RegExTokenizerFile2=c:/config/[Link]

In [Link]:
TEXT MyRegion FieldTokenizer=RegExTokenizerFile2

Addition details on configuring custom tokenizers is described in the Tokenizer section


of this document.

The Information Company™ 125


Understanding Search Engine 21

Note that there is an alternative mechanism available for specifying the entry in the
[Link] file. The [Link] file can be used to logically append lines to
the field definitions file at startup (the file is not actually modified). This alternative can
be used by Content Server to control the configuration, since Content Server does
write the [Link] file.
ExtraLLFieldDefinitionsLine0=TEXT MyRegion
FieldTokenizer=RegExTokenizerFile2

Re-indexing is not required. When the Index Engines are next started, a conversion
of the index for the region will be performed. You can apply or remove a custom
tokenizer this way for existing data.
By convention, tokenizers should be located in the config\tokenizers directory. Content
Server uses this location to present a list of available tokenizers to administrators.

Substring Performance
A region indexed for exact substring matching will require about 8 times as much space
for storing the index for that region. In a typical situation, with only a few regions
configured this way, the storage requirement difference will be minimal. Exact
substring configuration is only possible when the “Low Memory” mode configuration is
enabled for text metadata.
When a region is configured for exact substring matching, every query is equivalent to
having wildcards on either side of the query string. In the example above, a search for
“SSU 87” is effectively a search for “*SSU 87*”.No other operators (comparisons,
regular expressions, etc.) are allowed with regions configured for exact substring
searches.
The exact substring is usually much faster than a regular expression because of the
way the indexing is performed. By way of example, assume the indexed value is:
abcdefghijk. Using 4-grams, the following tokens are added to the dictionary: abcd
bcde cdef defg efgh fghi hijk. You want to search for cdefgh. The query engine will
first look for the first 4 gram... “cdef”, which is fast because it is in the dictionary. It then
looks for all 4-grams starting with “gh**”, and finds values with adjacent “cdef + gh**”
4-grams. While there may be a number of 4 grams for the regions beginning with “gh”,
this is much more efficient than scanning the entire dictionary with a regular expression
to find matches.

Substring Variations
The choice of Tokenizer determines the behavior of substring matching. The usual
suggested tokenizer would make the data case-insensitive, but otherwise leave all
other characters unchanged, including whitespace and punctuation.
Case sensitivity requires additional mappings in the tokenizer file. By default, the
tokenizer performs upper to lower case conversion. To preserve case sensitivity, add
a section to the start of a tokenizer file:
mappings {
0x41=0x41

The Information Company™ 126


Understanding Search Engine 21

0x42=0x42
0x43=0x43

}
Include a mapping to itself for every character that requires case preservation. Ensure
that suitable mappings for non-ASCII characters are included if those are important for
your application.
The other use case to be aware of is punctuation normalization or elimination.
Consider the example which includes ACF/24 in the value. If users are not expected
to correctly use the slash character “/” correctly, there are a couple of variations that
may be used. Normalization would convert all (or a desired set) of punctuation to a
standard value, perhaps Underscore. The string would be indexed as if it had the
value:
“Vendor_Acme_SSU_876MJACF_24_3_5inchesus”

If the user searches for Acme-SSU or ACF:24, the engine would similarly convert the
queries to “Acme_SSU” and “ACF_24”, which would then match.
Similarly, elimination strips all whitespace and punctuation from index and query
values. The index is built from:
“VendorAcmeSSU876MJACF2435inchesus”

With elimination, the test queries “Acme-SSU” or “ACF:24” are handled as if they were
“AcmeSSU” or “ACF24”, again generating a match. Eliminating punctuation is
generally better at finding a match (since it also handles extraneous whitespace), but
is not as precise – potentially returning some false positives.

Included Tokenizers
Customizing a tokenizer can be a challenge. To facilitate substring matching, there are
3 tokenizers provided with OTSE that cover the most common exact substring
requirements, in addition to the default tokenizer.
[Link]
This tokenizer is case insensitive, but otherwise preserves all punctuation and
spaces.
[Link]
This tokenizer eliminates all punctuation and whitespace. The strings “[Link]
name” and “123-m&n_amE” are equivalent, being interpreted as “123myname” in
both queries and indexed values.
[Link]
This tokenizer treats email addresses in common forms as a single token. With
the traditional tokenizer, [Link]@[Link] would be 5 tokens, as the
punctuation would be interpreted as white space. The email address tokenizer
would leave the email address intact as a single token. Searching on a single token
for email is faster and more accurate than a phrase search for multiple tokens.

The Information Company™ 127


Understanding Search Engine 21

Preserving other Query Features


As noted, once a region is marked for use with exact substrings, you cannot use other
search methods on the region. If you need to have both substring and regular search,
consider using the AGGREGATE feature.
AGGREGATE data types are used to create a new searchable region from a set of
existing regions, building only an index and not duplicating value storage. If our text
region is “VendorPart”, then we can create an AGGREGATE for substring searches:
AGGREGATE-TEXT VendorSubstring VendorPart
FieldTokenizer=RegExTokenizerFile2
In this scenario, we can now perform regular searches against the original region
“VendorPart”, and searches against the region “VendorSubstring” will use the exact
substring searching technique.

Part Numbers and File Names


A new feature for OTSE is a set of techniques for optimizing search queries for fields
not normally constructed for human readability. Text metadata regions need to be
configured for this behavior, which is subsequently invoked using the “Like” operator:
[region "PartNumber"] like "widget14"

Problem
Part numbers and file names are primary examples. A human might describe a part
for a machine as: “the 14 centimeter widget that fits jx27 engine”. Instead, we create
names along the lines of “PN3004/widget-14JX27”. Search technology that is
trying to formulate tokens and patterns based on regular sentence structure and
grammar rules will struggle to match these types of values.
Similarly, we create file names such as “SalesForecast2013-europeFRANCE
Rene&[Link]”. With file uploads and Internet encoding, this can even inject
strings such as %20 or &amp; into the metadata values. Again, algorithms designed
to parse human language have difficultly succeeding with these metadata fields.

Like Operator
To accommodate these types of metadata search requirements, OTSE includes the
concept of a “Likable” region. If you have metadata that fits the problem profile, a list
of the appropriate metadata regions can be declared as Likable in the [Link] file:
OverTokenizedRegions=OTFileName,MyParts

This instructs the Index Engines to build a “shadow” region derived from the original
metadata region, but using a very different set of rules for interpreting the metadata
and building tokens. For example, the traditional indexed tokens for our sample part
number and file name values might be:
pn3004 widget 14jx27
salesforecast2013 europefrance rene gina1 doc

The Information Company™ 128


Understanding Search Engine 21

The tokens indexed in the shadow regions might be:


pn 3004 widget 14 jx 27
sales forecast salesforecast 2013 europe france
europefrance rene gina 1 doc

When a query using the like operator is processed, the query is also tokenized using
the alternate rules, and is tested against the shadow region instead of the original
region. In this case, the following queries would succeed that would typically fail using
normal human language tokenizing rules:
where [region "OTFileName"] like "gina 2013 sales forecast"
where [region "MyParts"] like "JX27 widget 3004"

If the like operator is requested for a region that does not support it, then the operator
is treated as an “AND” between the provided terms, and applied against the original
region instead of the shadow region.

Like Defaults
Since many users will not understand the requirement to specify the “like” operator
in a query, a configuration option is provided in the [Link] that allows the use of
Like as the default operator.
UseLikeForTheseRegions=OTName,OTFileName

If a query for a token or phrase is requested against one of these regions and there is
no explicit term operator provided, then Like will be assumed. This also works if the
region is in the list of default search regions. For example, the common Content Server
region OTName can be both a default search region and have the Like operator
applied by default. Note that Content Server can be configured to inject a default
operator like stem into a query term, which would negate using like by default.
There is also a configuration setting that controls whether stemming should be used
searching with Like queries. By default, this feature is active. If there is a term
component in a query that is 3 letters or longer, then either the singular or plural form
will match. To disable this feature and only match the entered values, in the [Dataflow_]
section of the [Link] file:
LikeUsesStemming=false

Shadow Regions
The synthetic shadow regions built to support the Like operator have some properties
of interest. They are created when the Index Engines start based upon the [Link]
settings. This adds some time to startup, but allows the Like feature to be applied to
existing data sets without re-indexing. The shadow regions are saved on disk as part
of the index until removed from the list of over-tokenized regions, which also occurs on
Index Engine restart.
The shadow regions have the same names as their masters, appended by
_OTShadow. If the region OTName is configured as likable, then the synthetic region

The Information Company™ 129


Understanding Search Engine 21

OTName_OTShadow is created. These regions consume space in the index. Due to


the extra tokenization, the space requirements for shadow regions are higher than for
equivalent normal text regions.
The shadow regions will show up in lists of regions, and are also directly queryable or
retrievable. Although not the intended use, it is valid to perform other queries on these
regions.

Token Generation with Like


A key element of the Like behavior is the aggressive generation of tokens from a
metadata value. Unlike most other search operators which are algorithmically specific,
the Like behavior attempts to “think like a person”. The rules are more fluid, and are
subject to change over time as more useful cases are identified. Some of the current
tokenization rules include:
• Breaking tokens at transitions between letters and numbers. For example,
14red9 (14 red 9)
• Breaking tokens at punctuation. For example, [Link]-green (red blue
green).
• Breaking tokens at upper to lower case transitions, and also keeping the
conjoined token. For example, HIThat (hi that hit hat hithat)
• Breaking tokens at single character upper case transitions. For example,
MyHouse (my house)
• Breaking number strings at punctuation and retaining string. For example,
17,345 (17 345 17345).
• Removing leading zeros from number strings so that they match with or without
the zero. For example, 0078 ( 78 ).
• Converting URL encoded values to UTF8. For example go%2dfish (go-fish).
• Converting HTML encoded values to UTF8. For example my&nbsp;house (my
house).
• Identifying strings that appear to be URLs (may start with www or http) and
discarding parameters after a question mark.
• Truncating long strings to a maximum of about 256 characters.
Similar rules are applied to the query terms during a search, although usually only the
“best” interpretation of token splitting is used, instead of keeping the variations. In
some cases, alternatives will be used. For example, a search for BUCKShot would be
converted to a search for:
((bucks and hot) SOR (buck and shot) SOR (buckshot))

Limitations
Multi-value text regions have some limitations in behavior. The aggregate strings from
all the values are gathered together to create a single region that is tokenized for the
like operator. This means that there is no ability to combine the like operator with

The Information Company™ 130


Understanding Search Engine 21

specific values, such as might be expected when using attributes to represent


languages.
For example, a multi-value text region might have the English value “RedCar88” and
the French value “VoitureRouge88”. The like operator does not support examining
only one language. A search for “like RedVoiture” would match this object.
A second limitation is hit highlighting. Hit highlighting operations are not processed
using the like operation, which means a likely mismatch between tokens in the
original metadata value and the tokens matched in the shadow region during a query.
It is unclear what the correct operation should be, given the existence of the shadow
regions and one-to-many relationships between tokens and the original values. At this
time, hit highlighting ignores the like aspect of a query.
Most importantly, the like operator may generate many more search results than
expected. Due to the nature of part numbers and the behavior of the tokenization,
many small and common tokens can be generated. The like operator is biased
towards finding candidate search results, not towards filtering results to a most
probable match.
At this time there is no relevance adjustment based on the quality of the match in a
shadow region.

User Guidance
The description of the like operator so far may be good background into configuration
and applications, but does not provide much practical advice for an end user. The
normal warning that this guidance is not applicable in all situations applies.
Suggestions for a user trying to maximize success using a metadata region with the
like operator may include:
• Select fragments that appear to be logically distinct.
• Use spaces in place of punctuation.
• Do not enter a fragment of a longer numeric sequence as a search term.
• Do not enter a fragment of a text sequence as a search term.
• Do not use wild card operators.
• More terms or portions of the part number will be more precise.
An example using a fictitious part number string in a metadata region:

PN4556-WidgetRED01395b/v5.68.99 $2,867

The following queries would be successful:


4556 red
Widget 1395b
Pn v5.68.99
5 68 99

The Information Company™ 131


Understanding Search Engine 21

2867
867
On the other hand, these queries would fail:
idget [use widget]
2,86* [wildcards not permitted]
395 [fragment of1395]

Email Domain Search


A common requirement in search discovery applications is finding email messages
sent to or from various companies. In the case of Content Server, the relevant
metadata regions contain lists of email addresses, possibly as multi-value addresses.
Searching for an email domain is not always reliable.
By way of illustration, assume the email region is OTSender. Perhaps various values
of OTSender include:
[Link]@[Link]
Ibm-rep@[Link]
[Link]@[Link]
[Link]@[Link]
other@[Link]

A search for [Link] might also find [Link]. A search expecting to find
smith in the domain might also find [Link]@acme. A search for the other-
acme domain might also find other@acme. In some cases, you could use exact
phrases to better constrain the queries, but this places a high knowledge burden on
the user. Beginning with SE10.5 Update 2014-09, capabilities exist to facilitate the
common email domain search case.
If the region OTSender is declared to be an email region, the Index Engine will
construct a new region named OTSender_OTDomain, and place the domain portions
of the email addresses in this new region. The original OTSender region remains
unaffected. The OTSender_OTDomain region can now be easily searched.
The email domain indexing process can handle multiple values for email addresses in
two ways. If there is a list of addresses in a single value, they will be split using some
simple pattern matching rules, typically comma or semicolon delimited. Multi-value
regions are also supported. In both cases, each distinct email domain will be
represented as a value in the_OTDomain region.
Where multiple identical email domain values exist for an object in the email region,
duplicates will be removed. This behavior is important given that many recipients of
an email message are often in the same organization or email domain.
In the [Link] file there are several configuration settings for tuning and enabling
email domain search capabilities. The main setting to enable or disable the feature is
a comma-separated list of text metadata regions that should be treated as email
regions. By default, this list is empty:
EmailDomainSourcesCSL=OTEmailSender,OTEmailRecipient

The Information Company™ 132


Understanding Search Engine 21

When you add or remove regions from the email domain list, the changes take effect
the next time the Index Engines are started. At startup, any new email domain regions
will be created and the values populated. This may add 10 or so minutes to the first
startup process. Likewise, if any regions were removed, they will be deleted from the
index at next startup.
Tuning of the behavior is possible with remaining configuration settings. By default,
_OTDomain is used as the suffix for the email domain regions, but this can be adjusted.
There is an upper limit on the number of distinct email domain values that will be
retained for a given email value, which defaults to 50. If you anticipate longer lists of
email domains, this value can be adjusted upwards. Finally, the separators used to
delimit an email domain can be defined. When indexing, a simple rule is used that text
after the @ symbol up to a separator character represents the email domain. The
separators are defined in the [Link] file, and default to comma, colon, semicolon,
various brackets and whitespace. The separator string must be compatible with a Java
regular expression.
EmailDomainFieldSuffix=_OTDomain
MaxNumberEmailDomains=50
EmailDomainSeparators=[,:;<>\\[\\]\\(\\)\\s]

An example: if a multi-value email region for an indexed object has the values:
<OTEmailSender>bob@[Link]</OTEmailSender>
<OTEmailSender>bob@[Link]</OTEmailSender>
<OTEmailSender>sue@[Link]</OTEmailSender>
<OTEmailSender>bob@[Link]</OTEmailSender>
<OTEmailSender>sue@[Link]</OTEmailSender>

The OTEmailSender_OTDomain for that object will have effective values of:
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>

The same _OTDomain values would exist if a single value email region contains the
string:
<OTEmailSender>bob@[Link], bob@[Link][Robert]
sue@[Link];bob@[Link](“MightyBob”);
sue@[Link]</OTEmailSender>

Text Operator - Similarity


The TEXT operator is designed to help locate objects given a large block of text. The
provided text is analyzed to generate a list of terms and phrases that are significant,
and the resulting list is used with either the TERMSET or STEMSET operator to
generate search results. This operator is ideal for similarity applications, typically
within classification and discovery.

The Information Company™ 133


Understanding Search Engine 21

Unlike other search operators, the user does not have direct control over the exact
behavior of the search query. A typical use case would be to copy a couple of
paragraphs from a document, and search using the TEXT operator to find documents
with similar information. The TEXT operator takes arbitrary text as the parameter,
excluding closing brackets and end of line characters.
To illustrate by example, perhaps the first few lines from Lewis Carrol’s “Alice in
Wonderland” are used:
text (Alice was beginning to get very tired of sitting by
her sister on the bank, and of having nothing to do: once
or twice she had peeped into the book her sister was
reading, but it had no pictures or conversations in it,
‘and what is the use of a book,’ thought Alice ‘without
pictures or conversations?’ So she was considering in her
own mind (as well as she could, for the hot day made her
feel very sleepy and stupid, whether the pleasure of making
a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink
eyes ran close by her. There was nothing so very remarkable
in that; nor did Alice think it so very much out of the way
to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I
shall be late!’)

First observation… to be compatible with the text operator, the paragraph end “CRLF”
characters and closing braces “)” in the source were removed.
The Text operator would then analyze the text, discarding short words and top words.
Statistical analysis would select notable words (and phrases). Although not in this text
example, overly long words or lists of numbers would be ignored. The resulting set of
8 to 15 terms would then be used internally with stemset, with an effective internal
query something like:
stemset(80%,alice,sister,book,”pictures or
conversations”,rabbit,considering,trouble,picking,pleasure,
sleepy,stupid,daisies)

Which in turn would match all items that have 80% or more of those terms and phrases
in the full text of the object. In general, numbers are dropped from consideration in
TEXT queries. However, if the provided block of TEXT is relatively short (less than
about 250 characters), numbers will be included if necessary to meet the minimum
number of terms.
The TEXT operator has a number of configuration settings. See the Top Words section
below for more settings.
Performance degrades with more words used in stemset, while accuracy drops with
too few words. The upper limit on the number of terms and phrases to use is:
TextNumberOfWordsInSet=15

The Information Company™ 134


Understanding Search Engine 21

For accuracy, such as trying to match exact documents, termset is a better choice.
Otherwise, stemset is used to find more objects with singular/plural variations but runs
slightly slower:
TextUseTermSet=true

The percentage of matches with termset and stemset can be adjusted. Low values
find more objects with less similarity (eg: 40%). Higher values, such as 80%, require
better matches with the source material:
TextPercentage=80

Top Words
The TEXT query operator is specifically designed to efficiently locate good quality
results when provided with large blocks of text. In this particular scenario, overly
common words are of little value, and need to be discarded. In OTSE, the Top Words
feature is used for this purpose.
Top Words are those words which are found within a large percentage of the
documents. For example, the OpenText corporate document management system has
the word OpenText in many documents, and hence it is eliminated from TEXT queries.
Top Words are determined based upon the percentage of objects containing a word.
For example, if more than 30% of objects contain the word ‘date’, then ‘date’ is added
to the Top Words list.
Top Words are computed independently for each search partition. Usually, more
partitions are added over a prolonged period. If the frequency of words changes over
time, then newer partitions will have slightly different Top Words than older partitions.
This also means that TEXT queries which eliminate Top Words might construct slightly
different queries on each partition.
The Top Words are first computed for a partition once it contains approximately 10,000
objects. On reaching 100,000 and 1,000,000 objects, the list is discarded and
recomputed. This helps to ensure that the Top Words properly reflect the contents of
the partition. The Top Words are stored in a file that is not human readable, and has
the name topwords.10000, with the number changing to reflect the size. If the
topwords.n file is missing, it will be generated during next startup or checkpoint write.
The threshold for selecting Top Words is a real number that should be between 0.01
and 0.99, representing the fraction of objects in the partition that contain the word. The
default value is 33%, which we found in some typical partitions larger than 1 million
objects generated a Top Words list of about 750 words. Larger fractions result in fewer
Top Words. In the [Dataflow_] section:
TextCutOff=0.33

If the Top Words features are not required, generation and use can be disabled by
setting:
TextAllowTopwordsBuild=false

The Information Company™ 135


Understanding Search Engine 21

Stop Words
Stop words are words which are considered too common to be relevant, or do not
convey any meaning, and are therefore stripped from search queries, or potentially not
even indexed. For English, a typical list of stop words would contain words such as:
a, about, above, after, again, against, all, am, an,
and, any, are, aren't, as, at, be, because, been,
before, being, below, between, both, but…
The potential advantage of stop words is a reduction in the size of the search index.
However, use of stop words introduces several limitations for search.
If stop words are applied at indexing time, certain types of queries become impossible.
A Shakespearean scholar could never find Hamlet’s soliloquy “to be or not to be”, since
all of those words are considered stop words, and would not be in the index.
Another reason to not apply stop words during indexing is the multi-lingual capability
of OTSE. The Spanish word “ante” is very common, so it should be a stop word, and
not indexed. However, in English, this is an uncommon word, so it clearly should be
indexed.
As a result, the search engine does not use stop words during indexing, nor are they
applied as a general rule during search queries. However, there is a closely related
capability known as Top Words that is used under special circumstances.

The Information Company™ 136


Understanding Search Engine 21

Advanced Feature Configuration


Occasionally, you may need to optimize the configuration settings for some very
complex parts of the search grid. This section provides some details about how these
parts of search work, what they do, and how you can adjust them to optimize search
for your application.

Accumulator
The Accumulator is an internal component of the Index Engines which is responsible
for gathering the tokens (or words) that are to be added to the full text search index. A
basic understanding of the Accumulator is useful when considering how to tune and
optimize an OTSE installation.
As objects are provided to the Index Engine, the full text objects are broken into words
using the Tokenizer, and added to the Accumulator. When the Accumulator is full, this
event triggers creation of a new full text search fragment. In a process known as
“dumping” the Accumulator, a fragment containing the objects stored within the
Accumulator is written to disk.
The transactional correctness of indexing is possible in part because of how the
Accumulator works. As objects are added to the accumulator, they are also written to
disk in the accumlog file. These files are monitored by the search engines to keep the
search index incrementally updated. When the Accumulator dumps, a new index
fragment is created, and the accumlog files are available for cleanup.
The size of the accumulator has an impact on system performance, and on the
maximum size of an object that can be indexed. A small Accumulator is forced to dump
frequently, which can reduce indexing performance. A large Accumulator consumes
more memory. The default size value for the Accumulator is 30 Mbytes (which is a
nominal allocation target – Java overhead results in the actual memory consumption
being higher), and can be set from within the Content Server search administration
pages, which sets the [Dataflow_] value in the [Link] file:
AccumulatorSizeInMBytes=30
If a single object is too large to fit within the Accumulator, it will be truncated –
discarding the excess text content. You cannot always predict whether an object will
exceed this size limit, since this is a measurement of internal memory use including
data structures, and not a measurement of the length of the strings being indexed.
The Accumulator will dump if it contains data and indexing has been idle. The idle time
before dumping is configurable:
DumpOnInactiveIntervalInMS=3600000
During indexing of an object, the accumulator also makes an assessment of the quality
of the data it is given to index. If the data is too “random” from a statistical perspective,
then the accumulator will reject it with a “BadObjectHeuristics” error. The randomness
configuration settings in the [Dataflow_] section are:
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0
MaxAverageTokenLengthHeuristic2=15.0

The Information Company™ 137


Understanding Search Engine 21

MinDocSizeInTokens=16384
The heuristics are relatively lax, and essentially designed to try and protect the index
from situations where random data or binary data was provided. It is rare that these
values need to be adjusted, and some experimenting will be needed to find values that
meet special needs. There is a minimum size of about 16,384 bytes before these
heuristics are applied, since small objects would otherwise fail the uniqueness
requirement.
There is one situation where this safety feature is known to occasionally discard good
objects. If a spreadsheet is indexed that contains lists of names, numbers and
addresses, the uniqueness of the tokens may be very high, and it may be rejected as
random.
A related configuration setting is an upper limit on the size of a single object. Objects
are truncated at this limit, meaning that only the first part of the object is indexed. Note
that this size limit is applied to the text given to the Index Engine, not the size of an
original document file. For example, a 15 MB Microsoft PowerPoint file might only
have a filtered size of 100 Kbytes. Conversely, an archive file (ZIP file) with a size of
1 MB might expand to more than 10 MB after filtering.
ContentTruncSizeInMBytes=10
From an indexing perspective, 10 Mbytes is a lot of information. For English language
documents, this would normally be more than 1 million words. By way of comparison,
this entire document in UTF8 form is less well under 1 MByte.

Accumulator Chunking
Starting with Search Engine 10 Update 7, the Accumulator also has the ability to limit
the amount of memory consumed by “chunking” data during the indexing process.
Essentially, if the size of the accumulator exceeds a certain threshold, the input is
broken into smaller pieces, or chunks. Each chunk is separately prepared and written
to disk. When all the chunks are completed, a “merge” operation combines the chunks
into the index.
Chunking is a very disk-intensive process. When chunking occurs, there is a
noticeable impact on the indexing performance. Fortunately, chunking is only required
when indexing very large objects. Using the default settings, we noted while indexing
our own typical “document management” data set that chunking occurs with hundreds
of documents per million indexed, and showed an overall indexing performance hit of
about 15% in a development environment. If indexing performance must be optimized,
you can disable chunking or even reduce the Content Truncation size described above
to a small value (perhaps 1 MByte) such that chunking may never happen.
There are configuration settings in the [DataFlow_] section of the [Link] file for
tuning the chunking process. The number of bytes in an object before chunking will
occur has a default of 5 MBytes. The feature can be disabled with a large value, say
100,000,000.
AccumulatorBigDocumentThresholdInBytes=5000000

The Information Company™ 138


Understanding Search Engine 21

An additional amount of memory for related data such as the dictionary is reserved
as working space, expressed as a percentage of the Accumulator size (typically 30
Mbytes), with a default of 10 percent.
AccumulatorBigDocumentOverhead=10

As a result of this change, it will no longer be possible to search within XML regions
in the body of text for large XML objects where chunking occurs. Chunking can be
disabled for XML documents by setting the configuration to true, but this will negate
the memory savings from chunking.
CompleteXML=false

Reverse Dictionary
The search engine maintains dictionaries of words in the index. The dictionary is
sorted to be efficient for matching words, and for matching portions of words where the
beginning of the word is known (right-truncation, such as washin*). However, for
matching terms that start with wildcards (left-truncation), the dictionary is not optimal.
The search engine can optionally store a second dictionay, known as the Reverse
Dictionary. This is a dictionary of each term spelled backwards. For instance, the term
“reverse” is stored as “esrever”. This Reverse Dictionary allows high performance
matching of terms that begin with a wildcard, and for certain types of regular
expressions that are right anchored (ending with a $).
There is an indexing performance penalty associated with building and maintaining the
Reverse Dictionary. The penalty varies due to many factors, but has been observed
to be over 10%. There is additional disk space required, typically about 1 GB for a
partition with 10 million objects. As far as memory is concerned, another Accumulator
instance is used which consumes about 30 MB of RAM in the default configuration,
and space of about 15 MB is required for term sorting. The Reverse Dictionary is
enabled with a setting in the [Dataflow] section of the [Link] file:
ReverseDictionary=true

By default, the Reverse Dictionary is disabled (false) to maintain backwards


compatibility – this feature was added in version 16.2.5 (June 2018). Existing indexes
will create (or destroy) the Reverse Index during startup after this setting is changed.
The conversion is performed by the Index Engine, and the Search Engine will then
need to be restarted to apply the change during queries. Conversion to create the
Reverse Dictionary is relatively expensive, perhaps 30 minutes per partition. Data
does not need to be re-indexed.
Once enabled, the Reverse Dictionary also reserves memory for sorting results that
match the reverse dictionary during search queries. The default configuration allocates
space to sort up to about 100,000 terms per partition. If this number is exceeded,
performance is impacted. The value can be increased at a cost of about 15 MB per
100000 terms, with the [Dataflow] setting:
ReverseDictionaryScanningBufferWordEntries=100000

The Information Company™ 139


Understanding Search Engine 21

The Reverse Dictionary works with full text content and text metadata stored in “Low
Memory” mode. Older storage modes are not supported. The Reverse Dictionary is
not used with regions that are over-tokenized or configured for exact substring
matching.

Transaction Logs
In the event that an index or partition is corrupted or destroyed, OTSE provides
Transaction Logs to help rebuild and recover indexes with the least amount of re-
indexing. Transaction Logs are generated by the Index Engines with a minimal record
of the indexing operations that have been applied. A fragment of a Transaction Log
looks like this:
2018-03-15T[Link]Z, replace - content, DataId=1009174&Version=1
2018-03-15T[Link]Z, add, DataId=1036021&Version=1
2018-03-15T[Link]Z, delete, DataId=1015932&Version=1
2018-03-15T[Link]Z, add, DataId=1036022&Version=1
2018-03-15T[Link]Z, add, DataId=1036023&Version=1
2018-03-15T[Link]Z, Start writing new checkpoint
2018-03-15T[Link]Z, Finish writing new checkpoint
2018-03-15T[Link]Z, add, DataId=834715&Version=1

If an index is corrupted, it can be restored from the most recent backup. The
Transaction Log can then be used to determine which Content Server objects should
be re-indexed or deleted to bring the backup copy of the index up to date, based on
the date/time of the operations since the date of the backup.
The transaction logs are set up to rotate 4 logs of size 100 MB each, which should
typically be able to record more than 50 million operations for a partition. At this time,
these values are not adjustable. In a typical system with regular backups, this should
be more than enough to recover all transactions. If your backups are less frequent,
you may wish to copy these logs on a regular basis.
Multiple copies of the Transaction Logs can be written. The idea here is that these
logs must survive a disk crash to be useful for recovery. If you are concerned about
system recovery, consider recording the Transaction Logs on two different physical
disks. In the [IndexEngine_] section of the [Link] file:
TransactionLogFile=c:\logs\p1\[Link],
f:\logs\[Link]
TransactionLogRequired=false

In this example, logs are written to two locations. By default, the list is empty, which
disables writing the Transaction logs. The Index Engine will append text to the
provided file name to differentiate between the rotating logs. A second setting dictates
whether a failure to write Transaction Logs should be considered a transaction failure,
or should be accepted and allow indexing to continue. By default, this is false –
meaning the Transaction Logs are “nice to have”.

The Information Company™ 140


Understanding Search Engine 21

Protection
Because Content Server is relatively open and allows many types of applications to be
built on top of it, the search system can be exposed to unexpected data and
applications. This section touches on some of the configurable protection features of
OTSE.

Text Metadata Size


Text metadata regions are optimized for relatively small and important bits of
information. We have seen situations where customers attempt to place megabyte
text values in a text field. While this works, it consumes significant memory and CPU
to process. There is a default maximum size of 256 Kbytes for text in a single region
for an object. In the [Dataflow_] section, MetadataValueSizeLimitInKBytes controls this
size, and any regions listed in MultiValueLimitExclusionCSL are exempt.
If this limit is exceeded, the metadata is truncated to the maximum length, and the
string OTIndexLengthOverflow is added to the end so that you can search for these
conditions, and the OTIndexError count is incremented.

Text Metadata Values


Text metadata regions support multiple values. There is a default limit to the number
of values that can be accepted. This is especially important since processing multi-
value text regions consumes considerable stack space. The default is 200 values, as
defined in the [Link] file by the MultiValueLimitDefault setting. Regions listed in
MultiValueLimitExclusionCSL are exempt, which by default are regions used by
Content Server email management:
OTEmailToAddress
OTEmailToFullName
OTEmailBCCAddress
OTEmailBCCFullName
OTEmailCCAddress
OTEmailCCFullName
OTEmailRecipientAddress
OTEmailRecipientFullName
OTEmailSenderAddress
OTEmailSenderFullName
If this limit is exceeded, additional metadata values are discarded and an additional
metadata value of <>OTIndexMultiValueOverflow</> is added to make this
condition searchable, and the OTIndexError count is incremented.

Incorrect Indexing of Thumbnail Commands


An issue was detected whereby objects that had no content were being given portions
of the input IPool from DCS that contained requests to generate thumbnails. This issue
was corrected in the 16.2.4 update. However, attempts to re-balance objects affected
by this problem will fail – no full text will be provided on re-index, so the Index Engine
will see it as a partial update and not permit a rebalance of the object. The full text

The Information Company™ 141


Understanding Search Engine 21

checksum for the affected objects is always 485363284. There is a configuration


setting to allow objects with this checksum to be treated as if they have no text:
EnableWeakContentCheck=true

Cleanup Thread
As the Index Engines update the index, they create new files and folders. The Search
Engines read these files to update their view of the index. Left alone, these files will
eventually fill the disk. The Cleanup Thread is the component of the Index Engine that
runs on a schedule to analyze the usage of the files, and delete those which are no
longer necessary.
A Cleanup Thread only examines and deletes files for a single partition; each Index
Engine therefore schedules a Cleanup Thread. The Cleanup Thread will delete
unused configuration files, as well as unused files listed in the configuration files, such
as accumlog, metalog, checkpoint and subindex fragment files. Search Engines keep
file handles open for config files currently in use, and this is the primary mechanism
used by the Cleanup Thread to determine if files can be deleted.
There is no specific process to monitor for the Cleanup Thread; it is part of the Index
Engine process. By default, the Cleanup Thread is scheduled to run every 10 minutes.
You can adjust the interval in the [Link] file [Dataflow_] section:
FileCleanupIntervalInMS=600000
The Cleanup Thread also has a secure delete capability, disabled by default.
SecureDelete=false

When set to true, the Cleanup Thread will perform multiple overwrites of files with
patterns and random information before deleting them, making them unreadable by
most disk forensic tools. This also makes the file delete process considerably slower,
and uses significant I/O bandwidth. Some additional notes on this feature:
• The US Government has updated their guidelines to require physical
destruction of disk drives for highest security situations.
• Overwriting files is ineffective with journaling file systems.
• The algorithm is designed for use with magnetic media, and may not provide
any additional security with Solid State Disks.
• Optimizations by Storage Array Network storage systems may defeat this
feature.
The Cleanup Thread code has been enhanced starting with Search Engine 10 Update
4 to delete unused fragments more aggressively. If for some reason you require the
previous behavior, it can be requested in the search INI file with by setting
SubIndexCleanupMode=0. The default value is 1.

The Information Company™ 142


Understanding Search Engine 21

Merge Thread
The Merge Thread is a component of the Index Engine that consolidates full text index
fragments. As the Index Engines add or modify the index, they do not change the
existing files. Instead, they append new files, referred to as the “tail” fragments. The
Search Engines must search against all of the files that comprise the full text index.
As the number of files containing index fragments grows, the performance of search
queries deteriorate. The purpose of the Merge Thread is to combine fragments to
create fewer files that the Search Engines need to use, ensuring that query
performance remains high. Merging also reduces the overall size of the index on disk,
since deleted objects are simply “marked” as deleted in the tail fragments, and modified
objects will have multiple representations until they are merged.
The Merge Thread will create new full-text index fragment files and then communicate
with the Search Engine using the Control File regarding which files now comprise the
index. Once the Search Engine changes (locks the new files), the Cleanup Thread will
delete the older index files.
Merging is a disk-intensive process. The Merge Thread therefore tries to maintain a
balance between how frequently merges occur and how many index fragments exist.
In a typical index, there are frequent merges taking place within the tail index
fragments, which tend to be small and can be merged quickly. Eventually, older and
larger fragments must also be merged.
An optimal target for the number of fragments an index should have is about 5. In
practice, the number of smaller fragments can grow quite large depending upon the
characteristics of the index. As a safeguard, there is a configuration setting that places
an upper limit on the number of fragments that are permitted for a partition index, and
this will force merges to occur. Too many fragments can seriously affect query
performance due to the level of disk activity in a query and the number of file handles
needed.

The Information Company™ 143


Understanding Search Engine 21

Target size distribution of Index Fragments

Larger, older fragments


change less frequently

Tail Fragments

The Merge Thread configuration settings are located in the [Dataflow_] section of the
[Link] file:
// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
CompactEveryNDays=30
NeighbouringIndexRatio=3
“Want Merges” would normally only be changed for debugging purposes. In most
installations, these settings do not need to be modified. One setting of note is the
Compact Every N Days value, which instructs the Merge Thread to make a more
aggressive attempt to merge indexes over the long term. This setting helps to merge
older index fragments which are relatively stable, and would otherwise not be
scheduled for compaction.
Merge Tokens
Merging fragments temporarily requires additional disk space, nominally the size of all
the fragments being merged. If the temporary disk space needed causes the partition
to exceed the configured maximum size of the partition, then the merge will fail. One
way to address this is to increase the configured allowable disk space. However,
increasing the disk space for every partition can be a costly approach to solving the
problem.
The better approach is to enable Merge Tokens. Merge Tokens are managed by the
Update Distributor, and can be granted on an as-needed basis to Index Engines that
do not have sufficient space to perform merges. If given a Merge Token, the Index
Engine will proceed to perform a merge even if this exceeds the configured maximum
disk space. If the largest index fragments are 20 GB, then 100 GB of temporary space

The Information Company™ 144


Understanding Search Engine 21

would suffice for 4 or 5 Merge Tokens. Relatively few Merge Tokens are needed. 3
tokens would likely suffice for 10 partitions, perhaps 10 tokens for 100 partitions.
The Merge Token capability was first added in Update 2015-03, and the default setting
is disabled for backwards compatibility. In the [UpdateDistributor_] section of the
[Link] file:
NumOfMergeTokens=0
Too Many Sub-Indexes
Although OTSE has a typical target of merging down to 5 or so index fragments, there
are situations when this may not be possible. There is a maximum number of allowable
index fragments (or sub-indexes), which by default is 512. There have been scenarios,
usually due to odd disk file locking, where this limit has been reached or exceeded. In
this case, a Java exception will occur, logging a message along these lines:
MergeThr[Link]xception:Exception in
MergeThread:[Link]; 512

To recover from this, you can edit the [Dataflow_] section of the [Link] file to
increase the number of allowable sub-indexes (perhaps 600), and restart the affected
engines. Once recovered, the lower number should be restored, since running with
larger values has a potential negative performance impact.
MaximumSubIndexArraySize=512

Tokenizer
The Tokenizer is the module within OTSE that breaks the input data into tokens. A
token is the basic element that is indexed and can be searched. The Tokenization
process is applied to both the input data to be indexed, and the search query terms to
be searched.
There is a default standard Tokenizer (Tokenizer1) built into OTSE that applies to both
the full text and all search regions. The system supports adding new tokenizers that
can be applied to specific metadata regions. In addition, Tokenizer1 can be replaced
and customized, or can be used with a number of configuration options. Everything
that follows until the section entitled “Metadata Tokenizers” describes the use of the
default Tokenizer1.

Language Support
OTSE is based upon the Unicode character set, specifically using the UTF-8 encoding
method. This means that all indexing and query features can handle text from most
languages. If there are limitations in supported character sets, any necessary changes
would take place within the Tokenizer.

Case Sensitivity
By design, OTSE is not case sensitive. Text presented for indexing or terms provided
in a query are passed through the Tokenizer, which performs mapping to lower case.
This design decision provides a slight loss of potential feature capability in full text

The Information Company™ 145


Understanding Search Engine 21

search, but improves performance and reduces index size dramatically. Note that text
metadata values are stored in their original form, including accents and case, so that
retrieval of metadata has no accuracy loss. The mapping to lower case is not applied
to other aspects of the index, such as region names, which ARE case sensitive.

Standard Tokenizer Behavior


When dealing with English words, the Tokenizer has a simple task. Consider the input
sentence “how are you?” The Tokenizer will create 3 tokens:
how
are
you
To improve the find-ability of terms, the Tokenizer is used to normalize input data –
convert it to basic equivalent forms. This includes converting capitals to lower case,
removing accents, or converting certain characters to their common basic forms. So
the input string of “The café Boiñgo” generates the tokens:
the
cafe
boingo
The next set of problems the Tokenizer handles relates to white space and punctuation.
White space is the set of characters used to represent breaks between tokens. The
‘space’ character is one of these, but there are many more. Punctuation is generally
meaningless from a search perspective, so unless the punctuation is contained within
some specific patterns, the Tokenizer will normally ignore punctuation characters. In
practice, the Tokenizer works by searching for valid character patterns, rather than by
discarding whitespace and punctuation characters.
The Tokenizer handles a number of special cases. Consider the token “[Link]”.
Most likely, this should be treated as two words. However, the text “14.5587” is clearly
a number. The Tokenizer recognizes patterns in text and identifies certain special
cases. Where a number is concerned, the period will be kept and the text will index as
a single token. The regular expression matching handles this.
Numbers are the easy case. Languages such as English allow words to be broken
with hyphens, particularly at line breaks. Consider the text “con-tract”. Is this two
tokens, or one? Should the hyphen be removed or replaced with a space? What if the
string is “con- tract”, with a space after the hypen? Again, using appropriate regular
expressions will determine whether this is one or two words.
Several Asian character sets also need special handling. Written languages such as
Japanese, Chinese and Korean do not use the same concept of word breaks that are
common in European languages. For these character sets, the Tokenizer will instead
create overlapping sets of “bigrams” – pairs of adjacent characters.
Finally, the Tokenizer can be used to identify special forms of strings, and keep them
intact. A common case is part numbers. If your business commonly uses part numbers
of the form “1145\hgbuut-4478”, then the Tokenizer can be enhanced to recognize this
as a special case, and keep a string in this form intact as a single token instead of
breaking it into 3 separate tokens.

The Information Company™ 146


Understanding Search Engine 21

Customizing the Tokenizer


Warning: changing the Tokenizer for an existing index can cause unexpected results.
During fragment merges and accumulator dump activities, the Index Engine verifies
that the tokens have not changed. If the new Tokenizer causes existing words to be
tokenized differently, those words will be dropped from the Index and the event
recorded in the log files.
If you have special indexing and search requirements, you can create a custom
Tokenizer file. When you provide a new Tokenizer file, it completely replaces the
internal Tokenizer. The Tokenizer file is read when the OTSE components are started,
and used to build an optimal finite state machine for parsing strings. This optimization
means that a custom Tokenizer will not have a material impact on the tokenizing
performance.
The location of the Tokenizer file may be specified in the [Link] file, and allows a
unique Tokenizer to be used with each dataflow. This is normally not recommended.
[DataFlow_foo03278X2099X12621X11463]
RegExTokenizerFile=tokenizerFileName
The default location for a custom Tokenizer is the “config” folder for the search engine,
with the file name [Link]. If placed here, then the same Tokenizer is
used for all dataflows.
A restart of the engines is required after a change to the Tokenizer. Depending on the
change, reindexing of some or all of the content may be desired.
Creating a new Tokenizer is not trivial, and errors in the Tokenizer will require you to
re-index data to correct it. OpenText can provide services to help you customize your
Tokenizer.
Step one in customizing the Tokenizer is to obtain a copy of the default Tokenizer as a
starting point for reference. You can obtain this file from OpenText customer support
– it is built in to OTSE and not shipped with the product.

Tokenizer File Syntax


The basic layout of the tokenizer file is:
#
# comment lines start with the number sign
#
[comm|nocomm]
mappings {
map_specifications
}
ranges {
range_specifications
}
words{
word_specifications
}

The Information Company™ 147


Understanding Search Engine 21

The comm|nocomm line is optional, and not recommended. This controls whether text
that meets the criteria for SGML or XML style comments should be retained or
discarded. The default value is nocomm (do not index comments). This line is
equivalent to setting the standard Tokenizer options in the [Link] file with a value
of TokenizerOptions=2.

Tokenizer Character Mapping


The mappings section is used to map UTF8 characters from one value to another. For
instance, an upper case A to lower case a, or accented characters to non-accented
characters. The mappings section does not completely replace the default character
mapping; it supplements or replaces the specific mappings defined in the section.
However, providing any character mappings will require a complete tokenizer file to be
specified, including range and words sections. In the event that no mapping exists for
a character, the value is passed unchanged.
Mapping of characters takes place AFTER the tokenization has occurred.
The simple example below would be used to convert:

upper case A (hexadecimal 41) to lower case a (hexadecimal 61)


ntilde (ñ – hexadecimal f1) to lower case n (hexadecimal 6e)
character Æ (hexadecimal c6) to the two letters ae
and drop the Unicode diacritical mark ` (combining grave accent, hexadecimal
300)
mappings {
0x41=0x61 0xf1=0x6e
0xc6=0x00650061 0x300=0x00
}

NOTE: the special case for mapping one character to two


characters. To use this feature, you must express the two
characters as a single 32 bit value, with leading zeros, with the
second character first.

Using the null character as the “to” value in a mapping is a special case. Null
characters are skipped during a subsequent Indexing step, so mapping a character to
0x00 will effectively drop it from the string. This may be useful for removing standalone
diacritical marks or punctuation such as the single quote mark from the word
“shouldn’t”.
The following table illustrates the default character mappings for many of the European
languages.

The Information Company™ 148


Understanding Search Engine 21

From To

A-Z a-z

À Á Â Ã Å à á ã å Ā ā Ă ă Ą ą a

Ä Æ ä æ ae

Ç ç Ć ć Ĉ ĉ Ċ ċ Č č c

Ďď Đ đ d

È É Ê Ë èé ê ë Ē e

Ì Í Î Ï ì í î ï i

Ð ð ð

Ñ ñ n

Ò Ó Ô Õ Ø ò ó ô õ ø o

Ö ö oe

Ú Û ù ú û u

Ü ü ue

Ý ý ÿ y

Þ Þ (Large Thorn)

þ Þ (small Thorn)

ß ss
Note: prior to Update 2014-12, upper and lower case Ø characters were mapped to a zero.

Latin Extended-A Character Set Mapping


The Latin extended – A code page, also known as Unicode Code Page 1, has all
characters mapped to their nearest equivalent single ASCII character equivalents with
the following exceptions:

The upper and lower case IJ ligatures are mapped to the two letters I J.
Upper and lower case Letter L with Middle Dot are preserved ( Ŀ and ŀ).
Upper and lower case Œ ligatures converted to oe.
Accented W and Y characters are preserved (Ŵ ŵ Ŷ ŷ Ÿ ).
The ſ character (small letter “long s”) is preserved.

Arabic Characters
There are special cases implemented for tokenization of Arabic character sets, which
improves the findability of Arabic words.
Step 1 is character mapping. The character mapping is extended to handle cases in
which multiple characters must be mapped as a group. These mappings are:

The Information Company™ 149


Understanding Search Engine 21

0627 ARABIC LETTER ALEF 


0622 ARABIC LETTER ALEF WITH MADDA ABOVE
0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
0625 ARABIC LETTER ALEF WITH HAMZA BELOW
0675 ARABIC LETTER HIGH HAMZA ALEF

0647 ARABIC LETTER HEH 


0629 ARABIC LETTER TEH MARBUTA

064A ARABIC LETTER YEH


0649 ARABIC LETTER ALEF MAKSURA
0626 ARABIC LETTER YEH WITH HAMZA ABOVE
0678 ARABIC LETTER HIGH HAMZA YEH

0648 ARABIC LETTER WAW 


0624 ARABIC LETTER WAW WITH HAMZA ABOVE
0676 ARABIC LETTER HIGH HAMZA WAW

06C1 ARABIC LETTER HEH GOAL 


06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE

06C7 ARABIC LETTER U 


0677 ARABIC LETTER U WITH HAMZA ABOVE

06D2 ARABIC LETTER YEH BARREE 


06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

06D5 ARABIC LETTER AE 


06C0 ARABIC LETTER HEH WITH YEH ABOVE

In addition, several hundred Presentation Form characters are mapped to their


equivalent non-Presentation Forms.

Step 2 is removal of the following Unicode diacritical marks:

0640 ARABIC TATWEEL


064B ARABIC FATHATAN
064C ARABIC DAMMATAN
064D ARABIC KASRATAN
064E ARABIC FATHA
064F ARABIC DAMMA
0650 ARABIC KASRA
0651 ARABIC SHADDA
0652 ARABIC SUKUN
0653 ARABIC MADDAH ABOVE

The Information Company™ 150


Understanding Search Engine 21

0654 ARABIC HAMZA ABOVE


0655 ARABIC HAMZA BELOW
0670 ARABIC LETTER SUPERSCRIPT ALEF
0674 ARABIC LETTER HIGH HAMZA

Step 3 is removal of WAW and ALEF-LAM prefixes, only if doing so leaves at least 2
characters remaining.
The final step is removal of HEH-ALEF and YEH-HEH suffixes, again only if at least 2
characters will remain in the token.

Note that Arabic tokenization was improved significantly starting with Update 2014-
12.

Complete List of Character Mappings


For completeness, a table of all the character mappings performed by OTSE is
included in the Configuration Files section later in this document.

Tokenizer Ranges
Ranges define the primitive building blocks of characters, organizing them in logical
groups. Each range specification is comprised of Unicode characters and character
ranges, expressed in hexadecimal notation. For example, a range for the simple
numeric characters 0 through 9 would be:
number 0x30-0x39
In practice, there are multiple Unicode code points where numbers could be
represented, so a richer definition of a number might need to include Arabic numerals
(x660-x669), Devenagari numerals (0x966-0x96f) and similar representations from
other languages. You would probably also want to use the character mapping feature
to convert these all to the ASCII equivalents:
number 0x30-0x39 0x660-0x669 0x996-0x96f

Tokenizer Regular Expressions


The words section describes how word tokens are built from the range values. Each
definition is on a separate line, and is a regular expression using one or more ranges.
There should be no line breaks within a word specification. If the text matches the
regular expression, it is accepted as a token. A simple example:
currency?dash?number+(nseparators+number+)*
This regular expression is based upon the ranges currency, dash, number and
nseparators. Specifically the regular expression above indicates that the text is a word
if it meets these criteria:

May or may not start with currency – currency would be a list of symbols such as
$ ¥ £ or €.
May or may not start with a dash after the optional currency sign.

The Information Company™ 151


Understanding Search Engine 21

Has one or more numbers (0-9) following optional dash and currency.
Has zero or more sets of nseparators (, and .) and numbers following the first
number.
In general, the regular expressions are greedy – matching the longest possible string.
The following operations on ranges are supported, and are applied following the range:
? Zero or one instances of the range

* Zero or more instances of the range

+ One or more instances of the range

( ) Use brackets to clarify order of operations and create groups

| Vertical bar is used as an “OR” operator between ranges

- Token matching this pattern is not valid, advance start pointer one
character and continue

The Tokenizer begins at a specific character, and attempts to find the longest valid
regular expression match. Once found, it takes the matching value as a word,
advances to the character following the match, and repeats. If no match is found, it
advances one character and repeats.
In general, regular expressions that you construct should be relatively lax. In the
currency example above, for instance, we do not enforce 3 digits between commas.
Erring on the side of indexing information rather than rejecting it is a good guideline.

East Asian Characters


Languages such as Chinese, Japanese and Korean do not generally split into tokens
like European languages do. There is a special mechanism available to group these
characters into pairs, called bi-grams.
By way of example, the string of 3 characters 信用卡 would be indexed as two
overlapping sets of bi-grams: 信用 and 用卡. This approach improves search quality
for these character sets, although the resulting index is somewhat larger than an index
built by treating each character as a unique token.
In the Tokenizer, the characters indexed in this way are expressed in a range, as
follows:
ranges{
gram2 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d 0xfa30-
0xfa6a 0xfa70-0xfad9
0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d
0x3041-0x3094 0x30a1-0x30fe 0xff66-0xff9d 0xff9e-
0xff9f
}
In the words section, there is a reserved keyword that implements this bi-gram behavior
for the matched regular expression, _NGRAM2:
words{
_NGRAM2 gram2+

The Information Company™ 152


Understanding Search Engine 21

}
Bigram indexing is the default behavior for these languages. Older versions of the
Search Engine indexed each East Asian character as a separate token. There is a
configuration setting in the [Link] file that can force use of the older method. This
may be useful if you have an older index that predates OTSE with significant East
Asian character content that you do not wish to re-index.

Tokenizer Options
If you are using the standard Tokenizer, the following options are available in
[Dataflow_xxx] section of the [Link] file:
TokenizerOptions=128
The default value is 0 (no options set). The options are a bit field, and can be added
together to combine values. The bit field values are:

1 : a dash character “-“ is counted as a standard character for words. The string
“red-bananas-26” would be indexed as a single token, instead of as 3 the
consecutive tokens “red”, “bananas”, “26”.
2 : XML comments are indexed. By default, strings which fit the pattern for an
XML comment are stripped from the input. XML comments have the form
<!--any text in comment-->
4 : treat underscore characters “_” as separators. This would cause input such
as “My_house” to be indexed as two tokens, “my” and “house”. The default
would preserve this as a single token.
8 : special case handling to look for software version numbers of the form v2.0
and treat them as a single token.
16: treat the “at symbol” @ as a character in a word.
32: treat the Euro symbol as a character in a word.
128 : used to request the “older” method of indexing East Asian character strings
with each character as a separate token. The default indexes these strings as 2-
character “bi-grams”.

Testing Tokenizer Changes


This is an unsupported component of the [Link] file. If it does not work as
expected, OpenText has no obligation to correct it. This capability is documented
simply for convenience in the event that it may be useful if you are debugging Tokenizer
behavior.
There is a Java class function within the [Link] that can accept a tokenizer
file and a test file, and display the tokens that would be generated. An example in a
Linux environment:
cd ~
cp=$[Link]
java -cp $cp [Link] $@
testtok -inifile [Link] inputfile

testtok is the class for the test.


-inifile identifies that the tokenizer filename follows, in this case [Link].

The Information Company™ 153


Understanding Search Engine 21

inputfile is the name of the file containing the data you wish to tokenize.
If inputfile contains “THIS is a TEßT”, the output would be of the form:
|THIS|this
|is|is
|a|a
|TEßT|tesst
Where the first value on each line represents the word tokens accepted by the regular
expression parser, and the second value represents the results after the character
mappings are applied.

Sample Tokenizer
The following sample tokenizer file is similar to the default implementation. Indented
lines have been wrapped to fit the available space. In practice, lines should not be
broken.

ranges {
alpha 0x30-0x39 0x41-0x5a 0x5f 0x61-0x7a 0xc0-0xd6
0xd8-0xf6 0xf8-0x131 0x134-0x13e 0x141-0x148
0x14a-0x173 0x179-0x17e 0x384-0x386 0x388-0x38a
0x38c 0x38e-0x3a1 0x3a3-0x3ce 0x400-0x45f 0x5d0-0x5ea
0xFF10-0xFF19 0xFF21-0xFF3a 0xFF41-0xFF5a
number 0x30-0x39
numin 0x2c-0x2e
currency 0x24 0xfdfc
numstart 0x2d
alphain 0x5f
tagstart 0x3c
colon 0x3a
tagend 0x3e
slash 0x2f
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d 0xfa30-0xfa6a
0xfa70-0xfad9 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d
0x3041-0x3094 0x30a1-0x30fe 0xff66-0xff9d 0xff9e-0xff9f
arabic 0x621-0x63a 0x640-0x655 0x660-0x669 0x670-0x6d3
0x6f0-0x6f9 0x6fa-0x6fc 0xFB50-0xFD3D 0xFD50-0xFDFB
0xFE70-0xFEFC 0x6d5 0x66e 0x66f 0x6e5 0x6e6 0x6ee 0x6ef
0x6ff 0xFDFD
indic 0x900-0x939 0x93C-0x94E 0x950-0x955 0x958-0x972
0x979-0x97F 0xA8E0-0xA8FB 0xC01-0xC03 0xC05-0xC0C
0xC0E-0xC10 0xC12-0xC28 0xC2A-0xC33 0xC35-0xC39
0xC3D-0xC44 0xC46-0xC48 0xC4A-0xC4D 0xC55 0xC56
0xC58 0xC59 0xC60-0xC63 0xC66-0xC6F 0xC78-0xC7F
0xB82 0xB83 0xB85-0xB8A 0xB8E-0xB90 0xB92-0xB95
0xB99 0xB9A 0xB9C 0xB9E 0xB9F 0xBA3 0xBA4

The Information Company™ 154


Understanding Search Engine 21

0xBA8-0xBAA 0xBAE-0xBB9 0xBBE-0xBC2 0xBC6-0xBC8


0xBCA-0xBCD 0xBD0 0xBD7 0xBE6-0xBFA
}
words {
alpha+(alphain+alpha+)*
currency?numstart?number+(numin+number+)*
arabic+
onechar
indic+
_NGRAM2 gram2+
tagstart ( alpha+ (alphain+alpha+)* | arabic+ | onechar
|indic+ | gram2)+ (colon- ( alpha+ (alphain+alpha+)*
| arabic+ | onechar |indic+| gram2)+)? (slash tagend)?
tagstart slash ( alpha+ (alphain+alpha+)* | arabic+
| onechar |indic+ | gram2)+ (colon- ( alpha+
(alphain+alpha+)* | arabic+ | onechar|indic+
| gram2)+)? Tagend
}

Metadata Tokenizers
The default configuration uses the full text tokenizer for text metadata regions. OTSE
supports the use of additional tokenizers for text metadata regions. There are 3
requirements to enable this: creating the tokenizer file; referencing the tokenizer file in
the [Link] file; and associating the tokenizer with a metadata region.
Adding or changing the tokenizer configuration for text metadata is possible. When
the search system is restarted, the text metadata stored values are used to rebuild the
text metadata index using the new tokenizer settings. This may require several hours
on large search grids. There are configuration settings that determine the behavior of
the rebuilding when the tokenizers are changed. The first setting is a failsafe to prevent
accidental conversion if the tokenizers are deleted or changed unintentionally. It
requires that today’s date be provided for the conversion to occur. Use the value “any”
to allow conversion any time the tokenizers are changed. The second setting
determines whether the conversion is applied to existing data, or simply to new data.
Usually, applying to new data only is not recommended due to inconsistent results, so
the default value is true. In the [Dataflow_] section:
AllowAlternateTokenizerChangeOnThisDate=20170925
ReindexMODFieldsIfChangeAlternateTokenizer=true

The [Link] file is used to define where the search tokenizer files are located. In the
[Link] file, to add two metadata tokenizer files:

[Dataflow_]
RegExTokenizerFile2=c:/config/tokenizers/[Link]
RegExTokenizerFile3=c:/config/tokenizers/[Link]

The Information Company™ 155


Understanding Search Engine 21

Note that the additional tokenizer values start at the number 2. The first tokenizer entry
is always reserved for the full text tokenizer. The tokenizer definition files in this
example are located in the config/tokenizers directory, which is recommended by
convention as the preferred location for tokenizer definition files.
The next step is to identify the text metadata regions which should use the enumerated
tokenizers. This is done as an optional extension to the text region definition in the
[Link] file:

TEXT OTPartNum FieldTokenizer=RegExTokenizerFile2


TEXT RegionX FieldTokenizer=RegExTokenizerFile3

The search engine would then apply the rules defined in [Link] to the region
OTPartNum, and the tokenizer rules in the file [Link] to RegionX. The
tokenizer files are constructed using the same rules as the default full text tokenizer.

Metadata Tokenizer Example 1


This relatively simple tokenizer uses the default character mappings (e.g. upper to
lower case). It does not replace punctuation with whitespace, and does not break
words into multiple tokens. Instead, the output is a single value preserving all
punctuation, but encoded using 4-grams except for dual-byte characters such as
Chinese, which are encoded using bi-grams. This approach to tokenization would be
appropriate for metadata regions that require efficient exact substring matching on the
unmodified values.
ranges {
gram4 0x9-0xe00 0xe2f 0xe3b-0xe3f 0xe4e-0x3004
0x3007-0x3040 0x3095-0x30a0 0x30ff-0x33ff 0x9fa6-0xabff
0xd7a4-0xf8ff 0xfa2e-0xfa2f 0xfa6b-0xfa6f 0xfada-0xff60
0xffa0-0xfffd
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d 0x3041-0x3094
0x30a1-0x30fe 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d
0xfa30-0xfa6a 0xfa70-0xfad9 0xff66-0xff9d 0xff9e-0xff9f
}
words {
_NGRAM4 gram4+
onechar
_NGRAM2 gram2+
}

Metadata Tokenizer Example 2


This example differs from the previous example in one material way. All punctuation
and white space is mapped to a null character, leaving a dense set of characters. The
default conversion of ASCII to lower case still applies (not explicitly required). This
example also uses 3-gram encoding, which is useful in some exact substring matching
situations. An input of: So-me&vAL/ue her?e will be reduced to:

The Information Company™ 156


Understanding Search Engine 21

somevaluehere, then encoded with 3-grams (som ome mev eva val alu lue
ueh ehe her ere).
mappings {
0x9=0x0
0xa=0x0
0xb=0x0
0xc=0x0
0xd=0x0
0xe=0x0
0xf=0x0
0x10=0x0
0x11=0x0
0x12=0x0
0x13=0x0
0x14=0x0
0x15=0x0
0x16=0x0
0x17=0x0
0x18=0x0
0x19=0x0
0x1a=0x0
0x1b=0x0
0x1c=0x0
0x1d=0x0
0x1e=0x0
0x1f=0x0
0x20=0x0
0x21=0x0
0x22=0x0
0x23=0x0

< thousands of null mappings omitted>

0xfffb=0x0
0xfffc=0x0
0xfffd=0x0
}
ranges {
gram4 0x9-0xe00 0xe2f 0xe3b-0xe3f 0xe4e-0x3004
0x3007-0x3040 0x3095-0x30a0 0x30ff-0x33ff 0x9fa6-0xabff
0xd7a4-0xf8ff 0xfa2e-0xfa2f 0xfa6b-0xfa6f 0xfada-0xff60
0xffa0-0xfffd
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d 0x3041-0x3094
0x30a1-0x30fe 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d
0xfa30-0xfa6a 0xfa70-0xfad9 0xff66-0xff9d 0xff9e-0xff9f
}
words {

The Information Company™ 157


Understanding Search Engine 21

_NGRAM4 gram4+
onechar
_NGRAM2 gram2+
}

The Information Company™ 158


Understanding Search Engine 21

Administration and Optimization


From using administration APIs through maintenance, backups, scaling and
performance optimization. How to get the most out of your OTSE installation.

Index Quality Queries


There are a number of search queries that may be used to test the quality of the data
in the search index. The details of each feature are described elsewhere in this
document. As a convenience, these are summarized to provide a quick reference for
testing index quality.

Index Error Counts


Search on the OTIndexError region to identify objects which had invalid metadata
values. The larger the count, the more errors an object has.

Content Quality Assessment


Search on the OTContentStatus region to find objects where the content could not be
correctly or completely indexed. Presentation in a facet could make interpretation
easier.

Partition Sizes
Search for a partition name in the OTPartitionName region to get a count of the number
of objects stored in a given partition.

Metadata Corruption
Search for -1 in the region OTMetadataChecksum to identify if the metadata for any
objects are corrupt. This is only valid if the metadata checksum feature is enabled.

Bad Format Detection


Search for “unknown” in the OTFileType region. Use information about the results to
adjust the format recognition settings in the Document Conversion Server. You can
then collect and re-index the objects with “unknown” file types.

Text Metadata Truncation


Search for “OTIndexLengthOverflow” in the region OTMeta. This identifies metadata
that was too long and was truncated. Truncated metadata may indicate applications
abusing field sizes, or regions that need the limit adjusted.

The Information Company™ 159


Understanding Search Engine 21

Text Value Truncation


Search for “OTIndexMultiValueOverflow” in the region OTMeta. This identifies
metadata that had too many values, and the additional values were discarded. This
can isolate applications abusing the multi-value text region feature, or identify regions
that need the default value increased.

Search Result Caching


Search queries have the potential to run for very long times, especially if the results
are being retrieved in chunks, and the consuming application takes time to process
each chunk of search results. This interactive process of retrieving results can extend
the duration of a search query indefinitely. An example of this in Content Server might
be “Collect all search results”.
When queries are active for prolonged periods, they consume threads in the Search
Engines, potentially preventing other queries from running. In addition, while a search
query is active, the Search Engines will not update their index to ensure that the query
remains transactionally complete, which may eventually cause indexing operations to
stall.
To mitigate this possibility, the Search Federator has the ability to cache search results.
This behavior is triggered by the query duration. Once a query transaction is open
longer than a defined time, the Search Federator will pro-actively request all search
results from all the Search Engines. The results will be written to temporary disk
storage. The transactions with the Search Engines are then closed, and subsequent
requests to fetch results for the transaction are serviced by the Search Federator from
the temporary storage. Temporary files are deleted upon transaction completion or
startup of the Search Federator. The amount of space required depends on the
number of active transactions cached, number of search results in a transaction, and
amount of region data returned in the results. For typical applications, 1 GB of space
should be more than adequate.
Caching provides the highest value in scenarios that retrieve all the results from the
search engine.
There are two configuration controls for this feature, both in the [SearchFederator_]
portion of the [Link] file. The time before caching determines the time in
milliseconds at which the Search Federator will decide to begin caching the search
results. This time should generally be long enough that caching is not triggered every
time a query is slow. The other setting defines the disk storage location for the
temporary files. This should be the path to a working folder that the Search Federator
can access. This temporary file location must be defined to enable caching of search
results, and is empty (disabled) by default.
SearchResultCacheDirectory=G:\cache
TimeBeforeCachingResultsInMS=180000

Query Time Analysis

The Information Company™ 160


Understanding Search Engine 21

Query time and throughput varies based on many factors. The first step in optimizing
search query behavior is understanding how time is being consumed during search
queries. To help with this, the Search Federator keeps statistical information about
query performance, which is written to the Search Federator log once per hour. Using
this data, you can assess whether changes to the system or configuration are
improving or degrading search performance.
The data is written in tabular form, such that you can copy it and paste it into a
spreadsheet as Comma Separated Values to make analysis easier. The log entries
have this form, with leading time stamps and thread data omitted:

:Search Performance Summary for 12;00 on 2015-05-28:


:Time, Query Count, Elapsed, Execution, Wait, SELECT, RESULTS, FACETS, HH, STATS:
:12, 3, 50476, 19786, 30690, 18537, 1234, 0, 0, 15:
:11, 2, 64533, 38444, 26089, 38051, 299, 47, 0, 47:
:10, 0, 0, 0, 0, 0, 0, 0, 0, 0:
:9, 0, 0, 0, 0, 0, 0, 0, 0, 0:
… (values for up to 24 hours)…
:Days Ago, Query Count, Elapsed, Execution, Wait, SELECT, RESULTS, FACETS, HH, STATS:
:1, 18, 728909, 35687, 27098, 36709, 376, 36, 0, 45:
… (values for up to 14 days) …

Reading left to right, the values are:


Time: the reporting time. Time is written on the hour. For hourly entries, this is the
end of the hour period, values from 0 to 23. For days, this is the number of days ago
(e.g. 1 is yesterday). Hourly values are gathered into a daily value at midnight.
Query Count: the number of search queries processed in the time period.
Elapsed: the sum of total time for each query, from start to end transaction. Divide by
number of queries to get the average..
Execution: total active time used by search engines and search federator to perform
the search.
Wait: time while transaction is open, but with no tasks for the search engines. Typically
while Content Server is permission trimming.
SELECT: portion of Execution time running the SELECT portion of the query. This is
where the matching results are computed.
RESULTS: time spent fetching search results, typically a result of GET operations.
FACETS: time spent retrieving facets. Not that facets are computed during SELECT,
so a portion of facet generation time is no included here.
HH: time spent computing hit highlighting information.
STATS: time spent retrieving query analysis statistics. As with facets, a portion of the
time generating STATS data is performed during the SELECT phase.
The accuracy of the timing is typically limited by the system timer. On Windows, these
typically have 15 or 16 ms resolutions. The times are in milliseconds, total for all
transactions. Divide by the number of transactions to obtain averages. The statistics

The Information Company™ 161


Understanding Search Engine 21

are not persisted between restarts, so the data starts at zero after every startup of the
search grid. This information is written when the log level is set to status level or higher.
Data on a given query is collected when the query completes, so queries that cross an
hour or day boundary are reported for the time when the query finished.
This data is also available on demand through the admin interface using the command:
getstatustext performance

Administration API
In addition to a socket-level interface to support search queries, the search
components have a socket-level interface that support a number of administration
tasks. Each component honors a different set of commands, and in some cases reply
to the same command with different information. Commands that make sense for an
Index Engine may be irrelevant for the Search Federator.
This section outlines the most common commands and the components to which they
apply. The client making the requests is also responsible for establishing a socket
connection to the component. The configuration of the port numbers for the sockets is
controlled in the [Link] file.
You do not need to use this API for management and maintenance. Applications such
as Content Server leverage the Administration API to hide details of administration and
provide unified administration interfaces.
The examples below use a > (prompt) symbol to represent the command(s), followed
by the response. White space has been added in responses for readability.

stop
Stops the process as soon as possible. Applies to all processes.
> stop
true

getstatustext
In the Index Engine, this command returns information about uptime, memory use and
number index operations performed:
> getstatustext

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_IEname0_STATUS>
<upTimeSeconds>303</upTimeSeconds>
<numberOfRequests>0</numberOfRequests>
</LIVELINK_IEname0_STATUS>
<MetadataUsagePercentage>1</MetadataUsagePercentage>
<ContentUsagePercentage>0</ContentUsagePercentage>
</stats>

The Information Company™ 162


Understanding Search Engine 21

In the Search Federator, getStatusText returns summary information about uptime and
requests. In addition, this call is used to obtain detailed information about the current
status of each metadata region. In the “moveable” section, each region defined in the
index is listed along with a status indicating whether it is moveable. The moveable
status essentially identifies text regions, which can be moved to other storage modes
(DISK versus RAM storage, for example).
There are sections for ReadWrite, NoAdd and ReadOnly. In these sections, every text
(moveable) region is listed. In this example, the partition is in Read-Write mode, so
the regions are listed in the ReadWrite section. For each text region, an estimate of
the amount of memory currently used by the region, and the amount of memory that
would be used if the region was changed to other storage modes is provided. Note
that these are ESTIMATES and should not be used to accurately compute memory
requirements.
> getstatustext

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_SFname0_STATUS>
<upTimeSeconds>363</upTimeSeconds>
<numberOfRequests>1</numberOfRequests>
</LIVELINK_SFname0_STATUS>
<partitionMemInfo>
<moveable>
<OTWFMapTaskDueDate>false</OTWFMapTaskDueDate>
<PHYSOBJDefaultLoc>false</PHYSOBJDefaultLoc>
<OTWFileName>true</OTWFileName>
</moveable>
<ReadWrite>
<OTSomeRegion>
<sizeInMemory>684</sizeInMemory>
<sizeInMemoryKB>0</sizeInMemoryKB>
<sizeOnDisk>0</sizeOnDisk>
<sizeOnDiskKB>0</sizeOnDiskKB>
<sizeOnRet>0</sizeOnRet>
<sizeOnRetKB>0</sizeOnRetKB>
</OTSomeRegion>
</ReadWrite>
<NoAdd>
</NoAdd>
<ReadOnly>
</ReadOnly>
</partitionMemInfo>
</stats>
In the Search Engines, this command is used to obtain basic uptime and number of
search queries performed since startup.
> getstatustext

The Information Company™ 163


Understanding Search Engine 21

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_SEname0_STATUS>
<upTimeSeconds>580</upTimeSeconds>
<numberOfRequests>1</numberOfRequests>
</LIVELINK_SEname0_STATUS>
</stats>
Within the Update Distributor, the getstatustext command is used to obtain uptime, and
statistics about the number of IPool messages processed.
> getstatustext

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_UpDist1_STATUS>
<upTimeSeconds>619</upTimeSeconds>
<requests>
<numberOfRequests>0</numberOfRequests>
</requests>
<IPoolTransactions>
<NumberCommitted>0</NumberCommitted>
<AverageTime>NaN</AverageTime>
<RunningStdDev>NaN</RunningStdDev>
<MaxTime>4.9E-324</MaxTime>
<MinTime>1.7976931348623157E308</MinTime>
</IPoolTransactions>
<ForcedCheckpoint>
<InForcedCheckpoint>STATUS</InForcedCheckpoint>
<TotalPartitionsToCheckpoint>X</TotalPartitionsToCheckpoint>
<PartitionsInCheckpoint>Y</PartitionsInCheckpoint>
<PartitionsFinishedCheckpoint>Z</PartitionsFinishedCheckpoint>
</ForcedCheckpoint>
</LIVELINK_UpDist1_STATUS>
</stats>
The ForcedCheckpoint section identifies how many partitions are busy writing
checkpoints. The possible values for STATUS are:
No Checkpoint Command
Checkpoint pending
Checkpoint in progress
If STATUS is not “Checkpoint in progress”, then X, Y and Z are 0. Otherwise, these
values represent the number of partitions in various stages of writing checkpoint files.

With the Search Federator, a variation of getstatustext can be used to retrieve data
about search query performance. The interpretation of the values is outlined in the
section entitled “Query Time Analysis”.
> getstatustext performance

<?xml version="1.0" encoding="UTF-8"?>

The Information Company™ 164


Understanding Search Engine 21

<performance>
<hours>
<hour>
<hourNumber>13</hourNumber>
<numQueries>1</numQueries>
<elapsed>71305</elapsed>
<execution>1149</execution>
<wait>70156</wait>
<SELECT>376</SELECT>
<RESULTS>773</RESULTS>
<FACETS>0</FACETS>
<HH>0</HH>
<STATS>0</STATS>
</hour>
<hour>
<hourNumber>12</hourNumber>
<numQueries>4</numQueries>
<elapsed>149954</elapsed>
<execution>100071</execution>
<wait>49883</wait>
<SELECT>99761</SELECT>
<RESULTS>201</RESULTS>
<FACETS>16</FACETS>
<HH>0</HH>
<STATS>93</STATS>
</hour>
</hours>
</performance>

Similarly, the Update Distributor can provide accumulated statistics about indexing
throughput and errors with “getstatustext performance”. First introduced in 20.4, the
output is in XML form and includes the same data that is written to the logs on an hourly
basis.
<?xml version="1.0" encoding="UTF-8"?>
<performance>
<hours>
<hour>
<hourNumber>8</hourNumber>
<AddOrReplace>0</AddOrReplace>
<AddOrModify>0</AddOrModify>
<Delete>0</Delete>
<DeleteByQuery>0</DeleteByQuery>
<ModifyByQuery>0</ModifyByQuery>
<Modify>0</Modify>

The Information Company™ 165


Understanding Search Engine 21

Starting with the 2015-09 update, a new option for getstatustext will return a subset of
information, faster. The “basic” variation reduces the time needed by Content Server
to display partition data. The subset of data was specifically selected to meet the
needs of the Content Server “partition map” administration page. When basic is used,
the status and size of partitions is retrieved from cached data, and only updated during
select indexing operations such as “end transaction”. While technically the information
could be slightly incorrect, it is accurate enough for practical purposes. If there is no
cached data, then the slower methods are used – querying each index engine for data.

> getstatustext basic

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<IEname0>
<status>12</status>
<MetadataUsagePercentage>11</MetadataUsagePercentage>
<ContentUsagePercentage>31</ContentUsagePercentage>
<Mode>ReadWrite</Mode>
<Behaviour>Normal</Behaviour>
</IEname0>
</stats>

For the Index Engines, there is new data in this response. Percentage full is presented
in two different ways, one for text metadata, and one for usage of the allocated space
on disk of the index. The Behaviour represents the “soft” modes of a read/write partion
- update only, rebalancing. Sample responses from the other search processes are
shown below, returning the same codes as a “getstatuscode” command.
<?xml version="1.0" encoding="UTF-8"?>
<stats>
<UpDist1>
<status>135</status>
</UpDist1>
</stats>

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<SFname0>
<status>12</status>
</SFname0>
</stats>

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<SEname0>
<status>12</status>
</SEname0>

The Information Company™ 166


Understanding Search Engine 21

</stats>

getstatuscode
This function is used to determine if a process is ready, in error, or starting up. Starting
up is generally the status while an index is being loaded.

> getstatuscode

12

getstatuscode response values:

All Processes

10 Running, but not yet ready. Usually when loading an


index.

-11 An error condition exists

12 Ready

Index Engine Codes from 301 to 500

301 Looking for Update Distributor

Update Distributor Codes from 129 to 300

131 Polling for transaction

133 Done

134 Waiting for partitions to be added (RMI mode)

135 Waiting for index engines

137 Contacting index engines

Search Engine Codes from 501 to 700

501 Registering with Search Federator

502 Waiting for initial index

503 Initializing search engine index

Search Federator Codes from 701 to 900

701 Waiting for search engine

The Information Company™ 167


Understanding Search Engine 21

registerWithRMIRegistry
For all processes, this command forces a reconnection with the RMI Registry, and
reloads the remote process dependencies. Useful for resynchronizing after some
types of configuration changes without needed to restart the processes. If the search
grid is configured to not use RMI, this command is ignored.
> registerWithRMIRegistry

received ack

checkpoint
The checkpoint function is issued to the Update Distributor to force all partitions to write
a checkpoint file. This is especially useful as part of a graceful shutdown process. If
large metalogs are configured, the time to replay the metalogs during startup can take
a long time. Forcing checkpoints shortly before shutdown eliminates metalogs and can
dramatically improve startup time. After issuing the checkpoint command, the Update
Distributor waits for a number to be provided. The number is a percentage,
representing the threshold over which a checkpoint should be written. For example, if
a checkpoint is normally written when metalogs reach 200 Mbytes, a value of 10 means
that a checkpoint should be immediately forced if the metalog has reached 20 Mbytes
in size. The same logic applies for other checkpoint triggers, such as number of new
objects or number of objects modified. Any value other than an integer from 0 to 99
will simply abort the command.
> checkpoint
> 10

true

reloadSettings
This command applies to all processes. Some, but not all, of the [Link] settings
can be applied while the processes are running, and some can only be applied when
the processes first start. This command requests that the process reload settings. A
list of reloadable settings is included near the end of this document.
> reloadSettings

received ack

getsystemvalue
Used to obtain specific values from the Index Engine. Currently, there are only two
keys defined. ConversionProgressPercent will return the percentage complete when
an index conversion is taking place. A “ping” operation to check that the process is
responding is also available. This command is different from the others in that it

The Information Company™ 168


Understanding Search Engine 21

requires two separate submissions, the first being the command and the second being
the key.
> getsystemvalue
> marco

polo

> getsystemvalue
> ConversionProcessPercent

36

addRegionsOrFields
This command applies to the Update Distributor only, and can be used to dynamically
add a region definition. Once added to an index, regions are generally sticky. The
[Link] file is not updated, so note that using this command may cause
a drift between the index and the [Link] file. This discrepancy is not a
problem, but should be kept in mind in support situations.
The syntax requires exactly one TAB character after the type and before the region
name. This command waits for additional lines of definitions until an empty line is sent,
which terminates the input mode. The function returns true on completion.
> addRegionsOrFields
> text flip
> integer flop
>

true

runSearchAgents
Update Distributor only. Instructs the Update Distributor to run all of the search agents
which are currently defined against the entire index. Results are sent to the search
agent IPool.
> runsearchagents

true

runSearchAgent
Update Distributor only. Instructs the Update Distributor to run a specific search agent.
The search agent named must be correctly defined in the [Link] file. Results are
sent to the search agent IPool. This command expects one line with the search agent
after the command.
> runsearchagent
> bob

true

The Information Company™ 169


Understanding Search Engine 21

runSearchAgentOnUpdated
Update Distributor only. Instructs the Update Distributor to run all of the specific listed
search agents. Time is based on the values in upDist.N file, and the timestamp is
updated (see Search Agent Scheduling). Requests are added to a queue and may
require some time to complete. Results are sent to the search agent IPool.
> runsearchagentonupdated
> MyAgentName
> AnotherAgent

true

runSearchAgentsOnUpdated
Update Distributor only. Instructs the Update Distributor to run all the search agents.
Time is based on the values in upDist.N file, and the timestamp is updated (see Search
Agent Scheduling). Requests are added to a queue and may require some time to
complete. Results are sent to the search agent IPool.
> runsearchagentsonupdated

true

Server Optimization
There are many performance tuning parameters available with OTSE. There is no
single perfect configuration that meets all requirements. You can optimize for indexing
performance or query performance. There are tradeoffs between memory and
performance, and many external parameters can affect the OTSE behavior. In this
section we examine some of the most common options for system tuning. The focus
here is on administration and configuration tuning, not on application optimization.

Metadata Region Fragmentation


When metadata values are modified, fragmentation of the memory used to store the
metadata takes place. In a typical system, this defragmentation will slowly increase
the memory used to store metadata over a period of days or weeks.
To combat this, OTSE includes a metadata memory defragmentation capability. By
default, this is scheduled to run monthly or nightly, depending on the metadata storage
methods being used. For most applications, this will be sufficient to prevent any
material memory loss.
With Low Memory configuration, fragmentation is much less pronounced, and the
defragmentation impact is also smaller. Starting with the 16.0.3 update,
defragmentation is restricted to running only on the first Sunday of each month. If you
require nightly defragmentation, there is a setting in the [DataFlow_] section that can
enforce this:
DefragmentFirstSundayOfMonthOnly=0

The Information Company™ 170


Understanding Search Engine 21

If your use of OTSE includes high volumes of indexing and metadata updates, then
fragmentation may occur more quickly. You can consider modifying the configuration
settings to run the defragmentation several times per day. While defragmentation is
happening, there will be short periods, typically a few seconds at a time, where search
query performance is degraded. In practice, we find that Low Memory Mode without
daily defragmentation is providing the best indexing throughput.
The tuning parameters typically do not require adjustment unless you are experiencing
extraordinary levels of memory fragmentation. Within the search.ini_override
file, in the [DataFlow] section, the following settings can be added to make adjustments
if necessary:
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
Defragmentation times can be a list in 24 hour format (for example, 2:30;14:30) to run
multiple times per day. Space is the maximum temporary memory to consume while
defragmenting in MB; the larger the value, the faster defragmentation runs – up to a
limit based on the size of the largest region. To completely disable defragmentation,
set the DefragmentMemoryOptions value to 0. Setting the options value to 1 is not
recommended – it enables aggressive defragmentation, whereby all regions are
defragmented without relinquishing control to allow searches while defragmentation
occurs.
There are two other defragmentation settings that you will normally not need to adjust:
DefragmentMaxStaggerMinutes=60
DefragmentStaggerSeedToAppend=SEED
If you have multiple search partitions, each partition will randomly select a
defragmentation start time up to “MaxStaggerMinutes” after the specified daily
defragmentation time. The purpose of this is to distribute CPU load randomly if you
have many partitions. The SEED value is a string used to seed the random number,
and is available to change if for some reason the default string “SEED” produces start
times which cluster too tightly. It is unlikely you will need to provide an alternative
string.

Partition Metadata Memory Sizing


When you configure a partition, one of the key settings is the amount of memory that
should be reserved for metadata. As the partition accepts more objects, it consumes
this memory. The memory usage is typically measured as a percentage of the
allocated memory, and generally referred to as “percent full”.
If your system will encompass millions of objects managed, the chances are good that
you will need multiple partitions. The number of partitions you require is based upon
many variables; the one we will consider here is the amount of memory that you
allocate to metadata in each partition.
Before delving too deeply into the alternatives, a note about 32 bit environments is in
order. Content Server 9.7.1 for Windows is deployed by default within a 32 bit Java
environment. The 32 bit environment places a restriction on the amount of memory
that a single process can consume of about 1.3 gigabytes. Once you factor out

The Information Company™ 171


Understanding Search Engine 21

memory needed for other purposes, the practical upper limit for memory that can be
reserved for metadata is about 1 gigabyte. Customers using Content Server on
Solaris, which uses a 64 bit JVM, have reported success using larger partition sizes,
up to 3 gigabytes.
Assuming a 64 bit Java environment, such as Content Server 10.5 or 16, you can set
the partition sizes larger. Because of the number of variables, there is no simple
optimal size which is always correct. For systems which cannot contain the entire
index within a single partition, larger partition sizes are synonymous with fewer
partitions. Here are some of the tradeoffs:

The memory overhead for a partition is more or less constant, regardless of the
partition size. Larger partitions are therefore more efficient in terms of memory
use, which can reduce the overall cost of hardware.

In operation, partitions engage in high levels of disk access. Typically, fewer


partitions will result in more efficient use of the available disk bandwidth.

During indexing, the Update Distributor will balance the load over the available
index engines. If high indexing performance is a key requirement, more partitions
may be preferable.

For search queries returning small numbers of results (typical user searches),
fewer partitions are more efficient. This is typical of most Content Server
installations.
Some specific types of queries are slow and performance is based on the number
of text values in the partition dictionary. Smaller partitions are therefore faster. If
regular expression (complex pattern) queries on text values stored in memory are
common for your application, then smaller partitions may be a better choice.

A small partition would reserve about 1 gigabyte of RAM for metadata. A very
large partition would be about 8 gigabytes in size. Experimenting with intermediate
sizes before configuring a large partition is strongly recommended.

It is easy to make a small partition larger by changing the


configuration, but making a large partition smaller is more complex
and may require some level of re-indexing. Don’t use large partitions
until you are confident that they are appropriate for your environment.
With Low Memory mode, the number of items that can be stored in a partition is
considerably larger than when memory storage modes are configured. Putting aside
all the caveats about performance variations, for new systems that are expected to
become relatively large, our reference for development is:
• Low Memory Mode configuration
• 2 GB Partition Memory configuration
This configuration should handle up to 10 Million Content Server objects with
reasonable performance.

The Information Company™ 172


Understanding Search Engine 21

Automatic Partition Modes


To minimize the intervention needed by an administrator to monitor and manage search
partitions, OTSE has the ability to automatically change the mode of operation as the
partition fills with data. There are three effective modes of operation for a read-write
partition: normal, update and rebalance. These modes are selected based on the size
of a partition as measured by metadata memory use.
Memory Usage Mode Switching
In normal operation, the partition will accept any operations. As the partition is filled,
eventually it will cross an “update only” threshold. Above this threshold, the partition
will not accept new objects for indexing, although it will continue to accept updates to
objects already indexed within the partition. If the percent full falls below this threshold,
the partition will once again accept new objects. This can happen if objects are deleted
or partition settings are changed.
If the partition continues to fill, eventually it will reach a “rebalance” threshold. In
rebalancing mode, updates to objects will cause them to be moved to other partitions,
as determined by the Update Distributor. Rebalancing continues until the partition falls
below the “stop rebalancing” percent full threshold. Rebalancing ensures that the
partition does not exceed the available memory, but it is an expensive process, and
should be considered a safety mechanism of last recourse. Reserving sufficient
‘update only’ memory will minimize the need for rebalancing.
The figure below illustrates the percent full memory usage and thresholds for read-
write partitions.

The Information Company™ 173


Understanding Search Engine 21

Currently, very conservative default values are used: 80% full for rebalancing and 77%
for the stop rebalancing threshold, which reflects the amount of memory typically used
by existing Content Server customers.
Selecting a suitable threshold for update-only mode requires a little more thought, and
depends upon your expected use of the search engine. The default value with Content
Server is a setting of 70%, which reserves 10% of the space for metadata changes.
Some considerations for adjusting this setting include:
• If your system has applications or custom modules known to add significant
new metadata to existing objects, you should allow more space for updates.
• Archival systems which rarely modify metadata can reduce the space
reserved for updates. Note that Content Server Records Management will
often update metadata when activities such as holds take place, even with
archive applications.
Note that these values are representative for traditional partitions with 1 GB of memory
for metadata. If you are using a larger partition, then reserving less space for updates
and rebalancing may be appropriate. The best practice is to periodically review the
percent full status of your partitions, and adjust the partition percent full thresholds
based upon your actual usage patterns.
The values in the [Link] file that defined the various thresholds are:
MaxMetadataSizeInMBytes=1000
StartRebalancingAtMetadataPercentFull=99
StopRebalancingAtMetadataPercentFull=96
StopAddAtMetadataPercentFull=95
WarnAboutAddPercentFull=true
MetadataPercentFullWarnThreshold=90

Disk Usage Mode Switching


The previous section describes automated mode selection based on memory used for
metadata. A similar capability exists for switching modes based on disk usage. The
method is identical to the metadata memory scenario, except that percent full is
measured relative to the amount of disk space used to store the index.
The amount of space needed to represent the index changes in size as metalogs are
consumed into checkpoints, or text index files are merged. Merge operations may
temporarily require twice the disk space used by the partition. This can be addressed
by keeping the maximum used space relatively low, or enabling Merge Tokens.
The maximum allowable disk usage for a partition is specified in MB, the thresholds
are set as percentages relative to this value. The default values are shown below.
MaxContentSizeInMBytes=50000
StartRebalancingAtContentPercentFull=95
StopRebalancingAtContentPercentFull=92
StopAddAtContentPercentFull=90
WarnAboutAddPercentContentFull=true
ContentPercentFullWarnThreshold=85

The Information Company™ 174


Understanding Search Engine 21

Selecting a Text Metadata Storage Mode


As described elsewhere in this document, there are several storage modes available
for text metadata, each with relative strengths. To summarize:
• Memory Storage (RAM) provides the fastest retrieval of metadata values, but
consumes the most memory.
• Value Storage (DISK) reduces the memory required by storing theText metadata
values on disk, but keeps the Text metadata index in memory.
• Low Memory mode (DISK) moves the Text Metadata index to disk, dramatically
reducing the memory requirements.
• Merge File mode places the Text Metadata values in a separate set of files that
are merged in background processes. This mode is standard for Content Server
16.
• Retrieval Storage (DISK_RET) uses the least memory, storing the values on disk,
and discarding the index entirely, making the values non-searchable.
Use the [Link] file to choose optimal settings for your application. In
additional to allowing specification of each field individually, this file can also be used
to set a default storage mode to be applied unless otherwise indicated.
For the majority of applications, Disk Storage with Low Memory Mode and Merge File
mode enabled is probably the optimal setting, and is certainly the configuration that will
provide the highest possible search indexing throughput. Retrieval Storage is usually
indicated for the Hot Phrases and Summaries regions (OTHP and OTSummary).
Note that if you fill a partition in a low memory mode, you may not have enough space
later to convert to a higher memory usage mode. For example, if the partition memory
is 80% full with text regions in DISK mode, it is unlikely that you will be able to switch
the default setting to RAM mode unless some regions are removed or the partition size
is increased.

Content Server customers: remember you shouldn’t edit this file


directly, since there are administration pages within Content Server
that allow you to manage these settings, and Content Server will
over-write the [Link] file.

High Ingestion Environments


Some applications are focused on making large amounts of data searchable and
consider the indexing performance to be the key factor to optimize. There are a
number of specific considerations for tuning a search grid in these situations.
The first recommendation is the use of Low Memory mode for text metadata storage,
since high ingestion rates drive large search grids, and Low Memory mode minimizes
the number of partitions that will be required. Secondly, use Merge File mode for Text
Metadata index storage, which reduces the Checkpoint size, since Checkpoint writes
consume a considerable portion of the total Index Engine time.

The Information Company™ 175


Understanding Search Engine 21

Update Distributor Bottlenecks


After every indexing batch transaction completes, the Update Distributor records
performance metrics in its log file if the log DebugLevel is set to info or status. This
information can shed light on where time is spent during the indexing process, and
help guide optimization. A typical performance record looks like this:

1398979250175:IPoolReadingThr[Link]Timing info (counts). Total time 175255285 ms.


Start Transaction 11106708 ms (5513). End Transaction 7486279 ms (5513). Checkpoint
114769662 ms (1639). Local Update 12225821 ms (23161). Global Update 16140505 ms
(23161). Idle 9216366 ms (264). IPool Reading 1077868 ms (28674). Batch Processing
43376 ms (23161). Start Transaction and Checkpoint 0 ms (0). :

The times are cumulative since the Update Distributor was started. Each entry has the
form:

Category N ms (count).

Update Distributor Categories:

The Information Company™ 176


Understanding Search Engine 21

Total Time Total uptime of the Update Distributor – this includes the start-up time
that is not included in any other category – hence it will be larger than
the sum of the other categories.

Start Transaction Time the Update Distributor spends waiting for the Index Engines to be
ready to start a transaction.

End Transaction Time the Update Distributor spends waiting for a transaction to end,
excluding time to write checkpoint files. Too much time in this category
may indicate an excessive amount of time is spent running search
agents (for Content Server, usually Intelligent Classification or
Prospectors).

Checkpoint Time the Update Distributor waits for the Index Engines to write
checkpoint files. Large percentages of time here suggest that
checkpoints are created too frequently, or the storage system is under-
powered. Metalog thresholds can be adjusted to reduce the frequency
of checkpoint writes.

Local Update Time the Update Distributor is working with the Index Engines to update
the search index. This is useful time. It is common for this value to
remain below 15% of the time even when a system is performing well.

Global Update Time in which the Update Distributor is interrogating the Index Engines
prior to initiating the local update steps. A typical purpose is to establish
which Index Engine should receive a given indexing operation. Long
times here may indicate that Update Distributor batch sizes are too
small.

Idle The amount of time the update distributor is idle – it has completed all
the indexing it can, and is waiting for new updates to ingest. A high
percentage of time idle indicates that OTSE has additional capacity. If
indexing is slow and there is sufficient idle time, the bottlenecks likely
exist upstream in the indexing process (DCS, Extractors or DataFlow
processes). Note that you should always have some idle time, since the
demand on indexing throughput is not constant.

IPool Reading The amount of time the Update Distributor spends reading indexing
instructions from the disk. In general, this should be relatively small
compared to measurements such as Local Updates. If not, it may
indicate poor disk performance for the disk hosting the input IPools.

Batch Processing The amount of time planning how to proceed with the local update. This
value should be very small as a percentage of global update time.

Start Transaction Older systems, using RMI mode could not differentiate between time
and Checkpoint spent writing checkpoints and time spent on starting a transaction.
Therefore on these systems those two operations are grouped into a
single category. A properly configured system should have a value of 0
in this field.

The Information Company™ 177


Understanding Search Engine 21

Parsing Time spent parsing metadata.

Backups Time spent performing backups.

Search Agents Time spent running search agent queries. Does not apply when
configured to use the older method of running agents after every index
transaction.

Network Problems The values NetIO1 through NetIO5 capture the number of times 1 to 5
retries were needed to read or write to network IO. The NetIOFailed
counts the number of times IO failed after 5 retries.

The Update Distributor also keeps statistical summaries of performance for up to 1


week. Once per hour, a summary of the data is written to the Update Distributor log
file. The first line of output is the hour interval for which the data was collected. Search
for “Index Performance Summary” to locate this data.
The second line is a list of comma-separated titles. Subsequent lines contain the data
points, up to 24 hourly values (the list resets at midnight) and up to 7 lines for daily
summaries. These lines can be copied and pasted into a spreadsheet as comma
separated values for easier readability, analysis and charting. Selected values in the
table include:
Operation Counts
The number of IPool messages processed for each of the operation types:
AddOrReplace, AddOrModify, Delete, DeleteByQuery, ModifyByQuery, and
Modify. A count of the number of IPools processed is also included.
Percentage Times
The time spent as a percentage performing various operations, per the Update
Distributor Categories table above. Idle time is a key measurement, indicating
whether the indexing system has sufficient capacity.
Backup Times
This information can be used to verify that backups are occurring, and ensure they
are completing in a reasonable time.
Agent Times
Agents are search queries run against data during indexing, generating IPools for
ingesting into Content Server. Content Server uses agents for Classification and
Prospectors. Too many agents, or complex/expensive agents, can materially
affect indexing throughput.
NetIO Stats
Keeps track of network retries and errors. Includes disk errors, most of which are
network attached. Non-zero numbers here can indicate hardware issues.

The Information Company™ 178


Understanding Search Engine 21

Checkpoint Writing Thresholds


In the Dataflow section of the [Link] file are settings that instruct the Index Engines
to create Checkpoint files when certain conditions are met. For Memory-based
partitions (not Low Memory mode) the default settings are that Checkpoints are
generated when the metalog grows to 16 MB, or when 5,000 objects are added, or
when 500 objects are modified.
[Dataflow sections]
MetaLogSizeDumpPointInBytes=16777216
MetaLogSizeDumpPointInObjects=5000
MetaLogSizeDumpPointInReplaceOps=500

Because the characteristics of Low Memory mode are different, these values can be
adjusted upwards significantly, perhaps to 100 MB, or 50,000 new objects or 10,000
objects modified. In order to maintain backwards compatibility and mixed mode
operation, OTSE has a separate set of Checkpoint Threshold configuration settings for
Low Memory Mode:
MetaLogSizeDumpPointInBytesLowMemoryMode=100000000
MetaLogSizeDumpPointInObjectsLowMemoryMode=50000
MetaLogSizeDumpPointInReplaceOpsLowMemoryMode=5000

Throughput normally increases with larger values because the number of times that
Checkpoints are created decreases. At the same time, this increases the likelihood
that many partitions will need to create checkpoint files at the same time. This may
place a high load on your disk system, and stall indexing for longer periods when
Checkpoint writes happen.
Larger values mean that more data is kept in the metalog and accumlog files instead
of in the Checkpoint. Larger metalog files require more time to consume during the
startup process for Index Engines or Search Engines. In most cases, this is a one-
time penalty and is acceptable.
When checkpoints are written, the Update Distributor writes lines to the log file that
indicate progress against each of the three configuration thresholds for each partition
that will write a checkpoint. Reviewing these lines can help you understand where
adjustments may be appropriate. The log lines look like this:

1399063311301:main:5:Added partition ZZZZ to checkpointing list:


1399063311301:main:5:with metalog size ratio 209715200/209715200=1.0:
1399063311301:main:5:with metalog object ratio 55321/50000:
1399063311301:main:5:with metalog replace operation ratio 0/5000:
When using Merge File storage mode, there are analogous settings that manage the
behavior of the background merge process:
MODCheckMode=0
MODCheckLogSizeDumpPointInBytes=536870912
MODCheckMergeThreadIntervalInMS=10000
MODCheckMergeMemoryOptions=0

The Information Company™ 179


Understanding Search Engine 21

Set the CheckMode to 1 to enable use of metadata Merge File mode. The LogSize
determines how large the CheckLog files may become before a merge operation is
triggered, and defaults to 512 MBytes. The MergeThreadInterval determines how often
the Index Engines check to see if a merge should be performed, with a default of 10
seconds. The MemoryOptions is optimized to minimize memory use; setting this value
1 uses perhaps 100 MB of additional RAM per partition for a relatively small
performance increase while performing merge operations.
Index Batch Sizes
The Update Distributor breaks input IPools into smaller batches for delivery to Index
Engines. The default is a batch size of 100. For Low Memory mode, this can be higher,
perhaps 500. Since the batch size is distributed across all the Index Engines that are
currently accepting new objects, the batch size can be further increased if you have
many partitions. A guideline might be 500 + 50 per partition. Larger batches result in
less transaction overhead.
[Update Distributor section]
MaxItemsInUpdateBatch=500

Note that the batch size is also limited by the number of items in an IPool. Often, the
default Content Server maximum size for IPools is about 1000, so this may also need
to be modified to take full advantage of increases in the Update Distributor batch size.
Starting with 20.3, batches are also split when the total size of the metadata plus text
in the objects to be indexed exceeds a defined threshold. The default is 10 MB, but
can be set higher if indexing large objects is common. This has been seen when
indexing email that has distribution lists with thousands of recipients. In the [Dataflow_]
section:
MaxBatchSizeInBytes=20000000

Prior to 20.3 the splitting of batches based on size used a different approach, where
the total size of the metadata of the objects in the batch can not exceed half of the
content truncation size (typically 5 MB).
There is another configuration setting that enables an optimization added in 16.2.2
related to how batches are handled. When processing ModifyByQuery or
DeleteByQuery operations, each request is sent to every Index Engine separately. In
practice, there are often many such contiguous operations in an IPool. The
optimization bundles these contiguous operations into a single communication to each
Index Engine, reducing the coordination overhead. By default, this optimization is
enabled, and can be controlled in the [DataFlow] section of the [Link] file:
GroupLocalUpdates=true

Partition Biasing
Research has shown that there is a strong correlation between the number of partitions
used for indexing and the typical indexing throughput rate. As expected, more
partitions improve parallel operation, and increases the throughput. However, the
transaction overhead per partition is relatively fixed, and the batch sizes become
fragmented into small batches when the operations are distributed to many partitions.

The Information Company™ 180


Understanding Search Engine 21

Depending on hardware, the optimal indexing throughput is usually in the range of 4


to 8 partitions.
To enable indexing in this optimal range for large search grids, there is a feature in
OTSE that restricts indexing of new objects to a specified number of partitions. For
example, you may have 12 partitions, but want to only fill 5 at a time for optimal
throughput. This is called partition biasing, and is set in the [Dataflow section]:
NumActivePartitions=5

The default value is 0, which disables partition biasing. Biasing only applies to new
objects being indexed. Updates to existing objects are always sent to the partition that
contains the object, regardless of biasing. For biasing purposes, a partition is
considered “full” when it reaches its “update only” percent full setting. The algorithm
for distributing new objects across active partitions is based upon sending objects with
approximately similar total sizes of full text and text metadata.
During an indexing performance test at HP labs in the summer of 2013, a brief test of
indexing throughput versus the number of partitions was performed. At the time, the
index contained about 46 million objects. There was plenty of spare CPU capacity,
and a very fast SAN was used for the index. In this particular test, the throughput
peaked around 12 partitions.

Parallel Checkpoints
Another index throughput adjustment setting is control over parallel checkpoints.
When a partition completes an indexing batch, it checks to see if the conditions for
writing a Checkpoint have been met. If so, then all partitions are given the opportunity
to write Checkpoint files. The logic being that if at least one partition is stalled, then
any partition that might need to write a Checkpoint soon should do it now. However, if
there are many Checkpoints, you may saturate disk or CPU capacity when large
numbers of partitions write Checkpoints, causing dramatic performance degradation

The Information Company™ 181


Understanding Search Engine 21

while writing the Checkpoints. The parallel Checkpoint control lets you specify the
maximum number of partitions that are allowed to write a Checkpoint at the same
moment. If more need to write Checkpoints, they must wait until a slot is freed up by
a Checkpoint write completing in another partition. You should only need to adjust this
if thrashing due to partition writing is suspected as a problem. Disabled by default, in
the Dataflow section of the [Link] file:
MaximumParallelCheckpoints=8

Testing for Object Ownership


Beginning with 16.2.3 (December 2017), a new optimization is available for
ModifyByQuery and DeleteByQuery operations. For these operations, the Update
Distributor broadcasts the operation to every IndexEngine. The Index Engines then
run the query to determine which object(s) match the query criteria.
The most common query criteria with Content Server is to match a single object ID,
having the form: [region "OTObject"] "DataId=445195828". The optimization is to
recognize this specific form of the query, and instead of running a query, the DataId is
hashed and tested against a Bloom Filter. Only the partitions that pass the Bloom Filter
test go on to perform the query to match the DataId. This optimization has the highest
impact when there are many partitions.
The Bloom Filter is enabled by default, and typically requires about 8 MB of memory
for a partition that contains 10 million objects. Smaller partitions use less memory, and
the Bloom Filters are recomputed as partitions grow in size. When enabled, the Bloom
Filters have a minimum size of 2 MB, and maximum size of 128 MB. Bloom Filter data
is not persisted, which means that an Index Engine will typically require one or two
additional seconds per million objects to compute the Bloom Filter data during start up.
When objects are deleted from a partition, their corresponding bits in the Bloom Filter
are not removed. Eventually, if many objects are deleted, this will result in higher false
positive responses, thereby reducing the degree of optimization. A restart of the Index
Engine is needed to rebuild the Bloom Filter.
There are several configuration settings for tuning the behavior of Bloom Filters in the
[Dataflow] section of the [Link] file. In general, the defaults are designed to stay
below a false positive rate of about 3%. The LogPeriod determines how frequently
statistics about the performance of the Bloom Filter are written to the log file.
AutoAdjust is recommended. The test to see if an adjustment is needed (resize and
recompute the Bloom Filter) is determined by the MinAddsBetweenRebuilds value. If
AutoAdjust is disabled, then you are responsible for setting the the NumBits and
Number of Hash functions, which are ignored when AutoAdjust is enabled. The
defaults settings are good to about 10 million items. Consulting public sources on the
math behind sizing and false positive rates would be advisable in this case.
LogPeriodOfDataIdQueries=1000
NumBitsInDataIdBloomFilter=67108864
NumDataIdHashFunctions=3
AutoAdjustDataIdBloomFilterSize=true
AutoAdjustDataIdBloomFilterMinAddsBetweenRebuilds=1048576
To completely disable Bloom Filters, AutoAdjust should be set to false, and the Number
of Hash Functions should be set to 0.

The Information Company™ 182


Understanding Search Engine 21

A further optimization was added in version 20.4, in which a quick single-token search
for the data ID is performed to get a short list of objects which are then tested for the
phrase match. This behavior is considerably faster since phrase searches are
considerably slower than single token searches. This fast lookup can be disable if
necessary in the [Dataflow_ ] section of the [Link] file:
DisableDataIdPhraseOpt=true

Compressed Communications
There is a configurable option in OTSE that allows the content data sent from the
Update Distributor to the Index Engines to be compressed. For systems which have
excess CPU capacity and slow networking to the Index Engines, enabling this option
can improve indexing throughput. Most systems do not have this performance profile,
so the feature is disabled by default. The threshold setting determines the minimum
size of full text content that needs to be present before the compression is triggered
for a specific object. Note that compression also requires additional memory. The
memory requirement varies based upon the maximum size of the text content, and for
a system with a content truncation size of 10 MB an Index Engine would consume
another 12 MB of RAM. In the [Dataflow_] section:
CompressContentInLocalUpdate=false
CompressContentInLocalUpdateThresholdInBytes=65535
Data Storage Optimization
Regions are encoded for storage using a data structure that contains look-ahead
pointers which allow traversing the list quickly. Beginning with 21.2, this “skip list” has
been extended to support longer skips, allowing faster search and retrieval, but
requires a relatively small amount of additional memory. There are configuration
settings to enable and tune this behavior, with a default of “on” and skip values of 4096.
[IndexMaker]
UseLongSkips=true
LongSkipInterval=4096
Scanning Long Lists
There is a specific optimization available for updates to text metadata in partitions not
using Low Memory mode. Low Memory mode uses different data structures and does
not exhibit this behavior.
If metadata updates are applied to metadata values where many objects have the
same value, the update operation can be extremely slow. For example, the
“OTCurrentVersion” region may have 1 million objects with the value “true”. Updates
to this field would be very slow.
The optimization makes these updates fast, but requires additional memory. Because
many customers with this configuration have full partitions, they cannot tolerate extra
memory requirements, so the default is for the optimization to be disabled (a value of
0). The configuration setting specifies the distance between known synchronization
points in the data structure. Values of about 2000 perform well, values below 500
become memory-intensive. In the [Dataflow] section:
TextIndexSynchronizationPointGap=2000

The Information Company™ 183


Understanding Search Engine 21

Ingestion versus Size


When measuring performance of search indexing, bear in mind that throughput
reduces as the number of objects in the partition increases. As data structures become
larger, extending and updating the index becomes slower. The single largest
contributing factor to the performance degradation is writing Checkpoints. A
Checkpoint is a complete snapshot of the search partition. As the partition gets larger,
the time to create the Checkpoint increases. As a guideline, the indexing throughput
as a partition approaches 10 million items will be about 30% of the throughput
experienced for the first million items indexed.
Using achievable numbers with typical Content Server objects, indexing the first million
items in a partition may be possible in 6 hours. Indexing items 9 million to 10 million
in a partition may require 18 hours or more.
Content Server Considerations
In many scenarios, the bottleneck for indexing occurs upstream of the search engine.
The indexing process starts with the Extractor in Content Server, which feeds IPools
to DCS. DCS prepares the data, and creates IPools that feed the Update Distributor
component of the search engine.
This first constraint is typically the Document Conversion Server. There are
mechanisms available in Content Server to run multiple DCS instances in parallel, and
the worker processes that each DCS instance manages for operations such as format
parsing or thumbnail generation can also be scaled up. If DCS throughput is not the
limiting factor, then running multiple Extractors in parallel is also an option that can be
configured.

Ingestion Rate Case Study


To help assess the performance of the Low Memory Mode configuration for high
ingestion rates or large systems, Hewlett-Packard graciously agreed to provide time in
their labs for testing ingestion on one of their multi-CPU servers with a fast HP SAN as
the index and test data storage. There is a performance white paper available from
OpenText that provides details about the hardware configuration.
Given that limited time was available, the testing was pragmatic rather than rigorously
scientific. For example, we might change some parameters for a short time as the
index was growing to assess the impact, but we did not have time to back out the
changes and run the identical test again in multiple configurations. Regardless, the
results have provided significant insights. Note that this test is focused on the search
engine. The inputs were IPool messages, hence factors such as DCS, database or
Extractor performance are not considered.
One key objective was to determine how indexing performance degrades as the size
of the index grows. Another was to confirm that performance remains acceptable as
partitions are used to store more items – since the historic comfort zone for a search
partitions is less than 2 million objects. Finally, we also ran concurrent search load
tests to confirm that searches on large indexes under heavy indexing load return
results in an acceptable time.
Indexing batches of about 2.6 million items each were used for most test runs. The
objects indexed are statistically generated as IPools to simulate typical email ingestion

The Information Company™ 184


Understanding Search Engine 21

scenarios, about 2 KB of metadata and 31 KB of full text content. Each batch added
a net increase of about 2 million new objects, although a mixture of metadata updates
and deletes were also included in each batch to simulate real-world behavior with
Content Server. The number of partitions was nominally 8, although variations were
tested. The chart on the next page provides a summary, with commentary below.

The test was seeded with an 8-partition index of about 14 million items. Initially, 12 to
16 partitions were enabled. After each batch of 2 million items was ingested, the
performance was reviewed and occasionally changes made to the configuration of
hardware or the index.
Below 50 million items in the index, an important observation is that the Update
Distributor does not appear to be a bottleneck, despite all data for all Index Engines
passing through the Update Distributor. We see many data points where the overall
throughput exceeds 100 items per second, which would be in the neighborhood of 8
million objects per day.
Once we had confirmed that performance with 16 partitions was relatively high, we
adjusted the number of partitions down to 8, to focus on building larger partitions in the
available lab time. As expected, the throughput with 8 partitions is significantly lower.
By the end of the test, the 8 partitions contained indexes of 10 million objects each. At
this size, the indexing throughput had decreased to just under 30 objects per second.

The Information Company™ 185


Understanding Search Engine 21

This is nearly 2 million objects per day, not including excess capacity for downtime or
spikes.
Some interesting data points:
• At about 94 million objects, we enabled more active partitions and observed
that much higher ingestion rates were still possible.
• Around the 30 million object mark, a faulty network card was replaced,
resulting in a material jump in performance.
• During one interval we duplicated the exact same test on the same hardware,
running concurrently. Our indexing tests were not fully engaging the capacity
of the HP hardware, generally staying below 30% CPU use. Doubling the
indexing load on the hardware resulted in dropping the throughput from about
40 to about 30 objects per second for the observed test, although we did
manage to get a peak CPU use above 60%. The duplicate concurrent test
had similar performance characteristics. It would appear that the HP
environment has capacity for a much larger index than we tested, or could also
be used for other purposes such as the Document Conversion Server.
• We disabled CPU hyper-threading for two runs, which reduced throughput
again from about 40 to 30 new objects per second. Lesson learned: leave
hyper-threading enabled for Intel CPUs.
What about searching? Search load tests from within Content Server were performed
concurrently while indexing was occurring. As expected, search became slower as the
index size increased. By test end, with 100 million items and indexing 40 objects per
second, simple keyword searches from the search bar averaged less than 3 seconds,
and advanced search queries about 6 seconds, including search facets. This is not
the search engine time, but the overall time including Content Server.
Does this ingestion case study have relevance for even larger systems? Yes. The
indexing throughput we measured is based on the number of “active” partitions, using
partition biasing. Eventually, you may have many more partitions, but by biasing
indexing to a limited subset, the indexing throughput can be modeled along the lines
seen in this example.
As a final note, this test was performed using Search Engine 10.0 Update 11. There
are a number of performance improvements, in particular for high ingestion rates that
have been implemented since this test was performed. Consider these data points to
be conservative.

Re-Indexing
Although OTSE has many features that provide upgrade capability and in-place data
correction, there are times when you may want to completely re-index your data set.
If you have a small index, re-indexing is fast and easy. For larger indexes, there are
some performance considerations.
It is faster to rebuild from an empty index than to re-index over existing data. There
are several reasons for this. Firstly, the checkpoint writing process slows down as the
index becomes larger, since there is more data to write to disk. When starting fresh,
the early checkpoint writing overhead is very small. Modifying values is also more

The Information Company™ 186


Understanding Search Engine 21

expensive than adding values – searching for existing values, removing them, and
adding new values to the structure is slower than simply adding data to a structure.
Another key factor is the metalog update rules. In particular, the default checkpoint
write threshold is lower for updates than it is for adding new items to the index. This is
a reasonable value during normal operation, but when a complete re-index is in
progress and all objects are being modified, this setting will result in a high checkpoint
overhead. A purge and re-index avoids this problem entirely. If re-indexing very large
data sets, increasing the threshold replace operations may be a useful strategy.

Optimize Regions to be Indexed


Don’t index metadata regions that are not needed.
Review the DCS documentation to ensure you have the right level of indexing enabled
for Microsoft Office Document properties. Chances are that unless eDiscovery for
Litigation Support is a key requirement, you can materially reduce your index size by
suppressing indexing of the extra document properties.
Examine your region definitions file ([Link]). Make sure that DROP or
REMOVE is applied for regions which have little value for your business case. Verify
that the most efficient region type is defined for the remaining regions.
If you aren’t sure, you can move a text field to DISK_RET mode to reduce memory,
making it non-searchable. If you later determine that you do indeed need the field to
be searchable, you can change its storage mode back to DISK or RAM and make it
searchable again.

Selecting a Storage System


OTSE is a disk-intensive application. The characteristics of your disk storage can have
a dramatic effect on the performance of your search grid. If you are familiar with
configuring databases, applying similar guidelines to setting up storage for the search
index will normally give you good results.
The search grid is comprised of dozens of active files per partition. With a large search
grid, active reads and writes of hundreds of files simultaneously will be happening.
Indexing creates new files and performs disk intensive merging of existing files. Both
the Index Engines and Search Engines perform independent operations on these files.
While there is no specific rule or guideline for what constitutes an appropriate disk
configuration, keep the following in mind.
If you don’t care about indexing and search performance, then don’t worry about
configuring a high performance disk system. If your data indexing rate will be low and
search queries don’t require fast response, then you can probably tolerate a low
performance storage system.
Each incremental search partition adds files that need to be managed and accessed.
Accessing many files always impacts query performance. The increased file access
for indexing will usually not be noticeable if you have low rates of indexing, perhaps

The Information Company™ 187


Understanding Search Engine 21

below a few typical updates per second (yes – per second. Depending upon the
situation, an Index Engine is capable of indexing 50 or more objects per second).
For maximizing indexing throughput, disk performance is a key parameter, since disk
I/O is usually the limiting factor. Using several sample test setups on similar (but not
identical configurations) in 2012 we measured indexing times with 4 partitions of:

390 Minutes with a single good SCSI hard disk installed in the computer.

270 Minutes attached to a lightly loaded storage array with a 10 GB network


connection running on VMware ESX.

5000+ Minutes attached to a busy NFS storage array shared with other
applications with a 10 GB network connection running on VMware ESX.
Read that last one again. You really can configure disk storage that will reduce the
performance of OTSE by a factor of 20 or more. Disk fragmentation also has an
impact. On Windows, we typically see a 20% indexing performance drop between a
pristine disk and one with 60% file fragmentation.
Note that the caching features of some SANs are too aggressive, and can report
incorrect information about file locking and update times.
Customers using basic Network Attached Storage such as file shares generally report
poor search performance. In general, storing the search index on a network file share
will give very poor results.
The incidence of network errors that customers experience when using either SAN or
NAS is surprisingly high. OTSE has relatively robust error detection and retries for
these cases, but failure of the search grid due to network errors is still possible. When
using any type of network storage for the index, monitoring the network for errors is a
good practice that may prevent a lot of frustration due to intermittent errors.

The Information Company™ 188


Understanding Search Engine 21

NOTE: Do not use Microsoft SMB2. The Microsoft SMB2 storage


system caches information in such a way that it does not accurately
report file locking and updates in a timely fashion, resulting in
incorrect behavior of OTSE.

NOTE: Apply Windows NTFS patches. When using SAN storage


with an NTFS file system and large search partitions, some
customers have hit Windows operating system limits for file
fragmentation. The Microsoft Knowledge Base article 967351
contains information about this limit and provides a patch that can
solve the problem for some Windows operating systems.

NOTE: Use Drive Mapping. Wherever possible, use drive


mapping instead of UNC paths for search index components. In
particular, customers have reported instability with Java accessing
drives on Network Attached Storage when UNC path names are
used.

A dedicated physical high performance disk system will usually outperform a network
attached disk system. However, a SAN with high bandwidth often has other benefits,
such as high availability, which make them attractive. If you are configuring a SAN for
use with search, treat the search engine like a database. The performance of the disk
system is almost always the limiting factor in performance.
Any type of network storage is acceptable for index backups. In fact, backing up the
index onto a different physical system is generally recommended.
Finally, a word about Solid State Disks (SSD). SSDs are gaining acceptance for high
performance enterprise storage. The characteristics of fast SSD are a good fit for
search engines. Given the large number of small random access reads that occur
when searching, SSD storage is an excellent choice for maximizing search query
performance. Indexing performance is not as dramatically affected, since the Index
Engines are generally optimized to read and write data in larger sequential blocks.
However, even with indexing, the highest indexing throughputs we have measured in
our labs occurred with local SSD storage for the index, around 1 million objects indexed
per hour. If you need to improve the query performance or indexing throughput,
investing in good SSD storage media for the index is likely the best hardware
investment you can make.

Measuring Network Quality


Some of the most difficult search issues to diagnose are due to errors in the
environment that affect reliability of network communications. Beginning with the
16.2.9 update, OTSE records network problems encountered by the Update Distributor
communicating with Index Engines, and the Search Federator communication with
Search Engines. This communication will retry up to 5 times if needed. OTSE counts
the number of retries needed to complete a communication, and the number of times
the communication failed despite retries. The counts are written to log files on an
hourly basis as an extension of the “Index Performance Summary” and “Search
Performance Summary”. In the log files, the column headings are NetIO1 through

The Information Company™ 189


Understanding Search Engine 21

NetIO5 for retries, and NetIOFailed for failures. The counts are also included in a
“getstatustext performance” query to the admin port.
Retries and failures indicate problems in the environment and may include unreliable
network cards, bad cables, port conflicts, or virus/port scanners.
By default, recording of the network quality metrics is enabled, and can be disabled in
the [Dataflow_] section of the configuration file by setting the value to false:
LogNetworkIOStatistics=true

Measuring Disk Performance


OTSE keeps rudimentary disk performance statistics that are intended to help identify
when an environment if not performing as expected. During operation, both the Index
Engines and Search Engines track the performance of selected disk operations.
Occasionally, the summary information is written to the log files. The data is cumulative
since startup.
For the Index Engines, the average access time plus histogram data is maintained for
Writes, Syncs, Seeks and Close operations. In a fast environment, the times should
ideally be in the 0-2 millisecond bucket. If there are counts recorded for long periods,
this is a strong indicator that there are performance problems with the storage system.
Disk IO Counters. Read Bytes 0. Write Bytes 154394096.:
Histogram of Disk Writes. Avg 0 ms (381/18979). 0-2 ms (18979).
3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms
(0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Syncs. Avg 179 ms (37276/208). 0-2 ms (0). 3-
5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (37). 51-100 ms
(23). 101-200 ms (59). 201-500 ms (87). 501-Inf ms (2).:
Histogram of Disk Seeks. Avg 0 ms (1/78). 0-2 ms (78). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Closes. Avg 0 ms (0/2). 0-2 ms (2). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:

In addition to the times, the number of disk errors that occur and the number of retries
needed to succeed are recorded. If errors exist, an additional line of this form will be
written:
Disk IO Retries Needed. 1 (7). 2 (6). 3 (8). 4 (2). 5+ (22).
failed (17).
For example, this entry indicates that on 7 occasions, 1 error/retry was required. On
22 occasions 5 or more retries were attempted, and 17 times the disk I/O failed even
with retries.
Similarly, the Search Engine reports performance for selected disk operations, writing
entries of this form:
Disk IO Counters. Read Bytes 112711231. Write Bytes 0.:

The Information Company™ 190


Understanding Search Engine 21

Histogram of Disk Reads. Avg 0 ms (122/12347). 0-2 ms (12347).


3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms
(0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Seeks. Avg 0 ms (3/360). 0-2 ms (359). 3-5 ms
(1). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Closes. Avg 0 ms (0/1). 0-2 ms (1). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:

By default, reporting of this data is enabled and is written every 25 transactions. The
feature can be disabled and the frequency of reporting can be controlled in the
[Dataflow_] section of the [Link] file:
LogDiskIOTimings=true
LogDiskIOPeriod=25

Checkpoint Compression
There is an optional feature in OTSE that allows Checkpoint files to be compressed.
Checkpoint files can be large, over 1 GB as you exceed 1 million objects in a partition.
New Checkpoint files are written from time to time, usually by all partitions at once,
which can place a significant burden on the disk system.
The compression feature is disabled by default since, in a simple system with a single
spinning disk, compression makes Checkpoint writing CPU bound, and indexing
throughput may decrease by 10% to 15%. However, if you have a system which is
limited by disk bandwidth rather than CPU, then enabling Checkpoint compression
may be a good choice, and actually increase indexing performance. The compression
feature generally reduces the size of Checkpoint files by about 60%. Compression is
enabled in the [Dataflow_] section of the [Link] file:
UseCompressedCheckpoints=true

Disk Configuration Settings


OTSE has a number of configuration variables which can potentially change the
characteristics of disk usage. The default settings are normally appropriate, but
experimenting with some of these parameters may be needed depending on your disk
system. These values are configured in the [Link] file.
Delayed Commit
Some Storage Array Networks (SANS) with slow SYNC characteristics require a delay
between certain types of operations. While normally 0, this setting will insert a pause
between key disk operations that improves system stability in these cases:
DelayedCommitInMilliseconds=10

The Information Company™ 191


Understanding Search Engine 21

Chunk Size
Some storage systems are sensitive to the chunk size when reading or writing data.
The default is 8192. Although normally we recommend the Java default of 32768, this
parameter can be forced to a smaller maximum value if necessary in the DataFlow
section of the [Link] file:
IOChunkBufferSize=8192
Query Parallelism
The Search Federator asks each Search Engine to return results. There are two key
performance tuning values in this process in the [Link] file. The first is how
aggressive the Search Federator will be with respect to asking Search Engines to pre-
fetch results to keep the Search Federator result merging queue full. The default value
of 0 is used to pre-fetch as much as possible, measured in terms of Search Engine
result blocks. Setting this number higher will delay pre-fetching, which can reduce the
number of results fetched but introduces delays into result retrieval. For example, a
value of 3 will wait until a Search Engine has been asked for 3 blocks of results before
beginning to pre-fetch results.
MergeSortCacheThreshold=3
The other parameter is the number of results a Search Engine fetches each time the
Search Federator asks for a set of results. The default value is 50. Larger values are
more efficient when the typical query is for many results. Smaller values are more
efficient for typical relevance-driven queries. In general, if using the preload above, a
value of 20 to 50 is likely optimal, and reduces the potential load on the disk system.
MergeSortChunkSize=50
These values are multiplicative with the number of partitions. For example, if you have
8 partitions and a MergeSortChunk size of 250, then the MINIMUM number of results
that the Search Engines together will provide to the Search Federator is 2000. Keeping
MergeSortChunk size value low for systems with many partitions is recommended.
Throttling Indexing
In some environments, it may be the case that indexing operations are creating
metalogs faster than they can be consumed by the search engines. There is an upper
limit on how many unprocessed metalog files are acceptable, which can be adjusted if
necessary should Search Engines chronically lag behind the Index Engines. This can
happen in environments in which long-running search queries tie up the Search
Engines at the same time that high indexing rates are occurring. In some cases this
problem can be resolved by configuring Search Federator caching. When this limit is
reached the indexing updates will pause to allow the Search Engines to close the gap.
AllowedNumConfigs=200
In situations where queries are constantly running, it may be necessary to force a
pause in processing search queries in order to give the Search Engines an opportunity
to consume the index changes. There are two settings to control this, one that specifies
the maximum time that queries are allowed to run continuously (thus blocking updates),
and the other is the duration of the pause which is injected into searching. By default,
this feature is disabled.
[SearchFederator_xxx]
BlockNewSearchesAfterTimeInMS=0
PauseTimeForIndexUpdatingInMS=30000

The Information Company™ 192


Understanding Search Engine 21

Small Read Cache


The Search Engines have an optional feature to reserve memory for a disk read cache,
which can buffer recent small blocks read from the index during queries. Testing on a
typical index showed a reduction of read operations of up to 17%. If you enable this
feature, ensure that you do a before/after set of timing tests. While a benefit is typical,
some environments show a small performance degradation of a few percent in queries.
By default, this is disabled (set to 0). In the [Dataflow_] section of the [Link] file.
There is very little measured benefit to exceeding 10 MB in a Search Engine for this
optimization.
SmallReadCacheDesiredMaximumSizeOfSmallReadCachesInMB=5
File Retries
Experience has shown that disk reads and writes are not always reliable, especially
when low performance disk systems such as NAS or distributed file systems are in
use. To try and ensure correct operation in these environments, most file accesses
will detect errors and retry operations multiple times. The delay between retries is
about 2 seconds times the attempt number. For a number N, the total retry time is
N*(N+1) seconds (e.g. if N is 5, up to 30 seconds). In update 21.1, this setting was
extended to cover retries for reading the livelink.### files (aka [Link] files). Using
these types of disk environments is strongly discouraged, and even if correct, can be
extremely slow. The number of retries is adjustable, and defaults to 5.
NumberOfFileRecoveryAttempts=5

Indexing Large Objects


The default settings of OTSE and the Document Conversion Server are designed to
handle all normal document sizes. Text is typically truncated to 10 or 20 MB, which
accounts for all but the very largest of documents. This document, for example,
contains well under 1 MB of text. The text in very large documents is often of little
value, and the first 10 MB contains matches due to redundant terms. Note that the
amount of text in a document is often very much smaller than the file size – for example,
a PowerPoint file might be 100 MB in size, but contain only 10 KB of actual text.
However, there are situations where all of the text in very large documents must be
indexed. In experiments, we have successfully indexed documents comprised of more
than 200 MB of text. In order to achieve this, the Engines will need to have significant
spare memory (gigabytes). This is effectively done by setting the metadata memory
size to a large value (say 6 GB) with a maximum allowable utilization of 30%. In one
experiment, we measured indexing success with available memory of approximately
(100MB RAM + 8MB RAM/1MB TEXT), in both index and search engines. For
example, a 200 MB text file succeeded with 1.7 GB of available memory for processing.
This experiment occurred before Update 16.2.3, where the worst case scenario would
require available RAM equivalent to 7x the size of the text. Beginning with 16.2.3, the
situation has improved, with a worst case RAM requirement of 3x the size of the text.

The Information Company™ 193


Understanding Search Engine 21

The truncation size will also need to be adjusted upwards from about 10 MB to the
desired size, perhaps 210 MB. The timeouts for the Index Engines may also need to
be increased. Changes to settings in the Document Conversion Server will also be
required, including allocating more memory, adjusting truncation limits, and providing
much longer timeout values for processing formats.

Servers with Multiple CPUs


Large servers with multiple physical CPUs may require special consideration. Several
customers have experienced very slow operation with high-end expensive hardware,
which is counter-intuitive. Investigation has identified that systems with a Non-Uniform
Memory Access (NUMA) architecture need to be carefully configured. OTSE and the
Admin Servers do not have any special handling for execution on NUMA nodes. The
operating system tools are relied on for optimizing processor affinity. In most cases,
the default behavior of the operating system will allocate processes and threads such
that there is no problem in a system with multiple NUMA nodes.
In a NUMA system, memory is partitioned with fast access to one CPU, and much
slower access by the other CPUs. OTSE uses many threads for execution, and the
operating system could assign different threads for the same Search Partition to
different physical CPUs. Tasks undertaken by the threads on CPUs not attached to
the memory take about 5 times longer to execute, in part because of slower memory
access, but also because serial interconnects between the CPUs must be used to
synchronize caches.
One approach to resolving this issue is to use operating system tools to pin applications
to physical CPUs. In a Content Server environment, Search Engine processes are
started and ‘owned’ by an Admin Server. It may therefore be necessary to set the
affinity of an Admin Server and all of its attached processes to a single CPU. This in
turn may require changing the number of Admin Servers in use and allocating Search
Engine processes to the Admin Servers to meet your performance goals. In the
Content Server environment, the Document Conversion Server may likewise need to
be adjusted.
The tools used to analyze the allocation of applications to CPUs and to pin applications
to CPUs vary by operating system. You may wish to investigate the use of some of
the following operating system functions for optimizing execution on NUMA nodes:
Linux: taskset, numactl
Solaris: priocntl, pbind
Windows: start /NODE (may require hotfix to [Link])
If you are running OTSE in a Virtual Environment, the VM tools will often have
processor and NUMA node affinity controls that may also be used to set node affinity.
Note that these considerations only apply to servers with multiple physical CPUs.
There is no scalability performance issue associated with many cores on a single CPU.

Virtual Machines
In principal, virtual machines should be indistinguishable from physical computers from
the perspective of the software. In practice, there are occasionally problems which

The Information Company™ 194


Understanding Search Engine 21

arise from running software in a virtual environment. OTSE is known to operate with
VMWare ESX, Microsoft HyperV, and Solaris Zones. However, OpenText cannot in
reality rigorously test and certify every possible combination of hardware and virtual
environment, and there may be configurations of these virtual environments that
OpenText has not encountered which might be incompatible with the search grid.
The most important point is this: virtual machines do NOT reduce the size of the
hardware you need to successfully operate a search grid. If anything, operating a
search grid in a virtual environment will require MORE hardware to achieve the same
performance levels, when measured in terms of memory and CPU cores/speed.
For small installations of the search grid where performance issues are not a factor, a
virtual environment can be attractive. However, as your system increases in size to
require many partitions, be aware that a virtual environment may be more costly than
a physical environment for the search grid, which needs to be considered against VM
benefits such as simplified deployment and management. Consider a search engine
as being analogous to a database. For larger or performance-intensive database
applications, the database is often left on bare metal, even if the remainder of an
application is virtualized. The Search Engine has performance characteristics similar
to a database and it may make sense to leave the Search Engine on dedicated
hardware.
One example of a limitation we have seen is virtual machines in a Windows server
environment. In some cases, the I/O stack space is not sufficient once the extra VM
layers are introduced, and tuning of the Windows settings to increase I/O resources
becomes necessary.
As with most applications deployed in a virtual environment, the software runs slower.
The change in performance depends on many factors, but a 10% to 15% performance
penalty is not uncommon.
We have also seen instances in which the memory used by Java in a VM environment
is reported as much higher than the equivalent situation on bare hardware. In practice,
the actual memory in use is very similar, but the reported values can differ wildly. Often,
over a period of many hours, the reported VM memory will decline and converge on
memory consumption reported on a bare hardware environment.

Garbage Collection
The Java Virtual Machine will generally try to optimize the number of threads it
allocates to Garbage Collection. However, it is not always correct. For example, when
running in a Solaris Zones environment, the “SmartSharing” feature of Zones can
trigger the Java Garbage Collector to allocate very large numbers of threads and
memory resources, which in Zones may be manifested as Solaris Light Weight
Processes (LWPs).
If the number of threads on a system allocated to Garbage Collection seems unusually
large, you likely need to place a limit on the number of Garbage Collection threads,
which can be done using by modifying the Java command line to add the –
XX:ParallelGCThreads=N, where N is the maximum number of threads. Selecting N
may require experimentation, but values on the order of 8 are typical for a system with
8 partitions, and values over 16 may provide little or no incremental value.

The Information Company™ 195


Understanding Search Engine 21

File Monitoring
Some tools that monitor file systems can cause contention for file access. One known
example of this is Windows Explorer. If you browse to a folder used by SE 10.5 to
represent the search index using Windows Explorer, then you will likely cause file I/O
errors and a failure of the search system.

Virus Scanning
The performance impact of virus scanning applications on the search grid is
catastrophic because of the intense disk activity that the search grid performs. In some
cases, file lock contention can also cause failure or corruption of the index. You must
ensure that virus scanning applications are disabled on all search grid file I/O. The
search system only indexes data provided by other applications. If virus scanning is
necessary, then scanning the data as it is added to the controlling application (such as
Content Server) is the recommended approach.
Related to this, we see virus scanners now offering port scanning features as well.
Like virus scanners, we have found that port scanners can significantly reduce
performance or cause failure of the software.

Thread Management
OTSE makes extensive use of the multi-threading capabilities of Java. In general, this
leads to performance improvements when the CPUs have threads available. However,
for very large search grids with over 100 search partitions, the number of threads
requested by OTSE may exceed the default configuration values for specific operating
systems. Depending upon the operating system, it is usually possible to increase the
limits for the number of usable threads. This problem is less likely to occur when
running with socket connections instead of RMI connections.
Configuring an operating system to permit more threads for a single Java application
is beyond the scope of this document, and may also include tuning memory allocation
parameters for the JRE. The objective here is simply to make you aware that additional
system tuning outside the parameters of OTSE may be necessary.

Scalability
This section explores various approaches to scaling OTSE for performance or high
availability. OTSE does not incorporate specific scalability features. Instead, by
leveraging standard methods for system scalability with an understanding of how the
search grid functions, we can illustrate some typical approaches to search scalability.

Query Availability
The majority of customers that desire high availability are generally concerned with
search query performance and uptime. Usually, this is tackled by running parallel sets
of the Search Federators and Search Engines in ‘silos’, with a shared search index
stored on a high availability file system, as illustrated below:

The Information Company™ 196


Understanding Search Engine 21

To obtain the benefit of high availability, the search silos should be located on separate
physical hardware in order to tolerate equipment failure.
Search queries are not stateless transactions; they consist of a sequence of operations
– open a connection, issue a query, fetch results, and close the connection. Because
of this, simple load balancing solutions cannot easily be used as a front end for multiple
search federators. Instead, the application issuing search queries should have the
ability to direct entire query sequences to the appropriate silo and Search Federator.
Content Server is one such application. If multiple silos are configured, search queries
will be issued to each one alternately. In the event that one silo stops responding,
Content Server will remove that target from the query rotation. Refer to the Content
Server search administration documentation for more information.
In this configuration, the Search Engines share access to a single search index. This
works because Search Engines are “read only” services which lock files that are in
use. All changes to the Search Index files are performed by the Index Engines. When
a Search Engine is using an index file, it keeps a file handle open – effectively locking
it. The Index Engines will not remove an index file until all Search Engines remove
their locks on a fragment. Because these locks are based on file handles in the
operating system, a Search Engine which crashes will not leave locks on files.
When Search Engines start, they load their status from the latest current checkpoint
and index files, and apply incremental changes from the accumlog and metalog files.
Because of this, no special steps are needed to ensure that Search Engines in each

The Information Company™ 197


Understanding Search Engine 21

silo are synchronized. They will automatically synchronize to the current version of
the index.
It is possible for an identical query sent to each silo at the same time to have minor
differences in the search results. The differences are rare, probably small, and short
lived – and would not be noticed or important for most applications. These potential
variances arise due to race conditions. The Search Engines in each silo update their
data autonomously. When an Index Engine updates the index files, perhaps adding or
modifying a number of objects, the Search Engines will independently detect the
change and update their data. For a short period of time, a given update to the search
index may be reflected in one of the search silos but not the other.
This approach to high availability for queries also allows many search grid maintenance
tasks to be performed on Search Federators or Search Engines without disrupting
search query availability. By stopping one silo, performing maintenance, restarting the
silo, and then repeating the process with the other silo, user queries are not impacted
throughout the process. Note that some administration tasks which change
fundamental configuration settings may not be possible without service interruption.
An additional benefit of parallel silos is search throughput. Since applications such as
Content Server can distribute the query load across multiple silos, the overall search
performance might be higher. This will not be the case if the hardware on which the
search index is stored is a performance bottleneck, particularly the disk which is shared
by each silo.
For correct operation, each silo must have identical configuration settings. If you have
hand-edited any of the configuration files, you must ensure this is properly reflected on
both silos.

Indexing High Availability


By its nature, search indexing is not typically a real-time application. Objects for
indexing are placed in IPools (which are queues), then prepared by DCS, and added
to the index in batch transactions. By definition, there is latency and delay in these
processes, which vary based on many factors, including the indexing throughput, the
size and types of objects being indexed, or the number of objects in an IPool.
Because of this, high availability for search indexing is not a requirement for most
customers. Search queries can be available even should the indexing operations be
down. Given the cost of adding duplicate equipment for high availability and the limited
benefits, it is rare that you would need to implement indexing high availability.
What is commonly required is a way to redeploy or recover in a reasonable time frame
if indexing dies. Configuring the indexing system on a virtual machine is one possible
approach to reducing the recovery and redeployment times for search indexing. In the
event that a system fails, the VM images for the indexing processes can be rapidly
deployed on other hardware.
Within Content Server, the “Admin Server” component is also available. The Admin
Server will monitor the indexing processes, and is capable of restarting them in the
event that unexpected errors occur.
If you absolutely must have true high availability for indexing, this must be implemented
using technologies external to the search grid, with a combination of configuration

The Information Company™ 198


Understanding Search Engine 21

settings and external clustering hardware or software. The general principal is that two
completely separate search grids are created, the indexing workflow is split and
duplicated, and the indexes are independently created and managed. This is an
exercise pursued using products such as Microsoft Cluster Server, and beyond the
scope of this document.

Sizing a Search Grid


Determining how a hardware environment should be sized for search is not always a
simple task. There are many variables that can affect this. While there is no firm
formula for estimating hardware requirements, this section will examine some of the
common considerations for search index sizing. The bottom line – experimenting in a
test environment with your actual application and representative data is the only way
to make solid predictions about how search grid hardware should be sized.

Minimizing Metadata
Many Content Server applications index much more metadata than is actually used in
searches. Using the LLFieldDefinitions file to REMOVE metadata fields that will never
be used can minimize the RAM requirements.

Metadata Types
By default, metadata regions are created as type TEXT. Integers, ENUM, Boolean are
more efficient, and using the LLFieldDefinitions file to pre-configure types for these
regions can reduce the RAM requirements.

Hot Phrases and Summaries


New installations of Content Server should consider configuring the OTHP and
OTSummary regions as “Retrieve Only” regions, which can reduce RAM requirements
by 25% or more depending on the type of data you index.

Partition RAM Size


Each partition brings with it a relatively fixed amount of overhead of several hundred
Mbytes of RAM requirements. Added to this is the memory that each partition will use
to store Metadata. The larger the memory allocated to a partition, the fewer partitions
there are required, and thus the overall RAM requirements are lower. However, simply
using large partitions is not always the correct approach, since performance of a
partition reduces slightly as it grows in size.
Content Server 9.7.1 installations on Windows or Linux are using a 32 bit JRE, which
places an upper limit of about 1 GByte on the partition RAM size. Secondly, it is
relatively easy to allocate more RAM to partitions to make them larger, but difficult to
break a large partition into smaller ones. If you have performance bottlenecks in
indexing or searching, splitting the index into multiple partitions can sometimes
improve performance by leveraging the parallelism of multiple partitions, although this

The Information Company™ 199


Understanding Search Engine 21

may require additional CPU cores or disk bandwidth to leverage the parallel capabilities
for performance.
Sample Data Point
Our sample system is comprised of a relatively typical mix of Content Server data
types from a “document management” application, including some use of forms
and workflow. There are several hundred core metadata regions, and several
thousand lesser-value metadata regions from applications such as Workflow.
In “RAM” mode, without tuning, a default 1 GB partition holds about 1.5 million
objects. Using a 3 GB partition size, we measure 4.4 million objects using about
2.5 GBytes of RAM for metadata.
In “DISK” mode, with the same data we can index the same 4.4 million objects
using a little more than 1.8 GBytes of RAM for metadata, which is roughly a 2.5
GByte partition.
In “Low Memory” disk mode, the same 4.4 million objects requires about 700
Mbytes of RAM, which can be done in a 1 GByte partition. We extrapolate that a
2 GB partition in Low Memory mode can potentially handle up to 10 Million indexed
objects from Content Server.
The general guideline using Low Memory mode with Content Server is that you can
expect a partition to accommodate 7 to 10 million typical Content Server objects with
reasonable performance using a 2 GB RAM partition size. The overall conservative
memory budget for such a partition is approximately 6 GB (2 GB RAM + 1 GB overhead
and Java for each of the Index Engine and Search Engine).
Memory Use
When running a Java process, the amount of memory it may use is specified on
the command line. Java can be aggressive about consuming this memory. You
may be able to operate a partition with 1 GB of RAM, but if you made 8 GB of
memory available, Java may consume all of it. This memory use can be
misleading when analyzing resources used by a search partition.

Redundancy
If you are building a high availability system with failover capabilities, the hardware
must be suitably duplicated.

Spare Capacity
In the event that there are maintenance outages, or a requirement to re-index portions
of your data, you will need spare CPU capacity to handle this situation. Although OTSE
is a solid product, indexing problems can happen – generally incorrect configuration or
network/disk errors, although (perish the thought) there are occasionally bugs found.
Sizing the hardware to meet the bare minimum operating capacity won’t allow you any
headroom to recover from problems.

The Information Company™ 200


Understanding Search Engine 21

Indexing Performance
As with all sizing exercises, making predictions is fraught with danger. Ignoring the
peril, our anecdotal experience is that the Index Engines can ingest more than 1
Gigabyte of IPool data per hour.
A specific example on a computer that we frequently use for performance testing:
• Windows 2008 operating system, 2 Intel X5660 CPUs, 16 Gbytes RAM
• Update Distributor
• 4 Index Engines / partitions
• Partition metadata size of 1000 Mbytes
• Index stored on a single SCSI local hard disk
• Predominantly English data flow
consumes more than 4 GB per hour, comprised of nearly 200,000 objects added or
modified per hour. Usually, high performance indexing is limited by disk I/O capacity.
Refer to the Hard Drive Storage section for more information.
Beyond about 4 partitions, the performance of the Update Distributor becomes a factor,
and you may need to ensure that the disk read capability for the indexing IPools is
adequate.

CPU Requirements
There is no single rule for the number of physical CPUs needed for a search grid. Don’t
rely on hyperscaling – physical CPUs are key. The requirement is directly related to
your performance expectations. Some of the variables you should bear in mind are
outlined here.
Most customers optimize for cost and have low CPU counts. This means that search
works, but user satisfaction with performance may be low.
Active searches are CPU intensive. If good search time performance is expected, you
should have at least 1 CPU per search engine. This is especially true if multiple
concurrent searches will be running.
Searches are bursty in nature. CPUs will sit idle until a search request arrives, then
saturate the system. Administrators will tend to look at the average CPU use over time,
and claim that utilization is low, therefore no additional CPUs are needed. They are
wrong. Check to see if CPU utilization hits high levels during active searches, then
plan your CPUs based on load during that period.
Search Agents (Intelligent Classification, Prospectors) place an additional load on the
Search Engines. If you are using these features heavily, you may need to
accommodate with some additional fractional CPU. Search Agents run on a schedule,
so they have no impact most of the time, but a heavy potential impact when run.
Indexing is expensive. If you need high indexing throughput, you should have at least
1 CPU per active partition, plus 0.25 CPU per inactive partition, plus 1 CPU for the
Update Distributor. With low indexing throughput requirements, 1 CPU for 4 Index
Engines may suffice.

The Information Company™ 201


Understanding Search Engine 21

In addition, spare capacity is needed on the Index Engines for the following events…
running index backups, writing checkpoints, performing background merge operations.
These operations are designed to limit activity to a subset of partitions concurrently
(default about 6). You can choose degraded indexing during these periods or allocate
additional CPUs.
Example… if you want good search performance with many searches being run
(including searches for background RM disposition and hold) and expect hundreds of
thousands of indexing additions and updates every day. A medium-large system with
40 partitions (perhaps 500 million items). Configured with 6 active partitions (number
of partitions that accept new data, write checkpoints, merge concurrently).
1 CPU – Update Distributor
6 CPUs – Active Index Engines
8 CPUs – Update Index Engines
40 CPUs - Search engines with fast response
Assume indexing throughput can tolerate short slowdowns for background operations,
no extras. Over 50 CPUs is an appropriate size. Conversely, the same system which
can tolerate large backlogs for indexing (perhaps catching up in the evenings) and is
comfortable with users waiting 20 seconds on average for a search can probably get
by with 16 CPUs.

Maintenance
As with all sophisticated server software, there are a number of suggestions, best
practices and configurations that contribute to the long term health and performance
assessment. This section outlines some of the considerations.

Log Files
Each OTSE component has the ability to generate log files. There are separate log
files for each instance of each component. The basic settings are:
Logfile=<SectionName>.log
RequestsPerLogFlush=1
IncludeConfigurationFilesInLogs=true
Where the Logfile= specifies the path for logging (the file name is generated from the
component and the name of the partition). Requests per log flush specifies how many
logging events should be buffered before writing. The value of 1 is the least
performant, but does the best job of guaranteeing that logging occurs if something
crashed unexpectedly.
At startup, information about the version of OTSE and the environment are recorded
in the form of copies of the main configuration files, and can be used to verify that the
correct versions of software are running. This can be disabled by setting the Include
Configuration Files setting to false.

The Information Company™ 202


Understanding Search Engine 21

Log Levels
The log files have a configurable level of detail used when writing log files. The log
level for each component of the search engine is separately configured in the [Link]
file:
DebugLevel=0
The available log levels are:
0 – Lowest level, “Guaranteed logging” level output still
occurs.
1 – Severe Errors are logged
2 – All Error conditions are logged
3 – Warnings are logged
4 – Significant status information is logged
5 – Information level, most detail
If you are experiencing problems that require diagnosis, setting the log level to 5 is
recommended. You do not need to restart the search engine processes to change the
DebugLevel, these are reloadable settings.

Log File Management


OTSE supports several methods for managing log files. For most installations, using
the rotating log file method is recommended. The rotating method cycles through a
fixed number of log files of a configurable size, ensuring that the space used for log
files is bounded, and also ensuring that the latest portions of the log files are available
if an error occurs. The rotating log file method also sets aside the startup portion of
the log file, since the startup information often provides valuable debugging
information. The rolling log file parameters in the [Link] file are:
LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
These settings essentially request that 25 files are retained with a size of 25 MB each,
and in addition, the last 10 log files from startup of the component are retained, also of
size 25 MB.
The logging method to be used is set in the search INI file:
CreationStatus=0
Where the acceptable values are:
0 – Append new data to existing log file
1 – Replace the existing log file each time the component starts
2 – Create a new log file on startup, rename the old one to current date/time
3 – Log to console. Windows only – don’t use this
4 – Rolling log files
Value 3 – log to console, should generally NOT be used. It is listed here for
completeness, but is not a production grade implementation.

The Information Company™ 203


Understanding Search Engine 21

RMI Logging
The RMI logging section determines how the RMI Registry component performs
logging. It is defined in the General section and the behavior is similar to the
descriptions above, however the names of settings in the [Link] file are different.
RMILogFile ---> Logfile
RMILogTreatment ---> CreationStatus
RMILogLevel ---> DebugLevel

Backup and Restore


In many applications, the ability to search the index is considered essential. If an index
backup is not available and the index is destroyed, then a complete re-indexing of the
data is needed. Depending on the size of your data, this may take an unacceptable
period of time. In spite of this, many customers do not back up their search index on
a regular basis, and this eventually leads to considerable pain when a hard disk fails.
Best practice is regular backup of the search index.
The “Backup and Restore” section of this document contains additional information on
the mechanics of managing index backups.

Application Level Index Verification


The verification tools available with OTSE can determine whether the index is self-
consistent. These built-in tools can verify that checksums are correct, file locations
and names are as expected, and other structural elements are intact. You should use
these tools any time you suspect the disk data may be corrupted.
These tools cannot verify whether the index contains the objects expected by the
application using the index. For this reason, applications should use the OTSE
features to implement a higher level of index verification.
Content Server, for example, provides an index verification feature within its search
administration pages. You should refer to the Content Server documentation for
details. This tool checks to ensure that the objects in the index match the objects
currently being managed, and that the indexed object quality is appropriate. The
Content Server Index Verification tool can also issue updates to the index to correct
discrepancies by adding, removing or updating objects, and it generates a status report
upon completion.

Purging a Partition Index


In the event that disk errors or other system events render the index files for a particular
search partition completely unusable, you may wish to reset the index to an empty
state. Be warned that this is a destructive process. The index will be lost. This
approach should only be taken as a last recourse. It would be a good idea to back up
files before attempting this.
Step 1
Ensure that the Index Engine and Search Engine for the partition are stopped. In some
cases, the processes might have started even if the index is corrupted. For example,

The Information Company™ 204


Understanding Search Engine 21

the corruption might be such that searching can still occur if only index offset files are
corrupted, preventing further indexing from happening.
Step 2
Check the IndexDirectory= setting in the [Link] file in the
[Partition_xxxx] section to be certain which directory you should work in. Certain
key files in the index partition directory need to be preserved and all other files in the
directory removed. The files that must be KEPT are:
Signature file (partition name with .txt extension, typically of the form
[Link])
ALL the .ini configuration files, which includes:
[Link]
Backup process definition files
Step 3
Create an empty file in the partition index directory named [Link]. At this
point the directory should have only the INI files, the signature file, and [Link].
Step 4
Start the Index Engine. It will create a new, empty search index.

Security Considerations
OTSE does not directly implement any application security measures. However, the
interfaces to the search components are well defined, and if necessary can be locked
down using standard computer and network security tools.
A quick checklist of security access points that should be considered if you are
contemplating securing access to OTSE and the index:
• Socket API ports
• RMI API ports
• Access to folders where OTSE stores the index on disk.
• Access to the configuration files – [Link], search.ini_override,
[Link], [Link].
• Access to create indexing requests, written to an input IPool folder.
• Access to logging files or folders.
• Access to the search agents configuration file.
• Access to the search agents output IPool.
• Execute permissions for launching the application.
• Folders used in backup and restore operations.

The Information Company™ 205


Understanding Search Engine 21

• Java security policy


As mentioned in the performance tuning section, you should not implement virus
scanning applications for the search index. The performance degradation is severe in
these cases. Virus scanning should be implemented upstream where objects are
added to Content Server or the ECM application which is using the search technology.

Java Security Policy


To improve security, OTSE can leverage a Java Security Policy file. By default, no
policy file (or a permissive policy file) is provided. However, when present, the policy
file can be used to enforce restrictions such as which IP addresses can connect using
sockets. The policy files to be used are specified in the [Link] file, and located in
the config directory. The features of the policy file are standard Java capabilities, and
are not documented here.
A typical policy file might look like this:

grant {
permission [Link] "<<ALL FILES>>", "read, write, delete,
execute";

permission [Link] "[Link]";


permission [Link] "[Link]";
permission [Link] "[Link]";
permission [Link] "[Link]";
permission [Link] "accessClassInPackage.*";

permission [Link] "[Link]", "read";


permission [Link] "[Link]", "read";
permission [Link] "[Link]", "read,
write";
permission [Link] "[Link]", "read,
write";
permission [Link] "problemGenerator", "read";

permission [Link] "setIO";


permission [Link] "[Link].*", "read, write";
permission [Link] "setDefaultAuthenticator", "read, write";
permission [Link] "[Link]", "read,
write";

permission [Link] "[Link]", "accept, connect, listen,


resolve";
permission [Link] "[Link]", "accept, connect,
listen, resolve";
};

The Information Company™ 206


Understanding Search Engine 21

The IP Whitelist capability is illustrated in the SocketPermission section. Both IPv4


and IPv6 forms are accepted. This file is created by Content Server if the administrator
enables the IP Whitelist feature in the search administration pages.
The security policy file applies to the Update Distributor, Index Engines, Search
Federators and Search Engines. If RMI is used, the RMI Registry component
distributes the policy file to the other components. If socket communications are being
used instead of RMI, then each component loads the security policy file independently.

Backup and Restore


Having a reasonably current backup of the search index is a key part of maintaining
your system. Index backups are instrumental in restoring service in the event of index
corruption or for disaster recovery.
The backup methods described here backup or restore only the index. You must ALSO
ensure that you have an appropriate copy of the supporting files available in the event
that these are changed between the time a backup and a restore occur. This includes
configuration files, such as [Link], search.ini_override, [Link],
[Link] and any custom tokenizers, thesaurus or similar files. In most
cases, these will be unchanged, and some of these files are re-generated by Content
Server – however, making a copy of these files is strongly recommended.
There are three different approaches to backup and restore operations. The
recommended approach is to use the backup feature first made available in 16.2.9,
which will be described here first.
The second approach is to use the backup and restore utilities. These utilities support
both complete and differential backups, and have been in the product for many years.
These utilities are superseded by the backup commands in Content Server 16.2.9
(June 2019), but remain in the search engine to support older versions of Content
Server (e.g. Content Server 16.0). This approach is being deprecated since it is
complex to understand and manage, and most customers have abandoned its use.
The third option is to stop the search system and use operating system file copy
utilities. The index is a collection of files, so this approach is relatively easy – but has
the undesirable requirement that search and indexing are disabled for the duration. As
systems grow large (many TB of size for the index), an outage for backups becomes
material.

Backup Feature – Method 1


The backup command is an instruction to the Update Distributor to create a complete
set of index backups. The Update Distributor will communicate with each Index
Engine, instructing the Index Engine to create a backup of its associated partition. In
a large search grid, the Update Distributor will manage the number of Index Engines
that are creating backups concurrently to ensure that CPU and disk capacity are not
abused.
The backup process does NOT require a search outage. Indexing and search
operations continue, subject to possible impacts of additional CPU and IO used by the
backup process. This method creates a complete backup of the grid. The backup

The Information Company™ 207


Understanding Search Engine 21

does not represent a single moment in time – each partition may have a different
capture time. The Index Transaction Logs can be used in conjunction with the backups
to reconstitute a current index from the backups.
There are several configuration settings that control the behavior of the backup
process. In the [UpdateDistributor_] section of the [Link] file:
BackupParentDir=c:/temp/backups
MaximumParallelBackups=4
BackupLabelPrefix=MyLabel
ControlDirectory=
KeepOldControlFiles=false
The BackupParentDir field specifies where the backups should be written. This must
be a drive mapping that is visible to all the admin servers running search indexing
processes. Within this directory, a sub-directory with the time the backup starts will be
created, and within that directory each Index Engine will create a directory using the
partition names to store the index. You must have enough space available to capture
a complete copy of the index. The MaximumParallelBackups setting determines how
many Index Engines can be running backups concurrently. This number should reflect
the CPU and disk capacity of your system. The LabelPrefix is optional and can be
used by a controlling application to help track status. The ControlDirectory is optional,
allowing you to override the default location for control files used to manage the backup
process. The KeepOldControlFiles is included for completeness and is generally
reserved for running test scenarios. Except for the ControlDirectory, these settings can
be reloaded (changed without restart). However, some of the settings are only used
at the start of a backup, and best practice is to make changes only when there is no
backup running.
The admin port on the Update Distributor will listen for and respond to the following
commands related to creating backups
backup
backup pause
backup resume
backup cancel
getstatustext
Backup is used to start a new backup process. Cancel and pause will complete writing
backups for the partitions that have already been instructed to create backup files. This
may take several minutes, so status checks include “pausing” status results (note that
some partitions may still be writing their output even though status is “paused”.
Resume will continue a paused backup. The response to backup commands is “true”
if the command has been accepted and acted upon, false otherwise.
Getstatustext responses are extended to include information about backups. Status
includes: None; InProgress; Completed; Paused; Pausing; Cancelled; Failed. The
details about the backups are returned as XML elements in a getstatustext operation,
along these lines:

<BackupStatus>
<InBackup>InProgress</InBackup>
<BackupLabel>MyLabel_20190322_112519734</BackupLabel>
<TotalPartitionsToBackup>10</TotalPartitionsToBackup>

The Information Company™ 208


Understanding Search Engine 21

<PartitionsInBackup>4</PartitionsInBackup>
<PartitionsFinishedBackup>0</PartitionsFinishedBackup>
<BackupDir>C:\p4\search.ot7\main\obj\log\ot7testoutput\
BackupGridTest_testBackup5199Ten4\backups\
20190322_112519734</BackupDir>
<BackupMessage></BackupMessage>
</BackupStatus>

Up to 3 BackupStatus elements may exist, one each for “InProgress” (including Paused
or Pausing), “Cancelled” (including Failed) and “Completed”.
The Update Distributor persists the status and progress of backups in files named
upDist.#. A file named [Link] is used to track the current version of the [Link].
Because this data is persisted, any configured backup will resume and complete even
if the Update Distributor is stopped and started. By default, these files are stored in
the same directory that contains the Update Distributor log files. The file contents are
similar to this:
UpDistVersion 1
BackupStatus Completed
BackupTimestampString 20190325_143138107
BackupLabel MyLabel_20190325_143138107
BackupDir c:/temp/backups\20190325_143138107
NumPartitionsInThisBackup 1
NumBackupPartitionsCompleted 1
EndOfBackupRecord ----------------------------------------
BackupStatus Cancelled
BackupTimestampString 20190325_162312142
BackupLabel MyLabel_20190325_162312142
BackupDir c:/temp/backups\20190325_162312142
NumPartitionsInThisBackup 1
NumBackupPartitionsCompleted 0
EndOfBackupRecord ----------------------------------------
EndOfUpDistState
When a backup process completes successfully, a file named [Link] is
added to the backup location. This file is not required by OTSE for operation – it’s
presence is to make it easier for administrators that are inspecting the file system to
determine if the backup in that location is good. The file contains a summary of the
backup using the same syntax as the [Link] file, for example:
BackupStatus Completed
BackupTimestampString 20200528_123935489
BackupLabel SocketGridBase_BackupLabel_20200528_123935489
BackupDir
C:\p4\search.ot7\main\obj\log\ot7testoutput\BackupGridTest_testBackup5179a\b
ackups\20200528_123935489
NumPartitionsInThisBackup 2
NumBackupPartitionsCompleted 2
EndOfBackupRecord ----------------------------------------

The Information Company™ 209


Understanding Search Engine 21

Running Backups from Command Line


It is generally expected that backups will be triggered by Content Server. It is also
possible to use the search admin port to run backups from the command line. Before
performing these steps, you must ensure that the backup destination directory is set
correctly, which can be done from Content Server on the Update Distributor process
(Backup tab). Be aware that running backups from the command line may confuse
the status of running backups displayed in Content Server.

1- Run the Admin Port client


Run the following command on the backend server where the Update Distributor
process is running to allow using localhost. It's also important to run this from the
Content Server installation directory to use the Java shipped with it.
.\jre\bin\java -cp .\bin\[Link]
[Link] -host localhost -port 8503
The port is the admin Port of the Update Distributor process which is 8503 by default
and you can confirm this value on the UD specific tab.
2 – Response is a prompt showing the available commands
3 – Type backup and press Enter to run the command
4 – To automate this you can use an input and output file
Run the following command to send the backup command from the [Link] file to the
Admin Port client (one single line):
.\jre\bin\java -cp .\bin\[Link]
[Link] -host localhost -port 9103 <
[Link] > [Link]
The [Link] file needs the following entries (avoid spaces or empty lines):
backup
quit

Restoring Partitions
When restoring an index, the search partition(s) being restored must first be stopped.
Use file copy to restore the entire contents of the partition backup, then start the Index
Engine and Search Engine. The Transaction Logs can then be used to identify missing
transactions to bring the index up to date. Be sure you have Transaction Logs enabled.
As a convenience, entries are written to the Transaction Logs to mark the point at which
a backup occurred. The backup markers in the Transaction Log have this form:
2018-06-11T[Link]Z, Backup started,
backupDir="c:/temp/backups\20180608_132859489/partition1",
label="MyLabel_20180608_132859489-partition1", config="livelink.27"

The Information Company™ 210


Understanding Search Engine 21

Backup – Method 2
Operating system file copy utilities can be used to back up the search index. All search
and index processes must be stopped for this approach to succeed. Ensure that the
entire contents of the index directories for each partition are copied.

Backup Utilities – Method 3


This method is no longer recommended. It remains in the product for backwards
compatibility, and is used in versions of Content Server up to 16.2.8. All backup and
restore descriptions in the remainder of the section are related to this method.
The backup utility has the ability to make a backup copy of an index while the index is
still in use. The backup utility also runs index verification checks before initiating a
backup, and also on the backed up copy of the files. This is equivalent to a “Level 1”
plus a partial “Level 4” index verification, which is sufficient to ensure that the files are
not corrupted – although it does not test the internal correctness of the data. Note that
the verification of backups is a new feature with Update 3.
There are both incremental and complete backup operations available. Use of
incremental backups is discouraged, since you must ensure when restoring that the
right sequence of complete and incremental backups are applied. OpenText is
considering whether incremental backups as a feature should be removed from
Content Server in a future version.
When used with older versions of Content Server, the backup and restore features are
accessible through the search administration pages of Content Server – and Content
Server looks after the difficult bits of setting up the INI files and running the utilities.
This backup method is not supported with current updates of Content Server.
Differential Backup
The very first backup ever performed must necessarily be a full backup.
Subsequent backups can be differential backups or full backups depending on your
preference. A differential backup differs from a full backup in that it only makes a copy
of files that have changed from the last backup that was performed. These files are:
• metaLog and accumLog: these change frequently. The backup always saves
these for both full and differential backups.
• checkpoint file: for some partitions this can be a large file (over a GB). It is
only copied if it has changed.
• sub-index fragments: new fragments are saved.
The differential backup reduces the amount of disk space required for the backup and
also reduces the time taken to make the backup. However, it does make the restore
process more complex, and requires that you have a complete trail of differential
backups available traced back to a full backup.
Backup Process Overview
The backup and restore processes rely on special configuration files to control their
behavior and to record the status of the backups. As an administrator, you should
normally not modify these files. Content Server automatically generates these files as

The Information Company™ 211


Understanding Search Engine 21

needed for backups. This information is primarily for troubleshooting and as a starting
point for developers that are integrating index backup and restore into their
applications.
To run a full backup, a configuration file with the name ‘[Link]’ must first be created
and placed in each partition folder. For a differential backup, a file with the name
‘[Link]’ must be created.
The backup utility is then run, which performs the backup operation on a single
partition.
On completion, the backup data is contained in a folder target directory, called FULL
for a full backup and DIFFx for a differential backup (where x is the order number of
this differential backup relative to the baseline full backup). The backup process also
creates a file called ‘[Link]’ with copies in the source and backup target partition
folders.
Sample [Link] File
Note that the [Link] file is identical except for its name. The [Link] file uses a basic
Windows INI file syntax with a single section [Backup]. There a comments injected
here for explanatory purposes (the line starts with a # symbol) which should not exist
in the actual file. In practice, the only values you may want to change are the log file
name and log level.

[Backup]
# AutoNew requests that a new folder is created if it
# does not already exist.
AutoNewDir=True
DelConfig=FALSE

# DestDir identifies the folder where that backup should


# be placed.
DestDir=F:/backup/cs1064main01/ent

# These strings are identifiers that are required. Do


# not change these values.
DiffString=Differential
FullString=Full

# Index is the root location of the source index being backed up.
Index=F:/OpenText/cs1064main01/index/enterprise/index1

# Specify the names of regions that contain date and time values
# that can reasonably be expected to reflect object index dates.
IndexDateTag=OTCreateDate
IndexTimeTag=OTCreateTime

The Information Company™ 212


Understanding Search Engine 21

# Specify a name for the index. Can leave this as a constant.


IndexName=Livelink

# Label is a template for how the backup should be named.


# LangFile provides additional hints for formatting the Label.
Label=Enterprise_%m_%d_%Y_%T_58863
LangFile=F:/OpenText/cs1064main01/config/[Link]

# Specify logging level and locations for the backup process


LogFileName=F:/OpenText/cs1064main01/index/enterprise/index1/starskyX9099X181
[Link]
LogLevel=1

# Option – leave as COPY. Don’t change this.


Option=COPY

# The location of the Content Server binaries, used to perform


# a search to obtain the most recent date and time
OTBinPath=F:/OpenText/cs1064main01/bin
ScriptFileName=

# Specify whether a full or differential backup is to be performed.


# Values can be DIFF or FULL
Type=DIFF

Sample Lang File


This file is basically used by the automatic file naming process in the backups to map
numeric date and time values to display forms. The default file uses English language
conventions for days and months. This is a convenience function, and unless you
simply cannot accept English date structures for file names, you should most likely
leave this alone.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>


<labelstrings xml:lang="">
<days>
<day number="1" longformat="Sunday" shortformat="Sun" />
<day number="2" longformat="Monday" shortformat="Mon" />
<day number="3" longformat="Tuesday" shortformat="Tue" />
<day number="4" longformat="Wednesday" shortformat="Wed" />
<day number="5" longformat="Thursday" shortformat="Thu" />
<day number="6" longformat="Friday" shortformat="Fri" />
<day number="7" longformat="Saturday" shortformat="Sat" />
</days>
<months>
<month number="1" longformat="January" shortformat="Jan" />
<month number="2" longformat="February" shortformat="Feb" />

The Information Company™ 213


Understanding Search Engine 21

<month number="3" longformat="March" shortformat="Mar" />


<month number="4" longformat="April" shortformat="Apr" />
<month number="5" longformat="May" shortformat="May" />
<month number="6" longformat="June" shortformat="Jun" />
<month number="7" longformat="July" shortformat="Jul" />
<month number="8" longformat="August" shortformat="Aug" />
<month number="9" longformat="September" shortformat="Sep" />
<month number="10" longformat="October" shortformat="Oct" />
<month number="11" longformat="November" shortformat="Nov" />
<month number="12" longformat="December" shortformat="Dec" />
</months>
<era>
<era number="1" shortformat="BC" />
<era number="2" shortformat="AD" />
</era>
<timeperiods>
<timeperiod number="1" shortformat="AM" />
<timeperiod number="2" shortformat="PM" />
</timeperiods>
</labelstrings>

Related to this are the format codes that are used in the label string. The codes are:

Value Description

%% A percentage sign

%a The three-character abbreviated weekday name (e.g., Mon,


Tue, etc.)

%b The three-character abbreviated month name (e.g., Jan, Feb,


etc.)

%d The two-digit day of the month, from 01 to 31 (e.g., 01-31)

%j The three-digit day of year, from 001 through 366

%m The two-digit month (e.g., 01-12)

%p AM or PM

%w The 1-digit weekday, from 1 through 7, where 1= Sunday

%y The two-digit year (e.g., 93)

%A The full weekday name (e.g., Monday)

%B The full month name (e.g., January)

%H The two-digit hour on a 24-hour clock, from 00 to 23

%I The two-digit hour, from 01 through 12

The Information Company™ 214


Understanding Search Engine 21

%M The minutes past the hour, from 00 to 59

%P AD or BC

%S The seconds past the minute, from 00 to 59

%Y The year, including the century (e.g., 1993)

%T Replaced with the value of FullString or INCRString that is


specified on the command line or backup [Link] file.

Sample [Link] File


The [Link] file is created or updated after a backup operation completes. It is later
used by the restore utility. The basic purpose is to record the status of the backup, the
files which are included in the backup, and checksum data to allow the backup and
restore operations to validate correct file copies.
This particular example is from the second differential backup of an index. Each
differential backup results in another [DIFFx] section. A full backup would only contain
the [FULL] section, with no [DIFFx] sections. Commentary and white space has been
added.

[General]
# 0 status is good, other values are error codes
Status=0

# The last differential backup number


DIFF=2

DiffString=Differential
FullString=Full

[FULL]
CheckPointSize=624
MetaLogNumber=51
MetaLogOffset=0
AccumLogNumber=39
AccumLogOffset=0
I1=61
I1Size=447
I2=66
I2Size=39
TotalIndexSize=1109
Label=Enterprise_04_08_2011_Full_58863
Date=20110408 145139
MetaLogChkSum=524293
AccumLogChkSum=524293
CheckPointChkSum=206517074

The Information Company™ 215


Understanding Search Engine 21

I1ChkSum=15804739427
I2ChkSum=11071697352
ConfigChkSum=1160933350
Success=0

[DIFF2]
CheckPointSize=665
MetaLogNumber=53
MetaLogOffset=0
AccumLogNumber=41
AccumLogOffset=11785068
I1=69
I1Size=9
TotalIndexSize=674
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 150047
MetaLogChkSum=524293
AccumLogChkSum=258080884
CheckPointChkSum=1282731032
I1ChkSum=9500506248
ConfigChkSum=624209792
Success=0

[DIFF1]
CheckPointSize=664
MetaLogNumber=52
MetaLogOffset=4732284
AccumLogNumber=39
AccumLogOffset=5824292
TotalIndexSize=664
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 145644
MetaLogChkSum=1542696885
AccumLogChkSum=238343344
CheckPointChkSum=3018112456
ConfigChkSum=389190926
Success=0
Running the Backup Utility
Once the [Link] or [Link] file is in place the backup utility can be run. The utility is
contained within the Search Engine, and documented in the Utilities section of this
document.

Restore Process – Method 3


The restore procedure is considerably more complex than the backup. In its simplest
form, restore an index is comprised of the following stages:

The Information Company™ 216


Understanding Search Engine 21

Preparation
The partition to be restored is placed in a known location. A configuration file is created
which points to this location, called [Link]. The target directory needs to be empty,
which means moving or deleting any existing index.
Analysis
The [Link] file from the backup location is analyzed to determine which files and
folders are required to perform the restore operation. This information is written into
the [Link] file.
The controlling application then needs to prompt the administrator to stage the
necessary folders before proceeding. Content Server is one application which
performs this coordination.
Copy
In the copy phase, the files specified in the [Link] file are used as a guideline for
copying all the necessary files from that backup location to the search index. The copy
process takes place iteratively, with one differential backup folder processed on each
invocation, and the administrator staging needed files for the next copy operation. The
process is structured to support complex backup storage systems, where each backup
may have been placed in a tape archive.

Validate
The final step is validation, in which the restored index is checked for integrity.
These stages do not automatically happen one after the other. The administrator or
the controlling application needs to initiate the steps sequentially after ensuring that
appropriate file preparation occurs.
The restore operation works on a single partition. Content Server provides a
mechanism to simplify the restore of the entire index, and prompts that administrator
to ensure the appropriate files and folders are available at each step. The syntax of
the restore utility is documented in the Utilities section of this document.

[Link] File
The [Link] file is used for each stage of the restore procedure, and modified after
each stage. This file is the mechanism for transporting process information from one
phase to the next.
Before first running the analyze stage, a [Link] file needs to be created that looks
like this:

[restore]
otbinpath=d:\opentext\bin
SourceDir=d:\llbackup\ent\incr18
destdir=d:\temprest
option=analyse

The Information Company™ 217


Understanding Search Engine 21

Once the analysis is complete, the [Link] file will have been updated with
information about files that will be copied, and should look like this, without the added
comments and white space:

[restore]
OTBinPath=d:\opentext\bin
BackupIndexName=livelink
LogFilename=[Link]
RestoreHistory=[Link]
BackupHistory=[Link]
DestDir=d:\temprest
SourceDir=d:\llbackup\ent\incr18
loglevel=1

# CurrentImage indicates which [IMAGE#] section of this INI


# file should be examined to retrieve the needed files.
CurrentImage=1
success=0

# The insert option identifies that copy will take place next
option=insert

# Total images are the number of differential backups that


will
# be copied.
TotalImage=4
LastObjectSize=110750
LastObjectDate=20010426

# Each image is all or part of a saved differential backup.


# Only 1 image is shown here.
[IMAGE1]
TotalIndexSize=12
Processed=No
Date=20010612 110624
Label=Enterprise_06_12_2001_Incremental
TotalFrag=5
Frag5Size=0.111492
Frag5CkSum=12169
Frag5=00097
Frag4Size=0.220095
Frag4CkSum=3698
Frag4=00096
Frag3Size=0.468937
Frag3CkSum=59855
Frag3=00095
Frag2Size=5.679858
Frag2CkSum=19250

The Information Company™ 218


Understanding Search Engine 21

Frag2=00094
Frag1Size=2.669858
Frag1CkSum=59557
Frag1=00087
Master=Yes
In operation, the Administrator (or controlling application) is expected to examine the
IMAGE# section for the current image number, and mount the backup folder which has
the specified label and date. Once this is staged, the Admin edits the [Link] file to
change the option from “insert” to “copy”, and run the restore.
The restore utility will then copy the files from that one image, change the option to
insert, update the current image number, and the process repeats until all the IMAGE
sections are processed.

The Information Company™ 219


Understanding Search Engine 21

Index and Configuration Files


A complete list of [Link] settings and background information on how the partitions
store the index on the file system.

Index Files
OTSE persists the search index on disk, in a specific hierarchy of folders and file
names. This section outlines each of the folders and files and its purpose. Below is a
typical listing for a search partition for reference which will be described in detail below.
There is one such folder for each partition.

[Link]
accumlog.39
checkpoint.51
[Link]
[Link]
livelink.280
[Link]
metalog.51
topwords.100000
MODaccumlog.47
MODindex
\2
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\map
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\3
\\ same
61
\[Link]
\[Link]
\[Link]
\[Link]

The Information Company™ 220


Understanding Search Engine 21

\[Link]
\map
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
62
\ same

Signature File
This first file in the list, [Link], is technically not part of the search
index, and not required for search or indexing operations. Content Server adds this
file to allow the administration interfaces in Content Server to verify that related Search
Engines and Index Engines are referencing the same directories. If upgrades occur,
older server names may migrate, this is expected.

Accumulator Log File


This file (accumlog) contains the incremental updates to the full text index which have
occurred since the index file was last updated and written into an index fragment. This
file is managed by the Index Engines, and consumed by the Search Engines in the
normal course of operation. The accumlog enables rollback of partially completed
transactions.
The file is of the form accumlog.x, where x is a number that increments sequentially
each time the contents of the accumlog are committed to a new index fragment and a
new instance of the accumlog is created. The accumlog contains incremental adds
and deletes.

Metadata Checkpoint Files


Checkpoint files are of the form checkpoint.x, where x is an incrementing integer. A
Checkpoint contains a complete copy of the metadata for the partition, including the
values, the index and the dictionary. Checkpoints are managed by the Index Engines.
A new Checkpoint file is created when the size of incremental metadata changes
(metalogs) exceeds a configuration value, typically 16 Mbytes. New checkpoint files
are also created when index conversions are performed during Index Engine startup.
To ensure synchronization, checkpoint files are written at the same time by all partitions
in a system. The coordination of simultaneous checkpoint creation is directed by the
Update Distributor.
Upon startup, or upon resynchronization, Search Engines load their metadata image
from the checkpoint file, and then apply incremental changes from the metalogs.

The Information Company™ 221


Understanding Search Engine 21

It is possible for multiple checkpoint files to exist for a partition. Normally, this only
occurs for a short period, when a Search Engine is still using an older checkpoint file
after the Index Engine has created a new one. The Index Engines will reduce the
number of checkpoint files to one at the earliest safe opportunity.
Lock File
The Lock File is used by the Index Engine to indicate that this partition is in use. This
is a failsafe mechanism to ensure that multiple Index Engines will not attempt to use
the same data. In a properly configured system, this would not happen. The Lock file
provides additional insurance.
Control File
The Control File, named [Link], is used by the Index Engines to record the name
of the current Config file. The Search Engines read this file to obtain the name of the
current Config file. To ensure atomic reads and writes, both the Index Engine and
Search Engine will lock this file when accessing it.
Top Words
Optional file. Top Words are used to track which words in an index are candidates for
exclusion from TEXT queries because they are too common. The file is named
topwords.n, where n is one of 10000, 100000 or 1000000 – which reflects the number
of objects in the partition when the file was generated.
Config File
Named livelink.x, where x is an incrementing number. The config file contains detailed
information about the index fragments, working file offsets, file checksums, and other
parameters needed by the Index Engine and Search Engine to properly interpret the
index files.
A new Config file is written each time the Index Engine creates a new fragment or
generates a checkpoint. A Search Engine will place a non-exclusive lock on the Config
file which represents the accumlog and metalog files it is currently consuming.
The Index Engine will clean up older, unused Config files.

Metalogs
A metalog contains incremental updates to metadata. The Index Engine writes
updates to the metalogs, and occasionally creates a checkpoint file that rolls up all the
metalogs since the last checkpoint into a new checkpoint file.
Search engines consume updates from the metalog files to keep their copy of the
metadata current. When a metalog exceeds a configurable size, a new checkpoint is
created and a new metalog started. It is possible for multiple metalogs to exist for short
periods while the Search Engines consume older metalogs.

Index Fragment Folders


The full text content for a partition is broken into partition fragments. Each fragment is
contained within a numbered folder within the partition index folder. In the example at

The Information Company™ 222


Understanding Search Engine 21

the start of this section, these folders are labeled 61 and 62. Folder 61 is exploded to
show the files within.
A new Index Fragment is created when the Index Engine fills the accumulator and
‘dumps’ it to disk. The files within a fragment are never modified once written to disk.
The Index Engines occasionally merge fragments to consolidate them, creating new
larger fragments in the process, and allowing the smaller fragments to be deleted. A
cleanup task in the Index Engine will delete the older, smaller fragments once the
Search Engine stops referencing them.
In an optimal configuration, the merge process attempts to structure the fragments
such that the number approaches about 5 fragments, with geometrically related sizes.
For example, 1000 MB, 300 MB, 100 MB, 30 MB, 10 MB. In practice, the sizes will
vary from this pattern given the reality of the sizes available for merging and the
opportunistic scheduling of merges based on the indexing load. If the indexing load is
high and sustained, the opportunity for merges may be rare, and the number of Index
Fragments can become large. Large numbers of fragments are undesirable for query
performance, so there is a configuration setting in the [Link] file that places an
upper limit on the number of acceptable fragments, which will force merge activity,
stalling the indexing process if necessary.
Within the Index Fragment Folder, there are a number of files as described below.
Core, Region and Other
Examining the fragment folder, note that there are files of the same type but having the
prefixes core, region and other. These file sets are similar, but used for different data.
The ‘core’ files contain the full text search data for words which are comprised of the
basic ASCII character set (typically English).
The ‘region’ files contain the full text index for XML region names. These are special
cases that improve the performance of search for values within XML fields.
The ‘other’ files contain the full text index for all other words – those which are not
English and not XML tags.
The descriptions below for core files are also applicable to the files with ‘region’ and
‘other’ prefixes.
Index Files
The file [Link] contains the ‘dictionary’ of terms, plus pointers to the object id file.
As the dictionary grows large, multiple levels of dictionary pointers are created, so you
will often see [Link], [Link], and so forth. These higher numbered index
files contain references to the lower numbered files, with successively more accurate
data points. For instance, the [Link] file contains entries for every 16th dictionary
value. This hierarchy improves dictionary lookup time. This structure repeats until the
highest numbered index file is smaller than 1 MByte. This 1 MByte dictionary is kept
in memory to optimize performance.
Object Files
The file [Link] contains a list of all internal object IDs and pointers to the word
location lists in the offset file.

The Information Company™ 223


Understanding Search Engine 21

Offset File
The file [Link] contains the lists of word offsets. These word offsets indicate to
the search engine the relative position of a word within an indexed object.
Skip File
The file [Link] contains pointers to the offset file that allows the Search Engine
to quickly skip over large data sets.
Map File
The map file contains checksums that can be used to verify that the index fragment
files have not been corrupted. There is only one Map file per partition fragment.

Low Memory Metadata Files


The Low Memory mode of storing text region indices is available starting with Update
8, disabled by default. When configured, there are additional files in the index folder.
The purpose and functions of the Low Memory metadata files are the same as the
corresponding full text index files, except that they contain indices for text metadata
regions.
The Low Memory metadata index fragment subfolders are contained in the directory
called MODindex. There is also a MODaccumlog file, which is analogous to the
accumlog file.
If there are multiple metadata index fragment subfolders, then some of the subfolders
will also contain a file called [Link], which is used to identify objects with entries
in earlier fragments that have been modified.

Metadata Merge Files


The Merge File storage mode places the Text metadata values in a MODcheck file,
with incremental changes in aMODcheckLog file. These operate much like the full text
index files – using background merge processes to consolidate recently changed
values into larger compacted files.
The files in the index are as follows:

MODCheck.x
This is the master file for the metadata values, and the target after a merge.
The value of x increments after each merge operation.
MODcheckLog.x
Changes to text values are recorded in this file until a merge operation occurs.
MODpremerge.x+1
MODptrs.x+1
Files containing pointers used for recovery and playback during startup.

The Information Company™ 224


Understanding Search Engine 21

It is possible that multiple versions (values of .x) of these files may exist, especially if
a Search Engine is lagging in accepting updates from the Index Engine, or multiple
Search Engines exist.

Configuration Files
OTSE derives the bulk of its configuration settings from a number of files. In this
section, we review each of the files to convey the basic purpose of each.

[Link]
Most settings for OTSE are contained within the [Link] file. There is one [Link]
file per Admin Server. In practice, this usually means one per physical computer,
although other permutations are possible.
When used with Content Server, the [Link] file is generated by Content Server.
Although Content Server may preserve some of the edit changes you might make to
the [Link] file, this is not guaranteed. In general, you should not edit this file. Most
of the entries are set by Content Server, and using the Content Server search
administration pages is the preferred method for interacting with this file.
If you must edit this file within a Content Server application, consider using the
search.ini_override file instead.
The [Link] file follows generally accepted conventions for the structure of a ‘.ini’
file.
The file consists of several configuration sections. Where sections contain settings for
a particular partition, the section name will include the partition name. Refer to the
[Link] section of this document for detailed information on entries in the [Link]
file.

Search.ini_override
This file is specifically designed to supplement or override any values set in the
[Link] file. Because the [Link] file is controlled by Content Server, editing the
[Link] file does not ensure that your changes will be preserved.
The override file is optional. When present, it need contain only those configuration
settings which you want to take precedence over the default settings or the settings
within the [Link] file.
There is a special value that can be used in override settings, the DELETE_OVERRIDE
value. When this value is encountered, it means that the explicit value for the setting
in the [Link] file should be ignored, and the default value used instead.
For example, the default value for CompactEveryNDays is 30. If the [Link] file
contains the setting:

CompactEveryNDays=100

The Information Company™ 225


Understanding Search Engine 21

But the search.ini_override file contains:

CompactEveryNDays=DELETE_OVERRIDE
Then the default value of 30 will be used.
Note that the override file may need to be edited any time the partition configuration
changes. The most common situation is that when you create new partitions, you will
need to add corresponding sections to the override file.
If you use automatic partition creation (such as date based partition creation) within
Content Server, you may have difficulty keeping the override file current with newly
created partitions, and the override file might not be a good choice for this type of
deployment.

[Link]
This is an optional configuration file which is used to set the parameters for index
backup operations and record the status of the last backup operation. You should not
normally modify this file. Refer to the section on index backup for more information.

[Link]
This file defines the storage modes for text metadata regions, and should be located
in the partition directory. There is one [Link] file per partition.
Although each partition could have different settings, keeping them identical across
partitions is generally recommended, and within a Content Server environment this is
enforced. A [Link] file has the following form:

[General]
NoAdd=DISK
ReadOnly=DISK
ReadWrite=RAM

[ReadWrite]
someRegion1=DISK
someRegion2=RAM

[ReadOnly]
someRegion1=RAM
someRegion3=DISK

[NoAdd]
someRegion1=DISK_RET
someRegion2=RAM
The General section defines the default storage mode for a text metadata region. The
ReadWrite, ReadOnly and NoAdd sections allows control over storage of specific
regions, which have priority over the General section. The possible values are DISK,
RAM and DISK_RET. Refer to the section on text metadata storage for details.

The Information Company™ 226


Understanding Search Engine 21

Within Content Server, the [Link] file is created and managed by


Content Server, and should not be edited.

[Link]
The field definitions file has several purposes. Experience indicates that most
customers do not understand or modify this file, which is unfortunate, since significant
performance and memory use benefits may be possible by reviewing and editing this
file BEFORE indexing your content. Once an index has been created, it is not possible
to change some of the settings in this file without generating startup errors.
One function of the file is to establish the type for each metadata region to be indexed.
Each region is tagged with a type such as:
• INT
• LONG
• TEXT
• DATETIME
• TIMESTAMP
• USER
• CHAIN
• AGGREGATE-TEXT
A second purpose for the field definitions file is to provide metadata parsing hints for
nested metadata regions. Using the NESTED operative, the input IPool parser can
ignore outer tags and extract and index the inner region elements.
The field definitions file also provides instructions for special handling of certain region
types. This includes dropping, removing, renaming and merging metadata regions.
You can also use the aggregate feature to create a new region comprised of multiple
text regions.
One field definitions file is required per Admin server. As a general rule, each field
definitions file should be identical for partitions with different Admin servers.
Differences will result in inconsistent handling of regions between partitions.
Content Server does not edit, generate or manage this file. In general, changes to this
file must be done manually. There is one exception to this – the [Link] file has a
special setting for logically appending lines to the [Link] file. This allows
limited control over the definitions from Content Server. For example, if the [Link]
file contained these two lines:
ExtraLLFieldDefinitionsLine0=CHAIN MyID UserID
TwitterID FacebookID
ExtraLLFieldDefinitionsLine1=LONG OTBigNumber
Then at startup time, OTSE acts as if these lines existed at the end of the
[Link] file:
CHAIN MyID UserID TwitterID FacebookID
LONG OTBigNumber

The Information Company™ 227


Understanding Search Engine 21

Content Server usually ships with two versions of this file – a standard version, and
one for use with Enterprise Library Services. The determination of which version to
use is determined by a setting in the [Link] file:

FieldModeDefinitions=[Link]
Detailed information about each of the functions and data types of the field mode
definitions file can be found in the section of this document which covers metadata
regions.

[Link] Summary
This section gathers together most of the accessible configuration values that can be
used in the [Link] file, or the search.ini_override file. There are a number of
additional values which are only used for specific debugging or testing purposes that
are not listed here. A number of these configuration values are covered in more detail
in relevant sections of this document.
Not all processes read all sections of the [Link] file. Content Server generates
[Link] files for each process, and typically only includes values needed by the
process. Note that Content Server files do not include all of the entries, and default
settings are common.
Default values are displayed in this section wherever possible. Annotations in this
section are indicated with a // at the beginning of the line – this is not syntax supported
in an actual [Link] file, it is used here as a documentation device.
The settings in the INI file are applied when the processes start. Changes to this file
may require a restart of some or all of the search grid in order to take effect. Some of
these values can be re-applied to a running process without a restart, refer to the
“Reloadable Settings” section for a list.

General Section
This section is required for every [Link] file. The basic purpose is to share with all
components the configuration settings for the RMI Grid Registry and the Admin Server.
If RMI communication between grid components is not used, then the General Section
is ignored and not required.

[General]
AdminServerHostName=localhost

// RMI Registry
RMIRegistryPort=1099
RMIPolicyFile=[Link]
RMICodebase=../bin/[Link]
RMIAdminPort=8997

// RMI Grid Registry logging


RMILogFile=[Link]

The Information Company™ 228


Understanding Search Engine 21

RMILogTreatment=0
RMILogLevel=10

Partition Section
The Partition Section contains basic information about a partition, such as size,
memory usage preferences and, and mode of operation. The section name must
include the partition name after the underscore.
[Partition_]
AllowedNumConfigs=500 (-1 = none)
AccumulatorSizeInMBytes=30
PartitionMode=ReadWrite | ReadOnly | NoAdd | Retired

// Size of index on disk


MaxContentSizeInMBytes=50000
StartRebalancingAtContentPercentFull=95
StopRebalancingAtContentPercentFull=92
StopAddAtContentPercentFull=90
WarnAboutAddPercentContentFull=true
ContentPercentFullWarnThreshold=85

// Metadata memory usage


MaxMetadataSizeInMBytes=1000
StartRebalancingAtMetadataPercentFull=99
StopRebalancingAtMetadataPercentFull=96
StopAddAtMetadataPercentFull=95
WarnAboutAddPercentFull=true
MetadataPercentFullWarnThreshold=90

// Set true to reserve partition for large objects


LargeObjectPartition=false

// IE0, IE1, etc. This is REQUIRED


// For RMI, this is location of grid registry
// For sockets, this is location of Index Engine
IE#=//host:port/indexEngineName

DataFlow Section
The DataFlow section contains the majority of configuration settings relating to how
data should be processed. The partition name must be appended to the section name
after the underscore.

[DataFlow_]
FieldDefinitionFile=[Link]
FieldModeDefinitions=[Link]
QueryTimeOutInMS=120000
SessionTimeOutInMS=216000

The Information Company™ 229


Understanding Search Engine 21

StatsTriggerThreshold=200

LastModifiedFieldName=OTModifyDate

// Interval for reading metalog files


UpdatePollIntervalInMS=10000

// For tuning use of basestore. In general, don’t touch this.


// Max 200 values in a multivalue text field
// Max 256K of data in a text field
// Allow email management regions to exceed these limits
MultiValueOverflowBoundary=0
MultiValueLimitDefault=200
MetadataValueSizeLimitInKBytes=256
MultiValueLimitExclusionCSL=OTEmailToAddress,OTEmailToFullName,
OTEmailBCCAddress,OTEmailBCCFullName,
OTEmailCCAddress,OTEmailCCFullName,
OTEmailRecipientAddress,OTEmailRecipientFullName,
OTEmailSenderAddress,OTEmailSenderFullName

// Time zone obtained from OS by default, you can set e.g +5 for EST
TimestampTimeZone=

// Accumulator configuration
ContentTruncSizeInMBytes=10
DumpOnInactiveIntervalInMS=3600000
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0
MaxAverageTokenLengthHeuristic2=15.0
MinDocSizeInTokens=16384
DumpToDiskOnStart=false
AccumulatorBigDocumentThresholdInBytes=5000000
AccumulatorBigDocumentOverhead=10
CompleteXML=false

// Configure the Reverse Dictionary


ReverseDictionary=false
ReverseDictionaryScanningBufferWordEntries=100000

// Tokenizer
RegExTokenizerFile=[Link]
RegExTokenizerFileX=c:/config/tokenizers/[Link]
TokenizerOptions=0
UseLikeForTheseRegions=
OverTokenizedRegions=
LikeUsesStemming=true
AllowAlternateTokenizerChangeOnThisDate=20170925

The Information Company™ 230


Understanding Search Engine 21

ReindexMODFieldsIfChangeAlternateTokenizer=true

// Facets
ExpectedNumberOfValuesPerFacet=16
ExpectedNumberOfFacetObjects=100000
MaximumFacetValueLength=32
UseFacetDataStructure=true
MaximumNumberOfValuesPerFacet=32767
NumberOfDesiredFacetValues=20
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10
GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize
MaximumNumberOfCachedFacets=25
DesiredNumberOfCachedFacets=16

// Facet regions to compute on startup and protect


PrecomputeFacetsCSL=
PersistFacetDataStructure=true

// Enable and configure span features


SpanScanning=false
SpanMaxNumOfWords=20000
SpanMaxNumOfOffsets=1000000
SpanMaxTmpDirSizeInMB=1000
SpanDiskModeSizeOfOr=30

// Disk I/O tuning


DelayedCommitInMilliseconds=0
IOChunkBufferSize=8192
ParallelCommit=true
SmallReadCacheDesiredMaximumSizeOfSmallReadCachesInMB=0
NumberOfFileRecoveryAttempts=5

// Control reporting of disk timing in logs


LogDiskIOTimings=true
LogDiskIOPeriod=25

// Enable recording of network problems


LogNetworkIOStatistics=true

// Search IO buffers default to index buffer size.


// Modest space savings with small performance hit if set smaller.
MaxSizeInBytesOfSearchIOBuffers=-1

// The region name and value used to identify

The Information Company™ 231


Understanding Search Engine 21

// when objects should be indexed as XML with text regions


ContentRegionFieldName=OTFilterMIMEType
ContentRegionFieldValue=text/xml

// Enable region forgery checking with otb= attribute


IgnoreOTBAttribute=false

// Several controls exist for determining when a new metalog should


// be created. All are approximations!
MetaLogSizeDumpPointInBytes=16777216
MetaLogSizeDumpPointInObjects=5000
MetaLogSizeDumpPointInReplaceOps=500
MetaLogSizeDumpPointInBytesLowMemoryMode=100000000
MetaLogSizeDumpPointInObjectsLowMemoryMode=50000
MetaLogSizeDumpPointInReplaceOpsLowMemoryMode=5000

SubIndexCapSizeInMBytes=2147483647

// Skips indexing of regions when new data same as old


SkipMetadataSetOfEqualValues=true

// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
MaximumSubIndexArraySize=512
CompactEveryNDays=30
NeighbouringIndexRatio=3

// Enable Merge Files for Text Metada values, set to 1


MODCheckMode=0
// Control Merge File metalog size and merge time checks
MODCheckLogSizeDumpPointInBytes=536870912
MODCheckMergeThreadIntervalInMS=10000
MODCheckMergeMemoryOptions=0

// Some metadata regions need to be treated as content because


// they are derived from full text. MS Office properties.
ExtraDCSRegionNames=OTSummary,OTHP,OTFilterMIMEType,
OTContentLanguage,OTConversionError,OTFileName,OTFileType
ExtraDCSStartsWithNames=OTDoc,OTCA_,OTXMP_,OTCount_,OTMeta
DCSStartsWithNameExemptions=OTDocumentUserComment,OTDocumentUserExplanation
ExtrasWillOverride=false
// Handle bug where thumbnail requests were indexed as text
EnableWeakContentCheck=true

The Information Company™ 232


Understanding Search Engine 21

// Cleanup thread, removes unused files


// Cleanup mode 0 is pre-Update 3 algorithm
FileCleanupIntervalInMS=600000
SubIndexCleanupMode=1
WantFileCleanup=all |none
SecureDelete=false

// Metadata defragmentation
DefragmentFirstSundayOfMonthOnly=0
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
DefragmentMaxStaggerInMinutes=60
DefragmentStaggerSeedToAppend=SEED

// If a “validate” operation on a Checkpoint file fails, stop.


ContinueOnCorruptCheckpoint=false

// If changing existing types with [Link],


// enter today’s date
EnableRegionTypeConversionAsADate=YYYYMMDD

// List of Content Server fields that MUST be long


FieldsToBeLongCSL=OTCreatedByGroupID, OTDataID, OTOwnerID, OTParentID,
OTUserGroupID, OTVerCreatedByGroupID, OTWFManagerID, OTWFMapManagerID,
OTWFMapTaskPerformerID, OTWFMapTaskSubMapID, OTWFSubWorkMapID,
OTWFTaskPerformerID

// Set to false to disable removing empty regions


RemoveEmptyRegionsOnStartup=true

// Set to true to enable compression of Checkpoint files on disk


UseCompressedCheckpoints=false

// Two techniques for converting RAM/DISK fields for text metadata


// 0 is faster but uses more RAM. 1 is slower, less memory
MetadataConversionOptions=0

// Force IE to checkpoint if regions change at startup


// For instance, remove, merge, rename
WantCheckpointAfterFieldDefChanges=true

// On startup, if index has regions with Null characters.


// .. this is BAD DATA – a repair feature.
RemoveRegionsWithNull=false

// Relevance tuning
ExpressionWeight=100

The Information Company™ 233


Understanding Search Engine 21

ObjectRankRanker=
ExtraWeightFieldRankers=
DateFieldRankers=
TypeFieldRankers=
DefaultMetadataFieldNamesCSL=
// Set true for minor query performance boost on older CS instances
ConvertREtoRelevancy=false

// Maximum number of operators before a huge query should be


// broken into chunks, glued together.
// Slower, but handles extreme cases
MaximumNumberOfBinaryOperators=15000

// Field length included in relevance for metadata on disk


// 2015-09 for scanning operators (regex, *, etc.).
// To reset to old form (less RAM, faster), set false
MODScanningLengthNorm=true

// For backwards compatibility. New apps should set this false


// Affects how the OTScore is computed in some edge cases
UseOldIntScores=true

// New stemming is faster, but less comprehensive


UseOldStem=false

// If updates and queries are contending, how many queries


// should be serviced before allowing an update
MaxSearchesWhileUpdateBlocked=0

// If updates and queries are contending, time before retry


RWGateRetryIntervalInMS=1000

// Optional logging to check for potential IO resource leaks


LogHighestNumberOfIOBuffers=false

// Faster. Force more frequent logging output by setting true.


SyncIndexEngineLogEveryCommit=false

// Smaller values provide faster text operation, use more RAM


TextIndexSynchronizationPointGap=1000

// By default, relative comparison in full text is not allowed


// (although it is still allowed in Metadata regions)
AllowFullTextComparison=false

// Optimization that groups local updates for MBQ, DBQ


GroupLocalUpdates=true

The Information Company™ 234


Understanding Search Engine 21

// Restrict automatic timestamping to a list of


// Content Server object types (numeric OTSubType)
IndexTimestampOnlyCSL=

// For multivalue metadata with attributes, which attribute


// should have precedence for sorting. “Language” used since
// multilingual metadata is the primary user of this feature.
SystemDefaultSortLanguage=

// Orderby collation default is locale-sensitive


OrderedbyRegionOld=false

//
DiskRetSection=DISK_RET
FieldAliasSection=FAS_label

// Determines whether RMI or sockets are used within GRID


// New / better is sockets
GridConnectionType=rmi | direct
// Policy file – new location when sockets in use. Replaces the
// RMIPolicyFile from General section
PolicyFile=<path to [Link]>
// Timeout for socket connections only
IEUpdateTimeoutinMS=120000

// name the search agent sections


SA0=label

// Configure email domain search


EmailDomainSourcesCSL=
EmailDomainFieldSuffix=_OTDomain
MaxNumberEmailDomains=50
EmailDomainSeparators=[,:;<>\\[\\]\\(\\)\\s]

// Optional – append lines to [Link]


ExtraLLFieldDefinitionsLine0=CHAIN MyID UserID TwitterID FacebookID
ExtraLLFieldDefinitionsLine1=LONG OTBigNumber

// Index Engine Bloom Filter configuration


LogPeriodOfDataIdQueries=1000
NumBitsInDataIdBloomFilter=67108864
NumDataIdHashFunctions=3
AutoAdjustDataIdBloomFilterSize=true
AutoAdjustDataIdBloomFilterMinAddsBetweenRebuilds=1048576
DisableDataIdPhraseOpt=false

// Tuning for the TEXT query operator


TextCutOff=0.33

The Information Company™ 235


Understanding Search Engine 21

TextAllowTopwordsBuild=true
TextNumberOfWordsInSet=15
TextUseTermSet=true
TextPercentage=80

// Define large object partition size threshold


ObjectSizeThresholdInBytes=1000000

// Compress text content to Index Engines


CompressContentInLocalUpdate=false
CompressContentInLocalUpdateThresholdInBytes=65535

// TimeStamp region used for search agent scheduling


AgentTimestampField=OTObjectUpdateTime

// Do not use. Sets unit test conditions


BlockBackupIfThisFileExists
BlockStartTransactionIfThisFileExists

Update Distributor Section


Each Update Distributor requires an instance of this section. The name of the Update
Distributor is appended to the section name after the underscore.

[UpdateDistributor_]
// RMIServerPort not needed for direct socket connection mode
RMIServerPort=

AdminPort=
AllowRebalancingOfNoAddPartitions=false
IEUpdateTimeoutMilliSecs=3600000
MaxItemsInUpdateBatch=100
MaxBatchesPerIETransaction=1000
MaxBatchSizeInBytes=20000000
ReadOnlyConvertionBatchSize=1

// Retry and total wait time talking to UD, direct socket mode
WaitForTransactionMS=10000
MaxWaitForTransactionMS=600000

// for direct (non RMI) how often / long to try connecting to IE


ConnectionAttempts=5
ConnectionDelayBetweenAttemptsInMS=1000

// P0, P1, etc.


P#=partitionName

The Information Company™ 236


Understanding Search Engine 21

// ID is the path where inbound IPools reside, and the ReadArea


// is an integer for the folder number
IPoolId=
IPoolReadArea=

// Number of partitions that can merge even if disk percent full


NumOfMergeTokens=0

// Number of partitions to which new objects should be added


NumActivePartitions=0

// Number of partitions allowed to write checkpoints at same time


MaximumParallelCheckpoints=0

// logging
LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1

// index backup configuration


BackupParentDir=c:/temp/backups
MaximumParallelBackups=4
BackupLabelPrefix=MyLabel
ControlDirectory=
KeepOldControlFiles=false

Index Engine Section


This section is used primarily by the Index Engines. The Index Engine name must be
added to the section name after the underscore.

[IndexEngine_]
AdminPort=
IndexDirectory=

// RMI settings not needed if using sockets between UD and IE


RMIServerPort=
RMIUpdateDistributorURL=

// For direct (non RMI) a timeout between connection and first command
IEConnectionTimeoutInMS=10000

The Information Company™ 237


Understanding Search Engine 21

// Used in the backup/restore process


IndexName=Livelink

// Metadata Integrity Checksums


MetadataIntegrityMode=off | on | idle
MetadataIntegrityBatchSize=100
MetadataIntegrityBatchIntervalinMS=2000
TestMetadataIntegrityonDisk=true

// Log file configuration


LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1

// Transaction log files


TransactionLogFile=
TransactionLogRequired=false

// Level 3 Index Verify option


MaxVerifyIndexMODExceptions=300

Search Federator Section


This section is consumed by the Search Federators. The name of the Search
Federator must follow the underscore character in the section name.

[SearchFederator_]
RMIServerPort=
AdminPort=
SearchPort=8500

// SE0, SE1, etc. Required for each SE attached.


// For RMI, this is RMIRegistry location (same for each)
// For sockets, this is Search Engine location
SE#=//host:port/searchEngineName

// For sockets (not RMI), SE connections: 5 retries 1 second apart


ConnectionAttempts=5
ConnectionDelayBetweenAttempts=1000

// Worker threads and queue size determine how many queries

The Information Company™ 238


Understanding Search Engine 21

// can be active and waiting. Larger values consume more


// system resources.
WorkerThreads=10
QueueSize=25

// Low priority search queue, disabled by default


LowPrioritySearchPort=-1
LowPriorityWorkerThreads=2
LowPriorityQueueSize=25

// Test setting, forcing long search queries


MinSearchTimeInMS=0
// Test setting, forcing random long holds on a read lock
RandomMaxReaderDelayInMS=0

// Chunks are results the SF asks SE to fetch


// Larger chunks are slower. Very small chunks add overhead
MergeSortChunkSize=50

// Cache threshold determines how aggressively SF cache is filled


// by the SEs. 0 is most aggressive, best in most cases.
MergeSortCacheThreshold=0

// Timeouts for closing application connections that are idle


// Use 0 to disable feature
FirstCommandReadTimeoutInMS=10000
SubsequentCommandReadTimeoutInMS=120000

// Query suspension time to prevent index throttling


BlockNewSearchesAfterTimeInMS=0
PauseTimeForIndexUpdatingInM=30000

// Removing duplicates slows search queries. More than about


// 1 million values to dedupe uses a LOT of RAM.
RemoveDuplicatesIDs=false
MaximumDuplicatesToRemove=1000000

// Optimistic – if an SE dies, restart it.


// Pessimistic – glass is half empty
ErrorRecovery=optimistic | pessimistic

// Log file management


LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true

The Information Company™ 239


Understanding Search Engine 21

Logfile=<SectionName>.log
RequestsPerLogFlush=1

// Search Result Cache time trigger and temp directory


SearchResultCacheDirectory=G:\cache
TimeBeforeCachingResultsInMS=300000

Search Engine Section


The configuration settings in this section are consumed by the Search Engines. The
partition name must follow the underscore character in the section name.

[SearchEngine_]
AdminPort=
IndexDirectory=

// Used in backup processes


IndexName=Livelink

// Log file configuration


LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1

// CS10 Update 4 and later can use direct instead of RMI


ConnectionType=rmi | direct

// If using direct sockets, RMI settings not needed


RMISearchFederatorServerName=
RMIServerPort=

// If using direct sockets need this for each SE


ServerPort=<search_engine_port>

// Disk tuning values that you should leave alone unless you
// are having disk problems. Use cautiously.
UseSystemIOBuffers=true
MaximumNumberCachedIOBuffers=100
SizeInBytesIOBuffers=4096

The Information Company™ 240


Understanding Search Engine 21

DiskRet Section
This section is present to allow use of DISK_RET storage mode in older systems where
Content Server does not support DISK_RET configuration in the search administration
pages. Normally, should only be present in a search.ini_override file. CS10 Update 3
and later would put this into the [Link] file instead.

[DiskRetSection]
RegionsOnReadWritePartitions=
RegionsOnNoAddPartitions=
RegionsOnReadOnlyPartitions=

Search Agent Section


Search Agents are queries that are run on objects as they are being indexed. Within
Content Server, these are typically from Intelligent Classification or from Prospectors.
The search agent name must follow the underscore character in the section name.

[SearchAgent_]
operation=OTProspector | OTClassify

// IPool is the path, and readArea is a number which represents a


// folder within the ipool path were the results are stored.
readArea=
readIpool=

// The queries to be applied are contained in this path/file


queryFile=

Field Alias Section


This area defines a list of Content Server field names that are mapped to OTSE region
names at query time and during indexing. The partition name must be appended to
the FAS section name after the underscore character.

[FAS_label]
From=to
// example
Author=OTUserName

Index Maker Section


This section defines a number of internal values used to configure how a full text search
index is constructed and interpreted. It is included here for completeness. DO NOT
CHANGE these settings unless you have strong reasons for doing so and understand
exactly what you are doing. In general, this section is not present in a [Link] file,
and the default values are used.

The Information Company™ 241


Understanding Search Engine 21

[IndexMaker]
ObjectSkip=32
ObjectUseRLE=true
ObjectUseNyble=true
OffsetSkip=16
OffsetUseRLE=true
OffsetUseNyble=true
SmallestIndexIndexSizeInBytes=1048576
IndexingPartitionFactor=256
UseLongSkips=true
LongSkipInterval=4096

Reloadable Settings
A subset of the [Link] settings can be applied to search processes that are already
running. This feature is triggered using the “reloadSettings” command over the admin
API port. The [Link] settings applied at reload are:

Common Values
These values are reloadable in the Update Distributor, Index Engines, Search
Federator and Search Engines.

Logfile
RequestsPerLogFlush
CreationStatus
DebugLevel
LogSizeLimitInMBytes
MaxLogFiles
MaxStartupLogFiles
IncludeConfigurationFilesInLogs
NumberOfFileRecoveryAttempts
LargeObjectPartition
ObjectSizeThresholdInBytes
BlockBackupIfThisFileExists
BlockStartTransactionIfThisFileExists

If using RMI…

RMIRegistryPort
RMIPolicyFile
RMICodebase
AdminServerHostName

If not using RMI…

The Information Company™ 242


Understanding Search Engine 21

PolicyFile

Search Engines
DefaultMetadataFieldNamesCSL
DefragmentMemoryOptions
DefragmentSpaceInMBytes
DefragmentDailyTimes
DefragmentMaxStaggerInMinutes
DefragmentStaggerSeedToAppend
SkipMetadataSetOfEqualValues
MetadataConversionOptions
ExpressionWeight
ObjectRankRanker
ExtraWeightFieldRankers
DateFieldRankers
TypeFieldRankers
UseOldStem
HitLocationRestrictionFields
FieldAliasSection
DefaultMetadataAttributeFieldNames
SystemDefaultSortLanguage
SortingSequences
PrecomputeFacetsCSL
MaximumNumberOfCachedFacets
DesireNumberOfCachedFacets
TextNumberOfWordsInSet=15
TextUseTermSet=true
TextPercentage=80

Update Distributor
MaxItemsInUpdateBatch
MaxBatchSizeInBytes
MaxBatchesPerIETransaction
NumOfMergeTokens
RunAgentIntervalInMS

** The list of partitions is also reloaded from the section names in the Update
Distributor, allowing partitions to be added without restarts.
Although Search Agent definitions are not included in this list, changes to the Search
Agents do not require a restart. Search Agents use another mechanism for updates;
refer to the section on Search Agents for details.

The Information Company™ 243


Understanding Search Engine 21

Tokenizer Mapping
Earlier in this document, the Tokenizer section references various character mappings.
For reference, a detailed list of character mappings performed by the tokenizer is
included below. If a character is not included in this table, it is not mapped – it is added
to the index as itself.
The leftmost character in each row (and its hexadecimal Unicode value) represents
the output character(s) of the mapping. The remaining values following the colon
represent a list of source characters that are mapped to that output character. Each of
these source characters in the list is separated by a comma, with Unicode values in
parentheses.

a (61): A (41), À (c0), Á (c1), Â (c2), Ã (c3), Å (c5), à (e0), á


(e1), â (e2), ã (e3), å (e5), Ā (100), ā (101), Ă (102), ă
(103), Ą (104), A (ff21), a (ff41)
b (62): B (42), B (ff22), b (ff42)
c (63): C (43), Ç (c7), ç (e7), ą (105), Ć (106), ć (107), Ĉ (108), ĉ
(109), Ċ (10a), ċ (10b), Č (10c), č (10d), C (ff23), c
(ff43)
d (64): D (44), Ď (10e), ď (10f), Đ (110), đ (111), D (ff24), d
(ff44)
e (65): E (45), È (c8), É (c9), Ê (ca), Ë (cb), è (e8), é (e9), ê
(ea), ë (eb), Ē (112), ē (113), Ĕ (114), ĕ (115), Ė (116), ė
(117), Ę (118), ę (119), Ě (11a), ě (11b), E (ff25), e
(ff45)
f (66): F (46), F (ff26), f (ff46)
g (67): G (47), Ĝ (11c), ĝ (11d), Ğ (11e), ğ (11f), Ġ (120), ġ (121),
Ģ (122), ģ (123), G (ff27), g (ff47)
h (68): H (48), Ĥ (124), ĥ (125), Ħ (126), ħ (127), H (ff28), h
(ff48)
i (69): I (49), Ì (cc), Í (cd), Î (ce), Ï (cf), ì (ec), í (ed), î
(ee), ï (ef), Ĩ (128), ĩ (129), Ī (12a), ī (12b), Ĭ (12c), ĭ
(12d), Į (12e), į (12f), İ (130), ı (131), I (ff29), i
(ff49)
j (6a): J (4a), Ĵ (134), ĵ (135), J (ff2a), j (ff4a)
k (6b): K (4b), Ķ (136), ķ (137), K (ff2b), k (ff4b)
l (6c): L (4c), ĸ (138), Ĺ (139), ĺ (13a), Ļ (13b), ļ (13c), Ľ (13d),
ľ (13e), Ł (141), ł (142), L (ff2c), l (ff4c)
m (6d): M (4d), M (ff2d), m (ff4d)
n (6e): N (4e), Ñ (d1), ñ (f1), Ń (143), ń (144), Ņ (145), ņ (146), Ň
(147), ň (148), Ŋ (14a), ŋ (14b), N (ff2e), n (ff4e)
o (6f): O (4f), Ò (d2), Ó (d3), Ô (d4), Õ (d5), Ø (d8), ò (f2), ó
(f3), ô (f4), õ (f5), ø (f8), Ō (14c), ō (14d), Ő (150), ő
(151), O (ff2f), o (ff4f)
p (70): P (50), P (ff30), p (ff50)
q (71): Q (51), Q (ff31), q (ff51)
r (72): R (52), Ŕ (154), ŕ (155), Ŗ (156), ŗ (157), Ř (158), ř (159),
R (ff32), r (ff52)
s (73): S (53), Ś (15a), ś (15b), Ŝ (15c), ŝ (15d), Ş (15e), ş (15f),
Š (160), š (161), S (ff33), s (ff53)
t (74): T (54), Ţ (162), ţ (163), Ť (164), ť (165), Ŧ (166), ŧ (167),
T (ff34), t (ff54)
u (75): U (55), Ù (d9), Ú (da), Û (db), ù (f9), ú (fa), û (fb), Ũ
(168), ũ (169), Ū (16a), ū (16b), Ŭ (16c), ŭ (16d), Ů (16e),

The Information Company™ 244


Understanding Search Engine 21

ů (16f), Ű (170), ű (171), Ų (172), ų (173), U (ff35), u


(ff55)
v (76): V (56), V (ff36), v (ff56)
w (77): W (57), W (ff37), w (ff57)
x (78): X (58), X (ff38), x (ff58)
y (79): Y (59), Ý (dd), ý (fd), ÿ (ff), Y (ff39), y (ff59)
z (7a): Z (5a), Ź (179), ź (17a), Ż (17b), ż (17c), Ž (17d), ž (17e),
Z (ff3a), z (ff5a)
ae (00650061): Ä (c4), Æ (c6), ä (e4), æ (e6)
ð (f0): Ð (d0)
oe (0065006f): Ö (d6), ö (f6), Œ (152), œ (153)
ue (00650075): Ü (dc), ü (fc)
ss (00730073): ß (df)
ij (006a0069): IJ (132), ij (133)
ά (3ac): Ά (386)
έ (3ad): Έ (388)
ή (3ae): Ή (389)
ί (3af): Ί (38a)
ό (3cc): Ό (38c)
ύ (3cd): Ύ (38e)
ώ (3ce): Ώ (38f)
α (3b1): Α (391)
β (3b2): Β (392)
γ (3b3): Γ (393)
δ (3b4): Δ (394)
ε (3b5): Ε (395)
ζ (3b6): Ζ (396)
η (3b7): Η (397)
θ (3b8): Θ (398)
ι (3b9): Ι (399)
κ (3ba): Κ (39a)
λ (3bb): Λ (39b)
μ (3bc): Μ (39c)
ν (3bd): Ν (39d)
ξ (3be): Ξ (39e)
ο (3bf): Ο (39f)
π (3c0): Π (3a0)
ρ (3c1): Ρ (3a1)
σ (3c3): Σ (3a3)
τ (3c4): Τ (3a4)
υ (3c5): Υ (3a5)
φ (3c6): Φ (3a6)
χ (3c7): Χ (3a7)
ψ (3c8): Ψ (3a8)
ω (3c9): Ω (3a9)
ϊ (3ca): Ϊ (3aa)
ϋ (3cb): Ϋ (3ab)
ѐ (450): Ѐ (400)
ё (451): Ё (401)
ђ (452): Ђ (402)
ѓ (453): Ѓ (403)
є (454): Є (404)
ѕ (455): Ѕ (405)
і (456): І (406)
ї (457): Ї (407)
ј (458): Ј (408)
љ (459): Љ (409)
њ (45a): Њ (40a)
ћ (45b): Ћ (40b)
ќ (45c): Ќ (40c)

The Information Company™ 245


Understanding Search Engine 21

ѝ (45d):
Ѝ (40d)
ў (45e):
Ў (40e)
џ (45f):
Џ (40f)
а (430):
А (410)
б (431):
Б (411)
в (432):
В (412)
г (433):
Г (413)
д (434):
Д (414)
е (435):
Е (415)
ж (436):
Ж (416)
з (437):
З (417)
и (438):
И (418)
й (439):
Й (419)
к (43a):
К (41a)
л (43b):
Л (41b)
м (43c):
М (41c)
н (43d):
Н (41d)
о (43e):
О (41e)
п (43f):
П (41f)
р (440):
Р (420)
с (441):
С (421)
т (442):
Т (422)
у (443):
У (423)
ф (444):
Ф (424)
х (445):
Х (425)
ц (446):
Ц (426)
ч (447):
Ч (427)
ш (448):
Ш (428)
щ (449):
Щ (429)
ъ (44a):
Ъ (42a)
ы (44b):
Ы (42b)
ь (44c):
Ь (42c)
э (44d):
Э (42d)
ю (44e):
Ю (42e)
я (44f):
Я (42f)
‫ا‬ Arabic
(627): ‫ ﴼ‬,(675) ‫ٵ‬ ,(625) ‫ إ‬,(623) ‫ أ‬,(622) ‫( آ‬fd3c), ‫ﴽ‬
(fd3d), ‫( ﹵‬fe75), ‫( ﺁ‬fe81), ‫( ﺂ‬fe82), ‫( ﺃ‬fe83), ‫( ﺄ‬fe84), ‫ﺇ‬
(fe87), ‫( ﺈ‬fe88), ‫( ﺍ‬fe8d), ‫( ﺎ‬fe8e)
‫ و‬Arabic (648): ‫ ؤ‬,(676) ‫ٶ‬ ,(624) ‫( ؤ‬fe85), ‫( ﺆ‬fe86), ‫( و‬feed), ‫ﻮ‬
(feee)
‫ ي‬Arabic (64a): ‫ ﯨ‬,(678) ‫ ٸ‬,(649) ‫ ى‬,(626) ‫( ئ‬fbe8), ‫( ﯩ‬fbe9), ‫ﱝ‬
(fc5d), ‫( ﲐ‬fc90), ‫( ﺉ‬fe89), ‫( ﺊ‬fe8a), ‫( ﺋ‬fe8b), ‫( ﺌ‬fe8c), ‫ﻯ‬
(feef), ‫( ﻰ‬fef0), ‫( ﻱ‬fef1), ‫( ﻲ‬fef2), ‫( ﻳ‬fef3), ‫( ﻴ‬fef4)
‫ ه‬Arabic (647): ‫ ﳙ‬,(629) ‫( ة‬fcd9), ‫( ﺓ‬fe93), ‫( ﺔ‬fe94), ‫( ﻩ‬fee9), ‫ﻪ‬
(feea), ‫( ﻫ‬feeb), ‫( ﻬ‬feec)
0 (30): ۰ (660), ۰ (6f0), 0 (ff10)
1 (31): ۱ (661), ۱ (6f1), 1 (ff11)
2 (32): ۲ (662), ۲ (6f2), 2 (ff12)
3 (33): ۳ (663), ۳ (6f3), 3 (ff13)
4 (34): ٤ (664), ۴ (6f4), 4 (ff14)
5 (35): ٥ (665), ۵ (6f5), 5 (ff15)
6 (36): ٦ (666), ۶ (6f6), 6 (ff16)
7 (37): ۷ (667), ۷ (6f7), 7 (ff17)
8 (38): ۸ (668), ۸ (6f8), 8 (ff18)
9 (39): ۹ (669), ۹ (6f9), 9 (ff19)
‫ ۇ‬Arabic (6c7): ‫ ﯇‬,(677) ‫( ٷ‬fbc7), ‫( ﯗ‬fbd7), ‫( ﯘ‬fbd8), ‫( ﯝ‬fbdd)
‫ ە‬Arabic (6d5): ‫( ۀ‬6c0), ‫( ۀ‬fba4), ‫( ﮥ‬fba5), ‫( ﯀‬fbc0)

The Information Company™ 246


Understanding Search Engine 21

‫ ه‬Arabic (6c1): ‫( ۀ‬6c2), ‫( ﮦ‬fba6), ‫( ﮧ‬fba7), ‫( ﮨ‬fba8), ‫( ﮩ‬fba9), ‫﯁‬


(fbc1), ‫( ﯂‬fbc2)
‫ ے‬Arabic (6d2): ‫( ۓ‬6d3), ‫( ے‬fbae), ‫( ﮯ‬fbaf), ‫( ۓ‬fbb0), ‫( ﮱ‬fbb1), ‫﯒‬
(fbd2)
- (2d): − (2212), - (ff0d)
ゟ (309f): ゜ (309c)
ア (30a2): ァ (30a1), ァ (ff67), ア (ff71)
イ (30a4): ィ (30a3), ィ (ff68), イ (ff72)
ウ (30a6): ゥ (30a5), ゥ (ff69), ウ (ff73)
エ (30a8): ェ (30a7), ェ (ff6a), エ (ff74)
オ (30aa): ォ (30a9), ォ (ff6b), オ (ff75)
ツ (30c4): ッ (30c3), ッ (ff6f), ツ (ff82)
ヤ (30e4): ャ (30e3), ャ (ff6c), ヤ (ff94)
ユ (30e6): ュ (30e5), ュ (ff6d), ユ (ff95)
ヨ (30e8): ョ (30e7), ョ (ff6e), ヨ (ff96)
ワ (30ef): ヮ (30ee), ワ (ff9c)
カ (30ab): ヵ (30f5), カ (ff76)
ケ (30b1): ヶ (30f6), ケ (ff79)
‫ ٱ‬Arabic (671): ‫( ٱ‬fb50), ‫( ﭑ‬fb51)
‫ ٻ‬Arabic (67b): ‫( ٻ‬fb52), ‫( ﭓ‬fb53), ‫( ﭔ‬fb54), ‫( ﭕ‬fb55)
‫ پ‬Arabic (67e): ‫( پ‬fb56), ‫( ﭗ‬fb57), ‫( ﭘ‬fb58), ‫( ﭙ‬fb59)
‫ ڀ‬Arabic (680): ‫( ڀ‬fb5a), ‫( ﭛ‬fb5b), ‫( ﭜ‬fb5c), ‫( ﭝ‬fb5d)
‫ ٺ‬Arabic (67a): ‫( ٺ‬fb5e), ‫( ﭟ‬fb5f), ‫( ﭠ‬fb60), ‫( ﭡ‬fb61)
‫ ٿ‬Arabic (67f): ‫( ٿ‬fb62), ‫( ﭣ‬fb63), ‫( ﭤ‬fb64), ‫( ﭥ‬fb65)
‫ ٹ‬Arabic (679): ‫( ٹ‬fb66), ‫( ﭧ‬fb67), ‫( ﭨ‬fb68), ‫( ﭩ‬fb69)
‫ ڤ‬Arabic (6a4): ‫( ڤ‬fb6a), ‫( ﭫ‬fb6b), ‫( ﭬ‬fb6c), ‫( ﭭ‬fb6d)
‫ ڦ‬Arabic (6a6): ‫( ڦ‬fb6e), ‫( ﭯ‬fb6f), ‫( ﭰ‬fb70), ‫( ﭱ‬fb71)
‫ ڄ‬Arabic (684): ‫( ڄ‬fb72), ‫( ﭳ‬fb73), ‫( ﭴ‬fb74), ‫( ﭵ‬fb75)
‫ ڃ‬Arabic (683): ‫( ڃ‬fb76), ‫( ﭷ‬fb77), ‫( ﭸ‬fb78), ‫( ﭹ‬fb79)
‫ چ‬Arabic (686): ‫( چ‬fb7a), ‫( ﭻ‬fb7b), ‫( ﭼ‬fb7c), ‫( ﭽ‬fb7d)
‫ ڇ‬Arabic (687): ‫( ڇ‬fb7e), ‫( ﭿ‬fb7f), ‫( ﮀ‬fb80), ‫( ﮁ‬fb81)
‫ ڍ‬Arabic (68d): ‫( ڍ‬fb82), ‫( ﮃ‬fb83)
‫ ڌ‬Arabic (68c): ‫( ڌ‬fb84), ‫( ﮅ‬fb85)
‫ ڎ‬Arabic (68e): ‫( ڎ‬fb86), ‫( ﮇ‬fb87)
‫ ڈ‬Arabic (688): ‫( ڈ‬fb88), ‫( ﮉ‬fb89)
‫ ژ‬Arabic (698): ‫( ژ‬fb8a), ‫( ﮋ‬fb8b)
‫ ڑ‬Arabic (691): ‫( ڑ‬fb8c), ‫( ﮍ‬fb8d)
‫ ک‬Arabic (6a9): ‫( ک‬fb8e), ‫( ﮏ‬fb8f), ‫( ﮐ‬fb90), ‫( ﮑ‬fb91)
‫ گ‬Arabic (6af): ‫( گ‬fb92), ‫( ﮓ‬fb93), ‫( ﮔ‬fb94), ‫( ﮕ‬fb95)
‫ ڳ‬Arabic (6b3): ‫( ڳ‬fb96), ‫( ﮗ‬fb97), ‫( ﮘ‬fb98), ‫( ﮙ‬fb99), ‫( ﮳‬fbb3)
‫ ڱ‬Arabic (6b1): ‫( ڱ‬fb9a), ‫( ﮛ‬fb9b), ‫( ﮜ‬fb9c), ‫( ﮝ‬fb9d)
‫ ں‬Arabic (6ba): ‫( ں‬fb9e), ‫( ﮟ‬fb9f), ‫( ﮺‬fbba)
‫ ڻ‬Arabic (6bb): ‫( ڻ‬fba0), ‫( ﮡ‬fba1), ‫( ﮢ‬fba2), ‫( ﮣ‬fba3), ‫( ﮻‬fbbb)
‫ ھ‬Arabic (6be): ‫( ھ‬fbaa), ‫( ﮫ‬fbab), ‫( ھ‬fbac), ‫( ﮭ‬fbad), ‫( ﮾‬fbbe)
‫ ڲ‬Arabic (6b2): ‫( ﮲‬fbb2)
‫ ڴ‬Arabic (6b4): ‫( ﮴‬fbb4)
‫ ڵ‬Arabic (6b5): ‫( ﮵‬fbb5)
‫ ڶ‬Arabic (6b6): ‫( ﮶‬fbb6)
‫ ڷ‬Arabic (6b7): ‫( ﮷‬fbb7)
‫ ڸ‬Arabic (6b8): ‫( ﮸‬fbb8)
‫ ڹ‬Arabic (6b9): ‫( ﮹‬fbb9)
‫ ڼ‬Arabic (6bc): ‫( ﮼‬fbbc)
‫ ڽ‬Arabic (6bd): ‫( ﮽‬fbbd)
‫ ڿ‬Arabic (6bf): ‫( ﮿‬fbbf)
‫ ة‬Arabic (6c3): ‫( ﯃‬fbc3)
‫ ۄ‬Arabic (6c4): ‫( ﯄‬fbc4)
‫ ۅ‬Arabic (6c5): ‫( ﯅‬fbc5), ‫( ﯠ‬fbe0), ‫( ﯡ‬fbe1)
‫ ۆ‬Arabic (6c6): ‫( ﯆‬fbc6), ‫( ﯙ‬fbd9), ‫( ﯚ‬fbda)

The Information Company™ 247


Understanding Search Engine 21

‫ۈ‬ Arabic (6c8): ‫﯈‬ (fbc8), ‫( ﯛ‬fbdb), ‫( ﯜ‬fbdc)


‫ۉ‬ Arabic (6c9): ‫﯉‬ (fbc9), ‫( ﯢ‬fbe2), ‫( ﯣ‬fbe3)
‫ۊ‬ Arabic (6ca): ‫﯊‬ (fbca)
‫ۋ‬ Arabic (6cb): ‫﯋‬ (fbcb), ‫( ﯞ‬fbde), ‫( ﯟ‬fbdf)
‫ی‬ Arabic (6cc): ‫﯌‬ (fbcc), ‫( ﯼ‬fbfc), ‫( ﯽ‬fbfd), ‫( ﯾ‬fbfe), ‫( ﯿ‬fbff)
‫ۍ‬ Arabic (6cd): ‫﯍‬ (fbcd)
‫ێ‬ Arabic (6ce): ‫﯎‬ (fbce)
‫ۏ‬ Arabic (6cf): ‫﯏‬ (fbcf)
‫ې‬ Arabic (6d0): ‫﯐‬ (fbd0), ‫( ﯤ‬fbe4), ‫( ﯥ‬fbe5), ‫( ﯦ‬fbe6), ‫( ﯧ‬fbe7)
‫ۑ‬ Arabic (6d1): ‫﯑‬ (fbd1)
‫ڭ‬ Arabic (6ad): ‫ڭ‬ (fbd3), ‫( ﯔ‬fbd4), ‫( ﯕ‬fbd5), ‫( ﯖ‬fbd6)
‫ ﯾﺎ‬Arabic (0627064a): ‫( ﯪ‬fbea), ‫( ﯫ‬fbeb)
‫ ﯾە‬Arabic (06d5064a): ‫( ﯬ‬fbec), ‫( ﯭ‬fbed)
‫ ﯾﻮ‬Arabic (0648064a): ‫( ﯮ‬fbee), ‫( ﯯ‬fbef)
‫ ﯾﯘ‬Arabic (06c7064a): ‫( ﯰ‬fbf0), ‫( ﯱ‬fbf1)
‫ ﯾﯚ‬Arabic (06c6064a): ‫( ﯲ‬fbf2), ‫( ﯳ‬fbf3)
‫ ﯾﯜ‬Arabic (06c8064a): ‫( ﯴ‬fbf4), ‫( ﯵ‬fbf5)
‫ ﯾﯥ‬Arabic (06d0064a): ‫( ﯶ‬fbf6), ‫( ﯷ‬fbf7), ‫( ﯸ‬fbf8)
‫ ﯾﻲ‬Arabic (064a064a): ‫( ﯹ‬fbf9), ‫( ﯺ‬fbfa), ‫( ﯻ‬fbfb), ‫( ﰃ‬fc03), ‫ﰄ‬
(fc04), ‫( ﱙ‬fc59), ‫( ﱚ‬fc5a), ‫( ﱨ‬fc68), ‫( ﱩ‬fc69), ‫( ﲕ‬fc95),
‫( ﲖ‬fc96)
‫ ﯾﺞ‬Arabic (062c064a): ‫( ﰀ‬fc00), ‫( ﱕ‬fc55), ‫( ﲗ‬fc97), ‫( ﳚ‬fcda)
‫ ﯾﺢ‬Arabic (062d064a): ‫( ﰁ‬fc01), ‫( ﱖ‬fc56), ‫( ﲘ‬fc98), ‫( ﳛ‬fcdb)
‫ ﯾﻢ‬Arabic (0645064a): ‫( ﰂ‬fc02), ‫( ﱘ‬fc58), ‫( ﱦ‬fc66), ‫( ﲓ‬fc93), ‫ﲚ‬
(fc9a), ‫( ﳝ‬fcdd), ‫( ﳟ‬fcdf), ‫( ﳰ‬fcf0)
‫ ﺑﺞ‬Arabic (062c0628): ‫( ﰅ‬fc05), ‫( ﲜ‬fc9c)
‫ ﺑﺢ‬Arabic (062d0628): ‫( ﰆ‬fc06), ‫( ﲝ‬fc9d)
‫ ﺑﺦ‬Arabic (062e0628): ‫( ﰇ‬fc07), ‫( ﲞ‬fc9e)
‫ ﺑﻢ‬Arabic (06450628): ‫( ﰈ‬fc08), ‫( ﱬ‬fc6c), ‫( ﲟ‬fc9f), ‫( ﳡ‬fce1)
‫ ﺑﻲ‬Arabic (064a0628): ‫( ﰉ‬fc09), ‫( ﰊ‬fc0a), ‫( ﱮ‬fc6e), ‫( ﱯ‬fc6f)
‫ ﺗﺞ‬Arabic (062c062a): ‫( ﰋ‬fc0b), ‫( ﲡ‬fca1)
‫ ﺗﺢ‬Arabic (062d062a): ‫( ﰌ‬fc0c), ‫( ﲢ‬fca2)
‫ ﺗﺦ‬Arabic (062e062a): ‫( ﰍ‬fc0d), ‫( ﲣ‬fca3)
‫ ﺗﻢ‬Arabic (0645062a): ‫( ﰎ‬fc0e), ‫( ﱲ‬fc72), ‫( ﲤ‬fca4), ‫( ﳣ‬fce3)
‫ ﺗﻲ‬Arabic (064a062a): ‫( ﰏ‬fc0f), ‫( ﰐ‬fc10), ‫( ﱴ‬fc74), ‫( ﱵ‬fc75)
‫ ﺛﺞ‬Arabic (062c062b): ‫( ﰑ‬fc11)
‫ ﺛﻢ‬Arabic (0645062b): ‫( ﰒ‬fc12), ‫( ﱸ‬fc78), ‫( ﲦ‬fca6), ‫( ﳥ‬fce5)
‫ ﺛﻲ‬Arabic (064a062b): ‫( ﰓ‬fc13), ‫( ﰔ‬fc14), ‫( ﱺ‬fc7a), ‫( ﱻ‬fc7b)
‫ ﺟﺢ‬Arabic (062d062c): ‫( ﰕ‬fc15), ‫( ﲧ‬fca7)
‫ ﺟﻢ‬Arabic (0645062c): ‫( ﰖ‬fc16), ‫( ﲨ‬fca8)
‫ ﺣﺞ‬Arabic (062c062d): ‫( ﰗ‬fc17), ‫( ﲩ‬fca9)
‫ ﺣﻢ‬Arabic (0645062d): ‫( ﰘ‬fc18), ‫( ﲪ‬fcaa)
‫ ﺧﺞ‬Arabic (062c062e): ‫( ﰙ‬fc19), ‫( ﲫ‬fcab)

The Information Company™ 248


Understanding Search Engine 21

‫ ﺧﺢ‬Arabic (062d062e): ‫( ﰚ‬fc1a)


‫ ﺧﻢ‬Arabic (0645062e): ‫( ﰛ‬fc1b), ‫( ﲬ‬fcac)
‫ ﺳﺞ‬Arabic (062c0633): ‫( ﰜ‬fc1c), ‫( ﲭ‬fcad), ‫( ﴴ‬fd34)
‫ ﺳﺢ‬Arabic (062d0633): ‫( ﰝ‬fc1d), ‫( ﲮ‬fcae), ‫( ﴵ‬fd35)
‫ ﺳﺦ‬Arabic (062e0633): ‫( ﰞ‬fc1e), ‫( ﲯ‬fcaf), ‫( ﴶ‬fd36)
‫ ﺳﻢ‬Arabic (06450633): ‫( ﰟ‬fc1f), ‫( ﲰ‬fcb0), ‫( ﳧ‬fce7)
‫ ﺻﺢ‬Arabic (062d0635): ‫( ﰠ‬fc20), ‫( ﲱ‬fcb1)
‫ ﺻﻢ‬Arabic (06450635): ‫( ﰡ‬fc21), ‫( ﲳ‬fcb3)
‫ ﺿﺞ‬Arabic (062c0636): ‫( ﰢ‬fc22), ‫( ﲴ‬fcb4)
‫ ﺿﺢ‬Arabic (062d0636): ‫( ﰣ‬fc23), ‫( ﲵ‬fcb5)
‫ ﺿﺦ‬Arabic (062e0636): ‫( ﰤ‬fc24), ‫( ﲶ‬fcb6)
‫ ﺿﻢ‬Arabic (06450636): ‫( ﰥ‬fc25), ‫( ﲷ‬fcb7)
‫ ﻃﺢ‬Arabic (062d0637): ‫( ﰦ‬fc26), ‫( ﲸ‬fcb8)
‫ ﻃﻢ‬Arabic (06450637): ‫( ﰧ‬fc27), ‫( ﴳ‬fd33), ‫( ﴺ‬fd3a)
‫ ﻇﻢ‬Arabic (06450638): ‫( ﰨ‬fc28), ‫( ﲹ‬fcb9), ‫( ﴻ‬fd3b)
‫ ﻋﺞ‬Arabic (062c0639): ‫( ﰩ‬fc29), ‫( ﲺ‬fcba)
‫ ﻋﻢ‬Arabic (06450639): ‫( ﰪ‬fc2a), ‫( ﲻ‬fcbb)
‫ ﻏﺞ‬Arabic (062c063a): ‫( ﰫ‬fc2b), ‫( ﲼ‬fcbc)
‫ ﻏﻢ‬Arabic (0645063a): ‫( ﰬ‬fc2c), ‫( ﲽ‬fcbd)
‫ ﻓﺞ‬Arabic (062c0641): ‫( ﰭ‬fc2d), ‫( ﲾ‬fcbe)
‫ ﻓﺢ‬Arabic (062d0641): ‫( ﰮ‬fc2e), ‫( ﲿ‬fcbf)
‫ ﻓﺦ‬Arabic (062e0641): ‫( ﰯ‬fc2f), ‫( ﳀ‬fcc0)
‫ ﻓﻢ‬Arabic (06450641): ‫( ﰰ‬fc30), ‫( ﳁ‬fcc1)
‫ ﻓﻲ‬Arabic (064a0641): ‫( ﰱ‬fc31), ‫( ﰲ‬fc32), ‫( ﱼ‬fc7c), ‫( ﱽ‬fc7d)
‫ ﻗﺢ‬Arabic (062d0642): ‫( ﰳ‬fc33), ‫( ﳂ‬fcc2)
‫ ﻗﻢ‬Arabic (06450642): ‫( ﰴ‬fc34), ‫( ﳃ‬fcc3)
‫ ﻗﻲ‬Arabic (064a0642): ‫( ﰵ‬fc35), ‫( ﰶ‬fc36), ‫( ﱾ‬fc7e), ‫( ﱿ‬fc7f)
‫ ﻛﺎ‬Arabic (06270643): ‫( ﰷ‬fc37), ‫( ﲀ‬fc80)
‫ ﻛﺞ‬Arabic (062c0643): ‫( ﰸ‬fc38), ‫( ﳄ‬fcc4)
‫ ﻛﺢ‬Arabic (062d0643): ‫( ﰹ‬fc39), ‫( ﳅ‬fcc5)
‫ ﻛﺦ‬Arabic (062e0643): ‫( ﰺ‬fc3a), ‫( ﳆ‬fcc6)
‫ ﻛﻞ‬Arabic (06440643): ‫( ﰻ‬fc3b), ‫( ﲁ‬fc81), ‫( ﳇ‬fcc7), ‫( ﳫ‬fceb)
‫ ﻛﻢ‬Arabic (06450643): ‫( ﰼ‬fc3c), ‫( ﲂ‬fc82), ‫( ﳈ‬fcc8), ‫( ﳬ‬fcec)
‫ﻛﻲ‬ Arabic (064a0643): ‫( ﰽ‬fc3d), ‫( ﰾ‬fc3e), ‫( ﲃ‬fc83), ‫( ﲄ‬fc84)
‫ﻟﺞ‬ Arabic (062c0644): ‫( ﰿ‬fc3f), ‫( ﳉ‬fcc9)
‫ﻟﺢ‬ Arabic (062d0644): ‫( ﱀ‬fc40), ‫( ﳊ‬fcca)
‫ﻟﺦ‬ Arabic (062e0644): ‫( ﱁ‬fc41), ‫( ﳋ‬fccb)
‫ ﻟﻢ‬Arabic (06450644): ‫( ﱂ‬fc42), ‫( ﲅ‬fc85), ‫( ﳌ‬fccc), ‫( ﳭ‬fced)
‫ ﻟﻲ‬Arabic (064a0644): ‫( ﱃ‬fc43), ‫( ﱄ‬fc44), ‫( ﲆ‬fc86), ‫( ﲇ‬fc87)
‫ ﻣﺞ‬Arabic (062c0645): ‫( ﱅ‬fc45), ‫( ﳎ‬fcce)

The Information Company™ 249


Understanding Search Engine 21

‫ ﻣﺢ‬Arabic (062d0645): ‫( ﱆ‬fc46), ‫( ﳏ‬fccf)


‫ ﻣﺦ‬Arabic (062e0645): ‫( ﱇ‬fc47), ‫( ﳐ‬fcd0)
‫ ﻣﻢ‬Arabic (06450645): ‫( ﱈ‬fc48), ‫( ﲉ‬fc89), ‫( ﳑ‬fcd1)
‫ ﻣﻲ‬Arabic (064a0645): ‫( ﱉ‬fc49), ‫( ﱊ‬fc4a)
‫ ﻧﺞ‬Arabic (062c0646): ‫( ﱋ‬fc4b), ‫( ﳒ‬fcd2)
‫ ﻧﺢ‬Arabic (062d0646): ‫( ﱌ‬fc4c), ‫( ﳓ‬fcd3)
‫ ﻧﺦ‬Arabic (062e0646): ‫( ﱍ‬fc4d), ‫( ﳔ‬fcd4)
‫ ﻧﻢ‬Arabic (06450646): ‫( ﱎ‬fc4e), ‫( ﲌ‬fc8c), ‫( ﳕ‬fcd5), ‫( ﳮ‬fcee)
‫ ﻧﻲ‬Arabic (064a0646): ‫( ﱏ‬fc4f), ‫( ﱐ‬fc50), ‫( ﲎ‬fc8e), ‫( ﲏ‬fc8f)
‫ ھﺞ‬Arabic (062c0647): ‫( ﱑ‬fc51), ‫( ﳗ‬fcd7)
‫ ھﻢ‬Arabic (06450647): ‫( ﱒ‬fc52), ‫( ﳘ‬fcd8)
‫ ھﻲ‬Arabic (064a0647): ‫( ﱓ‬fc53), ‫( ﱔ‬fc54)
‫ ﯾﺦ‬Arabic (062e064a): ‫( ﱗ‬fc57), ‫( ﲙ‬fc99), ‫( ﳜ‬fcdc)
‫ ذ‬Arabic (630): ‫( ﱛ‬fc5b), ‫( ﺫ‬feab), ‫( ﺬ‬feac)
‫ ر‬Arabic (631): ‫( ﱜ‬fc5c), ‫( ﺭ‬fead), ‫( ﺮ‬feae)
‫ ﯾﺮ‬Arabic (0631064a): ‫( ﱤ‬fc64), ‫( ﲑ‬fc91)
‫ ﯾﺰ‬Arabic (0632064a): ‫( ﱥ‬fc65), ‫( ﲒ‬fc92)
‫ ﯾﻦ‬Arabic (0646064a): ‫( ﱧ‬fc67), ‫( ﲔ‬fc94)
‫ ﺑﺮ‬Arabic (06310628): ‫( ﱪ‬fc6a)
‫ ﺑﺰ‬Arabic (06320628): ‫( ﱫ‬fc6b)
‫ ﺑﻦ‬Arabic (06460628): ‫( ﱭ‬fc6d)
‫ ﺗﺮ‬Arabic (0631062a): ‫( ﱰ‬fc70)
‫ ﺗﺰ‬Arabic (0632062a): ‫( ﱱ‬fc71)
‫ ﺗﻦ‬Arabic (0646062a): ‫( ﱳ‬fc73)
‫ ﺛﺮ‬Arabic (0631062b): ‫( ﱶ‬fc76)
‫ ﺛﺰ‬Arabic (0632062b): ‫( ﱷ‬fc77)
‫ ﺛﻦ‬Arabic (0646062b): ‫( ﱹ‬fc79)
‫ ﻣﺎ‬Arabic (06270645): ‫( ﲈ‬fc88)
‫ ﻧﺮ‬Arabic (06310646): ‫( ﲊ‬fc8a)
‫ ﻧﺰ‬Arabic (06320646): ‫( ﲋ‬fc8b)
‫ ﻧﻦ‬Arabic (06460646): ‫( ﲍ‬fc8d)
‫ ﯾﮫ‬Arabic (0647064a): ‫( ﲛ‬fc9b), ‫( ﳞ‬fcde), ‫( ﳠ‬fce0), ‫( ﳱ‬fcf1)
‫ ﺑﮫ‬Arabic (06470628): ‫( ﲠ‬fca0), ‫( ﳢ‬fce2)
‫ ﺗﮫ‬Arabic (0647062a): ‫( ﲥ‬fca5), ‫( ﳤ‬fce4)
‫ ﺻﺦ‬Arabic (062e0635): ‫( ﲲ‬fcb2)
‫ ﻟﮫ‬Arabic (06470644): ‫( ﳍ‬fccd), ‫( ﷲ‬fdf2)
‫ ﻧﮫ‬Arabic (06470646): ‫( ﳖ‬fcd6), ‫( ﳯ‬fcef)
‫ ﺛﮫ‬Arabic (0647062b): ‫( ﳦ‬fce6)
‫ ﺳﮫ‬Arabic (06470633): ‫( ﳨ‬fce8), ‫( ﴱ‬fd31)
‫ ﺷﻢ‬Arabic (06450634): ‫( ﳩ‬fce9), ‫( ﴌ‬fd0c), ‫( ﴨ‬fd28), ‫( ﴰ‬fd30)
‫ ﺷﮫ‬Arabic (06470634): ‫( ﳪ‬fcea), ‫( ﴲ‬fd32)

The Information Company™ 250


Understanding Search Engine 21

‫ ﻃﻲ‬Arabic (064a0637): ‫( ﳵ‬fcf5), ‫( ﳶ‬fcf6), ‫( ﴑ‬fd11), ‫( ﴒ‬fd12)


‫ ﻋﻲ‬Arabic (064a0639): ‫( ﳷ‬fcf7), ‫( ﳸ‬fcf8), ‫( ﴓ‬fd13), ‫( ﴔ‬fd14)
‫ ﻏﻲ‬Arabic (064a063a): ‫( ﳹ‬fcf9), ‫( ﳺ‬fcfa), ‫( ﴕ‬fd15), ‫( ﴖ‬fd16)
‫ ﺳﻲ‬Arabic (064a0633): ‫( ﳻ‬fcfb), ‫( ﳼ‬fcfc), ‫( ﴗ‬fd17), ‫( ﴘ‬fd18)
‫ ﺷﻲ‬Arabic (064a0634): ‫( ﳽ‬fcfd), ‫( ﳾ‬fcfe), ‫( ﴙ‬fd19), ‫( ﴚ‬fd1a)
‫ ﺣﻲ‬Arabic (064a062d): ‫( ﳿ‬fcff), ‫( ﴀ‬fd00), ‫( ﴛ‬fd1b), ‫( ﴜ‬fd1c)
‫ ﺟﻲ‬Arabic (064a062c): ‫( ﴁ‬fd01), ‫( ﴂ‬fd02), ‫( ﴝ‬fd1d), ‫( ﴞ‬fd1e)
‫ ﺧﻲ‬Arabic (064a062e): ‫( ﴃ‬fd03), ‫( ﴄ‬fd04), ‫( ﴟ‬fd1f), ‫( ﴠ‬fd20)
‫ ﺻﻲ‬Arabic (064a0635): ‫( ﴅ‬fd05), ‫( ﴆ‬fd06), ‫( ﴡ‬fd21), ‫( ﴢ‬fd22)
‫ ﺿﻲ‬Arabic (064a0636): ‫( ﴇ‬fd07), ‫( ﴈ‬fd08), ‫( ﴣ‬fd23), ‫( ﴤ‬fd24)
‫ ﺷﺞ‬Arabic (062c0634): ‫( ﴉ‬fd09), ‫( ﴥ‬fd25), ‫( ﴭ‬fd2d), ‫( ﴷ‬fd37)
‫ ﺷﺢ‬Arabic (062d0634): ‫( ﴊ‬fd0a), ‫( ﴦ‬fd26), ‫( ﴮ‬fd2e), ‫( ﴸ‬fd38)
‫ ﺷﺦ‬Arabic (062e0634): ‫( ﴋ‬fd0b), ‫( ﴧ‬fd27), ‫( ﴯ‬fd2f), ‫( ﴹ‬fd39)
‫ ﺷﺮ‬Arabic (06310634): ‫( ﴍ‬fd0d), ‫( ﴩ‬fd29)
‫ ﺳﺮ‬Arabic (06310633): ‫( ﴎ‬fd0e), ‫( ﴪ‬fd2a)
‫ ﺻﺮ‬Arabic (06310635): ‫( ﴏ‬fd0f), ‫( ﴫ‬fd2b)
‫ ﺿﺮ‬Arabic (06310636): ‫( ﴐ‬fd10), ‫( ﴬ‬fd2c)
‫ ﺗﺠﻢ‬Arabic (0645062c062a): ‫( ﵐ‬fd50)
‫ ﺗﺤﺞ‬Arabic (062c062d062a): ‫( ﵑ‬fd51), ‫( ﵒ‬fd52)
‫ ﺗﺤﻢ‬Arabic (0645062d062a): ‫( ﵓ‬fd53)
‫ ﺗﺨﻢ‬Arabic (0645062e062a): ‫( ﵔ‬fd54)
‫ ﺗﻤﺞ‬Arabic (062c0645062a): ‫( ﵕ‬fd55)
‫ ﺗﻤﺢ‬Arabic (062d0645062a): ‫( ﵖ‬fd56)
‫ ﺗﻤﺦ‬Arabic (062e0645062a): ‫( ﵗ‬fd57)
‫ ﺟﻤﺢ‬Arabic (062d0645062c): ‫( ﵘ‬fd58), ‫( ﵙ‬fd59)
‫ ﺣﻤﻲ‬Arabic (064a0645062d): ‫( ﵚ‬fd5a), ‫( ﵛ‬fd5b)
‫ ﺳﺤﺞ‬Arabic (062c062d0633): ‫( ﵜ‬fd5c)
‫ ﺳﺠﺢ‬Arabic (062d062c0633): ‫( ﵝ‬fd5d)
‫ ﺳﺠﻲ‬Arabic (064a062c0633): ‫( ﵞ‬fd5e)
‫ ﺳﻤﺢ‬Arabic (062d06450633): ‫( ﵟ‬fd5f), ‫( ﵠ‬fd60)
‫ ﺳﻤﺞ‬Arabic (062c06450633): ‫( ﵡ‬fd61)
‫ ﺳﻤﻢ‬Arabic (064506450633): ‫( ﵢ‬fd62), ‫( ﵣ‬fd63)
‫ ﺻﺤﺢ‬Arabic (062d062d0635): ‫( ﵤ‬fd64), ‫( ﵥ‬fd65)
‫ ﺻﻤﻢ‬Arabic (064506450635): ‫( ﵦ‬fd66), ‫( ﷅ‬fdc5)
‫ ﺷﺤﻢ‬Arabic (0645062d0634): ‫( ﵧ‬fd67), ‫( ﵨ‬fd68)
‫ ﺷﺠﻲ‬Arabic (064a062c0634): ‫( ﵩ‬fd69)
‫ ﺷﻤﺦ‬Arabic (062e06450634): ‫( ﵪ‬fd6a), ‫( ﵫ‬fd6b)
‫ ﺷﻤﻢ‬Arabic (064506450634): ‫( ﵬ‬fd6c), ‫( ﵭ‬fd6d)
‫ ﺿﺤﻲ‬Arabic (064a062d0636): ‫( ﵮ‬fd6e), ‫( ﶫ‬fdab)

The Information Company™ 251


Understanding Search Engine 21

‫ ﺿﺨﻢ‬Arabic (0645062e0636): ‫( ﵯ‬fd6f), ‫( ﵰ‬fd70)


‫ ﻃﻤﺢ‬Arabic (062d06450637): ‫( ﵱ‬fd71), ‫( ﵲ‬fd72)
‫ ﻃﻤﻢ‬Arabic (064506450637): ‫( ﵳ‬fd73)
‫ ﻃﻤﻲ‬Arabic (064a06450637): ‫( ﵴ‬fd74)
‫ ﻋﺠﻢ‬Arabic (0645062c0639): ‫( ﵵ‬fd75), ‫( ﷄ‬fdc4)
‫ ﻋﻤﻢ‬Arabic (064506450639): ‫( ﵶ‬fd76), ‫( ﵷ‬fd77)
‫ ﻋﻤﻲ‬Arabic (064a06450639): ‫( ﵸ‬fd78), ‫( ﶶ‬fdb6)
‫ ﻏﻤﻢ‬Arabic (06450645063a): ‫( ﵹ‬fd79)
‫ ﻏﻤﻲ‬Arabic (064a0645063a): ‫( ﵺ‬fd7a), ‫( ﵻ‬fd7b)
‫ ﻓﺨﻢ‬Arabic (0645062e0641): ‫( ﵼ‬fd7c), ‫( ﵽ‬fd7d)
‫ ﻗﻤﺢ‬Arabic (062d06450642): ‫( ﵾ‬fd7e), ‫( ﶴ‬fdb4)
‫ ﻗﻤﻢ‬Arabic (064506450642): ‫( ﵿ‬fd7f)
‫ ﻟﺤﻢ‬Arabic (0645062d0644): ‫( ﶀ‬fd80), ‫( ﶵ‬fdb5)
‫ ﻟﺤﻲ‬Arabic (064a062d0644): ‫( ﶁ‬fd81), ‫( ﶂ‬fd82)
‫ ﻟﺠﺞ‬Arabic (062c062c0644): ‫( ﶃ‬fd83), ‫( ﶄ‬fd84)
‫ ﻟﺨﻢ‬Arabic (0645062e0644): ‫( ﶅ‬fd85), ‫( ﶆ‬fd86)
‫ ﻟﻤﺢ‬Arabic (062d06450644): ‫( ﶇ‬fd87), ‫( ﶈ‬fd88)
‫ ﻣﺤﺞ‬Arabic (062c062d0645): ‫( ﶉ‬fd89)
‫ ﻣﺤﻢ‬Arabic (0645062d0645): ‫( ﶊ‬fd8a)
‫ ﻣﺤﻲ‬Arabic (064a062d0645): ‫( ﶋ‬fd8b)
‫ ﻣﺠﺢ‬Arabic (062d062c0645): ‫( ﶌ‬fd8c)
‫ ﻣﺠﻢ‬Arabic (0645062c0645): ‫( ﶍ‬fd8d)
‫ ﻣﺨﺞ‬Arabic (062c062e0645): ‫( ﶎ‬fd8e)
‫ ﻣﺨﻢ‬Arabic (0645062e0645): ‫( ﶏ‬fd8f)
‫ ﻣﺠﺦ‬Arabic (062e062c0645): ‫( ﶒ‬fd92)
‫ ھﻤﺞ‬Arabic (062c06450647): ‫( ﶓ‬fd93)
‫ ھﻤﻢ‬Arabic (064506450647): ‫( ﶔ‬fd94)
‫ ﻧﺤﻢ‬Arabic (0645062d0646): ‫( ﶕ‬fd95)
‫ ﻧﺤﻲ‬Arabic (064a062d0646): ‫( ﶖ‬fd96), ‫( ﶳ‬fdb3)
‫ ﻧﺠﻢ‬Arabic (0645062c0646): ‫( ﶗ‬fd97), ‫( ﶘ‬fd98)
‫ ﻧﺠﻲ‬Arabic (064a062c0646): ‫( ﶙ‬fd99), ‫( ﷇ‬fdc7)
‫ ﻧﻤﻲ‬Arabic (064a06450646): ‫( ﶚ‬fd9a), ‫( ﶛ‬fd9b)
‫ ﯾﻤﻢ‬Arabic (06450645064a): ‫( ﶜ‬fd9c), ‫( ﶝ‬fd9d)
‫ ﺑﺨﻲ‬Arabic (064a062e0628): ‫( ﶞ‬fd9e)
‫ ﺗﺠﻲ‬Arabic (064a062c062a): ‫( ﶟ‬fd9f), ‫( ﶠ‬fda0)
‫ ﺗﺨﻲ‬Arabic (064a062e062a): ‫( ﶡ‬fda1), ‫( ﶢ‬fda2)
‫ ﺗﻤﻲ‬Arabic (064a0645062a): ‫( ﶣ‬fda3), ‫( ﶤ‬fda4)
‫ ﺟﻤﻲ‬Arabic (064a0645062c): ‫( ﶥ‬fda5), ‫( ﶧ‬fda7)
‫ ﺟﺤﻲ‬Arabic (064a062d062c): ‫( ﶦ‬fda6), ‫( ﶾ‬fdbe)

The Information Company™ 252


Understanding Search Engine 21

‫ ﺳﺨﻲ‬Arabic (064a062e0633): ‫( ﶨ‬fda8), ‫( ﷆ‬fdc6)


‫ ﺻﺤﻲ‬Arabic (064a062d0635): ‫( ﶩ‬fda9)
‫ ﺷﺤﻲ‬Arabic (064a062d0634): ‫( ﶪ‬fdaa)
‫ ﻟﺠﻲ‬Arabic (064a062c0644): ‫( ﶬ‬fdac)
‫ ﻟﻤﻲ‬Arabic (064a06450644): ‫( ﶭ‬fdad)
‫ ﯾﺤﻲ‬Arabic (064a062d064a): ‫( ﶮ‬fdae)
‫ ﯾﺠﻲ‬Arabic (064a062c064a): ‫( ﶯ‬fdaf)
‫ ﯾﻤﻲ‬Arabic (064a0645064a): ‫( ﶰ‬fdb0)
‫ ﻣﻤﻲ‬Arabic (064a06450645): ‫( ﶱ‬fdb1)
‫ ﻗﻤﻲ‬Arabic (064a06450642): ‫( ﶲ‬fdb2)
‫ ﻛﻤﻲ‬Arabic (064a06450643): ‫( ﶷ‬fdb7)
‫ ﻧﺠﺢ‬Arabic (062d062c0646): ‫( ﶸ‬fdb8), ‫( ﶽ‬fdbd)
‫ ﻣﺨﻲ‬Arabic (064a062e0645): ‫( ﶹ‬fdb9)
‫ ﻟﺠﻢ‬Arabic (0645062c0644): ‫( ﶺ‬fdba), ‫( ﶼ‬fdbc)
‫ ﻛﻤﻢ‬Arabic (064506450643): ‫( ﶻ‬fdbb), ‫( ﷃ‬fdc3)
‫ ﺣﺠﻲ‬Arabic (064a062c062d): ‫( ﶿ‬fdbf)
‫ ﻣﺠﻲ‬Arabic (064a062c0645): ‫( ﷀ‬fdc0)
‫ ﻓﻤﻲ‬Arabic (064a06450641): ‫( ﷁ‬fdc1)
‫ ﺑﺤﻲ‬Arabic (064a062d0628): ‫( ﷂ‬fdc2)
‫ ﺻﻠﮯ‬Arabic (06d206440635): ‫( ﷰ‬fdf0)
‫ ﻗﻠﮯ‬Arabic (06d206440642): ‫( ﷱ‬fdf1)
‫ اﻛﺒﺮ‬Arabic (0631062806430627): ‫( ﷳ‬fdf3)
‫ ﻣﺤﻤﺪ‬Arabic (062f0645062d0645): ‫( ﷴ‬fdf4)
‫ ﺻﻠﻌﻢ‬Arabic (0645063906440635): ‫( ﷵ‬fdf5)
‫ رﺳﻮل‬Arabic (0644064806330631): ‫( ﷶ‬fdf6)
‫ ﻋﻞ‬Arabic (06440639): ‫( ﷷ‬fdf7)
‫ ﺳﻠﻢ‬Arabic (064506440633): ‫( ﷸ‬fdf8)
‫ ﺻﻠﻲ‬Arabic (064a06440635): ‫( ﷹ‬fdf9)
‫ ﺻﻠﻰ ﷲ ﻋﻠﯿﮫ وﺳﻠﻢ‬Arabic
(064506440633064800200647064a0644063900200647064406440627002
0064906440635): ‫( ﷺ‬fdfa)
‫ ﺟﻞ ﺟﻼﻟﮫ‬Arabic (0647064406270644062c00200644062c): ‫( ﷻ‬fdfb)
‫ ﷼‬Arabic (0644062706cc0631): ‫( ﷼‬fdfc)
ًArabic (64b): ‫( ﹱ‬fe71)
‫ ٳ‬Arabic (673): ‫( ﹳ‬fe73)
َArabic (64e): ‫( ﹷ‬fe77)
ُArabic (64f): ‫( ﹹ‬fe79)
ِArabic (650): ‫( ﹻ‬fe7b)
ّArabic (651): ‫( ﹽ‬fe7d)
ْArabic (652): ‫( ﹿ‬fe7f)
‫ ء‬Arabic (621): ‫( ﺀ‬fe80)
‫ ب‬Arabic (628): ‫( ب‬fe8f), ‫( ﺐ‬fe90), ‫( ﺑ‬fe91), ‫( ﺒ‬fe92)
‫ ت‬Arabic (62a): ‫( ت‬fe95), ‫( ﺖ‬fe96), ‫( ﺗ‬fe97), ‫( ﺘ‬fe98)
‫ ث‬Arabic (62b): ‫( ﺙ‬fe99), ‫( ﺚ‬fe9a), ‫( ﺛ‬fe9b), ‫( ﺜ‬fe9c)

The Information Company™ 253


Understanding Search Engine 21

‫ج‬ Arabic (62c): ‫( ج‬fe9d), ‫( ﺞ‬fe9e), ‫( ﺟ‬fe9f), ‫( ﺠ‬fea0)


‫ح‬ Arabic (62d): ‫( ح‬fea1), ‫( ﺢ‬fea2), ‫( ﺣ‬fea3), ‫( ﺤ‬fea4)
‫خ‬ Arabic (62e): ‫( خ‬fea5), ‫( ﺦ‬fea6), ‫( ﺧ‬fea7), ‫( ﺨ‬fea8)
‫د‬ Arabic (62f): ‫( د‬fea9), ‫( ﺪ‬feaa)
‫ز‬ Arabic (632): ‫( ﺯ‬feaf), ‫( ﺰ‬feb0)
‫س‬ Arabic (633): ‫( س‬feb1), ‫( ﺲ‬feb2), ‫( ﺳ‬feb3), ‫( ﺴ‬feb4)
‫ش‬ Arabic (634): ‫( ش‬feb5), ‫( ﺶ‬feb6), ‫( ﺷ‬feb7), ‫( ﺸ‬feb8)
‫ص‬ Arabic (635): ‫( ص‬feb9), ‫( ﺺ‬feba), ‫( ﺻ‬febb), ‫( ﺼ‬febc)
‫ض‬ Arabic (636): ‫( ض‬febd), ‫( ﺾ‬febe), ‫( ﺿ‬febf), ‫( ﻀ‬fec0)
‫ط‬ Arabic (637): ‫( ﻁ‬fec1), ‫( ﻂ‬fec2), ‫( ﻃ‬fec3), ‫( ﻄ‬fec4)
‫ظ‬ Arabic (638): ‫( ظ‬fec5), ‫( ﻆ‬fec6), ‫( ﻇ‬fec7), ‫( ﻈ‬fec8)
‫ع‬ Arabic (639): ‫( ع‬fec9), ‫( ﻊ‬feca), ‫( ﻋ‬fecb), ‫( ﻌ‬fecc)
‫غ‬ Arabic (63a): ‫( غ‬fecd), ‫( ﻎ‬fece), ‫( ﻏ‬fecf), ‫( ﻐ‬fed0)
‫ف‬ Arabic (641): ‫( ف‬fed1), ‫( ﻒ‬fed2), ‫( ﻓ‬fed3), ‫( ﻔ‬fed4)
‫ق‬ Arabic (642): ‫( ق‬fed5), ‫( ﻖ‬fed6), ‫( ﻗ‬fed7), ‫( ﻘ‬fed8)
‫ك‬ Arabic (643): ‫( ك‬fed9), ‫( ﻚ‬feda), ‫( ﻛ‬fedb), ‫( ﻜ‬fedc)
‫ل‬ Arabic (644): ‫( ل‬fedd), ‫( ﻞ‬fede), ‫( ﻟ‬fedf), ‫( ﻠ‬fee0)
‫م‬ Arabic (645): ‫( م‬fee1), ‫( ﻢ‬fee2), ‫( ﻣ‬fee3), ‫( ﻤ‬fee4)
‫ن‬ Arabic (646): ‫( ن‬fee5), ‫( ﻦ‬fee6), ‫( ﻧ‬fee7), ‫( ﻨ‬fee8)
‫ﻻ‬ Arabic (06270644): ‫( ﻵ‬fef5), ‫( ﻶ‬fef6), ‫( ﻷ‬fef7), ‫( ﻸ‬fef8), ‫ﻹ‬
(fef9), ‫( ﻺ‬fefa), ‫( ﻻ‬fefb), ‫( ﻼ‬fefc)
・ (30fb): ・ (ff65)
ヲ (30f2): ヲ (ff66)
ー (30fc): ー (ff70)
キ (30ad): キ (ff77)
ク (30af): ク (ff78)
コ (30b3): コ (ff7a)
サ (30b5): サ (ff7b)
シ (30b7): シ (ff7c)
ス (30b9): ス (ff7d)
セ (30bb): セ (ff7e)
ソ (30bd): ソ (ff7f)
タ (30bf): タ (ff80)
チ (30c1): チ (ff81)
テ (30c6): テ (ff83)
ト (30c8): ト (ff84)
ナ (30ca): ナ (ff85)
ニ (30cb): ニ (ff86)
ヌ (30cc): ヌ (ff87)
ネ (30cd): ネ (ff88)
ノ (30ce): ノ (ff89)
ハ (30cf): ハ (ff8a)
ヒ (30d2): ヒ (ff8b)
フ (30d5): フ (ff8c)
ヘ (30d8): ヘ (ff8d)
ホ (30db): ホ (ff8e)
マ (30de): マ (ff8f)
ミ (30df): ミ (ff90)
ム (30e0): ム (ff91)
メ (30e1): メ (ff92)
モ (30e2): モ (ff93)
ラ (30e9): ラ (ff97)
リ (30ea): リ (ff98)
ル (30eb): ル (ff99)
レ (30ec): レ (ff9a)

The Information Company™ 254


Understanding Search Engine 21

ロ (30ed): ロ (ff9b)
ン (30f3): ン (ff9d)
゛ (309b): ゙ (ff9e)
゜ (309c): ゚ (ff9f)

The Information Company™ 255


Understanding Search Engine 21

Additional Information
Version history and selected built-in utilities.

Version History
This section of the document identifies which updates of Search Engine 10 and 10.5
contain new features or material changes in behavior. This is not comprehensive, but
a list of the more notable changes.

Search Engine 10
Released with Content Server 10, approximately September 2010. The versions of
the search engine prior to this release were generally referred to as OT7.
• Add support for key-value attributes in text metadata, used for multi-lingual
metadata indexing and search.
• Added Hindi, Tamil and Telugu to the standard tokenizer.
• New percent full model with “soft” update-only mode and rebalancing.
• Defragmentation of metadata storage.
• Added ModifyByQuery.
• Added DeleteByQuery.
• Added Disk Retrieval Storage mode.
• Bi-gram indexing of far-east character sets. May require re-indexing of existing
content with far-east character sets.
• Faster ‘stemming’ focused on noun plurals.
• Content Status feature added.
• Synthetic regions: partition name and mode.
• Change bad metadata to record error instead of halting.
• Search Federator closes connections from inactive clients.
• Rolling log file support added.
• Various bug fixes
• Support for Java 6 (Update 20)

Search Engine 10 Update 1


Released with Content Server 10 Update 1 and Content Server 9.7.1 November 2010
cumulative patch, approximately February 2011.
• Stagger defragmentation times to limit CPU loading.
• Fewer checkpoints on startup if conversions take place.

The Information Company™ 256


Understanding Search Engine 21

• Tokenizer modified for case-insensitive Russian character indexing. Optional re-


indexing of Russian content may be desired to leverage this feature for older
objects.
• Default number of results per query reduced, improves get results performance.
• New TIMESTAMP data type implemented using ISO 8601 format.
• Aggregate-text feature implemented.
• REMOVE region capability added for [Link].
• RENAME feature for existing regions added.
• MERGE feature for existing regions added.
• Various bug fixes

Search Engine 10 Update 2


Released with Content Server 10 Update 2 and Content Server 9.7.1 Update 2,
approximately April 2011.
• RENAME feature extended for new data being indexed.
• MERGE feature extended for new data being indexed.
• Copy of configuration files included as reference in log files.
• Various bug fixes

Search Engine 10 Update 3


Released with Content Server 10 Update 3 and Content Server 9.7.1 Update 3,
approximately July 2011.
• Percent full defaults revised downwards for conservative deployment.
• Backup utility now performs Level 1 + partial Level 4 verification.
• Add OTFileType and OTContentLanguage to default [Link]
• Check for base offset errors when creating fragments.
• Various bug fixes.

Search Engine 10 Update 4


Released with Content Server 10 Update 4 and Content Server 9.7.1 Update 4,
approximately September 2011.
• Adds search facet generation capabilities.
• Adds socket communication as alternative to RMI.
• Enhanced cleanup thread, more aggressive file removal.
• Fixed: Memory loss with defragmentation.

The Information Company™ 257


Understanding Search Engine 21

• Configuration limits for maximum transactions in a metalog.

Search Engine 10 Update 5


Released with Content Server 10 Update 5 and Content Server 9.7.1 Update 5,
approximately December 2011.
• Sockets and threads persist between Search Federators and Search Engines
• Search Engines can now terminate / recover broken socket connections
• Accumulator memory requirements reduced
• GetStatusText now much faster, but possibly less accurate in estimates

Search Engine 10 Update 5 Release 2


Available March 2012, this interim release fixed specific select issues and represented
a “stable” version for use with Content Server 10 and 9.7.1 Updates 3 through 5.

Search Engine 10 Update 6


Released with Content Server 10 Update 6 and Content Server 9.7.1 Update 6,
approximately March 2012.
• Accumulator memory use reduced by chunking large text objects

Search Engine 10 Update 7


Released with Content Server 10 Update 7 and Content Server 9.7.1 Update 7,
approximately June 2012.
• Additional accumulator memory use reduction
• Accumulator performance improvement when chunking
• Multi-value text region support with Facets
• Relaxed whitespace rules parsing configuration files
• Invalid hostnames allowed by Windows now reported as errors

Search Engine 10 Update 8


Released with Content Server 10 Service Pack 2 and Content Server 9.7.1 Update 8,
approximately September 2012.
• Cleanup of unused facet data structures
• Improved Date facets
• Support for FileSize facets
• Significant performance improvements converting OT7 indexes
• Text metadata size and number of values protection

The Information Company™ 258


Understanding Search Engine 21

• Disk-based index storage available for beta testing

Search Engine 10 Update 9


Released with Content Server 10 Service Pack 2 Update 9 and Content Server 9.7.1
Update 9, approximately December 2012.
• Added in-place conversion of region type definitions
• Performance improvements for regular expressions and left truncation
• Improved cache management of search facets
• Added limits to number of values and total length of values for text metadata

Search Engine 10 Update 10


Released with Content Server 10 Service Pack 2 Update 10 and Content Server 9.7.1
Update 10, approximately March 2013.
• Addition of support for DATE metadata region type
• Optional compression of Checkpoint files
• Improved removal of invalid regions
• Performance improvements with Low Memory mode
• Improved hit highlighting for bi-gram indexed characters
• Command line echo option added to search client

Search Engine 10 Update 11


Released with Content Server 10 Service Pack 2 Update 11 and Content Server 9.7.1
Update 11, approximately June 2013.
• Accepts “Oracle” as vendor for use with Java 7
• Improved removal of nulls in regions during conversion
• Compute protected facets on startup
• IO Buffer leaks corrected
• TEXT fields can now be used as TypeFieldRankers
• Added query by OTPartitionName or OTPartitionMode

Search Engine 10 Update 12


Released with Content Server 10 Service Pack 2 Update 12 and Content Server 9.7.1
Update 12, approximately September 2013.
• Separate metalog checkpoint settings for Low Memory Mode

The Information Company™ 259


Understanding Search Engine 21

• Index biasing feature introduce to give preference to filling partitions


• Optional limit on number of simultaneous partitions writing checkpoints
• Fast traversal option for TEXT region in RAM with many identical values
• Get Regions function now includes region type information

Search Engine 10.5


Released with Content Server 10.5 and Content Server 10 Service Pack 2 Update 13,
approximately December 2013.
• Index Engines will tolerate Search Engines not consuming metalogs
• Improved Index Engine shutdown reduces forced search grid restarts
• Introduction of the “LIKE” modifier
• RETIRE mode for partitions introduced
• REMOVE date regions
• Optimize facet creation on startup

Search Engine 10.5 Update 2014-03


Released with Content Server 10.5 Update 2014-03 and Content Server 10 Service
Pack 2 Update 2014-03, approximately March 2014.
• Performance optimization for sorting search results
• Optimize disk reads during metadata updates
• Remove TIME regions from DATETIME pairs

Search Engine 10.5 Update 2014-03 R2


Released as a hotfix for Content Server 10.5 Update 2014-03 and Content Server 10
Service Pack 2 Update 2014-03, March 2014.
• Bypass sorting if “Nothing” selected for sort order
• Optimize batch processing of DeleteByQuery and ModifyByQuery
• Add features for caching of search results

Search Engine 10.5 Update 2014-06


Released with Content Server 10.5 Update 2014-06 and Content Server 10 Service
Pack 2 Update 2014-06, June 2014.
• Add features for caching of search results
• Optimize indexing by using 1 thread per partition in Update Distributor

The Information Company™ 260


Understanding Search Engine 21

• Add administration command to force checkpoint writes


• Enhance Update Distributor “getstatustext” to include checkpoint data
• Enable the ‘LIKE’ capabilities for filename and part number search
• Add ‘Modify’ indexing operation

Search Engine 10.5 Update 2014-09


Released with Content Server 10.5 Update 2014-09 and Content Server 10 Service
Pack 2 Update 2014-09, September 2014.
• Reduced garbage collection, different estimation of percent full
• Added feature for searchable email domain regions
• ModifyByQuery will update regions with empty values
• OTFileType region repair on startup

Search Engine 10.5 Update 2014-12


Released with Content Server 10.5 Service Pack 1 Update 2014-12 and Content
Server 10 Update 2014-12, December 2014.
• Comparison queries for full text disallowed by default
• Relative day, week, month, quarter and year queries for DATE regions
• Added support for IN and NOT IN operators
• ModifyByQuery can completely remove empty values
• Improved Norsk/Dansk and Arabic tokenization
• Optional selective timestamps based on subtype

Search Engine 10.5 Update 2015-03


Released with Content Server 10.5 Update 2015-03 and Content Server 10 Update
2015-03, March 2015.
• Merge Tokens added to allow partitions to merge when out of disk space
• Partition rebalancing using disk percent full is supported
• Background text region index merges enabled, providing smaller checkpoint
files, faster startup, and higher ingestion throughput

Search Engine 10.5 Update 2015-06


Released with Content Server 10.5 Update 2015-06 and Content Server 10 Update
2015-06, June 2015.
• Additional Tokenizers can be used with text metadata regions

The Information Company™ 261


Understanding Search Engine 21

• Phonetic matching optimization, typically 30% faster


• Update Distributor performance statistics are logged

Search Engine 10.5 Update 2015-09


Released with Content Server 10.5 Update 2015-09 and Content Server 10 Update
2015-09, September 2015.
• Search execution times available per query or statistically
• Search facet types can be queried
• Getstatustext basic - efficient partition status calls added
• Added 2 decimal currency data type
• Exact substring matching for text metadata values added

Search Engine 10.5 Update 2015-12


Released with Content Server 10.5 Update 2015-12 and Content Server 10 Update
2015-12, December 2015.
• ConvertREtoRelevancy setting for query performance improvement on older
updates of Content Server.
• Reduced memory use with queries on ENUM and DATE types
• Federator caching now supports facets, statistics

Search Engine 16 Update 2016-03


Released with Content Server 16, Content Server 10.5 Update 2016-03 and Content
Server 10 Update 2016-03, March 2016.
• Protected facets are stored in the checkpoint for faster startup
• Per-query relevance boosting introduced
• Improved indexing throughput by optimizing file reads
• Reduced memory use in Search Engines for queries
• Large queries can be processed in chunks

Search Engine 16.0.1 (June 2016)


Released with Content Server 16.0.1, Content Server 10.5 Update 2016-06 and
Content Server 10 Update 2016-06, June 2016.
• Default operators for queries before chunking increased to 15,000
• Region comparisons converted to range operators
• Termset operator introduced

The Information Company™ 262


Understanding Search Engine 21

• Stemset operation introduced


• Query memory optimizations

Search Engine 16.0.2 (September 2016)


Released with Content Server 16.0.2 and Content Server 10.5 Update 2016-09,
September 2016.
• Metadata region forgery prevention added, using otb= attribute
• Optimizations for certain search query scenarios
• Corrected several relevance computation edge case issues

Search Engine 16.0.3 (December 2016)


Released with Content Server 16.0.3 and Content Server 10.5 Update 2016-12,
December 2016.
• Defragmentation is monthly with Low Memory Mode
• Optimized date facet generation
• Various indexing and query performance improvements

Search Engine 16.2.0 (March 2017)


Released with Content Server 16.2.0, Content Server 16.0.4, and Content Server 10.5
Update 2017-03, March 2017.
• [Link] can be used to logically append lines to [Link]
• Maximum number of sub-indexes now configurable
• Search Federator can report more than 2 billion results

Search Engine 16.2.1 (June 2017)


Released with Content Server 16.2.1, Content Server 16.0.5, and Content Server 10.5
Update 2017-06, June 2017.
• Priority CHAIN region introduced
• [first …] syntax added for dynamic queries on priority chains
• IndexVerify can test that TEXT values are readable
• Maximum number of sub-indexes now configurable
• Optimization in Metalog replay and AGGREGATE-TEXT indexing
• Java 8 u121

The Information Company™ 263


Understanding Search Engine 21

Search Engine 16.2.2 (September 2017)


Released with Content Server 16.2.2, Content Server 16.0.6, and Content Server 10.5
Update 2017-09, September 2017.
• MIN and MAX region capabilities added
• Optimization for grouping ModifyByQuery ops added, off by default
• Improved timeout handling for very large result sets
• Java 8 u131

Search Engine 16.2.3 (December 2017)


Released with Content Server 16.2.3, Content Server 16.0.7, and Content Server 10.5
Update 2017-12, December 2017.
• Bloom Filters added to optimize ModifyByQuery performance
• Reduced memory required to index very large objects
• Reduced temporary memory needed to build facets
• ANY search operator added
• ANY region query feature added
• ALL region query feature added
• Optimized hit highlighting performance through parallelization
• Java 8 u144

Search Engine 16.2.4 (March 2018)


Released with Content Server 16.2.4, Content Server 16.0.8, and Content Server 10.5
Update 2018-03, March 2018.
• Fix problem with thumbnail requests being indexed as part of text
• Extended file error retries due to DFS problems
• Introduce Top Words lists
• First implementation of the TEXT operator
• Java 8 u152

Search Engine 16.2.5 (June 2018)


Released with Content Server 16.2.5, Content Server 16.0.9, and Content Server 10.5
Update 2018-06, June 2018.
• Add Transaction Log file capability
• Add Reverse Dictionary
• Force conversion of some CS Integers to Long

The Information Company™ 264


Understanding Search Engine 21

• Java 8 u162

Search Engine 16.2.6 (September 2018)


Released with Content Server 16.2.6, Content Server 16.0.10, and Content Server
10.5 Update 2018-09, September 2018.
• Reverse Dictionary optimizations
• Report Disk I/O performance stats in log files
• Java 8 u181

Search Engine 16.2.7 (December 2018)


Released with Content Server 16.2.7, Content Server 16.0.11, and Content Server
10.5 Update 2018-12, December 2018.
• Optional low priority search queue added
• Java 8 192

Search Engine 16.2.8 (March 2019)


Released with Content Server 16.2.8, Content Server 16.0.12, March 2019.
• Optional query suspension feature to prevent index throttling
• OpenJDK 11.0.1

Search Engine 16.2.9 (June 2019)


Released with Content Server 16.2.8, Content Server 16.0.12, March 2019.
• Introduced span operator for advanced proximity searching
• New backup procedure
• Capture and log statistics on network errors
• OpenJDK 11.0.1

Search Engine 16.2.10 (September 2019)


Released with Content Server 16.2.10 and Content Server 16.0.14, September 2019.
• Optimization of numeric range search in text metadata fields
• OpenJDK 11.0.3

Search Engine 16.2.11 (December 2019)


Released with Content Server 16.2.11 and Content Server 16.0.15, December 2019.
• Capture and log statistics on network errors

The Information Company™ 265


Understanding Search Engine 21

• OTObjectUpdateTime always refreshed


• Regular expression and wildcard support in span operator
• Interval-based Search Agents introduced
• AdoptOpenJDK 11.0.5

Search Engine 20.2 (March 2020)


Released with Content Server 20.2 and Content Server 16.0.16, March 2020.
• Reserved partitions for objects with very large full text
• Additional span limits and controls
• Search Agent timing added to performance summary
• Support for long tokens
• AdoptOpenJDK 11.0.5

Search Engine 20.3 (July 2020)


Released with Content Server 20.3 and Content Server 16.0.17, July 2020.
• Improved batch splitting based on object and metadata size
• GroupLocalUpdates defaults to true
• Improve Search Federator query queue servicing
• Agent IPool info added to hourly stats
• Option to compress content sent to Index Engines
• Search Agent timing added to performance summary
• AdoptOpenJDK 11.0.6

Search Engine 20.4 (October 2020)


Released with Content Server 20.4 and Content Server 16.0.18, October 2020.
• Optimize modify/delete by query lookup times
• Locale-sensitive ordering of text metadata
• Improve distribution of objects to partitions by total size
• getstatustext performance for Update Distributor stats
• “good” file indicators written for index backups
• AdoptOpenJDK 11.0.7

Search Engine 21.1 (January 2021)


Released with Content Server 21.1 and Content Server 16.0.19, January 2021.

The Information Company™ 266


Understanding Search Engine 21

• Disk retries reading control files


• Add GlobalUpdate times to performance metrics
• Optimize performance with concurrent searches
• OTTextSize added
• DumpKeys added to shipping utilities
• AdoptOpenJDK 11.0.8

Search Engine 21.2 (April 2021)


Released with Content Server 21.2, April 2021.
• Improve relevance ranking expression evaluation
• Query optimization using long skip lists
• Various bug fixes
• AdoptOpenJDK 11.0.8

The Information Company™ 267


Understanding Search Engine 21

Error Codes

Errors and warnings from OTSE may be exposed in multiple ways. Process Error
codes are responses to communications. Detailed information about errors is normally
contained in the log files. The chart below articulates many of the possible Process
Error codes. This is not a comprehensive list.

Update Distributor
Code Description

129 Unable to load JNI library. To read or write IPools, OTSE leverages
Content Server libraries. This file is named [Link] (Windows) or
[Link] and is expected to reside in the <OTHOME>\bin directory.

131 Insufficient memory. The memory can be adjusted using the –XMX
parameter on the command line. Content Server exposes this
control in its administration pages.

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Update Distributor is incorrect.

150 Invalid URL. Is this RMI only?

152 Invalid partition name.

153 Error reading configuration file. The [Link] file is improperly


constructed, or has an invalid setting.

170 Insufficient memory in an Index Engine. The Update Distributor


cannot run because at least one Index Engine is out of memory

171 Unable to contact at least one Index Engine. Possible causes


include incorrect configuration, or conflicting use of ports.

172 One or more Index Engines have insufficient disk space.

173 Index is full. All Index Engines report they are unable to accept new
objects.

174 IPool read or write error occurred.

The Information Company™ 268


Understanding Search Engine 21

Index Engine
Code Description

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Index Engine is incorrect.

150 Invalid URL. Is this RMI only?

153 Error reading configuration file. The [Link] file is improperly


constructed, or has an invalid setting.

171 Unable to contact at least one Index Engine. Possible causes


include incorrect configuration, or conflicting use of ports.

174 IPool read or write error occurred.

175 Unreadable index. The Index Engine is unable to load an existing


index partition.

176 A restore from backup operation has failed.

180 Index failed to start. In some cases, this error is acceptable if the
Index Engine is already running.

181 Request to start the Index Engine has been ignored because an
index restore operation is in progress.

Search Federator
Code Description

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Search Federator is incorrect.

150 Invalid URL. Is this RMI only?

153 Error reading configuration file. The [Link] file is improperly


constructed, or has an invalid setting.

The Information Company™ 269


Understanding Search Engine 21

Search Engine
Code Description

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Search Engine is incorrect.

150 Invalid URL. Is this RMI only?

153 Error reading configuration file. The [Link] file is improperly


constructed, or has an invalid setting.

Utilities
OTSE contains a number of built-in utilities and diagnostic tools. These are often used
by OpenText support staff and developers when analyzing and testing an index. Many
of these will have limited value for customers, but may be of assistance when
diagnosing particular index problems. For convenience, basic documentation for some
of the more common utilities is included here.
Many of the utilities are NOT a supported feature of the product. They are not
guaranteed to work as described, and may be modified or removed at any time.

You are strongly advised to use the utilities on a backup of your index,
and not on a production copy. The potential exists to render an index
unusable for your application with some of these tools. You have
been warned.

General Syntax

The utilities are invoked by launching the search JAR using appropriate parameters.
The general syntax is:
java [-Xmx#M] –classpath <othome>\bin\[Link]
[Link].<subclasspath>
[parameters]

Where:

<othome>\bin is the file path where the search JAR file is located.

<subclasspath> is the name of the utility to be used.

[parameters] vary depending on the utility, as described in the following sections.

The Information Company™ 270


Understanding Search Engine 21

An example command line, using the VerifyIndex utility:

Java –classpath c:\opentext\bin\[Link]


[Link]
-level 5 –config [Link] –indexengine ieName
-outFile verify_results.out –verbose true

Backup
The backup utility is used to create either differential or full backups of a partition. Refer
to the section on Backup and Restore for more information.
Java –classpath [Link] [Link]
-inifile J:\index\[Link]
Where the inifile identifies the backup configuration file to be used.

Restore
The restore utility is used to restore an index from a prior backup. Restore to the
section on Backup and Restore for more information.

Java –classpath [Link] [Link]


-inifile J:\index\[Link]

Where the inifile identifies the [Link] file to be used. You may need to run the
restore process many times. Using the utility directly is not for the faint of heart, and
you should probably let Content Server manage this for you.

DumpKeys
The DumpKeys utility attempts to generate a list of all the object IDs for objects in the
partition. This is often a tool of last recourse for repairing a corrupted index. The
DumpKeys tool will sometimes be able to get data from a partition which is unreadable.
The input to dumpkeys is the [Link] file and partition information, and the output is
a file of object IDs. Sample output looks like this:
c DataId=41280133&Version=1
c DataId=41280132&Version=1
c DataId=41280131&Version=1
The first character details where the object ID was found. If in the checkpoint file, the
first character is a ‘c’ (as in the example above). If an object ID was found in the
metalog file (recently indexed), the first character reflects the operation type:

The Information Company™ 271


Understanding Search Engine 21

n: new
a: add
r: replace
m: modify
d: delete
Invoking Dumpkeys:
java -Xmx2000M -Xss10M -cp .\[Link];
[Link] -inifile <path_to_search.ini> -
sectionName <IE_or_SE_Section_Name> -log <Path_to_log_file> -output
<Path_to_DumpKeys_Output>

Parameters:
path_to_search.ini: Path to the [Link] file, typically /config/[Link]
IE_or_SE_Section_Name: The full section name including the SearchEngine_ or
IndexEngine_ prefix.
Path_to_DumpKeys_Output: Path to where the output file should be created.
Path_to_log_file: Path to where the log file should be created.

VerifyIndex
This utility performs internal checks of the structure of the index. Levels 1 through 5
are cumulative, and level 10 is a distinct operation. Parameters are:
–level K -config SearchIniFile –indexengine IEName
[–outFile OutFile] [-html true] [-verbose true]

level: a value between 1-5 or 10; see below for details


config: [Link] file
IEName: Index Engine signature in the [Link] file of the partition to be processed
outFile: output file containing results of verification
html: true requests output in HTML format
verbose: true generates progress and status information, even without errors

Levels are cumulative from 1-5, level 10 is distinct


1: Surface level check of the index, identifying inconsistencies
2: also verifies the checksum of the checkpoint file

The Information Company™ 272


Understanding Search Engine 21

3: also verifies the checkpoint is loadable


4: also verifies the checksum of all full text index files
5: also verifies the word pointers in the content index are consistent
10: Verifies that the word pointers in the metadata index are present
In addition to the output report, the VerifyIndex process has an exit code = 0 if OK, 1 if
not.
VerifyIndex should be run with a partition which is currently not in use by search or
index engines. Although a level 1 test may only take a few seconds, a level 5 or level
10 test might require up to 10 hours to run, depending upon the partition size and the
capabilities of the computer.
A Level 3 test has an option to rigorously test that TEXT metadata values can be
successfully read. When enabled, this will nearly double the time to execute. This
setting is controlled in the [IndexEngine_] section of the [Link] file, which identifies
how many exceptions should be logged before stopping, set to 300 by default, set to 0
to disable this check.
MaxVerifyIndexMODExceptions=300
By way of example, a sample level 5 test output is shown below:

Level 5 check starting on /index4


Level 5 verifies the postings in the content files are consistent

SubIndex Statistics

SubIndex \index4\12401 Tokens 22675687/25761333 (88%) Postings 1361632092/1440242913


(95%)
SubIndex \index4\21883 Tokens 15940369/16069809 (99%) Postings 761869365/774825024 (98%)
SubIndex \index4\23990 Tokens 931805/934883 (100%) Postings 34611908/34731025 (100%)
SubIndex \index4\25273 Tokens 348058/348058 (100%) Postings 5892009/5892009 (100%)
SubIndex \index4\27066 Tokens 28350/28350 (100%) Postings 1293163/1293163 (100%)

Index Statistics

Index Size = 4975289946


Max internalID = 629940
Active ObjectIDs = 594584
Total index: Tokens 39924269/43142433 (93%) Postings 2165298537/2256984134 (96%)
Total core tokens = 5954186 Total token length = 48085094 Average token length = 8.075847
Total other tokens = 37185437 Total token length = 405466390 Average token length = 10.903903
Total region tokens = 2810 Total token length = 42521 Average token length = 15.132029
Total content compression = 0.2505714
Level 5 check complete in 23528700 ms
If errors are found in a Level 10 diagnostic, they can usually be corrected using the
RebuildIndex utility.

The Information Company™ 273


Understanding Search Engine 21

RebuildIndex
This utility rebuilds the dictionary and index for metadata in a partition. This is possible
because an exact copy of the metadata is stored in the checkpoint files. This does not
affect the full text index. This utility can often be used to repair errors detected by a
Level 10 VerifyIndex.
Parameters:

-iniFile SearchIniFile –indexengine IEName

Where
SearchIniFile is the location and name of the [Link] file which should be used.
IEName is the name of the partition which should be rebuilt.

Because this utility needs to build and load the entire index, you may need to ensure
an appropriate -Xmx (memory allocation) parameter is specified on the Java command
line.

LogInterleaver
Each component of the search grid – index and search engines, search federators and
the update distributor – create their own log files. It can be difficult trying to trace a
single operation through multiple log files. The LogInterleaver function will combine
multiple log files by ordering entries according to their time stamps into a single log file
to simplify interpretation. The output file will have a slightly different syntax – each line
of output will be prefixed by the original log file name.
Parameters:

-d logDir | -o outputFile

OR

outputFile logFile1 logFile2 logFileN


The first usage will combine all the log files within a requested directory into the log file
specified by outputfile. In this usage, the logDir should be the same as the working
directory. The second usage combines a specific list of files.

[Link]
Log files from the search components have a time stamp in milliseconds from a
reference date. This utility will convert a log file to have human-readable time/date
values instead, which can be helpful when interpreting the logs manually.
This utility is somewhat unusual in that it reads from console input and writes to console
output, so the typical usage is to “pipe” the source logfile into the java command line,
and redirect the output to a target file like this:

The Information Company™ 274


Understanding Search Engine 21

type <logfile> | java –classpath c:\opentext\bin\[Link]


[Link] >
[Link]

[Link]
This utility enters a console loop. You enter one line of text, and it responds by printing
out each search token generated on a separate line. Control-C will terminate the loop.
Optional command line parameters:
-TokenizerOptions <Number> -tokenizerfile <RegExParserFile>

Where Number represents the bitwise controls for tokenizer options, as defined in the
Tokenizer section of this document. The tokenizerfile parameter specifies an
optional or custom tokenizer definition that may be used.

ProfileMetadata
This utility function loads a checkpoint file, and writes information about the metadata
in the checkpoint to the console. You may wish to redirect the console output to a file
to capture the data.
Parameters:

[-l (0|1|2)] [-values (true|false)] checkpointFile

Where:
l: profile level where 0=High Level,
1=Field Level (Default),
2=Field Part Level
values: true requests the # of objects with values
and the estimated total memory requirement
checkpointFile: file name of the checkpoint file to be
profiled
Refer to sample output fragments for the profile levels below.

Level 0:
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719

Level 1:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
2060 Field(Text):OTDocCompany
1996 Field(Text):OTDocRevisionNumber
10668 Field(Text):OTVerCDate
0 Field(Text):OTReservedByName

The Information Company™ 275


Understanding Search Engine 21

3872084 Total accounted for memory


NumOfDataIDs=10721
NumOfValidDataIDs=10719

Level 2:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
1376 Field(Text [RAM]):OTDocCompany dictionary (mappingEntries=0 wsEntries=1
tokenEntries=3)
256 Field(Text [RAM]):OTDocCompany content
428 Field(Text [RAM]):OTDocCompany index
2060 Field(Text [RAM]):OTDocCompany combined
1312 Field(Text [RAM]):OTDocRevisionNumber dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
256 Field(Text [RAM]):OTDocRevisionNumber content
428 Field(Text [RAM]):OTDocRevisionNumber index
1996 Field(Text [RAM]):OTDocRevisionNumber combined
10668 Field(Date):OTVerCDate combined
684 Field(Date):OTDateEffective combined
1312 Field(Text [RAM]):OTContentIsTruncated dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
33920 Field(Text [RAM]):OTContentIsTruncated content
428 Field(Text [RAM]):OTContentIsTruncated index
35660 Field(Text [RAM]):OTContentIsTruncated combined
Field(UserID):OTAssignedTo combined

Field(Integer):OTTimeCompleted combined
0 Field(UserLogin):OTReservedByName combined
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719
If the parameter “values” is true, the information for each region is considerably more
detailed:

Field(Text):OTWFMapTaskUserData values= 18 valuesSize= 7133 memorySize= 4108


Field(Text):OTVersionName values= 10630 valuesSize= 10630 memorySize= 35788
Field(Text):OTHP values= 5 valuesSize= 948 memorySize= 4364
Field(Date):OTVerMDate values= 10630 valuesSize= 85040 memorySize= 10668
Field(Integer):OTOwnerID values= 10719 valuesSize= 53583 memorySize= 7424
Field(Text):OTUserGroupID values= 10706 valuesSize= 42824 memorySize= 35916

[Link]
The search configuration files allow you to control several aspects of file I/O. Tuning
these for optimal performance can be difficult, since many factors are involved. The
DiskReadWriteSpeed utility can help by simulating disk performance using several of
the available configurations. For each mode, this utility performs 32678 iterations of
the test using 8KB block of data. Note that this information can help you tune disk

The Information Company™ 276


Understanding Search Engine 21

performance or identify system I/O bottlenecks, but is not necessarily sufficient to draw
a firm conclusion regarding the optimal configuration.
Parameters:

(write|read|both) TestDirectory
The operations tested are:

Stream read/write using RandomAccessFile


Stream read/write using FileOutputStream
Stream read/write using NIO1 (FileOutputStream base)
Stream read/write using NIO2 (RandomAccessFile base)
Random read/write using RandomAccessFile
Random read/write using NIO1 (FileOutputStream base)
Random read/write using NIO2 (RandomAccessFile base)

NOTE: NIO operations pull from a ByteBuffer as opposed to using


a static byte array.

SearchClient
The SearchClient is a console application that allows you to interactively issue
commands to the Search Federator. The SearchClient is useful for determining that
search is working as expected, or running queries without having an application such
as Content Server running. All console output is expressed in UTF-8 characters. Note
that you might need to adjust the default Search Federator timeout values higher if
using the SearchClient.
It is possible to use the SearchClient with an index that is also being used in a live
production system. In this situation, a SearchClient that is open consumes a search
transaction from the available pool, so this may impact the available pool of search
transactions.
Parameters:
–host SFHost –port SearchPort [-adminport SFAdminPort] [-time
true] [-echo true] [-pretty true]
SFHost is the URI for the target Search Federator, connected on SearchPort. The –
time true parameter adds response time information to each response.
The –echo parameter will add the input command to the output. This is useful when
redirecting input from a file for batch operations, so you can associate the commands
with the responses. By default, echo is false.
The –pretty parameter will use an alternate formatting of GET RESULTS. The alternate
format does not adhere to the API spec, but is better formatted for human readability
when developing or debugging.

The Information Company™ 277


Understanding Search Engine 21

The –csv true parameter will output the results in a form that can be easily imported
into a spreadsheet (comma separated values). This feature is most useful when
redirecting input and output from/to files. If –pretty is specified, it takes precedence
over –csv.
The –adminport setting enables specific commands to be interpreted and sent to the
administration port of the Search Federator. These admin commands are:

Reload – reload settings from the [Link] file


Stats – get statistics
Sendshutdown – send a request to shut down the Search Federator
In operation, the console of the SearchClient supports search query operations plus
some special commands. Query operations include SELECT, GET RESULTS, and
similar functions. The special administrative operations are:

exit / quit [close the client]


close [close the socket without closing the client]
sleep # [make the client wait for # ms]
sendquit [shutdown the SF via the search port]

Repair BaseOffset Errors


To minimize disk space, the offsets (or pointers) to portions of the index are kept as
values relative to the start of an index fragment. In rare cases where network or disk
errors have occurred, it might be possible for the base values to become misaligned,
resulting in overlapping (and therefore incorrect) indices which will not load properly.
The current version of the Search Engine will normally catch these cases when they
occur and exit immediately, but older versions of the software would sometimes
propagate these, resulting in a badly formed index.
The search engine contains a number of utilities that can be used to repair an index in
this state. This is a multi-step process, and ultimately any identified overlapping
objects will need to be deleted and re-indexed.
The main utility is RepairSubIndexes, which performs the cut of overlapped sub-
indexes.
The second utility is DumpSubIndexesIDs. For an input partition, the
DumpSubIndexesIDs utility goes through every active sub-index in the partition and
prints out internal-external objectID pairs to a file. It also prints out all the internal-
external IDs in the deleteMask (i.e., items marked for deletion).
The third tool is the DiffObjectIDFiles tool. It takes in as input, the output file of
RepairSubIndexes and an output file from DumpSubIndexesIDs to produce a diff of
internal-external ID pairs. The reason/use of these tools will be explained below.

Problem Illustration
subindex1 has internal IDs = 1,2,3,4,5,6,8,9
subindex2 has internal IDs = 5,7,8,9,10

The Information Company™ 278


Understanding Search Engine 21

BaseOffset problem: subindex1 should only contain 1,2,3,4. Internal IDs 5,6,8 and 9
overlap with subindex2.
Fix: cut 5,6,8 and 9 from subindex1.
Items 5, 8, and 9 already exist as duplicates in subindex2. However, item 6 only exists
in subindex1, so the fix would remove the only instance of item 6 from the index
content.
After fix:

subindex1New: 1,2,3,4
subindex2: 5,7,8,9,10
Output of DumpSubIndexes before fix: ids for subindex1, subindex2 and deleteMask
Ouput of RepairSubIndexes: a file which lists the objects removed from subindex1
(5, 6, 8 and 9) along with their external IDs for re-indexing.
Output of Diff tool: a file which only lists object 6 along with its external ID for re-
indexing.
Output of DumpSubIndexes after fix: ids for subindex1New, subindex2 and
deleteMask

Repair Option 1
This approach requires about 30 to 60 minutes for a typical partition, and makes the
index usable as quickly as possible. However, there may be a lot of objects that need
to be reindexed.
Running the RepairSubIndexes utility

java -[Link]
[Link]
-level x -config [Link] -indexengine firstEngine

where level x is 1, 2 or 4 (slowest but more detailed)


where config is [Link]
where firstEngine is the IE associated with the partition that you want fixed as
specified in the [Link]
example (assuming the new [Link] is in the current directory):

prompt>java -Xmx1300M -Xss10M -classpath ./[Link]


[Link] -level 1 -config
/opentext/config/[Link] -indexengine IEname0
The minimum [Link] sections necessary to run this tool are: IE section, Dataflow
section, Partition section. Any file paths mentioned in these sections should be
adjusted to point to the actual location of your index partition directory in your
environment

The Information Company™ 279


Understanding Search Engine 21

Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.
If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will warn
and exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending
on the size of the sub-index that is being fixed. The utility will produce an
output file bearing the name of the sub-index that was fixed. This file contains
the internal-external objectID (OTObject region value) pairs that can be
utilized for re-indexing.
3. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.

Additional Comments:
While running the tools, it is strongly recommended that the output be
redirected out to a file for easier analysis (… > [Link]).
During the repair process, it is possible to navigate inside the directory where
the index under repair sits. It is possible to observe the new sub-index
fragment being written out, growing larger in size over time.
At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being
repaired (same location where new fragment is made)

Repair Option 2
This method typically requires about 45 minutes longer per partition, but minimizes the
number of objects which may require re-indexing.
Running the RepairSubIndexes utility

java -classpath [Link] [Link]


-level x -config [Link] -indexengine firstEngine
• where level x is 1, 2 or 4 (slowest but more detailed)

The Information Company™ 280


Understanding Search Engine 21

• where config is [Link]


• where firstEngine is the IE associated with the partition that you want fixed as
specified in the [Link]
example (assuming the new [Link] is in the current directory):

prompt> java -Xmx1300M -Xss10M -classpath ./[Link]


[Link] -level 1
-config /opentext/config/[Link] -indexengine IEname0
Running the DumpSubIndexesIDs utility

java -classpath [Link];[Link]


[Link]
-config [Link] -indexengine firstEngine

NOTE: no level info needs to be specified , and the utility jar is


required.

example (assuming that both the new [Link] and [Link] are in the
current directory):

prompt>java -Xmx1300M -Xss10M -classpath ./[Link];./[Link]


[Link]
-config /opentext/config/[Link] -indexengine IEname0

Running the DiffObjectIDFiles utility


prompt>java -classpath [Link];[Link]
[Link]
-dir /index -deleteIDsFile fileName -subIndexIDsFile fileName

• where dir is the index directory where all the output files were written out
• where deleteIDsFile is the output file made by the RepairSubIndexes utility for
the sub-index that was fixed
• where subIndexIDsFile is the appropriate output file made by
DumpSubIndexesIDs utility. It is crucial to use the correct file; if we have
subindex1 and subindex2 with overlap and subindex1 was cut out, then use the
DumpSubIndexesIDs file for subindex2.
example:

prompt>java -Xmx1300M -Xss10M -classpath ./[Link];./[Link]


[Link] -dir /index -deleteIDsFile
index12401_ReIndexIDs.log -subIndexIDsFile 21883.log_1299091996464

The Information Company™ 281


Understanding Search Engine 21

The minimum [Link] sections necessary to run this tool are the Index Engine
section, Dataflow section and Partition section. Any file paths mentioned in these
sections should be adjusted to point to the actual location of your index partition
directory in your environment
Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.

If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will warn and
exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending on the
size of the sub-index that is being fixed. The utility will produce an output file
bearing the name of the sub-index that was fixed. This file contains the internal-
external objected (OTObject region value) pairs that can be utilized for re-indexing.
3a. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
3b. Run the DumpSubIndexesIDs utility after repair. This will generate a date-stamped
file for each sub-index. The file contains all the internal-external IDs for each sub-
index.
3c. Run the DiffObjectIDFiles tool (this only takes a few minutes). This will produce a
smaller set of objects to re-index. This set contains objects whose content was
cut from the bad sub-index and whose content is NOT contained anywhere else
in the partition.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.

NOTE: While running the DumpSubIndexesIDs tool, the utility will


likely report that many regions were ‘removed’ from the index. This
is due to the mode in which the utility runs while hydrating the
metadata part. No actual regions are permanently removed and
this should not cause alarm.

Additional Comments:
While running the tools, it is strongly recommended that the output be redirected
out to a file for easier analysis (… > [Link]).
During the repair process, it is possible to navigate inside the directory where the
index under repair sits. It is possible to observe the new sub-index fragment being
written out, growing larger in size over time.

The Information Company™ 282


Understanding Search Engine 21

At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being repaired
(same location where new fragment is made.

New Base Offset Errors


If a repaired sub-index (or an existing good index) generates a new index fragment
which has overlapping base offsets, this case will be detected when the Index Engine
next attempts to merge subindexes or dump the accumulator to a new subindex.
At the point of this detection, the IE will stop and the partition data will remain
unchanged.
A new lock file, called “[Link]” will also be written out to the current index
directory.
While this file remains in that directory, the Index Engine will be unable to start/restart.
This ensures that the statue of the index is not modified, and the files should be
collected for customer support to assist in determining the root cause of the problem
(Index Engine, Update Distributor logs and the IPools that triggered the error).
If for some reason you must ignore this condition and continue:
• if the [Link] lock is already present in the index directory, delete it
• make an empty file called “[Link]” and place it in the index directory
The IE should come up and ignore the baseOffset problem. WARNING: this WILL
generated an index with base offset errors that will later need to be repaired.

The Information Company™ 283


Understanding Search Engine 21

Index of Terms

!=, 74 AND-NOT, 69, 72


", 84 Arabic, 149
$, 83 ASC, 89
( ), 84 ascending, 89
*, 83 asterisk, 73
., 82 attribute, 79
.in, 100 ATTRIBUTE, 70, 71
.new, 100 Attributes, 24
?, 83 Backup, 207
[ ], 82 [Link], 215
[^ ], 83 [Link], 282
^, 83 Bloom Filter, 182
|, 83 Boolean, 30
+, 83 Boost, 107
<, 74, 76 Brackets, 72
<=, 74 Caching, 160
=, 73, 74 Case Sensitivity, 145
>, 74 CHAIN, 32
>=, 74 Character Mapping, 148
3-gram, 156 checkpoint command, 168
4-grams, 156 Checkpoint Compression, 191
Accumulator, 137, 139 Chunk Size, 192
AddOrModify, 54 Cleanup Thread, 140, 141, 142
AddOrReplace, 53 ConversionProcessPercent, 169
addRegionsOrFields, 169 ConvertDateFormat, 273, 274
Aggregate-Text, 31 Currency, 30
all, 73 Cursor, 58
AllowedNumConfigs, 192 Date Facets, 93, 94
AND, 72 DateTime, 31

The Information Company™ 284


Understanding Search Engine 21

de-duplication, 67 first, 80
Default, 89 Fragmentation, 170
Default Search Regions, 114 [Link], 212
Defining a Region, 18 Garbage Collection, 195
defragmentation, 170 Get Facets, 61
Delayed Commit, 191 Get Regions, 67
DelayedCommitInMilliseconds, 191 Get Results, 59
Delete, 54 Get Time, 66
DeleteByQuery, 54 getstatuscode, 167
DESC, 89 getstatustext, 162
descending, 89 getsystemvalue, 168
[Link], 212 hh, 65
DiffObjectIDFiles, 280 High Ingestion, 175
Disk Configuration, 191 Hit Highlight, 65
Disk fragmentation, 188 HIT LOCATIONS, 71
Disk Performance, 189, 190 HyperV, 195
Disk Storage, 36 [Link], 282
DiskReadWriteSpeed, 275 IN operator, 86
DROP, 19 Index Engines, 8
DumpSubIndexesIDs, 280 Integer, 28
Email Domain, 132 Interchange Pools, 51
Empty Regions, 20 IOChunkBufferSize, 192
Entire Value, 73 iPool errors, 50
ENUM, 30 iPools, 51
Error Codes, 267 IPv6, 207
EuroWordNet, 118 JNI, 5
Existence, 90 Key, 25, 26
Expand, 64 Lang File, 213
EXTRAFILTER, 100 left-truncation, 75, 76
Facet Memory, 96 Like, 128
Facet Security, 94 LogInterleaver, 273
Facets, 91 Long, 28
File Monitoring, 196 Low Memory, 37
FileCleanupIntervalInMS, 142 LQL, 69

The Information Company™ 285


Understanding Search Engine 21

marco, 169 OTIndexMultiValueOverflow, 141


Maximum, 81 OTMeta, 39
Memory Sizing, 171 OTMetadataChecksum, 40
Memory Storage, 36 OTMetadataUpdateTime, 45
Memory Use, 200 OTObject, 27, 40
Merge Thread, 143 OTObjectIndexTime, 45
Merge Tokens, 144, 145, 174 OTObjectUpdateTime, 46
MergeSortCacheThreshold, 192 OTPartitionMode, 44
MergeSortChunkSize, 192 OTPartitionName, 43
Merging Regions, 21, 22 OTSQL, 56, 69
MetadataValueSizeLimitInKBytes, 141 OTSTARTS, 57
Minimum, 81 OTURN, 26
MODDeflateMode, 37 ParallelGCThreads, 195
Modify, 54 Part Numbers, 128
ModifyByQuery, 55 Partition Biasing, 180
Multiple CPUs, 193, 194 Partitions, 13
MultiValueLimitDefault, 141 phonetic, 75
Non-Uniform Memory Access, 194 polo, 169
Nothing, 89 port scanners, 196
null characters, 19 ProfileMetadata, 274
Null Regions, 19 PROX, 72
NUMA, 194 Purge, 204
Object Ranking, 107 Quarantine, 56
OR, 70, 72 range, 75
ORDEREDBY, 88 RankingExpression, 89
OT7, 4 Rawcount, 90
otb, 25 Read-Only, 15
OTChecksum, 40 Read-Write, 16
OTContentLanguage, 43 RebuildIndex, 273
OTContentStatus, 41 regex, 75, 82
OTContentUpdateTime, 45 Region, 89
OTData, 39 Region Names, 18
OTIndexError, 44, 45 Regions, 18
OTIndexLengthOverflow, 141 registerWithRMIRegistry, 168

The Information Company™ 286


Understanding Search Engine 21

Regular Expressions, 82 Solid State Disks, 142


Re-Indexing, 186 SOR, 72
Relative Date, 85 span, 76, 77, 78
Relevance, 102 starting at, 58
Relevancy, 89 stem, 75
reloadSettings, 168 STEMSET, 86
Removing Regions, 20 stop, 162
Renaming Regions, 21 Substring, 125
RepairSubIndexes, 278, 279 SYNC, 100
Restore, 207 TERMSET, 86
Retired, 15 Text, 27
Retrieval Storage, 38 Text Operator, 133
RFC 2373, 13 thesaurus, 75
RFC 952, 13 Thread Management, 196
right-truncation, 75 Throttling Indexing, 192, 193
runSearchAgent, 169 Timestamp, 28
runSearchAgents, 169, 170 Tokenizer, 145
Search Engines, 9 Type Ranking, 105
Search Federator, 8 Update Distributor, 8
SearchClient, 276 Update-Only, 14
Select, 57 User, 31
SEQ, 89 Values, 23
SEQUENCE, 89 VerifyIndex, 270, 271
Server Names, 13 Virtual Machines, 194
Set lexicon, 66 Virus Scanning, 196
Set thesaurus, 66 VMWare ESX, 195
Set uniqueids, 66 WHERE, 71
Shadow Regions, 129 WHERE Operators, 73
shards, 13 WHERE Regions, 79, 80
Signature File, 221 WHERE Relationships, 72
SmartSharing, 195 WHERE Terms, 73
Sockets, 9 WordNet, 118
Solaris Light Weight Processes, 195 XML Text, 39
Solaris Zones, 195 XOR, 70, 72

The Information Company™ 287


Understanding Search Engine 21

About OpenText
OpenText enables the digital world, creating a better way for organizations to work with information, on premises or in the
cloud. For more information about OpenText (NASDAQ: OTEX, TSX: OTC) visit [Link].
Connect with us:

OpenText CEO Mark Barrenechea’s blog


Twitter | LinkedIn

[Link]
Copyright © 2021 Open Text SA or Open Text ULC (in Canada).
All rights reserved. Trademarks owned by Open Text SA or Open Text ULC (in Canada).

You might also like