Understanding OpenText Search Engine 21
Understanding OpenText Search Engine 21
White Paper
Understanding Search Engine 21
Patrick Pidduck, Director, Product Management
Foreword
Since 2009, I have had the honor of working with the extraordinary software
development teams at OpenText responsible for the OpenText Search Engine. Search
has always been a fundamental component of the OpenText Content Suite Platform,
and OpenText pioneered several key technologies that serve as the foundation for
modern search engines. Our team has built upon more than 25 years of search
innovation, and contributed to external research initiatives such as TREC for years.
OpenText knows search.
In the last few years, our customers have pushed scalability and reliability requirements
to new levels. The OpenText Search Engine has met these goals, and continues to
improve with each quarterly product update. Several billion documents in a single
search index, unthinkable just a few years ago, is a reality today at customer sites.
This edition of the “Understanding Search” document covers the capabilities of Search
Engine 21. Search Engine 21 is the most recent version, superseding 16.2, 16, 10.5,
10.0 and versions reaching back to Content Server 9.7. We understand our enterprise
customer needs, and this latest search engine provides seamless upgrade paths from
all supported versions of Content Server. While protecting your existing investments,
we continue to add incredible new capabilities, such as efficient search methods
optimized for eDiscovery and Classification applications, enhanced backups, and
integrated performance monitoring.
This document would not be possible without the help of our resident search experts.
As always, you have my thanks: Alex and Alex, Ann, Annie, Christine, Daniel, Dave,
Dave, Dave, Hiral, Jody, Johan, Kyle, Laura, Mariana, Michelle, Mike, Ming, Parmis,
Paul, Ray, Rick, Riston, Ryan, Scott and Stephen.
Patrick.
Contents
Basics........................................................................................................................... 3
Overview ................................................................................................................. 3
Introduction ...................................................................................................... 3
Disclaimer ........................................................................................................ 3
Relative Strengths ............................................................................................ 4
Upgrade Migration: .................................................................................... 4
Transactional Capability: ........................................................................... 4
Metadata Updates: .................................................................................... 4
Search-Driven Update: .............................................................................. 4
Maintenance Commitment: ....................................................................... 4
Data Integrity: ............................................................................................ 4
Scaling: ...................................................................................................... 4
Advanced Queries: .................................................................................... 5
Related Components ....................................................................................... 5
Admin Server ............................................................................................. 5
Document Conversion Server ................................................................... 5
IPool Library .............................................................................................. 5
Content Server Search Administration ...................................................... 5
Query Languages ...................................................................................... 6
Remote Search.......................................................................................... 6
Backwards Compatibility .................................................................................. 6
Installation with Content Server ....................................................................... 6
Search Engine Components .................................................................................. 7
Update Distributor ............................................................................................ 8
Index Engines .................................................................................................. 8
Search Federator ............................................................................................. 8
Search Engines ................................................................................................ 9
Inter-Process Communication ................................................................................ 9
External Socket Connections ........................................................................... 9
Internal Socket Connections .......................................................................... 10
Search Federator Connections ...................................................................... 11
Search Queues........................................................................................ 11
Queue Servicing ...................................................................................... 11
Search Timeouts...................................................................................... 12
Testing Timeouts...................................................................................... 13
File System .................................................................................................... 13
Server Names ................................................................................................ 13
Partitions ............................................................................................................... 13
Basic Concepts .............................................................................................. 13
Large Object Partitions .................................................................................. 16
Regions and Metadata .............................................................................................. 18
OTContentStatus ........................................................................................... 41
OTTextSize..................................................................................................... 43
OTContentLanguage ..................................................................................... 43
OTPartitionName ........................................................................................... 43
OTPartitionMode ............................................................................................ 44
OTIndexError ................................................................................................. 44
OTScore ......................................................................................................... 44
TimeStamp Regions....................................................................................... 45
OTObjectIndexTime................................................................................. 45
OTContentUpdateTime ........................................................................... 45
OTMetadataUpdateTime ......................................................................... 45
OTObjectUpdateTime .............................................................................. 46
_OTDomain .................................................................................................... 46
_OTShadow ................................................................................................... 46
Regions and Content Server ................................................................................ 46
MIME and File Types ..................................................................................... 47
Extracted Document Properties ..................................................................... 47
Workflow ........................................................................................................ 48
Categories and Attributes............................................................................... 48
Forms ............................................................................................................. 49
Custom Applications ...................................................................................... 49
Default Search Settings ........................................................................................ 49
Indexing and Query .................................................................................................. 50
Indexing ................................................................................................................ 50
Indexing using IPools ..................................................................................... 51
AddOrReplace ............................................................................................... 53
AddOrModify .................................................................................................. 54
Modify............................................................................................................. 54
Delete ............................................................................................................. 54
DeleteByQuery ............................................................................................... 54
ModifyByQuery .............................................................................................. 55
Transactional Indexing ................................................................................... 56
IPool Quarantine...................................................................................... 56
Query Interface ..................................................................................................... 56
Select Command ........................................................................................... 57
Set Cursor Command .................................................................................... 58
Get Results Command................................................................................... 59
Get Facets Command .................................................................................... 61
Date Facets .................................................................................................... 62
FileSize Facets .............................................................................................. 63
Expand Command ......................................................................................... 64
Hit Highlight Command .................................................................................. 65
Get Time......................................................................................................... 66
Basics
This section is an overview of Search Engine 21, and introduces fundamental concepts
needed to understand some of the later topics.
Overview
Introduction
Search Engine 21 (“OTSE” – OpenText Search Engine) is the search engine provided
as part of OpenText Content Server. This document provides information about the
most common Search Engine 21 features and configuration, suitable for
administrators, application integrators and support staff tasked with maintaining and
tuning a search grid. If you are looking for information on the internal details of the
data structures and algorithms, you won’t find it here.
This document is based upon the features and capabilities of Search Engine 21.2,
which has a release date of April 2021.
Disclaimer
DISCLAIMER:
This document is not official OpenText product
documentation. Any procedures or sample code is specific to the
scenario presented in this White Paper, and is delivered as-is and
is for educational purposes only. It is presented as a guide to
supplement official OpenText product documentation.
While efforts have been made to ensure correctness, the
information here is supplementary to the product documentation
and release notes.
Relative Strengths
There are many search engines available on the market, each of which has relative
merits. Search Engine 21 is a product of the ECM market space, developed by
OpenText, with a proven record as part of OpenText Content Server. This search
engine has been in active use and development for many years, and was previously
known by names such as “OT7” and “Search Engine 10”.
Because of the nature of OpenText ECM solutions, OTSE has a feature set oriented
towards enterprise-grade ECM applications. Some of the pertinent features which
make OTSE a preferred solution for these applications include:
Upgrade Migration:
As new features and capabilities are added, you are not required to re-index your data.
OTSE includes transparent conversion of older indexes to newer versions. Our
experience is that customers with large data sets often do not have the time or
infrastructure to re-index their data, so this is a key requirement.
Transactional Capability:
During indexing, objects are committed to the index in much the same way that
databases perform updates. If a catastrophic outage happens in the midst of a
transaction, the system can recover without data corruption. Additionally, logical
groups of objects for indexing can be treated as a single transaction, and the entire
transaction can be rolled back in the event that one object cannot be handled properly.
Metadata Updates:
The OpenText search technology has the ability to make in-place updates of some or
all of the metadata for an object. This represents a significant performance
improvement over search technology that must delete and add complete objects,
particularly for ECM applications where metadata may be changing frequently.
Search-Driven Update:
OTSE has the ability to perform bulk operations, such as modification and deletion, on
sets of data that match search criteria. This allows for very efficient index updates for
specific types of transactions.
Maintenance Commitment:
OpenText controls the code and release schedules. This way, we can ensure that our
ECM solutions customers will have a supported search solution throughout the life of
their ECM application.
Data Integrity:
OTSE contains a number of features that allow the quality, consistency and integrity of
the search index and the data to be assessed. These features give system
administrators the tools they need to ensure that mission critical applications are
operating within specification.
Scaling:
Not only can OTSE support very large indices (1 billion+ objects), it can be restructured
to add capacity, rebalance the distribution of objects across servers, switch portions
from read-write to update-only or read-only, and perform in-place addition or removal
of metadata fields. OTSE shelters applications from the complexity of tracking which
objects are indexed into each Search Engine.
Advanced Queries:
Customers engaged in Records Management, Discovery and Classification have
unique query features available optimized for these applications. Examples include
searching for N of M terms in a document, searching for similar information, and
conditional term matching where sparse metadata exists.
Related Components
The scope of this document is constrained to the core OTSE components which are
located within the search JAR file ([Link]).
There are a number of other components of both the overall search solution and
Content Server which are strongly related to OTSE but are not covered in this
document. In some instances, because of the tight relationship with other components,
references may be made in this document to these other components. For a complete
understanding of the search technology, you may wish to also learn about the following
products and technologies:
Admin Server
The Admin Server is a middleware application which provides control, monitoring and
management of processes for Content Server. The Admin Server performs a number
of services, and is critical to the operation of the search grid when used with Content
Server. As a rule of thumb, there is generally one Admin Server installed on each
physical computer hosting OTSE components.
Document Conversion Server
DCS is a set of processes and services responsible for preparing data prior to indexing.
DCS performs tasks such as managing the data flows and IPools during ingestion,
extracting text and metadata from content, generating hot phrases and summaries,
performing language identification, and more. You should ensure that DCS is optimally
configured for use with your application before indexing objects.
IPool Library
Interchange Pools (IPools) are a mechanism for managing batch-oriented Data Flows
within Content Server. IPools are used to encapsulate data for indexing. OTSE uses
the Java Native Interface (JNI) to leverage OpenText libraries for reading and writing
IPools.
Content Server Search Administration
While most OTSE setup is managed using configuration files, in practice many of these
files are generated and controlled by Content Server. Many of the concepts and
settings described in this document have analogous settings within Content Server
Search Administration pages, and should be managed from those pages wherever
possible.
Query Languages
This document describes the search query language implemented by the OTSE. It is
common for applications to hide the OTSE query language and provide an alternative
query language to end users. The Content Server query language – LQL – is NOT
described in this document.
Remote Search
Content Server Remote Search currently uses code within OTSE to facilitate obtaining
search results from remote instances of Content Server.
Backwards Compatibility
OTSE is capable of reading all indexes and index configuration files from all released
versions of OpenText Search Engine 20, Search Engine 16.2, Search Engine 16,
Search Engine 10.5, Search Engine 10, and OT7. OT7 is the predecessor to SE10.0
that was part of Content Server 9.6 and 9.7. For most of these, an index conversion
will take place. The new index will not be readable by older versions of the search
engines.
Indexes created with OT6 are not directly readable. Search Engine 10 can be used to
convert an OT6 index to a format Search Engine 10 can use, which can then be
upgraded in a second step using OTSE. In practice, given the improvements and fixes
since OT6, you would be best advised to re-index extremely old data sets. You should
consult with OpenText Customer Support if you are considering a migration from these
older search indices.
[Link]
[Link]
[Link]
[Link]
[Link]
The [Link] file is specifically for Content Server Remote Search, and is
not a requirement for other OTSE installations.
Update Distributor
The Update Distributor is the front end for indexing. The Update Distributor performs
the following tasks, not necessarily in this order:
• Monitors an input IPool directory to check for indexing requests.
• Reads IPools, unpacks the indexing requests.
• Breaks larger IPools into smaller batches if necessary.
• Determines which Index Engines should service an indexing request.
• Sends indexing requests to Index Engines.
• Rolls back transactions and sets aside the IPool message if indexing of an
object fails.
• Rebalance objects to a new Index Engine during update operations if a
partition is too full or retired.
• Manages which Index Engines can write Checkpoints.
• Grants merge tokens to Index Engines that have insufficient disk space.
• Controls sequence of operations for Index Engines writing backups.
Index Engines
An Index Engine is responsible for adding, removing and updating objects in the search
index. The Index Engines accept requests from the Update Distributor, and update the
index as appropriate. Multiple Index Engines in a system are common, each one
representing a portion of the overall index known as a “partition”.
The search index itself is stored on disk. In operation, portions of the search index are
loaded into memory for performance reasons.
Index Engines are also responsible for tasks such as:
• Converting older versions of the index to newer formats.
• Converting metadata from one type to another.
• Converting metadata between different storage modes.
• Background operations to merge (compact) index files.
• Writing backup files and transaction logs.
Search Federator
The Search Federator is the entry point for search queries. The Search Federator
receives queries from Content Server, sends queries to Search Engines, gathers the
results from all Search Engines together, and responds to the Content Server with the
search results.
The Search Federator performs tasks such as:
• Maintaining the queues for search requests.
Search Engines
The Search Engines perform the heavy lifting for search queries. They are responsible
for performing searches on a single partition, computing relevance score, sorting
results, and retrieving metadata regions to return in a query. Every partition requires
a Search Engine to support queries.
The Search Engines keep an in-memory representation of key data that replicates the
memory in the Index Engines. The files on disk are shared with the Index Engines.
Search Engines read Checkpoint files at startup and incremental Metalog and
AccumLog files during operation to keep their view of the index data current. These
Metalog and AccumLog files are checked every 10 seconds by default, and any time a
search query is run.
Search Engines also perform tasks such as building facets, and computing position
information used for highlighting search results.
Inter-Process Communication
Each component of the search engine exposes APIs for a variety of purposes. This
section outlines the various communication methods used.
distributed system. You must ensure that socket communications are not blocked by
firewalls, switches or other networking elements.
Sockets
Threads 210
Connections 200
Ports 21
The socket connections allocate and hold the threads and connections. Although this
uses the maximum number of resources, there are performance benefits and
predictability since it avoids allocation and re-use overhead that may exist within Java
or the operating system.
The low priority queue is disabled by default, and activated with the following settings:
[SearchFederator_xxx]
LowPrioritySearchPort=-1
LowPriorityWorkerThreads=2
LowPriorityQueueSize=25
Note that using the low priority queue requires an additional port. As a general
recommendation, small values (perhaps 2 or 3) should be used for the threads to
prevent the low priority searches from consuming too many resources.
Queue Servicing
There are three phases to servicing a request to the Search Federator.
Phase 1 – Content Server indicates a desire to start a search query by opening
a connection to the Search Federator. The connection is put on an operating
system / Java queue (not in the search code).
Phase 2 – a dedicated thread takes the connection from Java, and places it in
an internal queue. If the internal queue is full, the request is discarded and the
connection is closed.
Phase 3 – when a search worker thread becomes available, the connection is
removed from the queue and given to the worker. At this point, the worker
responds to Content Server to indicate it is ready to receive the search request,
and Content Server sends the search query for processing.
Note that in versions prior to 20.2, the process around Phase 1 and Phase 2 were
different – the pending requests were left on the operating system queue, and the
internal queue had an effective size of 1.
Search Timeouts
The Search Federator places a limit on how long it will wait for an application which
has opened a search transaction. If the application does not initiate a message in the
available time, then the Search Federator will close the connection and terminate the
transaction.
Keeping a connection and transaction open is expensive from a resource perspective,
and applications that leave connections open and idle can block search activity by
consuming all available threads from the search query pool.
There are two timeout values. The first is the time between the acknowledging that a
worker is ready to receive a query the first message arriving. This is expected to be a
short time, and the default is 10 seconds. The second is the time between messages
– for instance between consecutive “GET RESULTS” messages. This is longer, with
a default of 120 seconds. Both times can be adjusted or disabled in the
search.ini_override file. Bear in mind these are timeouts from the server perspective
– Content Server will also have timeout values from the client perspective.
Within the [SearchFederator] section of the [Link] file, you may specify the time the
Federator will wait between a connection being created and the first command arriving
(10 second default):
FirstCommandReadTimeoutInMS=10000
Time the Search Federator will wait between commands (2 minute default):
SubsequentCommandReadTimeoutInMS=120000
In either case, the timeouts can be completely disabled with a value of 0.
The Search Federator also places a limit on how long it will wait for a response from a
Search Engine with an open search session. If the Search Engine does not reply within
the available time, then the Search Federator will terminate the search session. For
example, if the Search Federator has issued a “SELECT” to a Search Engine, it will
wait a limited amount of time for the reply. This timeout value is in the [DataFlow]
section of the [Link] file has a default value of 2 minutes:
QueryTimeOutInMS= 120000
The search session on a Search Engine will regularly ping the Search Federator to
ensure that it is still responding. If the Search Federator does not answer, then the
Search Engine will terminate its search session to recover resources. In addition, there
is a failsafe timeout which is the maximum time that a Search Engine will leave a
session active. In normal operation, even if the Search Federator fails, this is not
typically encountered. Located in the [DataFlow] section of the [Link] file, the
failsafe timeout value is 6 hours:
SessionTimeOutInMS= 21600000
Testing Timeouts
In a test environment, search results are often completed too quickly to permit testing
of system behavior for long searches and search timeouts. For test purposes, there is
a configuration setting that will cause all searches to take at least a defined period of
time. In production environments, this value should be 0.
MinSearchTimeInMS=0
File System
The Index Engines communicate updates to the Search Engines using a shared file
system. At various times, files may be locked to ensure data integrity during updates.
It is important that the Search and Index Engines have accurate file information for this
to work correctly. Some file systems use aggressive caching techniques that can break
this communication method. The Microsoft SMB2 caching is one example, and it must
be disabled for correct operation of OTSE. Microsoft SMB3 reverts to using the SMB2
protocol in many situations, and so should also be avoided. You must disable SMB2
caching on the servers running the search processes and on the file server. Similarly,
Microsoft Distributed File System (DFS) is known to have unpredictable file locking
behavior and must not be used.
Some customers have also experienced locking issues with NFS, and have needed to
use the NOLOCK or NOAC parameter in their NFS configuration to ensure correct
operation.
Server Names
Java enforces strict adherence to the various IETF standards for URIs and server
naming conventions. RFC 952, RFC 2396 and RFC 2373 are examples. Some
operating systems allow server names that do not meet the criteria for these standards.
When this happens, OTSE will likely fail with exceptions at startup. One example we
have seen is violation of this rule in RFC 952: “The rightmost label of a domain name
consisting of two or more labels, begins with an alpha character”. This means a domain
name such as “zulu.server3.7up” is invalid because the “7” must instead be an alpha
character.
Partitions
Basic Concepts
The concept of partitions is central to how OTSE scales and manages search indexes.
A search index may be broken horizontally into a number of pieces. These pieces are
known as “partitions” in OTSE terminology. The sum of all the partitions together
represents the search index.
Update-Only Partitions
It is possible to place a partition in “Update-Only” mode. In this mode, the partition will
not accept new objects to index, but it will update existing objects or delete existing
objects. If a partition is marked as Update-Only, then the Update Distributor will not
send it new objects.
Update-Only behavior is a legacy feature inherited from OT7, and is still supported for
backwards compatibility. However, it is recommended that you do not use Update-
Only mode for future applications. In normal Read-Write mode, OTSE contains a
dynamic “soft” update-only feature which is generally superior. The use and
configuration of dynamic update-only mode is covered elsewhere in this document.
Beginning with Content Server 16, Update-Only mode is not available as a
configuration option from within Content Server.
The default storage mechanism for text metadata is independently configured for
Update-Only partitions. If your default configuration for Update-Only mode differs from
Read-Write mode, then the Index Engines will convert the index data structures the
first time they restart after the configuration is changed. This default configuration
setting is found in the [Link] file.
Read-Only Partitions
OTSE allows partitions to be placed in a “Read-Only” mode. In this mode, the partition
will respond to search queries, but will not process any indexing requests. Objects
cannot be added to the partition, removed or modified.
In operation, when started, the Index Engines for Read-Only partitions will shut down
once they have verified the index integrity. This means that fewer system resources
are being consumed. It also means that, since there is no Index Engine to respond to
the Update Distributor, a new instance of an object will be created in another partition
if you attempt to replace or update an object in a Read-Only partition.
You should only use Read-Only partitions in very specific cases. Customers will
occasionally get into trouble because they use Read-Only partitions when their
applications are still updating objects. This would happen in an application such as
Records Management – a “hold” is put on an object in a Read-Only partition, and a
duplicate entry is inadvertently created in another partition. Similarly, moving items to
another folder, updating classifications, updating category attributes and other
operations will cause this type of behavior. The search engines then respond to search
queries with multiple copies of objects.
The use of “Retired” mode for partitions avoids these issues, and should be considered
instead of Read-Only mode. Beginning with Content Server 16, Read-Only mode will
no longer be provided as a configuration option in the Content Server administration
interface.
Read-Only partitions also have a distinct default configuration for text metadata
storage in the [Link] file, and changing to or from Read-Only mode
may trigger data conversion on startup.
Retired Partitions
OTSE allows partitions to be placed in a “Retired” mode. This mode of operation is
intended for use when a partition is being replaced. The behavior is close to partitions
in Update-Only mode. It will not accept new items, but it will update existing objects or
delete existing objects. If a partition is marked as Retired, then the Update Distributor
will not send it new objects. The key difference is that when an object in a Retired
partition is re-indexed, it will be deleted from the Retired partition and added to a Read-
Write partition.
Support for Retired Partitions is new starting with Search Engine 10.5. Retired mode
is strongly preferred over Read-Only mode, since Retired mode avoids problems
related to creating duplicate copies of objects in the Index.
Retired partitions are also a key feature for merging many small partitions into a set of
larger partitions. This is typical for customers upgrading older systems that use RAM
mode, and are switching to Low Memory mode. In this case, approximately 65% of
the partitions can be marked as “Retired”, and incremental re-indexing of the Retired
partitions will empty move all the objects out of the Retired partitions. When empty,
the partitions can be removed from the search grid.
One common strategy for moving items from one partition to another is to place a
partition into Retired Mode, perform a search for all items in the Retired partition, add
them to a Collection, and re-index the Collection. This moves all the items that are re-
indexed from the Retired partition into other partitions. In practice, there are often
items left behind in the Retired partition after this is done. Typically, this is to be
expected. Occasionally, a Content Server object will be deleted but not removed from
the index. When this happens, it cannot be Collected. In other cases, the Extractor
may be set to re-index only recent versions of objects, and will not re-index older
versions. In some cases, when a document was deleted, an associated Rendition may
not have been removed from the index. If unsure about whether a re-indexed Retired
partition can be deleted, the OpenText customer support organization may be able to
provide some guidance.
Note that when objects are deleted from a partition, some of the data structures remain
in place. For example, a dictionary entry for a word may exist, even though no objects
now contain that word. It is normal for a retired partition that has had all objects
removed to show a small non-zero size. The search engine will also mark items as
deleted, but leave them in place until scheduled processes compact and refresh the
data – which may take days depending on the situation.
Read-Write Partitions
For completeness, the normal mode of operation for a partition is “Read-Write” mode.
In this mode, the partition will accept new objects, can delete objects and update
objects.
Read-Write partitions can be configured to automatically behave as Update-Only
partitions as they become full. More information on soft Update-Only configuration is
available in the optimization section.
To set the size threshold for determining if an object should be sent to a large object
partition:
[DataFlow_yyyy]
ObjectSizeThresholdInBytes=1000000
Metadata Regions
A region is OTSE terminology for a metadata field. Using a database analogy, you can
think of a region as being roughly equivalent to a column in a database. Understanding
and optimizing how metadata regions are defined and stored has a big impact on
performance, sizing, usability and search relevance. This section provides background
on the administration of regions to optimize the search experience.
Defining a Region
Regions are defined in the configuration file “[Link]”. This file is edited
to define the desired regions and their behaviors, and interpreted by the Index Engines
when they start. Currently, Content Server does not provide an interface for editing
and managing this file, so you must do this with a text editor.
Once a region is defined, it is recorded in the search index. Changing the definition for
an existing region in the [Link] file or attempting to index a metadata
value that is incompatible with the defined region type will usually result in an error. It
is possible to redefine the type for existing metadata regions in many cases as
explained under the heading “Changing Region Types”.
Region Names
There are limitations on the labels which can be used for a metadata region. The rules
for acceptable region names are approximately the same as the rules for valid XML
labels.
The simplified explanation is that almost any valid UTF-8 characters can be used in
the name, with some exceptions. White-space characters (various forms of spaces,
nulls and control characters) are not permitted. To remain compliant with XML naming
conventions, use of a hyphen ( “-“ ), period ( “.” ), a number ( 0-9 ) or various diacritical
marks are discouraged as the first character.
The DCS filters often create region names from extracted document properties. In
some cases, DCS will strip white space and punctuation from the property names to
ensure that the region names are comprised of valid characters.
Region names are case sensitive. The region “author” is different from the region
“Author”.
<customerName>
<firstName>bob</firstName>
<lastName>smith</lastName>
</customerName>
Then the region “customerName” is indexed, and it will have the value:
<firstName>bob</firstName><lastName>smith</lastName>.
Within the definitions file, you can define hierarchy structures that should be ignored
and flattened when looking for regions to index. In the case above, by declaring
“customerName” as a nested region, the field customerName is ignored and the regions
firstName and lastName would be recognized and indexed. This is not intended to
handle arbitrarily complex nesting structures, but was designed to accommodate a few
specific instances in data presented for indexing by Content Server. In particular,
indexing of Workflow objects within Content Server prior to Content Server 10 SP2
Update 10 is the only known requirement for the use of nested region names. Using
the above example, a nested value is expressed within the definitions file like this:
NESTED customerName
REMOVE someRegionName
Special considerations exist for the compound region types DATETIME and USER.
USER regions must be removed together in the same way they were defined, with 3
regions removed:
DATETIME regions can also be removed in their entirety by specifying both regions:
There is a special case supported for removing the TIME portion of a DATETIME pair
to leave only the DATE field behind. Ensure that you also add a DATE field to prevent
conversion of the DATE field to TEXT. There is no method available to remove just the
date portion of a DATETIME field to leave the time intact.
REMOVE OTVerMTime
DATE OTVerMDate
RemoveEmptyRegionsOnStartup=false
Renaming Regions
Consider the case where you need to change the name of a metadata field in Content
Server or a custom application. You are now confronted with the problem that data
which is already indexed is using an older name for the region.
OTSE provides a mechanism for handling these situations. Within the region
definitions file, you can rename an existing region like this:
Merging Regions
The merge capability of OTSE is similar to the RENAME capability, but is instead used
to combine two existing regions. Within the definitions file:
Once an index is running, any new values for sourceRegion will instead be indexed
within the targetRegion.
If targetRegion does not exist, the effective behavior of the MERGE command is the
same as a RENAME command.
There are limitations. The MERGE operation is NOT capable of merging text metadata
values that contain attributes. For Content Server, this includes the OTName,
OTDescription and OTGUID regions. The attributes will be silently lost during the
merge operation. You must check to ensure that regions being merged do not
incorporate attributes.
If conversions are required for MERGE at startup, this will trigger writing new
checkpoint files.
To
Boolean Integer Long Enum Text Date
Boolean
Integer
Long
From Enum
Text
Date
TimeStamp
You cannot change the type of a Text region that has multiple values or uses
attribute/value pairs, since these concepts are only available for Text regions.
The procedure is as follows:
Edit the [Link] (or search.ini_override) file to include the following entry in the
[Dataflow] section: EnableRegionTypeConversionAsADate=YYYYMMDD, where
YYYMMDD is today’s date. This informs OTSE that type conversion is allowable today.
This is a safety feature to prevent inadvertent region type conversion.
Edit the [Link] file to have the desired region type definitions.
Restart the search processes. On startup, the Index Engines will determine that a
conversion is required, and use the stored values to rebuild the metadata indexes for
the changed regions. This process may require several minutes per partition, longer if
many region types are being defined.
In the event that a given value cannot be converted, the failure is recorded in the log
files and the OTIndexError count for metadata errors is incremented for the affected
object in the index.
You are strongly encouraged to back up an index before converting region types and
ensure that conversion has succeeded, reverting to the backups if there are problems.
In the log files, each failed conversion has an entry along these lines:
Couldn't set field OTFilterMIMEType for object
DataId=254417&Version=1 to text/plain:
Similarly, if the region OfficeLocation is selected for retrieval, the results would return
all three values.
When updating values in regions, you cannot selectively update one specific value of
a multi-value region. If a new value is provided for OfficeLocation for this object, all 3
existing values would be replaced with the new data – which may be a single value or
multiple values.
<OTMeta>
…
<OTName lang="en">My red car</OTName>
<OTName lang="fr">Mon voiture rouge</OTName>
…
</OTMeta>
In addition to using the multiple value capabilities of OTSE, region attributes are used
by Content Server to tag each metadata value with attribute key/value pairs. In this
example, the key is “lang”, and the values are “en” and “fr”.
When constructing a search query, use of the region attributes is optional. A search
for “red car” or a search for “rouge” will find this object and return the values. When
values are returned, the attributes are included in the results only on request.
It is possible to construct a search query against regions that have specific region
attributes. If you only want to locate objects that contain the term “rouge” in the French
language value for OTName, the where clause would look like this:
where [region "OTName"][attribute "lang"="fr"] "rouge"
The query language has also been extended to permit sorting of results using an
attribute. Consider the case where there are values for both French and English, but
the user preference is French. Sorting based on the French values is therefore
desired. Within the “ORDEREDBY” portion of a SELECT statement, the SEQ keyword
is used to specify the attribute to be used for sort preferences:
The use of attributes with text values for specifying language values is a relatively
simple example. You may index multiple attributes within a single region. You may
also have different attributes for each value. The following example illustrates this
concept for indexing:
Silly stuff</Description><fakeRegion>Certified
Paid</fakeRegion><Description>nothing to see here
Then this data could be wrapped in a legitimate Description region when extracted for
indexing, resulting in:
<Description>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>
Which effectively forges a value for fakeRegion. By using the otb attribute,
<Description otb=94>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>
The Index Engine would notice that the Description region ended after only 11 bytes
instead of 94 bytes, and would prevent the injection of the fakeRegion by flagging the
object metadata as unacceptable. Content Server first began using this otb protection
for regions generated by Document Conversion Server in September 2016, and for
regions provided by Content Server metadata in December 2016.
The otb attribute is never stored in the index. There is a [Link] setting that will
disable this capability, which will ignore the otb value. In the [Dataflow_] section:
IgnoreOTBAttribute=true
Key
Each object in the index must have a unique identifier, or key. The KEY entry in the
region definitions file identifies which region will be used as this unique identifier. It is
of type text and may not have multiple values. Exactly one must be defined. The
default Key name is OTObject. During indexing, the Key is typically represented by
the entry OTURN within an IPool. To paraphrase, in a default Content Server
installation, the OTURN entry in an IPool is treated as the Key, and populates the
region OTObject.
KEY OTObject
Text
Text, or character strings. Text strings must be defined in UTF-8 encoding. Text strings
can potentially be very large. Because of this, many customers find that the available
space in their search index is consumed quickly by text regions. To help manage the
large potential sizes, there are several methods available for storing text metadata.
This is covered in a separate section.
Text values may contain spaces and special punctuation. When represented in the
input IPools, certain characters may need to be ‘escaped’ to allow them to be
expressed in the IPools. In general, this means placing a backslash (‘\’) character
before “greater than” and “less than” characters (‘<’ and ‘>’).
There are some features available for TEXT regions which are not available for other
data types, and these may affect the decision about which type of region is suitable for
a given metadata field. TEXT regions support multiple values for an object, and TEXT
regions also support attribute keys and values.
It is possible to index numeric information in a text region, but they
are indexed as strings. When using comparison operations – such
as greater than, less than, ranges and sorting – remember that
strings sort differently than numbers. Intuitively, you expect the
number 123 to be greater than the number 50. But text comparisons
consider 123 to be less than 50. For example, in a TEXT region, a
clause of WHERE [region "partnum"] range "100~200" will
match a value of 1245872. If numeric comparisons are important, a
TEXT region is not a good choice.
TEXT textRegionName
There are default limits on the size and number of values you can place in a text region.
It is possible to configure these limits on a per-region basis. Size is expressed in
Kbytes. These parameters are optional. More details are available in the “Protection”
section of this document.
Rank
The rank type region is a special case for modifiers used in computing the relevance
of an object to boost its position in the result list. For example, frequently used objects
may be given a rank of 50. The default is 0. Values in this region must be between 0
and 100 inclusive. Only 1 region may be defined with type of rank. In the definitions
file:
RANK rankRegionName
Integer
An integer is a 32 bit signed value, which can represent an integer value between -
2,147,483,648 and 2,147,483,647. Integer values are stored in memory. Search
results can be sorted on an integer field. In the definitions file:
INT integerRegionName
Long Integer
A long integer is a 64 bit signed value, which can represent a number between
−9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 inclusive. LONG integer
values are stored in memory. Existing Integer fields in an index can be converted to
LONG Integer values by changing their definition. Search results can be sorted on a
LONG integer field. In the definitions file:
LONG longRegionName
Timestamp
A TIMESTAMP region encodes a date and time value. TIMESTAMP values are
expressed in a string format that is compatible with the standard ISO 8601 format. The
milliseconds and time zone are optional, but time up to the seconds is mandatory:
2011-10-21T[Link].354+05:00
2011-10-21T[Link]
Where
The time zone is always optional. If omitted, the local system time zone will be
assumed. The local system time zone is determined from the operating system, but
can also be explicitly set by means of a [Link] file setting. Internally, timestamp
values are converted to UTC time before being indexed.
During search queries, lower significance time elements can be omitted. For instance,
the following will all be accepted:
2011-05-30T[Link]
2011-05-30T13:20
2011-05-30-2:30
2011
If not fully specified, during indexing the earliest possible time for a value will be used.
For example:
2011-05
Would be interpreted as:
2011-05-01T[Link].000
TIMESTAMP values are kept in memory, stored as 64 bit integers. In the definitions
file:
TIMESTAMP timestampRegionName
There are special behaviors for several reserved metadata regions that use
TIMESTAMP definitions for tracking the time when objects are indexed or modified.
See the section on Reserved Regions for more information.
Enumerated List
The enumerated type is ideal for metadata regions which will have one of a defined set
of values. For example, file type identifiers (Word, Excel, etc.) are members of a set
of file types. Enumerated lists use less memory than text if RAM storage is being used.
In the definitions file:
ENUM enumerableRegionName
Boolean
The BOOLEAN type is used for objects which can have a value of true or false. Fields
of type BOOLEAN use memory very efficiently. In order to accommodate the reality
that different applications represent BOOLEAN values in different ways, the indexing
processes will accept BOOLEAN values in any of the following alternate forms:
true false
yes no
1 0
on off
y n
t f
Boolean values are not case sensitive, so that False, FALSE and false are equivalent.
When retrieved, the values are always presented as true or false, regardless of which
form was used for indexing. If building a new indexing application, the use of true and
false is the preferred form.
BOOLEAN booleanRegionName
Date
A Date region accepts a string that represents a date in the form ‘YYYYMMDD’, where
YYYY is the year, MM the month, and DD the day. For example, 20130208 would
represent February 8th 2013. Date values can be used presented in search facets, and
used in relevance scoring computation. This form of a Date matches the format for
dates used in Content Server. The date portion of a DateTime region is effectively a
Date region. The Date region type is first available in Search Engine 10 Update 10.
DATE dateRegionName
Currency
A region can be defined as a currency, a feature first available with Update 2015-09.
When so declared, the input data will be assumed to be in one of several common
forms that are used to represent currency values. The data is stored internally as a
long integer, with an implied 2 decimal digits. Character strings preceding or trailing
the currency value are discarded, which would typically be a symbol or a country
currency designation. Although some tolerance of poorly formed currency values is
built in, the expectation is that well formed data with 0 or 2 digits after the decimal will
be present. Examples of valid currency representations are:
$1,376,378 1376378.00
CURRENCY2 ListPrice
Aggregate-Text Regions
An AGGREGATE-TEXT region has a search index which is the sum of all the regions
it aggregates, but does not store a copy of the values. The values remain within the
original regions. Aggregation only applies to TEXT regions.
Judicious use of AGGREGATE-TEXT regions can improve search performance and
simplify the user experience. Searching many text regions is slower than searching
against an equivalent AGGREGATE-TEXT region. When the AGGREGATE-TEXT
feature is combined with the DISK_RET storage mode for text regions, a significant
reduction in the total memory used to store the index and metadata of the aggregate
is possible if not using Low Memory mode.
AGGREGATE-TEXT regions are constructed using the [Link] file.
Create an entry along these lines:
AGGREGATE-TEXT AggName OTCreatedBy,OTModifiedBy,OTDocAuthor
In this example, a new field is created, “AggName”. The values from the regions
named OTCreatedBy, OTModifiedBy and OTDocAuthor are all placed as separate
values into the AggName field.
There is a special case for defining aggregates, a trailing wildcard character.
AGGREGATE-TEXT DocProperties OTFileName,OTDoc*
This would place the OTFileName region and any text region that starts with OTDoc
into the DocProperties region.
Regions that match the wildcard pattern can be excluded by using an exclamation
mark instead of a comma as the preceding delimiter. The following illustrates excluding
two regions from a pattern match:
AGGREGATE-TEXT DocProperties
OTFilterName,OTDoc*!OTDocAuthor!OTDocumentUserRating
The exclusions must be exactly specified: they must follow the wildcard operator; they
must match the wildcard operator; they must not contain wildcards.
When the Index Engines start, if the AGGREGATE-TEXT configuration has been
changed, a one-time conversion of the index takes place. The Aggregate configuration
is then subsequently applied to new objects as they are indexed or updated.
Deleting the entry for an AGGREGATE-TEXT field within the [Link] file
does not cause the field to be deleted. The REMOVE command in the
[Link] file must be used to remove an AGGREGATE-TEXT region.
REMOVING an AGGREGATE-TEXT region will delete the index for the region, but
does not eliminate the underlying regions that comprise the Aggregate.
If the definition of an AGGREGATE-TEXT field is edited to add or remove regions from
the list of regions which comprise an Aggregate, then when the Index Engines are next
started, the AGGREGATE-TEXT region will be rebuilt. This will take some time, and
results in a new checkpoint being written.
It is possible to combine AGGREGATE-TEXT with any text region storage mode. For
example, if Storage-Only mode (DISK_RET) is used, then only the Aggregate region
can be searched, but each component region can be retrieved.
CHAIN Regions
The CHAIN definition can be used to define a synthetic region which is used for
constructing queries against lists of regions. The list is prioritized. The value of the
first region that is defined (not null) is used for evaluating the query. There is no
additional storage or index penalty since the definition is an instruction used at query
execution that directs how the CHAIN region should be evaluated.
CHAIN UserHandle UserID FacebookID TwitterID
CHAIN regions can be used with any region type. Using different region types within
a single CHAIN region is not recommended, since not all search operators are
consistently available or applied to all region types.
The [first "UserID","FacebookID","TwitterID"] syntax in a query is equivalent to a
CHAIN region for queries. However, when a CHAIN region is predefined, the value of
the CHAIN region can also be requested in the search results using the SELECT
statement.
It is possible to change the text metadata storage modes for an existing index without
re-indexing the content. The Index Engines can perform any necessary storage mode
conversions when they are started.
Content Server exposes control over the storage modes in the search administration
pages. Beginning with Content Server 16, support of several legacy configuration
modes have been removed, forcing indexes to use DISK + Low Memory + Merge Files
as the proven best overall configuration. For most applications, the configuration file
settings described here will not need to be directly manipulated.
[ReadWrite]
SomeRegionName=DISK
OtherRegionName=DISK_RET
[ReadOnly]
ImportantRegionName=RAM
[NoAdd]
HugeRegionName=DISK
[Retired]
HugeRegionName=DISK
The [General] section of this file specifies the default storage mode for text metadata.
The ‘NoAdd’ value is the setting for Update-Only partitions.
You can also specify storage modes for regions which differ from the default settings.
Each partition mode has a section, and a list of regions and their storage modes can
be provided. Note that Low Memory and Merge File storage modes require DISK
configuration as a pre-requisite.
The [Link] file is
generated dynamically by administration
interfaces within Content Server. Normally,
you should not edit this file.
Beginning with Content Server 16, RAM based storage, ReadOnly mode and NoAdd
mode are no longer available through the administrative interfaces.
typically see a 30% reduction in the indexing performance with Disk storage relative to
Memory storage. For example, in one of the OpenText test cases using a 4-partition
system performing a 1 million+ objects indexing test: 7 hours 24 minutes with Disk
mode versus 5 hours 9 minutes in RAM mode.
Switching between Value Storage and Low Memory disk modes will trigger a
conversion of the index format when the Index Engines are next started. Typically,
conversion of a partition should be less than 20 minutes. Value Storage mode is
backwards compatible with versions of Search Engine 10.0 back to Update 2. Low
Memory mode is new beginning with Update 9, and partitions in Low Memory mode
cannot be read by earlier versions of Search Engine 10.5.
The Merge File storage mode is fist available in Content Server 10.5 Update 2015-03.
Retrieval Storage
This mode of storage is optimized for text metadata regions which need to be retrieved
and displayed, but do not need to be searchable. In this mode, the text values are
stored on disk within the Checkpoint file, and there is no dictionary or index at all. This
mode of operation is recommended for regions such as Hot Phrases and Summaries.
These regions do not need to be searchable since they are subsets of the full text
content (you can search the full body text instead). Typical ECM applications see a
savings of 25% of metadata memory using Retrieval Storage mode instead of Memory
Storage for these two fields.
Retrieval Storage mode can be configured in the [Link] file using the
value DISK_RET.
[DataFlow_DFname0]
DiskRetSection=DISK_RET
[DISK_RET]
RegionsOnReadWritePartitions=OTSummary,OTHP
RegionsOnNoAddPartitions=OTSummary,OTHP
RegionsOnReadOnlyPartitions=OTSummary,OTHP
Reserved Regions
There are a number of region names which are reserved by OTSE, and application
developers must be aware of the restrictions on their use. In most scenarios, the
Document Conversion Server is part of the indexing process, and DCS will also add a
number of metadata regions that are not described here.
OTMeta
The OTMeta region is reserved for use in two ways. In the first case, the region
OTMeta is reserved to indicate the collection of all metadata regions defined in the
Default Metadata List. This list is described in the [Link] file by the entry
DefaultMetadataFieldNamesCSL. A query against the OTMeta region will search
this entire list of regions. Where possible, this should be discouraged since searches
of this form may be relatively slow compared to searching in a specific region,
particularly if there are many regions included in the default search region list.
The second application is using OTMeta as the prefix for a region in a search query. A
query with a WHERE clause of [region "someRegion"] "term" is equivalent to
[region "OTMeta": "someRegion"] "term".
<furniture>
<chairs>
4
<chairColor>red</chairColor>
</chairs>
</furniture>
You can construct a query to locate objects where the chair color is red. The WHERE
clause of the search query would look something like this:
[region "OTData":"furniture":"chairs":"chairColor"] "red"
The XML search capability does not require a complete XML path specification. The
following WHERE clauses would also match this result, but would potentially also
match other results that are less specific:
[region "OTData":"chairs":"chairColor"] "red"
[region "OTData":"chairs"] "red"
To be a candidate for XML search matching, the XML document must have been
assigned the value text/xml in the OTFilterMIMEType region, which is typically
the responsibility of the Document Conversion Server. The metadata region and the
value for allowing XML content search are configurable in the DataFlow section of the
[Link] file:
ContentRegionFieldName=OTFilterMIMEType
ContentRegionFieldValue=text/xml
OTObject
Each index must specify a unique key region which functions as the master reference
identifier for an object. The region which represents the key is declared in the region
definitions file, but by convention and by default, the region OTObject is almost always
used as the key. During indexing, the unique key is defined in the OTURN entry for an
IPool object.
In practice, Content Server uses strings that begin with “DataId=” for the unique
identifier of managed objects. There are special cases in the code that rely on this form
of the OTObject field to determine when certain optimizations can be applied, such as
Bloom Filters for membership within a partition. If you are creating alternative or
custom unique object identifiers, ensure that the string “DataId” is not present in the
identifier to avoid unexpected behaviors.
OTCheckSum
This region contains a checksum for the full text content indexed for an object. The
value is generated by the Index Engines. Attempts to provide an OTCheckSum value
when indexing an object will increment the metadata error count for the object, and be
ignored. You can search and retrieve this region.
Internally, the Index Engines use this field to optimize re-indexing operations by
skipping content that is unchanged. This value is also used by index verification utilities
to verify that data has not been corrupted.
OTMetadataChecksum
This region has several purposes related to checksums for metadata. You cannot
index this region, but you can query against it and retrieve the values. Internally, this
value is used to verify the correctness of the metadata. Errors in the checksum
generally indicate severe hardware errors.
When a new object is indexed, a checksum of each metadata value is made. These
values are combined to create an aggregate checksum value, and the checksum is
stored in the region OTMetadataChecksum.
A background process is then scheduled which runs at a low priority. This process
traverses all objects in the index and recalculates the metadata checksum. If the
recalculated value does not match the stored value, a message is logged, and an error
code (-1) is placed in the OTMetadataChecksum region for that object. Applications
can find objects with metadata checksum errors by searching for a value of -1 in this
region.
If an existing index does NOT have checksums computed, then the background
process will populate checksum values. When objects are re-indexed, changes to the
metadata will be reflected in the new checksum. Transactional integrity for metadata
regions that were not changed is preserved.
There are configuration settings in the Index Engine section of the [Link] file that
allow the feature to be ON, OFF or IDLE. When IDLE, new indexing operations will still
create checksums, but the background process will not be validating them. In the Index
Engine section of the search INI file, this entry controls the mode, where acceptable
values are ON, OFF and IDLE. Default value is OFF for backwards compatibility.
MetadataIntegrityMode=OFF (IDLE | ON)
By default, the engines will wake up once every two seconds and verify 100 objects:
MetadataIntegrityBatchSize=100
MetadataIntegrityBatchIntervalinMS=2000
Metadata regions stored on disk are excluded from this processing by default, since
disk files have other checksum validation mechanisms. It is possible to include
checksum validation for regions stored on disk, as indicated below, but the
processing is considerably slower in this mode:
TestMetadataIntegrityOnDisk=OFF (ON)
OTContentStatus
This region is used to record an indicator of the quality of the full text index for each
object. This data can assist applications with assessing the quality of the indexed data,
and taking corrective action when necessary. The status codes are roughly grouped
into 4 levels of severity – level 100, 200, 300 and 400 codes, where 100 level codes
indicate good indexed content, and level 400 codes represent significant problems with
the content.
Applications can provide a status code as part of the indexing process. If the Indexing
Engines encounter a more serious content quality condition (a higher number code)
then the higher value is used. In other words, the most serious code is recorded if
multiple status conditions exist.
The majority of the codes are generated within DCS. Based upon Content Server 16,
the defined codes are:
100 There is no content indexed, only metadata. This is expected behavior, since
no content was provided as part of the indexing request.
103 This is the value for a normal, successful extraction and indexing of a single
document, both text and metadata.
104 One or more metadata regions contained non-UTF8 data. The non-UTF8 bytes
were removed and best-attempt indexing of the region performed. This
behavior only exists when region forgery detection is disabled.
120 The full text content of the indexing request was correctly processed, and is
comprised of multiple objects. The metadata of only the top or parent object
was extracted. The full text content of all objects is concatenated together. An
example is when multiple documents within a single ZIP file are indexed.
125 There were multiple objects provided for indexing, but some of them were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. The metadata of only the top or parent object was extracted. The
full text content of all objects that were not discarded are concatenated
together. A typical example would be when a Word document and JPEG photo
are attached to an email object, and the JPEG was discarded as an excluded
file type.
130 There were one or more content objects provided for indexing, but all were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. There is no full text content.
150 During indexing, the statistical analyzer in the Index Engine identified that the
content has a relatively high degree of randomness. This is a warning, the data
was accepted and indexed.
300 During indexing, the text required more memory than is allowed by the
Accumulator memory settings that are currently configured. The text has been
truncated, and only the first portion of the text that fit in the available memory
has been indexed.
305 Multiple content objects were provided, and at least one but not all of them are
an unsupported file format. There is some full text content, but the content of
the unsupported files have not been indexed.
310 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects consists of an unsupported file
format.
320 Multiple content objects were provided, and at least one but not all of them
timed out while trying to extract the full text content. There is some full text
content, but the content of the objects which timed out have not been indexed.
360 Multiple content objects were provided, and at least one but not all of them
could not be read. There is some full text content, but the content of the objects
exhibiting read problems have not been indexed.
365 One or more content objects were provided, and the full text of at least one but
not all of them could be indexed. At least one of these objects was rejected
because of a serious internal or code error while preparing the content. This
error may or may not recur if you re-index this object.
401 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of
unsupported character encoding.
405 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because the
process timed out while trying to extract the full text content from a file.
406 Non-UTF8 data was found in metadata regions with region forgery detection
enabled. The metadata was discarded.
408 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of a
serious internal or code error while preparing the content. This error may or
may not recur if you re-index this object.
410 DCS was unable to read the contents of the IPool message or the file
containing the content. No full text content has been indexed.
OTTextSize
This region captures the size of the indexed full text content in bytes. Note that for
many languages there may be fewer characters than bytes. Note that this reflects the
size of the text extracted by DCS and filters, and can be significantly different from the
OTFileSize region defined by Content Server. Should be declared as type INTEGER.
First available in update 21.1.
OTContentLanguage
This region is optionally generated by the Document Conversion Server. DCS can
assess the full text content of an object to determine the language in which the content
is written. The language code is then typically represented in this region.
OTPartitionName
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the name of the partition which contains
the object. In a search query, OTPartitionName supports equals and not equals, for
either an exact value or a specific list of range values. Operations like regular
expressions or wildcards are not supported. This limited query set is intended to help
administrators with system management tasks, such as locating all the objects in a
given partition. In Content Server, partition names usually start with the text
“Partition_”.
OTPartitionMode
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the operating mode of the partition which
contains the object. In a search query, OTPartitionMode supports equals and not
equals, for either an exact value or a specific list of range values. Operations like
regular expressions or wildcards are not supported. This limited query set is intended
to help administrators with system management tasks, such as locating all the objects
in a retired partition. The mode will be one of:
OTIndexError
This field is used to contain a count of metadata indexing errors associated with an
object. Metadata indexing errors occur for situations such as:
• An improperly formatted metadata object. A string value within an integer or
date field would be examples of this.
• An improperly formed region name.
• Attempts to provide values for reserved and protected region names.
For each such instance, the OTIndexError count region is incremented. Applications
providing objects for indexing may provide an initial value. For example, DCS may
have found that a date or integer value it attempted to extract was incorrect, and
therefore could determine that there is already a metadata error before the Index
Engine is provided with the object.
The error counts are incremental. Updates to objects which contain metadata errors
can cause this value to become artificially inflated. For example, if an object is added
with a date error, and then 10 updates include the same date error, then error count
may be 11.
Applications can query and retrieve this field to help assess the quality of the search
index.
OTScore
This synthetic region usually contains the computed relevance score for a search result
as an integer value. With the default configurations, a relevance score is between 0
and 100. It is important to understand that the relevance score as computed does NOT
have any measurable correlation with the relevance of an object as assessed by a
user. These scores at best must be considered relative. For most applications,
displaying the OTScore (or computed relevance) is not normally appropriate.
Although a simple integer is presented in the OTScore, internally the relevance
differences between objects may be very small fractions. The sorting of objects
internally for relevance is based on the floating point value.
TimeStamp Regions
During indexing operations, the Index Engine can mark objects with the time that
objects are created or updated. This behavior is enabled by including the appropriate
definitions in the [Link] file as described below. When enabled, by
default these timestamps are added on all objects. If trying to minimize the index size,
you might want to add timestamps to only a subset of objects. For example, with
Content Server, you might want to add timestamps to only the Content Server “Index
Tracer” objects. For stamping only limited object types, ensure the TimeStamp fields
are defined in [Link], and add the list of object types to the [DataFlow_]
section of the [Link] file. Only objects that contain an OTSubType value in the list
will have the time stamp values added:
IndexTimestampOnlyCSL=147
OTObjectIndexTime
When an object is created, this field will be populated with the current time, as
determined by the system clock. This field has the type TIMESTAMP, and must
be declared in the [Link] file to function.
OTContentUpdateTime
When the text content of an object is updated, this value records the current time
for the update. Only actual changes to the content will trigger a change. If an object
is re-indexed, but the text content is identical, then this value will not be updated.
This region has the type TIMESTAMP, and must be declared in the
[Link] file to function.
The definition of “identical” is based upon the text as interpreted by the index
engine. Changes in the tokenizer or file format filters may result in the text being
declared “different”, even if the master object content is unchanged.
OTMetadataUpdateTime
This field records the time at which the metadata for an object was last modified.
If an object is re-indexed and no metadata changes, then this value is not updated.
This region has the type TIMESTAMP, and must be declared in the
[Link] file to function.
OTObjectUpdateTime
This field is updated any time the metadata OR the content is changed. You should
normally not remove this field, since it is required for correct operation of Search
Agents.
_OTDomain
The searchable email domain feature generates synthetic regions by appending this
suffix to the email region name. For instance, if your region that contains email is
OTEmailSender, then the region OTEmailSender_OTDomain will be created to
support the email domain search capability.
_OTShadow
Regions ending with the string _OTShadow are created when the LIKE operator is
configured. If the Content Server region OTName is configured for use with LIKE, then
the region OTName_OTShadow contains the extended indexing information required
by the LIKE feature.
In the index, these types of regions are typically prefixed with OTDocXXXX or
OTXMP_XXXX. Be careful if you choose to remove these, since it is possible that
region names from other sources might match this naming convention. For example,
the Content Server ‘User Rating’ metadata fields OTDocSynopsis and
OTDocUserRating also have this form.
Workflow
Indexing of Workflow metadata from Content Server has been problematic historically,
but is considerably better since Content Server 10.0 Update 10.
Firstly, the default Workflow configuration indexes all the internal Workflow metadata
to the search engine. In most applications, many of these regions have no value for
user search. The default region definitions file has DROP or REMOVE instructions in
place to prevent this data from being indexed. If you need to make these metadata
fields searchable, edit the definitions file appropriately.
The other aspect is Workflow Map attributes. These are presented as regions for
indexing in the form WFAttr_xxxx, where xxxx is text that represents the name of the
Workflow attribute. It is possible for a very large number of these WFAttr_ regions to
exist, especially in older versions of Content Server where the default setting was to
always index these regions. This increases the size of the index. If you do not need
to search on these fields, you might consider DROP or REMOVE in the definitions file.
If searching the aggregate value of these fields is sufficient, you might also want to
consider using AGGREGATE-TEXT for queries against these regions, in conjunction
with DISK_RET for storing the values.
indexed, they will be marked as type ‘TEXT’, and cannot be changed short of removing
the entire region and re-indexing the objects, or using the region type conversion
features.
This is an optimization consideration only. Leaving the Category and Attribute values
as TEXT within the index does not affect feature availability, although differences in
behavior between integer and text values may be a concern.
Forms
Within Content Server 10, the Forms module permits users to create arbitrary labels
for form fields. The region names are generated directly from these labels.
Unfortunately, this can result in conflicts with other search regions in the index. It is
recommended that you enforce a business practice of prefixing all form names with a
unique value, such as OTForm_. This will provide two major benefits: it will minimize
the chance of name conflicts, and it allows use of AGGREGATE-TEXT regions to
improve search usability.
Content Server 10.5 or later will generate region names that follow a well defined
syntax, along the lines of OTForm_1234_5678. This change makes it much easier to
identify regions associated with forms, and simplifies selecting them for REMOVE or
aggregation purposes.
Custom Applications
It is common for OpenText customers to create their own solutions using Content
Server as a platform. Often, the considerations for metadata indexing and search are
overlooked. If you have custom applications that index metadata fields, you should
consider the impact on search index size and performance.
• Only index object subtypes that are of interest to users
• Only extract metadata fields that are useful for search
• Ensure that the region definition file has optimal configuration for each region
• Provide a unique prefix so that the custom metadata will not conflict with
other region names
• If appropriate, add the custom regions to the default Content Server search
regions.
Indexing
Updating the search index is performed by preparing files containing indexing
commands in a defined location. The input files and structures are in the OpenText
“IPool” format. The Update Distributor watches for these files, and initiates indexing
when IPools arrive.
A single IPool may contain many indexing commands and objects. Updates to the
index from an IPool are only “committed” once all of the messages within the IPool are
successfully handled. If either the Update Distributor or one of the Index Engines is
unable to process a message, then the indexing process will halt and the all the
changes from the IPool are rolled back when the Index Engines are restarted. This
behavior applies to serious indexing IPool errors, such as malformed IPool messages.
Objects too large, for example, are not IPool errors.
If multiple partitions exist for an index, the Update Distributor chooses which partition
will index an object. Some operations, such as Modify By Query, are broadcast to all
the Index Engines. Most operations are specific to a single partion, and the first step
in deciding which partition to use is to ask if any of the existing Index Engines already
have an entry with the same object identifier (the “Key” value). If one of the Index
Engines responds affirmatively, then the object is given to that Index Engine to add,
modify or remove.
If no partition already has the object, the Index Engine will make a selection based
upon the Read-Write or Update-Only mode of the partitions, and whether they are full.
Partitions which are in “Update-Only” or “Retired” mode are never given new objects
to index. Partitions which are in “Read-Only” mode do not have Index Engines running,
and are not given any indexing tasks.
<Object>
<Entry>
<Key>OTURN</Key>
<Value>
<Size>16</Size>
<Raw>8273908620;ver=1</Raw>
</Value>
</Entry>
<Entry>
<Key>Operation</Key>
<Value>
<Size>12</Size>
<Raw>AddOrReplace</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>187</Size>
<Raw>
<FileName>/MyContentInstances/[Link]</FileName>
<ObjectTitle>Things that go bump</ObjectTitle>
<OTName>Cars</OTName>
<OTName lang=”fr”>Voitures</OTName>
<OTCurrentVersion>true</OTCurrentVersion>
</Raw>
</Value>
</Entry>
<Entry>
<Key>ContentReferenceTemp</Key>
<Value>
<Size>20</Size>
<Raw>C:/dev/[Link]</Raw>
</Value>
</Entry>
<Entry>
<Key>Content</Key>
<Value>
<Size>28</Size>
<Raw>full text to be indexed here</Raw>
</Value>
</Entry>
</Object>
The <Size> value reports the number of characters contained within a <Raw> section.
The <Raw> section contains the actual values. The <Raw> section can contain
arbitrary data expressed in UTF-8 encoding, and does not require character escaping
because the <Size> is known, although for metadata regions this data is expected to
be structured much like XML. The <Key> value specifies the top level purpose for
each entry, sometimes processed by DCS, sometimes by the Index Engines. This
object contains 5 entries – the OTURN, Operation, Metadata, and content referenced
in two different ways.
Every object to be indexed requires a unique identifier. For typical Content Server
applications, the unique identifier is provided in the region “OTURN”, as shown in this
example. The value for the OTURN is “8273908620;ver=1” – different Content Server
modules may provide OTURN values in different forms. Operations such as
ModifyByQuery would use a query “where clause” as the OTURN.
The Operation entry instructs the Index Engines how the object should be interpreted
as explained in the sections below.
The Metadata entry is used to provide the regions names and values that are provided
for indexing. In the example above, metadata for the regions FileName, ObjectTitle,
OTName and OTCurrentVersion are provided. You can specify multiple values for one
region. The OTName region, for example, has two values, and one of them also uses
the attribute key/value feature of OTSE to specify that “voitures” is the French language
value.
The entry for ContentReferenceTemp is used to identify that the content data is located
at the specified file location. The IPool libraries would normally delete the file after
processing, since by convention ContentReferenceTemp is used when a temporary
copy of a file was made. A permanent copy can also be specified using
ContentReference as the key, which does not delete the original. IPools given to the
Index Engines normally should NOT have either ContentReferenceTemp or
ContentReference entries, since extraction and preprocessing of files should already
have occurred to extract the raw text data. These modes are common for earlier steps
in the DCS process.
The entry for Content in the example indicates that the data in question is contained
within the IPool, in the <Raw> section. This is the normal expected use case for IPools
being consumed by the Update Distributor. Unlike this artificial example, having both
Content and ContentReferenceTemp values is atypical.
AddOrReplace
This is the primary indexing operation used to create new objects in the index. If the
object does not exist, it will be created. If an entry with the same OTURN exists in
either a Read-Write or Update-Only partition, then it will be completely replaced with
the new data, equivalent to a delete and add.
The AddOrReplace function distinguishes between content and metadata. If an object
already exists, and metadata only is provided, the existing full text content is retained.
However, the line between content and metadata is somewhat distorted. The DCS
processes will typically extract metadata from content and insert this metadata into
regions for indexing. There is a list of metadata regions which are therefore considered
to be “content”, and not replaced or deleted if content is not provided in a replace
operation.
The list of metadata considered to be content for this purpose is defined in the
[DataFlow_] section of the [Link] file by:
ExtraDCSRegionNames=OTSummary,OTHP,OTFilterMIMEType,
OTContentLanguage,OTConversionError,OTFileName,OTFileType
ExtraDCSStartsWithNames=OTDoc,OTCA.OTXMP_,OTCount_,OTMeta_
DCSStartsWithNameExemptions=OTDocumentUserComment,
OTDocumentUserExplanation
ExtrasWillOverride=false
The ExtrasWillOverride setting is used to disable this feature, which would cause the
regions to be deleted if content is not indexed in an AddOrReplace operation. The
DCSStartsWith entry is used to capture the dynamic regions that DCS extracts from
document properties.
The Exemptions list identifies regions that should not be treated as part of the full text
content, despite matching the DCS “starts with” pattern.
The AddorReplace function can also trigger “rebalancing” operations. If the target
partition is Retired or has exceeded it’s rebalancing threshold, the Update Distributor
will instead delete the object from the partition where it currently resides, and redirect
the AddorReplace operation to a partition with available space.
AddOrModify
The intended use of AddorModify is to update selected metadata regions for an item
thought to already exist in the index. The AddOrModify function will update an existing
object, or create a new object if it does not already exist. When modifying an existing
object, only the provided content and metadata is updated. Any metadata regions that
already exist which are not specified in the AddOrModify command will be left intact.
There is no mechanism to delete a region which has already been defined for an object,
but you can delete the values by providing an empty string as the value for the region
("").
One potential downside of the AddOrModify operation is that if you selectively modify
metadata regions and the target object is not already correctly indexed, you will create
a new object that only has the metadata regions or content which was defined in the
modify operation. This will effectively create an object which only has partial data
indexed. If you provide all metadata region values in a modify operation, this situation
will not arise. New applications may want to consider using the “ModifyByQuery” or
“Modify” indexing operator instead of AddOrModify, which do not create an object if not
already defined.
Modify
The Modify operation is used to update specific metadata in an object. Unlike the
AddOrModify operation, Modify will never create a new object. If the OTURN specified
in a Modify operation does not exist, the transaction is simply discarded. Modify can
add new metadata, or replace existing metadata. Metadata for regions not included in
the IPool message are unaffected.
Delete
The Delete function will remove an object from the index, including both the metadata
and the content.
Note that if an object exists in multiple partitions, it will only be removed from the
partition to which the Update Distributor sent the Delete operation. This is a very rare
case, and would likely only arise if partitions were marked as Read-Only, then updates
to objects in the Read-Only partition were performed.
DeleteByQuery
The DeleteByQuery operator deletes objects which meet the provided search criteria.
A standard “WHERE” clause is provided in OTURN. This operator can be used to
delete many objects at once. Since the Update Distributor broadcasts the function to
all active Only partitions, duplicate objects can also be removed.
DeleteByQuery is of particular usefulness for applications that no longer track the
unique identifier for an object.
Applications which need to perform bulk deletes on a project will also find this far more
efficient. Instead of issuing 25,432 delete requests for every object in a project, a single
DeleteByQuery operation with an OTURN of
[region "ProjectName"] "old project"
would delete all objects marked as belonging to the project in a single transaction.
ModifyByQuery
This operation is used to selectively modify the content or specific metadata regions
for objects in the index. The affected objects are specified by search parameters – a
valid “WHERE” clause within the OTURN entry of the IPool. If no objects match the
query, then no updates are performed. Every object in the index which matches the
query will have the provided regions updated. Other regions for objects are not
affected; for example, you could change the value in the region “CurrentVersion” to
“false” without modifying values in other regions.
The Update Distributor will send ModifyByQuery operations to every active partition.
To modify a specific known object, you can place an object ID in the OTURN field:
[region "OTURN"] "ObjectID=1833746;ver=3"
You can also quickly perform bulk operations, such as marking all the objects
associated with a specific project as “released”. The IPool would contain region values
such as:
<ProjectStatus>released</ProjectStatus>
And the Key field in the IPool would contain a ‘WHERE’ clause such as:
[region "ProjectName"] "Great Scott"
All objects with the value of “Great Scott” in a region labeled “ProjectName” will then
have their ProjectStatus region populated with the value “released”.
A value for a region cannot be completely removed, but it can be replaced with an
empty string by providing a region definition in the IPool that has an empty string:
<ProjectStatus></ProjectStatus>
The full text content of an object can not be updated using ModifyByQuery.
Transactional Indexing
The indexing process with OTSE is transactional in nature. This essentially means
that the indexing request is not deleted until the index updates have been committed
to disk.
Transactional indexing ensures that no indexing requests are lost in the event of a
power loss or similar problem while indexing is taking place.
OTSE treats all of the indexing requests within an input IPool as a single transaction.
The input IPool is not considered complete until every request in the IPool is serviced
and committed to disk. Only then is the IPool deleted.
There are performance considerations related to transactional indexing. The more
objects there are within an IPool indexing transaction, the more efficient the indexing
process is. This is because a new index fragment is created each time a transaction
completes. Many objects in a transaction therefore generate fewer new index
fragments, and use the disk bandwidth more efficiently.
The converse of this is the time to index. By collecting index updates and packaging
them into transactions, for low-load systems, the average time for an object to be
indexed is somewhat slower. The majority of applications do not have a requirement
to minimize the lag time between an object update and the moment the changes are
reflected in the index, so large numbers of objects in the indexing IPool is generally the
best approach.
OTSE does not collect objects to create transactions. The number of objects in a
transaction is set by the upstream applications which are generating the indexing
updates. By default, Content Server 16 will attempt to package up to 1000 objects
within a single indexing transaction.
IPool Quarantine
In the event that an object in an IPool cannot be indexed because of severe errors, the
affected indexing component will halt. Upon restart, all of the indexing operations for
the IPool will be rolled back. Depending on the error code and configuration settings,
the Admin Server might automatically restart the component. If an IPool fails in this
way 3 times, it is moved into quarantine and the next IPool is processed. The
quarantine location is a sub-directory named \failure in the IPool input directory. If there
are too many quarantined items, the IPool libraries can be configured to either halt or
discard the oldest IPool. Quarantine behavior is a Content Server configuration, not in
OTSE.
Query Interface
Queries to OTSE are submitted to the Search Federator over a socket connection
using a language known as OpenText Search Query Language (OTSQL). Applications
communicating directly with the Search Federator will need to understand and
implement this wire-level protocol exposed by the Search Federator. Content Server
implements this protocol, as does the Admin Server component of Content Server and
the search client built into OTSE.
Responses from the Search Federator are expressed in a clear text data stream which
explicitly includes data size information to allow parsing values without needing to
escape special characters.
The available commands are described below. The commands themselves are not
case sensitive, although parameters to the commands such as region names may be
case sensitive.
Select Command
The select command is used to initiate a query. This command is essentially the
OpenText “OTSTARTS” query language, which is described in more detail in the
OTSQL section of this document. The basic form is:
<OTResult>
Cursor 0
DocSetSize 1012
</OTResult>
<OTResult>
cursor 100
</OTResult>
The cursor is automatically advanced after a get results command, which means
that use of set cursor between get results is optional if you are retrieving
consecutive sets of results. It should also be noted that moving the cursor forward is
relatively efficient. Moving the cursor backwards internally requires a reset to the start
of the results and moving forward to the desired location. If you are performing multiple
get results operations, structuring them to move strictly forward through the results
is much faster. This observation is only true within a search transaction (between open
and close operations), and has no impact on distinct queries.
There is an alternative method for managing the cursor location. The general form of
a query is:
Select … where … orderedby … starting at N for M
Where N is the number of the first desired result, and M is the number of results to
return in the Get Results command. The first result has a number of 0.
Would return results number 1000 through 1249 when Get Results is called. This
method is not generally used or recommended, and is noted here for completeness.
Using Set Cursor with Get Results is the recommended usage pattern.
<OTResult>
ROWS 4
ROW 0
COLUMN 0 "OTObject"
DATA 25
DataId=41280133&Version=1DATA END
COLUMN 1 "OTName"
DATA 29
Approval Handilist [Link] END
ROW 1
COLUMN 0
DATA 25
DataId=41280094&Version=1DATA END
COLUMN 1
DATA 18
P&L Jun to [Link] END
ROW 2
COLUMN 0
DATA 25
DataId=41280131&Version=1DATA END
COLUMN 1
DATA 0
DATA END
ROW 3
COLUMN 0
DATA 25
DataId=41280093&Version=1DATA END
COLUMN 1
DATA 10
Mar [Link] END
</OTResult>
In this example, there are 4 results, indicated by the “ROW” values. ROW values are
numbered starting at 0.
Each result contains 2 returned regions, identified by the COLUMN values. In the first
ROW, the COLUMN labels are provided. To save bandwidth, the COLUMN values are not
labeled in subsequent ROWS.
The COLUMN values are numbered starting at 0, in the same order in which the regions
were requested in the SELECT statement for the query. Note that the DataId= portion
of the COLUMN 0 results is typical of how Content Server provides the data for indexing,
this is not an artifact of the search technology.
If a value is not defined for a region, the region is still returned in the results with an
empty value. ROW 2 COLUMN 1 illustrates this case.
If ATTRIBUTES were requested in the select statement, then the requested attribute
information will be appended to the get results data. In the example below, the data
element for the region “TestSplit” has 3 values. The first value had one attribute, the
language (English), the second has two attributes, and the third value has no attributes
– indicated by the empty placeholder.
COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr"
translated="true"</><></>ATTRIBUTES END
If HIT LOCATIONS were requested in the select statement, the locations are added
to the results:
COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr" translated="true"</>
<></>ATTRIBUTES END
LOCATIONS 17
0 4 6 1; 2 10 7 3 LOCATIONS END
The triplets indicate that the first cell (start counting at 0) has a hit at location 4, length
6, matching term 1. The third term (2) has a hit starting at character 10 with length 7,
matching query term 3.
If you are retrieving large numbers of search results, it can be more efficient to break
the operation into multiple get results operations. Typically, these “gulp” sizes are
optimal in the 500 to 2000 results range. The performance benefit of using an optimal
size is typically only about 10 percent, so this is not a critical adjustment.
because of size restrictions. This means that there are facet values in the index for
this region that have not been considered in computing these facet results.
The facet data is terminated with the FACETS END text.
A simple example of output from a get facets command is included below. Note the
special case where a facet has no values, as illustrated in the COLUMN 1 values.
get facets
<OTResult>
ROWS 1
ROW 0
COLUMN 0 "OTModifyDate","Date"
FACETS 45
3,9,d20120605,14;9,d20120528,4;9,d20120514,1;
FACETS END
COLUMN 1 "OTUserName","UserLogin"
FACETS 3
1,;FACETS END
</OTResult>
Date Facets
Facets for regions that are defined as type DATE in the [Link] file have
a special presentation in the facet results.
Each date value is placed into buckets representing days, weeks, quarters, months
and years. Instead of the most frequent values being returned in facets, the most
recent values are returned instead. For most search-based applications, the
“recentness” of an object is a key consideration, and the implementation of date facets
reflects this requirement.
A single date value may be represented in multiple buckets. For example, if today is
July 1st 2012, an object with an OTCreateDate of June 30 2012 may be represented in
the facet values for yesterday, for this week, for last month, last quarter and this year.
Each date bucket type has a distinct naming convention to help parsers discriminate
between the buckets.
• Years have the form y2012. Years are aligned to the calendar. The current year
will include dates from the start of the year to today.
• Quarters have the form q201204, which represent the year and the month in which
the quarter starts. Quarters start in January, April, July and October. The current
quarter will include dates from the start of the quarter to today.
• Months have the form m201206, which represent the year and the month. Month
facets are aligned to the calendar month. The current month will include dates from
the start of the month to today.
• Weeks have the form w20120624, which represents the year, month and first day
of the week. Weeks are always aligned to start on Sundays. The current week will
include dates from the start of the week to today.
• Days have the form d20120630, which represents the year, month and day.
If the contents of a date bucket are empty (count of zero), then no result is returned for
that bucket.
Refer to the FACETS portion of the SELECT statement for information on requesting
the number of facet values for each of years, quarters, months, weeks and years.
FileSize Facets
The [Link] file can be used to identify integer or long regions that should be treated
as FileSize facets. Size facets are optimized for values that represent file sizes.
Clearly, discrete file size facets are useless. File sizes have the property that they
range from 0 to Gigabytes, but are psychologically thought of in geometric sizes. The
FileSize facet places integers into ranges that follow this geometric pattern. The entire
set of sizes is returned, rather than the most frequent counts for facets. Applications
presenting facets may choose to combine these ranges into larger ranges.
The buckets for FileSize facets and the corresponding labels for those buckets are
captured in the table below:
The list of integer regions to be presented as FileSize facets is within the [Link] file
in the [Dataflow_] section. The default regions shown here are tailored for typical
Content Server installations:
GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize
Expand Command
This command is used to determine the list of words that are used in a search query
for a given term expansion operation. Term expansions occur when features such as
stemming, regular expressions or a thesaurus are used in a term. The simple case of
stemming to match boat and boats is illustrated below.
ROWS 2
ROW 0
COLUMN 0 "Data"
DATA 4
boatDATA END
ROW 1
COLUMN 0 "Data"
DATA 5
boatsDATA END
</OTResult>
The following operator examples also work:
> HH
> DATA 61
> The <B>rain</B> in <Tag>Spain</Tag> falls mainly on the
plain
> TERMS 2
> the
> spain falls
<OTResult>
HITS 3
0,3,0
52,3,0
24,17,1
</OTResult>
After the TERMS element, each keyword to be matched is entered on a separate line.
If there are multiple words in the line, it is considered to be a phrase to be matched.
This example requests hit highlighting for the terms “the” and “spain falls”.
The results are comprised of numeric triplets, where each triplet is of the form
POSITION,LENGTH,TERM. The position starts at 0, and the term numbering starts at
0.
The hit highlighting code strips common HTML formatting characters out of the data.
In this example, the </Tag> is ignored when matching the phrase “spain falls”, although
these formatting tags are counted in the character positions.
You may need to use the EXPAND command to obtain a list of terms that should be
tested in hit highlighting.
Get Time
While a query is executing, detailed timing information for each element of the query
is tracked. The Get Time command will return this data, including total time, wait time,
execution time, and execution time broken down by each command execution within
the connection. To obtain accurate information about the entire search query, this
should be the last command executed before closing the connection.
<OTResult>
<TIME>
<ELAPSED>68638</ELAPSED>
<SELECT>21329</SELECT>
<GET RESULTS>610</GET RESULTS>
<GET FACETS>187</GET FACETS>
<HH>0</HH>
<GET STATS>31</GET STATS>
<EXECUTION>22157</EXECUTION>
<WAIT>46481</WAIT>
</TIME>
</OTResult>
Set Command
The set command is used to specify values for variables that apply to the subsequent
operations. The supported set operations include:
Set uniqueids true requests that the Search Federator remove duplicate results from
multiple Search Engines. The optional maxNum parameter is the upper limit on
performing de-duplication. If there are more results than maxNum, de-duplication does
not occur. De-duplication is generally not recommended, since it can negatively impact
query performance and increases the memory used by the Search Federator.
Duplicates of objects may exist if a partition was placed in read-only mode, and
subsequent attempts are made to modify an object managed by the read-only partition.
This causes a new instance of the object to be created in a read-write partition. De-
duplication is a method of last recourse if you have misused the read-only mode for
partitions.
The Set Lexicon and Set Thesaurus commands are usually the first operations in a
handshaking sequence for a search query. If one or more search engines are
unavailable, the return message is:
<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "Description"
DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 0
DATA END
ROW 2
COLUMN 0
DATA 16
OTWFSubWorkMapIDDATA END
COLUMN 1
DATA 0
DATA END
…
</OTResult>
The Get Regions command can take an optional parameter, “types”.
get regions types
When the types parameter is present, this function will include the type definition for
the region in the response. This type definition can be used to provide optimized
interfaces for users (for example, integer comparisons instead of text modifiers). If
multiple partitions report different types, then the Search Federator will respond with
the value “inconsistent” as the type. Note that differences in region types for partitions
in Retired mode are allowed; the assessment of inconsistency is based only on
partitions that are not Retired. The possible types are: Integer, Long, Enum, Date, Text,
Boolean, Timestamp.
<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "RegionType"
DATA 4
DateDATA END
COLUMN 2 "Description"
DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 4
EnumDATA END
COLUMN 2
DATA 0
DATA END
…
</OTResult>
When the facets parameter is present, then the type definition of generated facets is
included in the response. Normally, the facet types are the same as the region types,
but the special handling of integers that represent file sizes is an exception, returning
the value ‘FileSize’.
get regions types facets
<OTResult>
…
ROW 98
COLUMN 0
DATA 12
OTObjectSizeDATA END
COLUMN 1
DATA 4
LongDATA END
COLUMN 2
DATA 8
FileSizeDATA END
COLUMN 3
DATA 0
DATA END
…
</OTResult>
SELECT parameters
FACETS parameters
WHERE clauses
ORDEREDBY parameters
Content Server users do not directly use OTSQL. The Content Server search query
language is known as LQL (historically, the Livelink Query Language). LQL is similar
to OTSQL in most respects, but provides some convenience operators and generally
uses different keywords. LQL in Content Server represents only the subset of OTSQL
that defines the WHERE clauses. Some of the differences between LQL and OTSQL
include:
LQL OTSQL
termset termset
stemset stemset
qlprox prox
qlregion region
qlleft-truncation left-truncation
qlright-truncation right-truncation
qlthesaurus thesaurus
qlstem Stem
qlphonetic phonetic
qlregex regex
qlrange range
qllike like
in in
any any
text text
” « » ‟ ″ “ ”„″ "
SELECT Syntax
The SELECT section is used to specify which regions in the index should be included
in the returned results. The more regions that are requested, the longer the ‘get results’
operations will take, but this does not impact the query time.
SELECT "region1","region2","region3"
To return all of the regions use the * keyword. For a Content Server installation, this is
not recommended, since there may be hundreds of regions. Requesting the minimum
necessary regions is suggested for optimal performance.
If you want to return information about the key/value attributes within text regions, you
can use the ATTRIBUTES modifiers:
When attributes are requested, the response in the get results command is modified
to append the attribute information (see the “get results” description for more
information). The primary usage for requesting attributes is to identify language tags
attached to values in multi-language applications. The attributes modifier is applied to
all the regions specified in the select list.
The select statement can also be modified to request hit locations within the results:
FACETS Statement
The FACETS section specifies whether facets are desired, and if so, for which regions.
This is optional, with the default being no facets returned. Refer to the next major
section of this document entitled “Facets” for a complete description of the FACETS
statement.
Sample facet requests:
FACETS "regionX"[10],"regionY"
FACETS "OTCreateDate"[d100,m24]
The ‘get facets’ command is used to retrieve the results. See the commands section
for additional details.
WHERE Clause
The WHERE clause defines the rules by which an object satisfies the search query.
The basic form is:
where "red"
where "red riding hood"
where [region "name"] "red riding hood"
where [region "FileSize"] >= "1000" and [region "FileSize"]
< "10000"
WHERE Relationships
Each WHERE clause in a query is evaluated relative to other WHERE clauses by a
logical relationship. The supported relationships are:
Relationships are evaluated from left to right. Brackets can be used to clarify and
modify the order of evaluation of clauses. For example, using single letters a through
d to represent entire clauses:
WHERE Terms
The search terms in a WHERE clause should normally be enclosed in quotes.
Although there are some specific cases where the lack of quotes is tolerated, if you
are writing a query application, quotes are recommended in all cases.
The first form of a search term is the simple token. This is a value which is normally
expected to pass through the tokenizer and be recognized in its entirety as a single
token. All operators work on simple terms.
"hello"
"pottery123"
"3.1415926"
The second form is an exact phrase. Not all operators are compatible with phrases.
Phrases should normally only be used in string comparison operations.
"the quick brown fox"
"1334.8556/995-x"
You can also request that matches are only returned when the entire value is an exact
match for the phrase. For example, if there is a search region “ProjectName”, and
possible values are “Plan A” and “Plan A Extended”, searching for “Plan A” will match
both of these cases. Preceding the phrase with an equality operator ( = ) can
differentiate these, and match only the values that do not include the “Extended” term:
[region "ProjectName"] = "Plan A"
Finally, there is a special case for search terms, the * character (asterisk or star) or the
keyword all, with no quotation marks. This value is interpreted by the search engine
to match any object which has a value for the specified region. This will not match
objects if the region does not have a value defined for an object.
[region "name"] *
[region "name"] all
WHERE Operators
Each WHERE clause is comprised of a region specification, a comparison operation,
and a term. The region is optional, and if missing is assumed to be the default search
region list. The operation is optional, and if absent is assumed to match any token
within the region.
The following operators function with either simple tokens or phrases:
= Use of the equality operator will only match if the entire value is
identical to the term provided. “York” will not match “New York” but a
query for “New York” will.
!= Will match all values which exist and do not exactly match the term.
The next set of operators is available for use with integers, dates and text metadata
values. They are disabled by default for full text query, since comparison queries in
full text are generally misleading and perform very slowly, although this behavior can
be changed by setting AllowFullTextComparison=true in the [Link] file.
These operators also have special capabilities for Date regions described later.
< Will match all values which exist and are less
than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
<= Will match all values which exist and are less
than or equal to the specified term. If a phrase
is provided, only the first term in the phrase is
used.
> Will match all values which exist and are
greater than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
>= Will match all values which exist and are
greater than or equal to the specified term. If
a phrase is provided, only the first term in the
phrase is used.
is not efficient. To improve performance, the query syntax parser will attempt to identify
usage patterns where multiple comparisons are made to a single region, and convert
it to the more efficient form of
[region"x"] range "20150621~20160101"
The following operators are designed for use with single tokens, not phrases. Some
limited phrase support is available with some of the operators as noted in the
explanations.
range "start~to" Will match any value between the start term
and the end term, inclusive. Note that the start
term must be less than the end term.
range "value1|value2|value3" The range operator can be provided with a list
of terms or phrases. This is equivalent to
value1 OR value2 OR value3. This operator
matches any value in a region; it is not
restricted to matching entire values.
thesaurus Will match the exact term or synonyms for the
term using the currently defined thesaurus.
phonetic Will match phonetic equivalents for the term.
If applied to a phrase, phonetic matching for
each word in the phrase will be performed.
Refer to the Phonetic matching section for
more information.
regex Will interpret the term as a regular expression.
Values which satisfy the regular expression
match the term. Regular expressions apply
only to a single token. Regular expressions
are more fully described later.
stem Will match values that meet the stemming
rules. Refer to the Stemming section for more
information. If stemming is applied to a
phrase, then the last word in the phrase is
stemmed.
right-truncation Right truncation matches terms which begin
with the provided search term. The user
would typically consider this as term*. If used
with a phrase, then the last word in the phrase
is stemmed.
left-truncation Left truncation matches terms which end with
the provided search term. The user would
typically consider this to be of the form *term.
This operator is valid only for single tokens.
like String matching optimized for part number and
file names. Only valid with “Likable” regions.
any (term,"search phrase") Match any term or phrase in the list. Unlike
the IN operator, partial matches within a
metadata region are acceptable. Equivalent to
(term SOR "search phrase").
in (term, "search phrase") Match any term or phrase in the list. Within a
region, only matches complete values.
Equivalent to (=term SOR ="search phrase").
not in (term, "search phrase") Excludes any objects containing the term or
phrase. For regions, equivalent to (and-not
[region "xx"] in (term,"search phrase")).
termset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of terms and phrases. N% may also be
used.
stemset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of the stems (singular/plural) of the terms
and phrases. N% may also be used.
text (something to search)
For large blocks of text, finds objects with
similar common terms. Check Advanced
Concepts section for more details.
span (distance, query)
Match query within distance number of terms.
Will match “big truck” or “big red truck” but not “truck is big”. The second parameter is
a single letter indicating whether order needs to match. Use a ‘t’ (true) or f (false). In
the example above, using f would match “truck is big”.
The first parameter of the span operator is the maximum distance between terms that
will satisfy the query. These fragments would meet the distance of 4 requirement:
Mike smith
A smith named Michael
Michael Herbert James Smit
The span operator supports query fragments for any combination of AND, OR, and
nesting (brackets) for single search terms.
“space” and span(10, ((Yellow and sun) or (blue and moon))
and (earth or planet))
The span operator can be used with full text, but not with text metadata.
A span query is a relatively expensive operation and can be very expensive when used
with wildcards (left-truncation and right-truncation) or regular expressions. By default,
the engine is configured to disable support for these types of term expansions within
the span operator. If term expansion is enabled, the search engines will store
temporary working data on disk files during the evaluation of the span. Temporary files
are stored by each Search Engine in their corresponding index\tmp directory, and files
are named matchingWordsNNNNN and spanValuesNNNNN, where NNNNN is a
dynamically generated unique value. The temporary files are deleted when the query
completes, and also by the general purpose cleanup thread which runs from time to
time.
If abused, the span operator has the potential to require large amounts of disk space
and will take a long time to execute. There are a number of limits set by default in the
[Link] configuration file, which can be adjusted if more complex queries must be
run. When a limit is reached, the search will be terminated as unsuccessful. The limits
apply to a single partition (not the entire query for the entire index) and are located in
the [Dataflow_] section of the configuration file, with the defaults shown below.
SpanScanning=false
By default, use of term expansion (regex and wildcards) is not permitted with the span
operator. Set true to enable.
SpanMaxNumOfWords=20000
The upper limit on the number of terms that will be considered when wildcards and
regular expressions are expanded.
SpanMaxNumOfOffsets=1000000
Each term in the span expression may exist multiple times in documents. This file
stores the locations of the terms being evaluated. This is the upper limit for the number
of instances of matching terms.
SpanMaxTmpDirSizeInMB=1000
Limits the temporary disk space the partition can use for storing temporary data during
span operation evaluation.
SpanDiskModeSizeOfOr=30
The cost of executing a span is directly related to the number of “OR” operations in the
span query. This setting is an upper limit on the number of “OR” Boolean operators
that can be assessed.
WHERE Regions
A region is specified within square brackets with a region keyword, and enclosed in
quotation marks. The search term is likewise enclosed in quotation marks. There are
specific cases which are unambiguous and quotation marks are not required, but for
consistency your application should use quotation marks regularly. Region names are
case sensitive!
If the region portion of a WHERE clause is absent then the default search list is used
to determine the regions.
The following are examples of WHERE clauses using regions:
[region "OTNAME"] "cars"
[region "OTNAME"] all
[region "OTDate"] > "20100602"
[region "abc”] <= "string1"
Regions are grouped by OTSE into content and metadata regions, which are internally
represented by OTData and OTMeta. The representation of the “OTNAME” in the
example above is actually an abbreviated form of:
[region "OTMeta":"OTNAME"]
You can use OTMeta without a region name to examine all of the metadata regions.
However, this is relatively slow (depending on the number of regions) and in many
cases is not logical because of the different type definitions for regions.
You can also use OTMeta with some surrounding syntax to search within metadata
regions. For example, the clause:
[region "OTMeta"] "<someRegion>123 ABC</someRegion>"
Will find the exact value ‘123 ABC’ within the region “someRegion”. This is a much
slower way to locate the value, but there may be special cases where matching a
phrase anchored to the start or end of a region is needed.
You can specify searching in the full text using the OTData region:
[region "OTData"] "looking for this"
If you have indexed XML content, you can also search within specific XML regions of
the full text content using the XML structure, refer the section on indexing XML data for
more information.
The WHERE clause can also be used to set restrictions on attribute/value tags for text
metadata. For example, to restrict a search to looking at French language values of
the OTName field, you might use the syntax:
[region "OTName"][attribute "lang"="fr"] "voiture"
This presumes that “lang” is the attribute name, and “fr” is the value for that attribute.
Multiple attribute fields are possible, which effectively operates as a Boolean “and”,
requiring that both attributes must match:
[region "OTName"][attribute "lang"="fr"][attribute
"size"="med"] all
This syntax can be used to dynamically define the regions and their priority as part of
the query. However, this approach does not allow the value that matched the query to
be returned. If retrieving of a priority value is necessary, then a synthetic region
declaration must be made in the [Link] file:
CHAIN GoodDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate OTCreateDate
A query can then be made using the pre-defined date, and the GoodDate field can also
be returned as a target of the SELECT:
[region "GoodDate"] < "-5y"
For those interested in trying to construct the equivalent query using standard Boolean
operators, an example is shown below. Note that using the ‘first’ feature is not only
more convenient, but the implementation is more efficient. Internally, a new operator
performs the necessary logic with fewer operations, it is not simply converted to this
Boolean equivalent:
[region "OTExternalCreateDate"] < "-5y" or ([region
"OTExternalCreateDate"] != all and ([region
"OTExternalModifyDate"] < "-5y" or ([region
"OTExternalModifyDate"] != all and ([region
"OTDocCreatedDate"] < "-5y" or ([region "OTDocCreatedDate"]
!= all and ([region "OTCreateDate"] < "-5y"))))))
The ‘first’ region method can be used with all region types and most operators.
However, search within a specific text metadata attribute value with the CHAIN / first
operator is not supported.
The min and max operators will skip assessment when an object lacks a value. For
example, if an object had only Attr2 defined in the example above, then it would
automatically be evaluated as the minimum value. If none of the regions has a value,
the object does not match.
Min and max region assessments work for all data types, although not all operations
are supported. Supported operations include comparisons against a value (<,=, >,
etc.), basic term and phrase matching, IN, ranges, etc. However, operators that
expand to multiple elements are not available, such as termset, stemset, thesaurus,
wildcards and regular expressions.
For multi-value TEXT metadata regions, the smallest value in a set of values for a
region will be used when assessing a minimum region, and the largest value will be
used when assessing a maximum region.
In addition to specifying ad-hoc minimum and maximum region evaluations in a query,
a synthetic region may be defined as a convenience using the [Link]
file:
MIN SmallAttr Attr1 Attr2 Attr3
MAX BigDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate
A predefined region has the additional property that the tested value can also be
returned in a SELECT statement. Note that no additional storage or indexes are
created, this region definition is a directive to the query constructor. Both the dynamic
and predefined approaches execute identically.
As a point of interest, it is usually possible to construct an equivalent query using
standard Boolean logic, although the min and max forms are computationally more
efficient. The equivalent query is quite complex, and varies depending on the nature
of the comparison (greater than, equal, less than) and whether a minimum or maximum
is required. Where multi-value text is present, there is no Boolean logic equivalent. As
one example,
[min created,modified,record,system] >= "20150403"
Is equivalent to:
Similarly, the all region designation is a syntax shortcut for using the AND operator.
The convenience form:
[all "r4", "r5", "r6"] "sue"
Regular Expressions
OTSE supports the use of regular expressions for matching tokens. A regular
expression is a pattern of characters. In the OTSE query language, a term preceded
by the operator regex is interpreted as a regular expression. Patterns are defined
using the following rules:
+ The plus character matches the smallest preceding range one or more
times. For example,
"tr[eay]+ " will match words like try, tree, trey, treayaaa or country. It will
not match tr.
? The question mark character matches the smallest preceding range
exactly zero or one time. Reusing the previous example:
"tr[eay]? " will match try or pictr. However, it will not match tree.
| The vertical bar functions as an OR operation between patterns.
"go|stay" will match cargo or stay.
The range "[a-c]" could be represented as "a|b|c".
"^....s?$" Match five letter words that end with the letter s or four letter
words.
Not sure how you spell encyclopedia. Starts with ‘en’, has some
"^en[a-z]+p[eaid]+$" letters, then a ‘p’, then some combination of e, a, i and d. Mind
you, this also matches envelope.
"(0?[1-9])|(1[0-2]):[0-5][0-9]" Find words that contain a string that might be a time in 12 hour
format, such as 1:30, 03:26, 12:59
"^s(ch)?m[iy](th|dt|tt)e?$" Match words like smith, smyth, Schmidt, smitte.
"^ope.+ext$” Matches the common user expectation of a wildcard in the middle
of a word: ope*ext.
Using the SOR operator ensures that multiple matches won’t rank the result higher.
Note the use of the = modifier; the IN operator will only match entire values in metadata
regions. The behavior in full text content is slightly different, in that the entire value
matching is no longer pertinent.
in(superior,erie, "Lake of the Woods")
The TERMSET feature allows you to locate objects that have at least N matching
values from the provided list. For example, the clause:
termset(5,Water, river, lake, pond, stream, creek, rain,
rainfall, dam)
will match an object that contains 5 or more of the terms and phrases. This is a very
powerful construct for discovery and classification applications. There is no simple
equivalent representation. The example above could be expressed like…
SELECT ... WHERE
(stream AND pond AND lake AND river AND water) OR
(creek AND pond AND lake AND river AND water) OR
(creek AND stream AND lake AND river AND water) OR
(creek AND stream AND pond AND river AND water) OR
(creek AND stream AND pond AND lake AND water) OR
(creek AND stream AND pond AND lake AND river) OR
(rain AND pond AND lake AND river AND water) OR
(rain AND stream AND lake AND river AND water) OR …
Fully written out, this query is comprised of 126 lines with 629 operators. The
TERMSET operator is powerful, concise, and eliminates errors constructing complex
queries. The implementation of TERMSET and STEMSET is also internally optimized
for these cases. Queries may operate considerably faster with less memory using
TERMSET/STEMSET compared to executing the fully expanded equivalent queries
constructed of AND / OR terms.
The value of N can also be a percentage, meaning that it must match at least the
specified percentage of terms. 50% of 4 terms means that 2 or more matching terms
are needed. 51% means that 3 or more must match, since the percentage is a
minimum requirement. Using percentages is typically useful when there are longer
lists of candidate matching terms. These are equivalent:
Termset( 3, Water, river, lake, "duck pond", "stream")
Termset( 50%, Water, river, lake, "duck pond", "stream")
Negative values for N are interpreted to mean M-N as the threshold. For example, if
there are 10 terms, a value of -2 is equivalent to a value of 8 for N. It may be of interest
to note that at the endpoints for a list of N terms, TERMSET 1 is an effective OR, and
TERMSET N is an effective AND.
Termset (1, red, blue, green) red OR blue OR green
Termset (3, red, blue, green) red AND blue AND green
The STEMSET operator is similar to TERMSET, except that it matches stems of the
values (that is, singular and plural variations).
stemset(5, Water, river, lake, pond, stream, creek, rain,
rainfall, dam)
Being singular/plural aware means that a document that had only the words:
Water, river, rivers, pond, ponds
will not match, since STEMSET considers the singular and plural forms of river and
pond to be the same term. This document therefore only has 3 matching terms, instead
of the desired 5. Essentially,
stemset(2,water,river,pond)
can be thought of as
((stem(water) and stem(river)) or (stem(water) and
stem(pond)) or (stem(river) and stem(pond)))
or, in a somewhat simplified form which doesn’t really cover all the variations of
stemming,
((water or waters) and (river or rivers)) or ((water or
waters) and (pond or ponds)) or ((river or rivers) and
(pond or ponds))
Unlike the IN operator, STEMSET and TERMSET are not constrained to matching only
full values in text metadata regions. The negation of these operators is possible using
NOT, and can be interpreted as follows:
(m or n) not termset(2,a,b,c)
(m or n) and-not (termset(2,a,b,c))
The TERMSET and STEMSET operators were first introduced in version 16.0.1 (June
2016).
ORDEREDBY
The ORDEREDBY portion of a query is optional. Its purpose is to give you control over
how the search results should be sorted (ranked) and returned in the get results
command. If omitted from the query, the result ranking is sorted by the relevance score
in descending order. This means that the most “relevant” results are returned first.
language, then the language with the smallest value is used, otherwise use the
standard “no attribute” sorting.
ORDEREDBY Existence
Rank the search results by the number of matching terms in an object. This modifies
the standard relevance computation slightly, so that the number of times a term
appears is not important, only the number of terms which exist in the document.
ORDEREDBY Rawcount
Rank the search results by the number of instances of terms in an object. This modifies
the standard relevance computation slightly, so that the number of times a term
appears is highly rated. The default scoring algorithm considers the number of times
a word appears, but it is only a modifier. Using Rawcount will make the number of
times words appear a major factor in the score.
ORDEREDBY Score[N]
Rank the search results using a combination of the ranking computation (global
settings) and boost values specified as parameters in the query. Refer to the
Relevance section of this document for details.
Performance Considerations for Sort Order
In some cases, the sorting requested for results can be a factor in search performance.
Sorting is performed in the search engines, and each search engine requires
temporary memory allocation and time to perform the sorting. For both time and
memory, the key variables are the type of sort, and the cursor position of the requested
results.
Orderedby Nothing is the fastest performer, and uses the least memory – since it skips
the sorting step entirely. If your application needs to gather all the results from a query,
the use of Nothing as the sort order is strongly recommended, especially if you are
dealing with large data sets. Sorting and retrieving 1 million results may require on the
order of 100 Mbytes of temporary memory. Sorting by Nothing will avoid this penalty.
Sorting by primitive data types such as floats (relevance), integers, or dates is the next
best performing configuration. Roughly speaking, primitive types require about 4
Mbytes of RAM for each 100,000 results the cursor is advanced.
Sorting by string values is slower and uses more memory. Performance may start to
become material moving the cursor past about 20,000 results. Memory requirement
varies depending on the lengths of the strings, but typically runs about 15 Mbytes of
temporary memory per 100,000 results the cursor is advanced.
Sorting on multiple fields is slower, and uses more memory. The performance
penalties are difficult to predict, since they depend on the numbers and types of sorts.
The order also matters – a sort on a number first, then on a string uses about 8 Mbytes
per 100,000 results the cursor advances. Reversing it to sort on the string first, then a
number, would use more memory than just a string sort.
What does it mean when we talk about advancing the cursor position? Regardless of
how many search results there are, if you are only retrieving the first few hundred, the
sort time and memory required will be low. However, if you want sorted results
numbers 99,900 to 100,000 – then the cursor must be advanced to at least position
100,000. The search engines must sort at least that number of results, requiring
significant resources. When asking for results 1 to 100, the search engines can
optimize their sorting implementation to focus on just the ensuring the minimum set of
values are properly sorted.
The memory resources required for sorting are per search engine, per concurrent
search query. If you want to support up to 10 concurrent queries, each asking for
100,000 results, then each search engine may need over 150 Mbytes of working space
available. In normal types of applications this pattern is rarely observed, and in practice
most applications use relatively small amounts of memory to retrieve less than 10,000
results from a few concurrent queries.
Text Locale Sensitivity
When ordering results by a text region, locale-sensitive sorting is used by default. As
a result, sorting can differ somewhat depending upon the locale. Locale-sensitive
collation generally groups accented characters near their unaccented equivalents.
Depending on the locale, multiple characters may be considered as a single logical
character, and some punctuation may be ignored.
The locale for a system is determined from the operating system by Java, and uses
the Java system variables [Link], [Link] and [Link]. For
debugging, these values are logged during startup. In Java, locale can explicitly be
set to override system defaults as command line parameters. For example:
java -[Link]=CA -[Link]=fr …
Locale sensitive sorting was first added in 20.4, and can be disabled in the [Dataflow_]
section of the [Link] file by requesting the older behavior:
OrderedbyRegionOld=true
Facets
Purpose of Facets
Facets allow metadata statistics about a search query to be retrieved. For example, if
facets are built for the region “Author”, and there were 300 results, facets might supply
the following information from the “Author” region:
Mike 121
Alexandra 72
David 32
Michelle 21
Stephen 19
Alex 11
Paul 6
The interpretation would be that of the 300 results, 121 of them had the value “Mike”
in the “Author” region, 72 had the value “Alexandra”, and so forth. As an application
developer, you can present this information to the user to help them understand more
about their search results. It is also common to allow the user to “drill down” into the
results based on facets. For example, the user might determine they only want results
authored by Ferdinand. They select Ferdinand, which re-issues the same search, this
time with an additional clause in the query along the lines of AND [region
"Author"] "Ferdinand" (require “Ferdinand” in the region “Author”).
Requesting Facets
OTSE generates facet results when requested within the search queries. There are
no special configuration settings necessary to use facets, although optimization by
protecting commonly required facets may be a good idea. To request facets, in the
‘SELECT’ portion of the query, you add text along these lines:
SELECT “OTObject”,“OTSummary” FACETS
"Author","CreationDate" WHERE …
OTSE would then generate facets for two regions: Author and CreationDate. There is
no defined limit to the number of facets that can be requested for a query, but memory
or performance limitations will become a factor for large numbers of facets. The design
optimizations selected for OTSE are based on expectations of 100 or fewer distinct
facets in use at any time.
Once the query completes, you retrieve the results from the search engines with the
command:
GET FACETS
The output from the GET FACETS command is described in more detail in the Query
Interface section.
Like the search results, the facets for the query are retained until the query is
terminated or times out. Except for date facets, the values are returned sorted from
highest frequency to lowest frequency.
When facet values are returned, there are a couple of additional values provided. The
number of facet values identifies the total number of facet values found. The returned
count is the number of facet values actually returned, which is usually smaller. There
is also an overflow indicator, which identifies whether the number of facet values
exceeded the configurable limit – meaning that the facet results are not exact since
they are incomplete.
In most applications, a user is not interested in reviewing thousands of possible
metadata values in a facet. Usually, only the most common values are of interest. The
facets implementation allows you to place a limit on the number of values for each
facet you want to see. Using syntax such as:
SELECT "OTObject" FACETS "Author"[5], "DocType"[15]
This would return only the 5 highest frequency values in the field “Author” and the 15
highest frequency values in the field “DocType”. By default, the first 20 values are
returned. This default can be overridden by a configuration setting. You are strongly
advised to limit the number of values returned, especially with facets that may contain
arbitrary values, since they can potentially contain millions of values which would
significantly impact search performance.
Facet Caching
Facets data structures are built on demand. Once created for a given facet, the
structure is retained in memory so that subsequent queries using the facet are very
fast. In order to keep memory use constrained, there is a maximum number of facets
that the search engine will retain. If a query requests new facets that are not in memory
and the maximum number of facets is exceeded, then the search engine will delete the
facet structure that has not been used for the longest time. The default is to retain up
to 25 facet structures in memory. There is a 10 minute “safety margin” – meaning that
even if 25 facets are exceeded, a facet that was used in the last 10 minutes will not be
deleted. A facet that that is included in a query can also not be deleted. The limit is
therefore a guideline rather than an absolute maximum.
If your applications use more than 25 facets regularly, then search query performance
may suffer as facet data structures are regularly created and deleted. You can adjust
the number of facets to retain in memory in the [Dataflow_] section of the [Link]
file:
MaximumNumberOfCachedFacets=25
Date Facets
Date facets represent a special case, which has been constructed specifically to
address a very common and important requirement, namely presenting facets that
represent the “recentness” of an object in the index. Date facets are not designed to
handle arbitrary dates or future dates.
If facets are requested for regions of type DATE, special handling occurs. Each day
within the supported time range is counted multiple times – as a day, within a week,
within a month, within a calendar quarter, and within a calendar year.
Date facets are not sorted by frequency. Instead they are ordered by recentness. If
you have requested facets for 8 months, you will always get the most recent 8 months
returned. When constructing a query for date facets, the syntax within the SELECT
statement is:
… FACETS "CreateDate"[d30,w0,m12,q0,y10] …
The facet counts are optionally specified as a letter followed by the number of facet
values desired, where:
d –
number of days, including today
w –
number of weeks starting on Sunday, including today
m –
number of months, including the current month
q –
number of calendar quarters (Jan, Apr, Jul, Oct),
including the current quarter
y – number of calendar years, including the current year
The example above would request the last 30 days, the last 12 months, the last 10
years, and no facets for weeks or quarters. To obtain no values for a category, specify
zero. Omitting the category will result in the default number of values being returned.
If the count for a value is zero, then no facet value will be returned.
The default number of date values to be returned is defined in the [Link] file. In
the [DataFlow_] section:
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10
The values returned for date facets are formatted to easily identify their type and date
range.
Days: d20120126 (dYYYYMMDD) 26 Jan 2012
Weeks: w20120108 (wYYYYMMDD) week starting 8 Jan 2012
Months: m201202 (mYYYYMM) Feb 2012
Quarters: q201204 (qYYYYMM) quarter starting Apr 2012
Years: y2012 (yYYYY) year 2012
Date facets can only be built for dates where the day is within range of the
maximum number of facet values, per the settings described later. The default is
32767, or about 90 years.
FileSize Facets
Integer regions may be marked in the [Link] file to have their facets presented as
FileSize facets. This mode groups file sizes into a set of about 30 pre-defined ranges.
This mode ignores the number of facets request, and always returns a fixed number of
facet values representing the buckets (or ranges). Details of these facet values are
described in the get facets command section.
For applications in which the security requirements are high, you must ensure that
facets which contain sensitive information are not made available to users without
suitable clearance. In many cases, it is considered acceptable to display facets which
do not contain sensitive data, such as file sizes, object types, or dates. It might also
be possible to achieve acceptable security by reducing the exactness of the object
counts – displaying a more generic frequency count (eg: 1 to 4 “bars”, or labels such
as “many” or “few”) instead of the precise counts from the search engine.
Ultimately, you will need to choose an appropriate tradeoff between user convenience
and improved user search experience versus the risk that a user might glean harmful
information from facets values.
The maximum number of values per facet sets the upper limit on how many distinct
facet values are possible. This limitation is present as a failsafe from abuse, and
presumes the typical facet application is intended for much smaller data sets.
Increasing this value will increase the amount of memory required to store facet
information. Because the internal data structures use bit-fields, the optimal setting for
this value are 1 less than a power of 2 (eg: 2**N – 1). It should be noted that multi-
value text fields consume a facet value for every combination of text values contained
in the field. For example, if the region “Colors” can contain combinations of “red”,
“blue”, “green” and “black”, then 15 combinations are possible and 15 of the facet
values could potentially be used. If you expect to create facets for regions that may
have many combinations (such as email distribution lists) then this number may need
to be very large, and you may be limited by usable memory.
MaximumNumberOfValuesPerFacet=32767
The number of desired facets is the default number for the “most common N” facet
values to be returned if the number of desired facets is not specified in the query. This
ini setting does not affect the special return values for Date type facets.
NumberOfDesiredFacetValues=20
You are unlikely to need to resort to these calculations. In practice, for a 1 GB partition
in Low Memory mode (3 to 5 million objects) with 10 to 15 typical facets in use, memory
consumption by facets is usually less than 40 MB. Content Server uses this guideline
for its default setting.
To re-iterate: facet memory allocation is NOT an explicit setting. Simply increase the
Java heap size available on the command line. Content Server 16 exposes this
incremental memory allocation for search engines in its search admin pages.
return many million results, average times would be closer to 1 second. As more facets
are requested for a query, these time are additive. Experience has been that facet
computation is not a material consideration for performance in most scenarios.
Conversely, initial generation of facet data structures can be relatively expensive. Each
potential metadata value must be examined, and a new facet value created or the data
structures updated if it already exists. The time to perform this task varies widely based
on the number of items in the partition, the data type, the number of possible unique
values, and for text metadata – whether the values are stored in memory or on disk.
For example, if there is an enumerated data type with less than 100 possible values in
a partition containing just 1 million items, generation of the facet data structures is likely
less than 1 second.
At the other extreme, generating facet data structures on a text region that has high
cardinality (e.g. 200,000 possible values, such as a folder location or keywords/hot
phrases), in a large partition containing 10 million items that is configured for storage
on disk will take considerably longer, potentially many minutes.
For larger systems in particular, limiting the use of facets for regions with high
cardinality may be necessary to meet performance objectives.
Protected Facets
As noted above, the time required to generate facet data structures can be material. In
addition to building search facets on demand, it is possible to specify facets that are
known to be commonly used. On startup, the data structures for these facets will be
built if they are not in the Checkpoint; they are excluded from facet recycling (never
destroyed); and they are optionally saved in the Checkpoint file for faster loading on
next startup. Content Server uses this feature. To build protected facets at startup, in
the [Link] file, specify the regions in the [Dataflow_] section:
PrecomputeFacetsCSL=region1,region2,region3
As an option, the protected facets may be stored in the Checkpoint file. This also
means a copy of the facet data is maintained in the Index Engines, which requires
additional memory. To enable persisting facets in the Checkpoints, in the [Dataflow_]
section of the [Link] file add:
PersistFacetDataStructure=true
When specifying protected regions, you should also ensure that the desired number of
cached facets is greater than or equal to the number of protected facets specified in
this list. The desired number represents the point at which the search engine will begin
recycling non-protected facets to make room for new facets requested in queries. In
addition, the maximum number of facets should be higher still. The maximum number
of facets is the limit, which may be higher than the desired number if there are many
facets requested in a single query. Beyond this maximum number, the facet requests
are discarded.
DesiredNumberOfCachedFacets=16
MaximumNumberOfCachedFacets=25
Search Agents
Search Agents are stored queries that are tested against new and changed objects as
part of the indexing process. The two most common uses of Search Agents are to stay
up to date on topics of interest, and for assigning classifications.
The monitoring case is illustrated by the Content Server concept of Prospectors.
Consider a situation where you want to know everything about a particular customer.
You construct a query to match the name of the customer or a few of the known key
contacts at that customer. By adding this as a Prospector, you are notified any time
new data is indexed that matches this query.
For classification, you construct a set of queries that define a specific classification
profile. For example, if all customer service requests use a form that contains the text
“customer support ticket”, then this query is attached to the classification agent, and
any object containing this phrase is marked with the classification. By using many
queries, you can build a complete set of classification categories. One object may
match several possible queries, and be tagged with multiple classifications this way. In
Content Server, this is known as Intelligent Classification.
In operation, the queries to be tested against new data are contained in a file. Matches
to the search agent queries are placed in iPools which are monitored by the parent
application, typically Content Server.
If the interval is set to a value of -1, the agent execution will pause. There is no loss of
activity – when the interval is restored to a positive value, the agent queries will include
all objects that were indexed while paused. Pausing may be desirable if there is a
temporary need to maximize indexing performance.
The Update Distributor keeps track of the agent execution in files that are stored in a
subdirectory of the search index:
index/enterprise/controls
The files are named upDist.N and contain the timestamp for each of the last agent
runs, expressed in Unix time (also known as Epoch time or POSIX time, i.e.
milliseconds since Jan 1, 1970). Sample file below.
UpDistVersion 1
SearchAgentBaseTimestamp 1571261889130 "MySA0"
SearchAgentBaseTimestamp 1571261889130 "MySA1"
EndOfUpDistState
The timestamp field used by default is the OTObjectUpdateTime. The field can be
changed, but there are currently no known scenarios where the default value should
not be used.
[Dataflow_xxx]
AgentTimestampField=OTObjectUpdateTime
When using interval agent execution, the Update Distributor timing summaries will
include the time spent running agent queries, identified with the label SAgents.
[SearchAgent_agent1]
operation=OTProspector
readArea=d:\\locationpath
readIpool=334
queryFile=d:\\someDirectory\[Link]
The readArea and readIpool parameters specify the file path and directory name where
iPools with results from the Search Agent should be written. These are then consumed
by the controlling application.
The queryFile contains the search queries to be applied during indexing. You can have
many search queries within each queryFile.
The operation can be one of OTProspector or OTClassify. This value does not change
the operation of the search agents, but is recorded in the output iPools, and is used to
help the application (typically Content Server) determine how the iPool should be
processed.
<Object>
<Entry>
<Key>OPERATION</Key>
<Value>
<Size>10</Size>
<Raw>OTClassify</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>297</Size>
<Raw>
<SYNC1>2959</SYNC1>
<Q1N0>OTObject</Q1N0>
<Q1R0C0>
<OTObject>DataId=16412&Version=1</OTObject>
</Q1R0C0>
<Q1N1>OTScore</Q1N1>
<Q1R0C1>93</Q1R0C1>
<Q1R1C0>
<OTObject>DataId=16389&Version=1</OTObject>
</Q1R1C0>
<Q1R1C1>39</Q1R1C1>
<Q1R2C0>
<OTObject>DataId=16390&Version=1</OTObject>
</Q1R2C0>
<Q1R2C1>29</Q1R2C1>
</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>2178</Size>
<Raw>
<SYNC2>3276</SYNC2>
<Q2N0>OTObject</Q2N0>
<Q2R0C0>
<OTObject>DataId=16388&Version=0</OTObject>
</Q2R0C0>
<Q2N1>OTScore</Q2N1>
<Q2R0C1>71</Q2R0C1>
<Q2R1C0>
<OTObject>DataId=16398&Version=0</OTObject>
</Q2R1C0>
<Q2R1C1>71</Q2R1C1>
<Q2R2C0>
<OTObject>DataId=16409&Version=0</OTObject>
The Search Agent type, in this case OTClassify, is the first entry in the IPool. This
value is drawn from the [Link] file in the Search Agent configuration setting.
The search results themselves are presented with a naming convention that reflects a
QUERY, ROW, COLUMN numbering convention. For instance, the value <Q2R0C1>
is used for Query 2, Row 0 (the first result), Column 1 (the second region in the select
clause). Likewise, the value <Q1N0> is used to label the Name of Column 1 for Query
1 (in this case “OTObject”). Note that the names of the regions are only provided in
the first row for a given query.
Performance Considerations
Search Agents are not free. Although the Agents are only applied to newly added
objects, the frequency, complexity and number of queries run as agents can have a
noticeable impact on indexing performance. For applications with high indexing rates,
Search Agents may not be an appropriate feature.
If you require these types of features for high indexing volumes, you can consider
implementing your solution using standard search queries, serviced by the Search
Engines. By enabling the TIMESTAMP feature for objects, the exact indexing time of
objects can be determined, and a pure search application can provide similar features,
running on a scheduled interval.
Relevance Computation
Relevance is a measurement of how well actual search results meet the user
expectations for search result ranking. Relevance is a subjective area, based upon
user judgments and perception, and often requires experimentation and tuning to
optimize. This is one of the fundamental challenges with relevance tuning: if you
improve relevance for one type of user, you may well be reducing relevance for other
users who have different expectations.
Relevance is a method for determining how close to the top of the list a search result
should be placed. However, relevance has NO IMPACT on whether an object actually
satisfies a query. If a query matches 100,000 items, tuning relevance only affects the
ordering of the items, not which items are matched.
Search relevance is not entirely the responsibility of the search engine. Relevance
scoring is a function of many parameters, most of which are provided by the
application, such as Content Server. Tuning Content Server is also required to
optimize search relevance, but this document will focus more on the OTSE
contributions to relevance.
For typical users trying to find objects, relevance is an important consideration, and the
search results are usually presented sorted by the relevance score. However,
relevance is not a consideration for certain types of applications. For example, Legal
Discovery search is concerned with locating all objects, but does not care about the
order of presentation. Likewise, when using search to browse, results are often sorted
by date or object name.
Components of Relevance
There are two different types of computations that are applied to objects in the index
to determine their relevance. The first is “ranking”, which is a computation applied in
the same way on every search query. Ranking typically adjusts relevance by giving
higher weights to recently created objects, office documents, or known important
locations. Before Search Engine 16, Ranking was the only available relevance scoring
method, and ranking and relevance were often used interchangeably.
Beginning with Search Engine 16, a second type of relevance computation is available,
known as “boost”. Unlike Ranking, the Boosting parameters are dynamic, and are
provided on each query. This permits the application to add relevance adjustments
based on context, such as the user identity or current folder location.
The remainder of this section will cover the Ranking capabilities, with Boost features
detailed later. You can mix and match both Ranking and Boost, although each
additional relevance feature slightly increases the overall search query time.
In most cases, the ranking configuration is comprised of weights and regions. The
weights indicate how important the parameter is in scoring. Note that these weight
values are relative. Setting all the weights high is the same as setting all the weights
to a medium value. The difference in weights is ultimately what matters.
Some of the explanations below contain simplified versions of the equations used to
compute the scores. They are simplified to the extent that a number of additional
computations are performed to adjust the results from each computation to a
normalized range. The equations presented here are only intended to clarify the
impact that adjustments to the parameters make on the ranking computations.
Note: result scoring has been improved with OTSE 21.2. The relevance computation
no longer includes adjustments for a number of search clauses that have little meaning
for relevance, such as Boolean operations, and “not” clauses (not in(), not termset(),
not stemset(), etc.). Synonym-Or (“SOR”) has changed to score matching either/both
terms as a single value.
Date Ranking
The date an object was created or updated is typically an important aspect of
relevance, especially for a dynamic or social application. In these cases, users tend
to favor objects that are recent. Applications such as archival on the other hand
typically do not care about recentness, and different settings might be appropriate.
The date ranking parameter allows you to identify metadata regions which contain date
values that reflect the recentness of an object, and configure their scoring parameters.
Date ranking is computed using a decay rate from the current date. The decay rate is
one of the configurable values. Small values for decay rates will reduce the score of
older items more rapidly. A simplified approximation of the algorithm is:
Date Relevance = decay / (recentness + decay)
In practice, a very aggressive value that strongly favors recent objects would be a
decay rate of 20 days. Consider this chart of some representative values. The decimal
values in the body of the table represent the contribution to ranking, with higher values
representing higher ranking.
AGE IN DAYS
Clearly, small values of decay rates generate small ranking contributions for older
items. Remember that the date ranking value is only one component of the ranking
score, and you also control the weight to be applied to this computed value.
The syntax for the date ranking configuration in the [Link] file is:
DateFieldRankers="dateRegion",decay,weight
For example, the following would use the last modified date on an object to compute
date ranking, with a moderately aggressive decay of 45 – but then make the overall
contribution of date to the ranking score small by giving it a weight of 2:
DateFieldRankers="OTModifiedDate",45,2
The date scoring algorithm supports multiple elements. For example, if you had two
different metadata regions that commonly contain important dates that reflect object
recentness, you can specify both, and each is independently computed and added to
the overall ranking score:
DateFieldRankers="OTCreateDate",45,50;"OTVerCDate",30,30
The DateFieldRankers setting is recorded in the [Link] file, and Content server
exposes this configuration setting in the search administration pages.
Relative frequency
The relative ratio of matched search terms to the overall content size is a factor. The
higher this ratio, the higher the relevance. An obvious example… assume you search
for “combustible”. If document ROMEO has the word combustible 30 times in 1000
words (3%) and document JULIETTE has 50 instances of combustible in 2000 words
(2.5%), then document ROMEO will be ranked higher.
Frequency
The more often the search terms occur in the text for an object, the higher the ranking
score.
Commonality
The more common a search term is in the dictionary for this partition, the less weight
it is given in computing the text score. For example, with typical English language
data, if you search for keywords “the” AND “scooter” – the value given to matches for
“scooter” will be considerably higher than matches for “the”, since “the” is overly
common.
The full text search ranking algorithm is applied to the indexed content, plus any
metadata regions defined in the default search list. The relative weight of the full text
search is also configurable. Both values are specified in the [Link] file.
The default region search list is defined in the search INI file as:
DefaultMetadataFieldNamesCSL="OTName,OTDComment"
ExpressionWeight=100
Content Server exposes the list of default regions to search in the administration pages
for search, and the values are stored in the [Link] file. Remember to ensure that
any metadata text regions given an adjusted score are included in this default region
search list.
Object Ranking
The search ranking algorithm also allows external applications to provide ranking hints
for objects. In a defined metadata field, the application can provide a numeric ranking
score – an integer between 0 and 100. The search ranking algorithm can incorporate
this ranking value into the overall rank. You have the ability to set a ranking value for
each object, define the field to be used for object ranking, and assign an overall weight
to Object Ranking relative to other elements of the ranking algorithm. If there is no
Object Ranking value for an object, it gets a ranking adjustment of zero.
The Object Ranking settings are kept in the [Link] file. In the example below,
OTObjectScore is the metadata region that contains the ranking value, and 80 is the
relative weight attached to the Object Ranking component of the ranking calculation.
ObjectRankRanker="OTObjectScore",80
If you are developing applications around search, using the Object Ranking feature
can improve the overall user experience. Some of the common events used to modify
the ranking include tracking objects that are popular for download, objects placed in
particular “important” folders, how frequently objects are bookmarked, or other
situations which are appropriate to the application. As a developer, you also need to
remember to degrade the object ranking over time – an object which is important now
may well lose its relevance later.
One other observation for developers setting Object Ranking values: as described
elsewhere in this document, OTSE supports indexing select metadata regions for
objects. You do not need to re-index the entire object in order to set the Object Rank
value; using the ModifyByQuery indexing operation is usually a good choice. Re-
indexing the entire object each time a ranking value changes would likely have a
material negative impact on overall system performance – both on the application and
OTSE.
Within Content Server, the use of Object Ranking is a feature that is leverage by the
Recommender module.
and results in an interim score in the range of 0 to 100. Boost operations are applied
later, and modify the ranking score to generate the final relevance score.
Relevance boosting is specified in the ORDEREDBY section of the search query:
SELECT … WHERE … ORDEREDBY SCORE[N] boost parameters
SCORE[N] identifies that boost adjusting is desired. N is a multiplier (in percent) of the
relevance computed in the ranking algorithm. Normally, N of 100 would be
recommended, which means that the ranking values are used without modification. If
N was 80, then the ranking values would be multiplied by 0.8 before final adjustments
from boosting. Setting the value of N to 0 would cause the ranking component of
relevance to be ignored (treated as 0).
There are three types of boost operations that may be applied, text, integer and date.
Boosting may allow the score rise above 100, but never below 0.
Query Boost
This boost method is used to adjust the relevance based on whether an object matches
query clauses. For illustration, consider the following example…
SELECT "OTObject" where "animal" ORDEREDBY Score[100] "dog"
BOOST[-10] "cat" BOOST[+15] ("t-rex" and "evolution")
BOOST[+%40]
The query will match items containing the text “animal”. However, we are less
interested in objects that also contain the text “dog”, so 10 is subtracted from the
relevance score. The user likes cats, so if the result contains the text “cat”, then we
add 15 to the score. If the result contained both “dog” and “cat”, then the net
adjustment would be +5. The full text clauses do not need to be simple, as shown with
the dinosaur adjustment. The dinosaur adjustment also illustrates that the relevance
can be boosted by a relative percentage. The text clause can also specify text
metadata regions and include complex parameters…
SELECT "OTObject" where "accident" ORDEREDBY Score[100]
([region "model"] in ("ford","Toyota",”gm") and [region
"Date"] > "-2m") BOOST[+15]
Date Boost
This boost method is used to adjust the relevance based on how closely the value in a
Date region matches a target date. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Date,"region","target",range,adjust]
Range is an integer number of days on either side of the target for which a
boost adjustment should be applied.
Adjust is an integer value that specifies the maximum adjustment to be
applied if the value in the region is an exact match for the target. The
adjustment is reduced in a linear fashion based on distance from the target.
An example is in order.
SELECT … ORDEREDBY Score[100]
BOOST[Date,"OTCreateDate","20140415",60,40]
This boost essentially states: Examine the value in OTCreateDate for each matching
search result. If the value is April 15 2014, then add 40 to the relevance score. If the
value in the OTCreateDate field is within 60 days of April 15, then add a pro-rated
value. For example, if the value in OTCreateDate was May 30 (45 days away), then
adjust the relevance score by 10 (which is 40 * (60-45)/60).
The intent of this type of boost is to help users find items based on dates. A typical
use case might be “I am trying to find a document that I think was issued June of 2000,
but maybe I am off by 6 months”. Any document in that +/- 6 month range gets a
boosted relevance, with a higher adjustment the closer to the target date.
Another common application would be adjusting for recentness, where the target date
is today, and all objects with dates within 90 days receive an adjustment.
Integer Boost
This boost method is designed to allow a range of values to be mapped to a relevance
contribution. For example, if there was a “usefulness” rating for a document on a scale
of 1 to 10, you could use that range to boost relevance on the objects. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Integer,"region",lower,upper,adjust]
This boost essentially states: Items with a Popularity value greater than 100 and less
than or equal to 200 will receive a relevance boost of up to 30. A value of 200 gets the
maximum adjustment of 30. A value of 120 would get a boost of 6 [ =30*(120-
100)/(200-100) ].
So why are there separate methods for Dates and Integers? The Date and Integer
boost features allow the boost adjustment to be varied depending on how close the
values are to a target, versus the all or nothing adjustment that occurs with Query
Boosting. If you have applications where getting close is useful, versus matching
exactly, the Date or Integer Boosting is superior.
For most customers however, a review of their search expectations and some Content
Server 16 considerations are in order.
Date Relevance
This is usually an important factor. Content Server has many ‘Date’ fields, where the
date represents specific information. Consider some of the following:
Creation Date – usually refers to the date an object was added to the system. Often
this is a good value for relevance, but the creation date only refers to the first version.
Versioned objects which are updated will not change this date, which reduces its value
for these data types.
Version Creation Date – for versioned objects, such as documents, this is a good
choice. Each version of the object gets an updated version creation date. On the other
hand, many objects do not have the concept of a version creation date.
Modified Date – for some types of objects, such as folders, the modified date clearly
identifies when the folder has been created or updated. However, for other types of
objects, the modified date is too volatile. Depending upon other settings in Content
Server, the modified date may change for many reasons, and therefore does not reflect
the user expectation for when an object has truly changed.
Understanding which types of objects are most important in your application for search
relevance will help you determine which Content Server date values should be used
for date relevance scoring.
There are several other date fields in Content Server that may also be used. Review
the types of objects that are most important for your application, and choose dates that
best reflect creation or change that users would consider material to search relevance.
Recent experiments suggest that new default values for Content Server using both the
Creation Date and the Version Creation Date, with relatively high weights, may be a
good choice for typical document management and workflow applications.
For new installations of Content Server, the use of MIME types and OTSubTypes for
Type Ranking is discouraged in favor of using OTFileType instead. OTFileType is a
generated by the Document Conversion Server during indexing, and gives every object
a type such as “Microsoft Word”, “Adobe PDF” or “Audio”. This greatly simplifies
constructing the Type Rank, and improves accuracy.
Note that OTFileType was introduced in Content Server 10 Update 5, with some minor
tuning since then. If you have older data, then you may need to re-index the objects.
Details about the values for OTFileType are not included in this document. Some of
the more common values you may want to configure for Type Ranking using the
OTFileType region might be:
Word, Excel, PowerPoint, PDF, Folder, “Web Page”, Text, Audio, Video or Email.
Are HTML pages a key part of your data? Consider adding the HTML keywords
region to the default search regions.
Some applications, such as eDiscovery, are biased towards searching all possible
regions. The challenge is this: more default search regions results in slower query
performance. For small numbers of regions, this is not an issue. For eDiscovery, with
thousands of potential Microsoft Office document properties, this performance
degradation can be material. The “Aggregate-Text” features of the search engine may
be helpful for these situations.
Using Recommender
Recommender is a feature of Content Server which monitors user activity, and
leverages the Object Ranking feature of the search engine to boost the relevance
scores of certain objects. Specifically, the feature of Recommender known as “Object
Ranker” is responsible for computing relevance adjustments and triggering the
appropriate indexing updates. You can review the use of Recommender in the Content
Server documentation.
User Context
Statistically, a user is more likely to be searching for objects that meet one of more of
these types of criteria…
• It is located in my personal work area;
• It was created by me;
• It is located in the folder in which I am currently working;
• It is located in a sub-folder of my current location;
• It is in a location where I was recently working;
OTSE has no knowledge of the user performing a search. Content Server, however,
is aware of the user identity and location. New to Content Server 16, the relevance
boost features allow user context to be incorporated in relevance computation. For
example, each query could specify that items with the current user in the “created by”
metadata fields are emphasized, or that objects in specific locations and folders have
their relevance score enhanced. You should review these configuration settings in
Content Server, and adjust them to reflect your expected user behaviors.
Enforcing Relevancy
Adding Ranking Expressions to a search query results in more work for the Search
Engines. If the default relevance computation is performed (based on the WHERE
clause), then no material penalty occurs since the values are already retrieved as part
of the query evaluation. The Search Engines have an optimization that will determine
if the Ranking Expression is the same as the WHERE clause, in which case the
Ranking Expression computation is skipped. In updates of Content Server prior to
December 2015, the Ranking Expression differs from the WHERE clause, which will
reduce query performance.
There is a configuration setting that will ignore the Ranking Expression and enforce
use of the default WHERE clause ranking. Effectively, this is the same as using
ORDEREDBY RELEVANCY in the query. For older updates of Content Server that
install the 2015-12 or later update, this setting can be used to achieve a modest search
query performance gain. In the [Link] file [Dataflow_] section, add:
ConvertREtoRelevancy=true
Thesaurus
OTSE has the ability to search not only for keywords, but for synonyms of keywords,
using a thesaurus system. This section of the document explores the use of a
thesaurus with OTSE.
Overview
Searching with a thesaurus specified allows a query to match synonyms of words. For
example, the English thesaurus might have an entry for house which includes “home”,
“residence” and “dwelling”. A search for the keyword “house” would also match any of
those words if the thesaurus is enabled.
The list of synonyms to be used is contained within a thesaurus file. You can have
many thesaurus files, and each query can specify which thesaurus file should be used.
In practice, this flexibility is generally used to select a thesaurus containing synonyms
for a particular language. OTSE ships with a number of standard thesaurus files:
English, French, German, Spanish, and Multilingual.
It is also possible to use a thesaurus to help find specialized words in specific
applications. For example, a medical thesaurus file could contain alternate names for
drugs, symptoms or other medical terminology. A custom corporate thesaurus could
contain synonyms for products, part numbers, customer names or departments.
Thesaurus Files
Thesaurus files should be placed in the “config” directory. They should follow a naming
convention of “[Link]”, where xxx defines the language and identifies the
thesaurus file as provided in the search query. By convention, OpenText default
thesaurus files are provided for English, French, German, Spanish and Euro
(multilingual) as follows:
[Link]
[Link]
[Link]
[Link]
[Link]
Thesaurus files are stored in a proprietary file format which is optimized for
performance and size. These files are created using a thesaurus builder utility, which
converts a thesaurus from the Princeton WordNet format to the OpenText thesaurus
format.
Thesaurus Queries
In order to leverage a thesaurus in a search query, you choose the thesaurus using
the “SET” command, and specify thesaurus use for a search term using the “thesaurus”
operator in the query select statement.
set thesaurus eng
select “OTName” where thesaurus “home”
The value for the language (in this case “eng”) must match the extension of the
thesaurus file. This is an optional statement. The default language setting for the
Thesaurus is English.
The “thesaurus” operator in the select statement only applies to simple single terms –
it cannot be combined with other features such as proximity, stemming, wildcards or
phrase search.
Stemming
Stemming is a method used to find words which have similar root forms, called “stems”.
The easiest way to explain stemming is by example.
The words flowers, flowering and flowered all have the same stem: flower. When
stemming is applied during a search, then a search for one of these words would match
any of these words.
The special terminology “stem” is used since the common element is not always a
word. For instance, for algorithmic reasons, the stem for “baby” might be “babi”, which
facilitates matching words such as babied or babies.
Stemming algorithms are not foolproof. In our example of “flower”, the stemming
algorithm might identify that “flow” is the stem – and try to find matches such as flows,
flowing or flowed. Stemming is a useful tool, but cannot always be relied upon to
behave as a user expects.
The concepts that make stemming possible are not applicable to all languages. In
general, Western European languages can use stemming, since plurals, tenses and
gender are typically formulated in terms of appending different endings to root forms
of words. Accordingly, the algorithms for stemming are different for each language.
There are many languages, such as East Asian languages, where the concept of
stemming does not apply.
Because of the language-specific aspects of stemming, a search engine has many
options available for how stemming should be implemented. One approach is to stem
words during indexing, and create an index of word stems. This can result in very fast
searches (since the stems are all pre-computed), but requires that you know the
language at index time. If only one language will ever be used, this is acceptable. In
multi-language environments, it is less useful. Some search implementations will
guess at the language during indexing and stem accordingly, which is statistically
useful but not always correct.
OTSE applies stemming rules at query time. This reduces the size of the index (since
word stems are not stored), but has a query performance penalty since the stems for
candidate words must be computed for each query.
The other key advantage of query-time stemming is that true multi-lingual stemming
can be used. Consider an index containing the following words:
Arrives (in English documents)
Arrivons (in French documents)
Arriva (in Spanish documents)
Each of these words might have the same stem (“Arriv”). By applying the stemming
algorithm at query time, the search system can differentiate between the English,
French and Spanish forms of the word based on the language preferences used for
stemming, since the English algorithms would not generate query expansions for the
words arrivons or arriva. This approach is not perfect, since in many cases similar
languages have common rules. For example, the French word “arriver” would match
the English stem for “Arrived”, since the postfix “er” is also common in the English
language.
OTSE supplies stemming rules for 5 languages: English, French, German, Spanish
and Italian. When building a search query, you request the stemming rules in the “SET”
command, using the language preference. To request a match for keyword stems, use
the “stem” operator on a keyword in the select statement:
SET language fre
select “OTName” where stem “arrive”
The stem operator does work not in conjunction with other operators, such as proximity,
wildcards and exact phrase searches.
Phonetic Matching
Phonetic matching, or “sounds like” algorithms, are used to match words that have
similarities when spoken aloud. There are many possible algorithms that can be used
for phonetic matching, and OTSE contains a phonetic matching algorithm which is a
variation of the classic US Government ‘Soundex’ algorithm.
Phonetic algorithms are primarily designed to help match surnames, particularly where
the names have been transcribed with potential errors. Matching surnames is of
particular interest for a number of reasons:
Many surnames were recorded as phonetic equivalents from other languages, often
with variations in spelling.
A name which sounds generally similar may in fact have different spelling, particularly
with language variations. Consider the dozens of variations of the name “Stephen”
that exist, including Steven, Steffen, Steffan, Stephan, Steafán, and Esteban.
There is no master dictionary that contains a “right” way to spell a surname, so it is
common for people hearing a name to write it as they think it should be spelled. Smith,
Smithe, and Smyth are all legitimate surnames – you cannot perform spelling
correction, since they are all correct.
In many applications, names are recorded over a poor quality phone connection, which
can introduce errors. I say ‘Pidduck’, the recipient hears and records ‘Pittock’.
All phonetic matching algorithms share some common attributes. They relax the rules
for matching search terms in certain ways. The result is more terms matching, but with
a decrease in accuracy. This decrease occurs because the algorithms can match
words which are clearly not related, despite having similar phonetic properties.
Matching “Schmidt” when querying for “Smith” makes sense. But you also need to be
prepared for false matches, such as finding “Country” when searching for “Ghandi”.
Phonetic matching is generally NOT recommended for general keyword searching. It
is intended for use with names, and works best when applied against a metadata
region which is known to contain names. Otherwise, the number of false positives will
almost certainly be frustrating to a user.
There is one phonetic matching algorithm within OTSE, a modified “Soundex”
implementation. This algorithm is optimized for English. However, the algorithm is
sufficiently generic that it does provide useful results for many Western European
languages. The phonetic matching does not work for non-European languages.
To request a phonetic match for a keyword in a query, use the modifier ‘phonetic’:
Select X where [region "UserName"] phonetic "smith"
A phonetic modifier can only be applied to a simple keyword, and cannot be combined
with other features such as proximity, wildcards, regular expressions or exact phrase
searches.
There are two dictionaries of terms within the search engine, the primary dictionary for
terms that are “typical” western language words (Western European characters, no
punctuation or numbers), and the secondary dictionary for everything else. Phonetic
matching searches only for terms that meet the criteria for inclusion in the primary
dictionary.
Users might be accustomed to working only with a subset of the complete value, and
expect to find matches using arbitrary substrings of the value, such as:
Acme:SSU 87
ACF/24
F/24 3.5inches
The traditional searches using tokens, regular expressions and Like operators are not
sufficient.
Configuration
The implementation of exact substring matching is configured on a per-region basis,
and is valid only for text metadata regions. A custom tokenizer ID is configured for the
region in the [Link] file; the custom tokenizer is specified in the
[Link] file; the custom tokenizer is constructed to encode the entire value using 4-
grams.
For example, in [Link] file [DataFlow_xxx] section:
RegExTokenizerFile2=c:/config/[Link]
In [Link]:
TEXT MyRegion FieldTokenizer=RegExTokenizerFile2
Note that there is an alternative mechanism available for specifying the entry in the
[Link] file. The [Link] file can be used to logically append lines to
the field definitions file at startup (the file is not actually modified). This alternative can
be used by Content Server to control the configuration, since Content Server does
write the [Link] file.
ExtraLLFieldDefinitionsLine0=TEXT MyRegion
FieldTokenizer=RegExTokenizerFile2
Re-indexing is not required. When the Index Engines are next started, a conversion
of the index for the region will be performed. You can apply or remove a custom
tokenizer this way for existing data.
By convention, tokenizers should be located in the config\tokenizers directory. Content
Server uses this location to present a list of available tokenizers to administrators.
Substring Performance
A region indexed for exact substring matching will require about 8 times as much space
for storing the index for that region. In a typical situation, with only a few regions
configured this way, the storage requirement difference will be minimal. Exact
substring configuration is only possible when the “Low Memory” mode configuration is
enabled for text metadata.
When a region is configured for exact substring matching, every query is equivalent to
having wildcards on either side of the query string. In the example above, a search for
“SSU 87” is effectively a search for “*SSU 87*”.No other operators (comparisons,
regular expressions, etc.) are allowed with regions configured for exact substring
searches.
The exact substring is usually much faster than a regular expression because of the
way the indexing is performed. By way of example, assume the indexed value is:
abcdefghijk. Using 4-grams, the following tokens are added to the dictionary: abcd
bcde cdef defg efgh fghi hijk. You want to search for cdefgh. The query engine will
first look for the first 4 gram... “cdef”, which is fast because it is in the dictionary. It then
looks for all 4-grams starting with “gh**”, and finds values with adjacent “cdef + gh**”
4-grams. While there may be a number of 4 grams for the regions beginning with “gh”,
this is much more efficient than scanning the entire dictionary with a regular expression
to find matches.
Substring Variations
The choice of Tokenizer determines the behavior of substring matching. The usual
suggested tokenizer would make the data case-insensitive, but otherwise leave all
other characters unchanged, including whitespace and punctuation.
Case sensitivity requires additional mappings in the tokenizer file. By default, the
tokenizer performs upper to lower case conversion. To preserve case sensitivity, add
a section to the start of a tokenizer file:
mappings {
0x41=0x41
0x42=0x42
0x43=0x43
…
}
Include a mapping to itself for every character that requires case preservation. Ensure
that suitable mappings for non-ASCII characters are included if those are important for
your application.
The other use case to be aware of is punctuation normalization or elimination.
Consider the example which includes ACF/24 in the value. If users are not expected
to correctly use the slash character “/” correctly, there are a couple of variations that
may be used. Normalization would convert all (or a desired set) of punctuation to a
standard value, perhaps Underscore. The string would be indexed as if it had the
value:
“Vendor_Acme_SSU_876MJACF_24_3_5inchesus”
If the user searches for Acme-SSU or ACF:24, the engine would similarly convert the
queries to “Acme_SSU” and “ACF_24”, which would then match.
Similarly, elimination strips all whitespace and punctuation from index and query
values. The index is built from:
“VendorAcmeSSU876MJACF2435inchesus”
With elimination, the test queries “Acme-SSU” or “ACF:24” are handled as if they were
“AcmeSSU” or “ACF24”, again generating a match. Eliminating punctuation is
generally better at finding a match (since it also handles extraneous whitespace), but
is not as precise – potentially returning some false positives.
Included Tokenizers
Customizing a tokenizer can be a challenge. To facilitate substring matching, there are
3 tokenizers provided with OTSE that cover the most common exact substring
requirements, in addition to the default tokenizer.
[Link]
This tokenizer is case insensitive, but otherwise preserves all punctuation and
spaces.
[Link]
This tokenizer eliminates all punctuation and whitespace. The strings “[Link]
name” and “123-m&n_amE” are equivalent, being interpreted as “123myname” in
both queries and indexed values.
[Link]
This tokenizer treats email addresses in common forms as a single token. With
the traditional tokenizer, [Link]@[Link] would be 5 tokens, as the
punctuation would be interpreted as white space. The email address tokenizer
would leave the email address intact as a single token. Searching on a single token
for email is faster and more accurate than a phrase search for multiple tokens.
Problem
Part numbers and file names are primary examples. A human might describe a part
for a machine as: “the 14 centimeter widget that fits jx27 engine”. Instead, we create
names along the lines of “PN3004/widget-14JX27”. Search technology that is
trying to formulate tokens and patterns based on regular sentence structure and
grammar rules will struggle to match these types of values.
Similarly, we create file names such as “SalesForecast2013-europeFRANCE
Rene&[Link]”. With file uploads and Internet encoding, this can even inject
strings such as %20 or & into the metadata values. Again, algorithms designed
to parse human language have difficultly succeeding with these metadata fields.
Like Operator
To accommodate these types of metadata search requirements, OTSE includes the
concept of a “Likable” region. If you have metadata that fits the problem profile, a list
of the appropriate metadata regions can be declared as Likable in the [Link] file:
OverTokenizedRegions=OTFileName,MyParts
This instructs the Index Engines to build a “shadow” region derived from the original
metadata region, but using a very different set of rules for interpreting the metadata
and building tokens. For example, the traditional indexed tokens for our sample part
number and file name values might be:
pn3004 widget 14jx27
salesforecast2013 europefrance rene gina1 doc
When a query using the like operator is processed, the query is also tokenized using
the alternate rules, and is tested against the shadow region instead of the original
region. In this case, the following queries would succeed that would typically fail using
normal human language tokenizing rules:
where [region "OTFileName"] like "gina 2013 sales forecast"
where [region "MyParts"] like "JX27 widget 3004"
If the like operator is requested for a region that does not support it, then the operator
is treated as an “AND” between the provided terms, and applied against the original
region instead of the shadow region.
Like Defaults
Since many users will not understand the requirement to specify the “like” operator
in a query, a configuration option is provided in the [Link] that allows the use of
Like as the default operator.
UseLikeForTheseRegions=OTName,OTFileName
If a query for a token or phrase is requested against one of these regions and there is
no explicit term operator provided, then Like will be assumed. This also works if the
region is in the list of default search regions. For example, the common Content Server
region OTName can be both a default search region and have the Like operator
applied by default. Note that Content Server can be configured to inject a default
operator like stem into a query term, which would negate using like by default.
There is also a configuration setting that controls whether stemming should be used
searching with Like queries. By default, this feature is active. If there is a term
component in a query that is 3 letters or longer, then either the singular or plural form
will match. To disable this feature and only match the entered values, in the [Dataflow_]
section of the [Link] file:
LikeUsesStemming=false
Shadow Regions
The synthetic shadow regions built to support the Like operator have some properties
of interest. They are created when the Index Engines start based upon the [Link]
settings. This adds some time to startup, but allows the Like feature to be applied to
existing data sets without re-indexing. The shadow regions are saved on disk as part
of the index until removed from the list of over-tokenized regions, which also occurs on
Index Engine restart.
The shadow regions have the same names as their masters, appended by
_OTShadow. If the region OTName is configured as likable, then the synthetic region
Limitations
Multi-value text regions have some limitations in behavior. The aggregate strings from
all the values are gathered together to create a single region that is tokenized for the
like operator. This means that there is no ability to combine the like operator with
User Guidance
The description of the like operator so far may be good background into configuration
and applications, but does not provide much practical advice for an end user. The
normal warning that this guidance is not applicable in all situations applies.
Suggestions for a user trying to maximize success using a metadata region with the
like operator may include:
• Select fragments that appear to be logically distinct.
• Use spaces in place of punctuation.
• Do not enter a fragment of a longer numeric sequence as a search term.
• Do not enter a fragment of a text sequence as a search term.
• Do not use wild card operators.
• More terms or portions of the part number will be more precise.
An example using a fictitious part number string in a metadata region:
PN4556-WidgetRED01395b/v5.68.99 $2,867
2867
867
On the other hand, these queries would fail:
idget [use widget]
2,86* [wildcards not permitted]
395 [fragment of1395]
A search for [Link] might also find [Link]. A search expecting to find
smith in the domain might also find [Link]@acme. A search for the other-
acme domain might also find other@acme. In some cases, you could use exact
phrases to better constrain the queries, but this places a high knowledge burden on
the user. Beginning with SE10.5 Update 2014-09, capabilities exist to facilitate the
common email domain search case.
If the region OTSender is declared to be an email region, the Index Engine will
construct a new region named OTSender_OTDomain, and place the domain portions
of the email addresses in this new region. The original OTSender region remains
unaffected. The OTSender_OTDomain region can now be easily searched.
The email domain indexing process can handle multiple values for email addresses in
two ways. If there is a list of addresses in a single value, they will be split using some
simple pattern matching rules, typically comma or semicolon delimited. Multi-value
regions are also supported. In both cases, each distinct email domain will be
represented as a value in the_OTDomain region.
Where multiple identical email domain values exist for an object in the email region,
duplicates will be removed. This behavior is important given that many recipients of
an email message are often in the same organization or email domain.
In the [Link] file there are several configuration settings for tuning and enabling
email domain search capabilities. The main setting to enable or disable the feature is
a comma-separated list of text metadata regions that should be treated as email
regions. By default, this list is empty:
EmailDomainSourcesCSL=OTEmailSender,OTEmailRecipient
When you add or remove regions from the email domain list, the changes take effect
the next time the Index Engines are started. At startup, any new email domain regions
will be created and the values populated. This may add 10 or so minutes to the first
startup process. Likewise, if any regions were removed, they will be deleted from the
index at next startup.
Tuning of the behavior is possible with remaining configuration settings. By default,
_OTDomain is used as the suffix for the email domain regions, but this can be adjusted.
There is an upper limit on the number of distinct email domain values that will be
retained for a given email value, which defaults to 50. If you anticipate longer lists of
email domains, this value can be adjusted upwards. Finally, the separators used to
delimit an email domain can be defined. When indexing, a simple rule is used that text
after the @ symbol up to a separator character represents the email domain. The
separators are defined in the [Link] file, and default to comma, colon, semicolon,
various brackets and whitespace. The separator string must be compatible with a Java
regular expression.
EmailDomainFieldSuffix=_OTDomain
MaxNumberEmailDomains=50
EmailDomainSeparators=[,:;<>\\[\\]\\(\\)\\s]
An example: if a multi-value email region for an indexed object has the values:
<OTEmailSender>bob@[Link]</OTEmailSender>
<OTEmailSender>bob@[Link]</OTEmailSender>
<OTEmailSender>sue@[Link]</OTEmailSender>
<OTEmailSender>bob@[Link]</OTEmailSender>
<OTEmailSender>sue@[Link]</OTEmailSender>
The OTEmailSender_OTDomain for that object will have effective values of:
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>[Link]</OTEmailSender_OTDomain>
The same _OTDomain values would exist if a single value email region contains the
string:
<OTEmailSender>bob@[Link], bob@[Link][Robert]
sue@[Link];bob@[Link](“MightyBob”);
sue@[Link]</OTEmailSender>
Unlike other search operators, the user does not have direct control over the exact
behavior of the search query. A typical use case would be to copy a couple of
paragraphs from a document, and search using the TEXT operator to find documents
with similar information. The TEXT operator takes arbitrary text as the parameter,
excluding closing brackets and end of line characters.
To illustrate by example, perhaps the first few lines from Lewis Carrol’s “Alice in
Wonderland” are used:
text (Alice was beginning to get very tired of sitting by
her sister on the bank, and of having nothing to do: once
or twice she had peeped into the book her sister was
reading, but it had no pictures or conversations in it,
‘and what is the use of a book,’ thought Alice ‘without
pictures or conversations?’ So she was considering in her
own mind (as well as she could, for the hot day made her
feel very sleepy and stupid, whether the pleasure of making
a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink
eyes ran close by her. There was nothing so very remarkable
in that; nor did Alice think it so very much out of the way
to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I
shall be late!’)
First observation… to be compatible with the text operator, the paragraph end “CRLF”
characters and closing braces “)” in the source were removed.
The Text operator would then analyze the text, discarding short words and top words.
Statistical analysis would select notable words (and phrases). Although not in this text
example, overly long words or lists of numbers would be ignored. The resulting set of
8 to 15 terms would then be used internally with stemset, with an effective internal
query something like:
stemset(80%,alice,sister,book,”pictures or
conversations”,rabbit,considering,trouble,picking,pleasure,
sleepy,stupid,daisies)
Which in turn would match all items that have 80% or more of those terms and phrases
in the full text of the object. In general, numbers are dropped from consideration in
TEXT queries. However, if the provided block of TEXT is relatively short (less than
about 250 characters), numbers will be included if necessary to meet the minimum
number of terms.
The TEXT operator has a number of configuration settings. See the Top Words section
below for more settings.
Performance degrades with more words used in stemset, while accuracy drops with
too few words. The upper limit on the number of terms and phrases to use is:
TextNumberOfWordsInSet=15
For accuracy, such as trying to match exact documents, termset is a better choice.
Otherwise, stemset is used to find more objects with singular/plural variations but runs
slightly slower:
TextUseTermSet=true
The percentage of matches with termset and stemset can be adjusted. Low values
find more objects with less similarity (eg: 40%). Higher values, such as 80%, require
better matches with the source material:
TextPercentage=80
Top Words
The TEXT query operator is specifically designed to efficiently locate good quality
results when provided with large blocks of text. In this particular scenario, overly
common words are of little value, and need to be discarded. In OTSE, the Top Words
feature is used for this purpose.
Top Words are those words which are found within a large percentage of the
documents. For example, the OpenText corporate document management system has
the word OpenText in many documents, and hence it is eliminated from TEXT queries.
Top Words are determined based upon the percentage of objects containing a word.
For example, if more than 30% of objects contain the word ‘date’, then ‘date’ is added
to the Top Words list.
Top Words are computed independently for each search partition. Usually, more
partitions are added over a prolonged period. If the frequency of words changes over
time, then newer partitions will have slightly different Top Words than older partitions.
This also means that TEXT queries which eliminate Top Words might construct slightly
different queries on each partition.
The Top Words are first computed for a partition once it contains approximately 10,000
objects. On reaching 100,000 and 1,000,000 objects, the list is discarded and
recomputed. This helps to ensure that the Top Words properly reflect the contents of
the partition. The Top Words are stored in a file that is not human readable, and has
the name topwords.10000, with the number changing to reflect the size. If the
topwords.n file is missing, it will be generated during next startup or checkpoint write.
The threshold for selecting Top Words is a real number that should be between 0.01
and 0.99, representing the fraction of objects in the partition that contain the word. The
default value is 33%, which we found in some typical partitions larger than 1 million
objects generated a Top Words list of about 750 words. Larger fractions result in fewer
Top Words. In the [Dataflow_] section:
TextCutOff=0.33
If the Top Words features are not required, generation and use can be disabled by
setting:
TextAllowTopwordsBuild=false
Stop Words
Stop words are words which are considered too common to be relevant, or do not
convey any meaning, and are therefore stripped from search queries, or potentially not
even indexed. For English, a typical list of stop words would contain words such as:
a, about, above, after, again, against, all, am, an,
and, any, are, aren't, as, at, be, because, been,
before, being, below, between, both, but…
The potential advantage of stop words is a reduction in the size of the search index.
However, use of stop words introduces several limitations for search.
If stop words are applied at indexing time, certain types of queries become impossible.
A Shakespearean scholar could never find Hamlet’s soliloquy “to be or not to be”, since
all of those words are considered stop words, and would not be in the index.
Another reason to not apply stop words during indexing is the multi-lingual capability
of OTSE. The Spanish word “ante” is very common, so it should be a stop word, and
not indexed. However, in English, this is an uncommon word, so it clearly should be
indexed.
As a result, the search engine does not use stop words during indexing, nor are they
applied as a general rule during search queries. However, there is a closely related
capability known as Top Words that is used under special circumstances.
Accumulator
The Accumulator is an internal component of the Index Engines which is responsible
for gathering the tokens (or words) that are to be added to the full text search index. A
basic understanding of the Accumulator is useful when considering how to tune and
optimize an OTSE installation.
As objects are provided to the Index Engine, the full text objects are broken into words
using the Tokenizer, and added to the Accumulator. When the Accumulator is full, this
event triggers creation of a new full text search fragment. In a process known as
“dumping” the Accumulator, a fragment containing the objects stored within the
Accumulator is written to disk.
The transactional correctness of indexing is possible in part because of how the
Accumulator works. As objects are added to the accumulator, they are also written to
disk in the accumlog file. These files are monitored by the search engines to keep the
search index incrementally updated. When the Accumulator dumps, a new index
fragment is created, and the accumlog files are available for cleanup.
The size of the accumulator has an impact on system performance, and on the
maximum size of an object that can be indexed. A small Accumulator is forced to dump
frequently, which can reduce indexing performance. A large Accumulator consumes
more memory. The default size value for the Accumulator is 30 Mbytes (which is a
nominal allocation target – Java overhead results in the actual memory consumption
being higher), and can be set from within the Content Server search administration
pages, which sets the [Dataflow_] value in the [Link] file:
AccumulatorSizeInMBytes=30
If a single object is too large to fit within the Accumulator, it will be truncated –
discarding the excess text content. You cannot always predict whether an object will
exceed this size limit, since this is a measurement of internal memory use including
data structures, and not a measurement of the length of the strings being indexed.
The Accumulator will dump if it contains data and indexing has been idle. The idle time
before dumping is configurable:
DumpOnInactiveIntervalInMS=3600000
During indexing of an object, the accumulator also makes an assessment of the quality
of the data it is given to index. If the data is too “random” from a statistical perspective,
then the accumulator will reject it with a “BadObjectHeuristics” error. The randomness
configuration settings in the [Dataflow_] section are:
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0
MaxAverageTokenLengthHeuristic2=15.0
MinDocSizeInTokens=16384
The heuristics are relatively lax, and essentially designed to try and protect the index
from situations where random data or binary data was provided. It is rare that these
values need to be adjusted, and some experimenting will be needed to find values that
meet special needs. There is a minimum size of about 16,384 bytes before these
heuristics are applied, since small objects would otherwise fail the uniqueness
requirement.
There is one situation where this safety feature is known to occasionally discard good
objects. If a spreadsheet is indexed that contains lists of names, numbers and
addresses, the uniqueness of the tokens may be very high, and it may be rejected as
random.
A related configuration setting is an upper limit on the size of a single object. Objects
are truncated at this limit, meaning that only the first part of the object is indexed. Note
that this size limit is applied to the text given to the Index Engine, not the size of an
original document file. For example, a 15 MB Microsoft PowerPoint file might only
have a filtered size of 100 Kbytes. Conversely, an archive file (ZIP file) with a size of
1 MB might expand to more than 10 MB after filtering.
ContentTruncSizeInMBytes=10
From an indexing perspective, 10 Mbytes is a lot of information. For English language
documents, this would normally be more than 1 million words. By way of comparison,
this entire document in UTF8 form is less well under 1 MByte.
Accumulator Chunking
Starting with Search Engine 10 Update 7, the Accumulator also has the ability to limit
the amount of memory consumed by “chunking” data during the indexing process.
Essentially, if the size of the accumulator exceeds a certain threshold, the input is
broken into smaller pieces, or chunks. Each chunk is separately prepared and written
to disk. When all the chunks are completed, a “merge” operation combines the chunks
into the index.
Chunking is a very disk-intensive process. When chunking occurs, there is a
noticeable impact on the indexing performance. Fortunately, chunking is only required
when indexing very large objects. Using the default settings, we noted while indexing
our own typical “document management” data set that chunking occurs with hundreds
of documents per million indexed, and showed an overall indexing performance hit of
about 15% in a development environment. If indexing performance must be optimized,
you can disable chunking or even reduce the Content Truncation size described above
to a small value (perhaps 1 MByte) such that chunking may never happen.
There are configuration settings in the [DataFlow_] section of the [Link] file for
tuning the chunking process. The number of bytes in an object before chunking will
occur has a default of 5 MBytes. The feature can be disabled with a large value, say
100,000,000.
AccumulatorBigDocumentThresholdInBytes=5000000
An additional amount of memory for related data such as the dictionary is reserved
as working space, expressed as a percentage of the Accumulator size (typically 30
Mbytes), with a default of 10 percent.
AccumulatorBigDocumentOverhead=10
As a result of this change, it will no longer be possible to search within XML regions
in the body of text for large XML objects where chunking occurs. Chunking can be
disabled for XML documents by setting the configuration to true, but this will negate
the memory savings from chunking.
CompleteXML=false
Reverse Dictionary
The search engine maintains dictionaries of words in the index. The dictionary is
sorted to be efficient for matching words, and for matching portions of words where the
beginning of the word is known (right-truncation, such as washin*). However, for
matching terms that start with wildcards (left-truncation), the dictionary is not optimal.
The search engine can optionally store a second dictionay, known as the Reverse
Dictionary. This is a dictionary of each term spelled backwards. For instance, the term
“reverse” is stored as “esrever”. This Reverse Dictionary allows high performance
matching of terms that begin with a wildcard, and for certain types of regular
expressions that are right anchored (ending with a $).
There is an indexing performance penalty associated with building and maintaining the
Reverse Dictionary. The penalty varies due to many factors, but has been observed
to be over 10%. There is additional disk space required, typically about 1 GB for a
partition with 10 million objects. As far as memory is concerned, another Accumulator
instance is used which consumes about 30 MB of RAM in the default configuration,
and space of about 15 MB is required for term sorting. The Reverse Dictionary is
enabled with a setting in the [Dataflow] section of the [Link] file:
ReverseDictionary=true
The Reverse Dictionary works with full text content and text metadata stored in “Low
Memory” mode. Older storage modes are not supported. The Reverse Dictionary is
not used with regions that are over-tokenized or configured for exact substring
matching.
Transaction Logs
In the event that an index or partition is corrupted or destroyed, OTSE provides
Transaction Logs to help rebuild and recover indexes with the least amount of re-
indexing. Transaction Logs are generated by the Index Engines with a minimal record
of the indexing operations that have been applied. A fragment of a Transaction Log
looks like this:
2018-03-15T[Link]Z, replace - content, DataId=1009174&Version=1
2018-03-15T[Link]Z, add, DataId=1036021&Version=1
2018-03-15T[Link]Z, delete, DataId=1015932&Version=1
2018-03-15T[Link]Z, add, DataId=1036022&Version=1
2018-03-15T[Link]Z, add, DataId=1036023&Version=1
2018-03-15T[Link]Z, Start writing new checkpoint
2018-03-15T[Link]Z, Finish writing new checkpoint
2018-03-15T[Link]Z, add, DataId=834715&Version=1
If an index is corrupted, it can be restored from the most recent backup. The
Transaction Log can then be used to determine which Content Server objects should
be re-indexed or deleted to bring the backup copy of the index up to date, based on
the date/time of the operations since the date of the backup.
The transaction logs are set up to rotate 4 logs of size 100 MB each, which should
typically be able to record more than 50 million operations for a partition. At this time,
these values are not adjustable. In a typical system with regular backups, this should
be more than enough to recover all transactions. If your backups are less frequent,
you may wish to copy these logs on a regular basis.
Multiple copies of the Transaction Logs can be written. The idea here is that these
logs must survive a disk crash to be useful for recovery. If you are concerned about
system recovery, consider recording the Transaction Logs on two different physical
disks. In the [IndexEngine_] section of the [Link] file:
TransactionLogFile=c:\logs\p1\[Link],
f:\logs\[Link]
TransactionLogRequired=false
In this example, logs are written to two locations. By default, the list is empty, which
disables writing the Transaction logs. The Index Engine will append text to the
provided file name to differentiate between the rotating logs. A second setting dictates
whether a failure to write Transaction Logs should be considered a transaction failure,
or should be accepted and allow indexing to continue. By default, this is false –
meaning the Transaction Logs are “nice to have”.
Protection
Because Content Server is relatively open and allows many types of applications to be
built on top of it, the search system can be exposed to unexpected data and
applications. This section touches on some of the configurable protection features of
OTSE.
Cleanup Thread
As the Index Engines update the index, they create new files and folders. The Search
Engines read these files to update their view of the index. Left alone, these files will
eventually fill the disk. The Cleanup Thread is the component of the Index Engine that
runs on a schedule to analyze the usage of the files, and delete those which are no
longer necessary.
A Cleanup Thread only examines and deletes files for a single partition; each Index
Engine therefore schedules a Cleanup Thread. The Cleanup Thread will delete
unused configuration files, as well as unused files listed in the configuration files, such
as accumlog, metalog, checkpoint and subindex fragment files. Search Engines keep
file handles open for config files currently in use, and this is the primary mechanism
used by the Cleanup Thread to determine if files can be deleted.
There is no specific process to monitor for the Cleanup Thread; it is part of the Index
Engine process. By default, the Cleanup Thread is scheduled to run every 10 minutes.
You can adjust the interval in the [Link] file [Dataflow_] section:
FileCleanupIntervalInMS=600000
The Cleanup Thread also has a secure delete capability, disabled by default.
SecureDelete=false
When set to true, the Cleanup Thread will perform multiple overwrites of files with
patterns and random information before deleting them, making them unreadable by
most disk forensic tools. This also makes the file delete process considerably slower,
and uses significant I/O bandwidth. Some additional notes on this feature:
• The US Government has updated their guidelines to require physical
destruction of disk drives for highest security situations.
• Overwriting files is ineffective with journaling file systems.
• The algorithm is designed for use with magnetic media, and may not provide
any additional security with Solid State Disks.
• Optimizations by Storage Array Network storage systems may defeat this
feature.
The Cleanup Thread code has been enhanced starting with Search Engine 10 Update
4 to delete unused fragments more aggressively. If for some reason you require the
previous behavior, it can be requested in the search INI file with by setting
SubIndexCleanupMode=0. The default value is 1.
Merge Thread
The Merge Thread is a component of the Index Engine that consolidates full text index
fragments. As the Index Engines add or modify the index, they do not change the
existing files. Instead, they append new files, referred to as the “tail” fragments. The
Search Engines must search against all of the files that comprise the full text index.
As the number of files containing index fragments grows, the performance of search
queries deteriorate. The purpose of the Merge Thread is to combine fragments to
create fewer files that the Search Engines need to use, ensuring that query
performance remains high. Merging also reduces the overall size of the index on disk,
since deleted objects are simply “marked” as deleted in the tail fragments, and modified
objects will have multiple representations until they are merged.
The Merge Thread will create new full-text index fragment files and then communicate
with the Search Engine using the Control File regarding which files now comprise the
index. Once the Search Engine changes (locks the new files), the Cleanup Thread will
delete the older index files.
Merging is a disk-intensive process. The Merge Thread therefore tries to maintain a
balance between how frequently merges occur and how many index fragments exist.
In a typical index, there are frequent merges taking place within the tail index
fragments, which tend to be small and can be merged quickly. Eventually, older and
larger fragments must also be merged.
An optimal target for the number of fragments an index should have is about 5. In
practice, the number of smaller fragments can grow quite large depending upon the
characteristics of the index. As a safeguard, there is a configuration setting that places
an upper limit on the number of fragments that are permitted for a partition index, and
this will force merges to occur. Too many fragments can seriously affect query
performance due to the level of disk activity in a query and the number of file handles
needed.
Tail Fragments
The Merge Thread configuration settings are located in the [Dataflow_] section of the
[Link] file:
// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
CompactEveryNDays=30
NeighbouringIndexRatio=3
“Want Merges” would normally only be changed for debugging purposes. In most
installations, these settings do not need to be modified. One setting of note is the
Compact Every N Days value, which instructs the Merge Thread to make a more
aggressive attempt to merge indexes over the long term. This setting helps to merge
older index fragments which are relatively stable, and would otherwise not be
scheduled for compaction.
Merge Tokens
Merging fragments temporarily requires additional disk space, nominally the size of all
the fragments being merged. If the temporary disk space needed causes the partition
to exceed the configured maximum size of the partition, then the merge will fail. One
way to address this is to increase the configured allowable disk space. However,
increasing the disk space for every partition can be a costly approach to solving the
problem.
The better approach is to enable Merge Tokens. Merge Tokens are managed by the
Update Distributor, and can be granted on an as-needed basis to Index Engines that
do not have sufficient space to perform merges. If given a Merge Token, the Index
Engine will proceed to perform a merge even if this exceeds the configured maximum
disk space. If the largest index fragments are 20 GB, then 100 GB of temporary space
would suffice for 4 or 5 Merge Tokens. Relatively few Merge Tokens are needed. 3
tokens would likely suffice for 10 partitions, perhaps 10 tokens for 100 partitions.
The Merge Token capability was first added in Update 2015-03, and the default setting
is disabled for backwards compatibility. In the [UpdateDistributor_] section of the
[Link] file:
NumOfMergeTokens=0
Too Many Sub-Indexes
Although OTSE has a typical target of merging down to 5 or so index fragments, there
are situations when this may not be possible. There is a maximum number of allowable
index fragments (or sub-indexes), which by default is 512. There have been scenarios,
usually due to odd disk file locking, where this limit has been reached or exceeded. In
this case, a Java exception will occur, logging a message along these lines:
MergeThr[Link]xception:Exception in
MergeThread:[Link]; 512
To recover from this, you can edit the [Dataflow_] section of the [Link] file to
increase the number of allowable sub-indexes (perhaps 600), and restart the affected
engines. Once recovered, the lower number should be restored, since running with
larger values has a potential negative performance impact.
MaximumSubIndexArraySize=512
Tokenizer
The Tokenizer is the module within OTSE that breaks the input data into tokens. A
token is the basic element that is indexed and can be searched. The Tokenization
process is applied to both the input data to be indexed, and the search query terms to
be searched.
There is a default standard Tokenizer (Tokenizer1) built into OTSE that applies to both
the full text and all search regions. The system supports adding new tokenizers that
can be applied to specific metadata regions. In addition, Tokenizer1 can be replaced
and customized, or can be used with a number of configuration options. Everything
that follows until the section entitled “Metadata Tokenizers” describes the use of the
default Tokenizer1.
Language Support
OTSE is based upon the Unicode character set, specifically using the UTF-8 encoding
method. This means that all indexing and query features can handle text from most
languages. If there are limitations in supported character sets, any necessary changes
would take place within the Tokenizer.
Case Sensitivity
By design, OTSE is not case sensitive. Text presented for indexing or terms provided
in a query are passed through the Tokenizer, which performs mapping to lower case.
This design decision provides a slight loss of potential feature capability in full text
search, but improves performance and reduces index size dramatically. Note that text
metadata values are stored in their original form, including accents and case, so that
retrieval of metadata has no accuracy loss. The mapping to lower case is not applied
to other aspects of the index, such as region names, which ARE case sensitive.
The comm|nocomm line is optional, and not recommended. This controls whether text
that meets the criteria for SGML or XML style comments should be retained or
discarded. The default value is nocomm (do not index comments). This line is
equivalent to setting the standard Tokenizer options in the [Link] file with a value
of TokenizerOptions=2.
Using the null character as the “to” value in a mapping is a special case. Null
characters are skipped during a subsequent Indexing step, so mapping a character to
0x00 will effectively drop it from the string. This may be useful for removing standalone
diacritical marks or punctuation such as the single quote mark from the word
“shouldn’t”.
The following table illustrates the default character mappings for many of the European
languages.
From To
A-Z a-z
À Á Â Ã Å à á ã å Ā ā Ă ă Ą ą a
Ä Æ ä æ ae
Ç ç Ć ć Ĉ ĉ Ċ ċ Č č c
Ďď Đ đ d
È É Ê Ë èé ê ë Ē e
Ì Í Î Ï ì í î ï i
Ð ð ð
Ñ ñ n
Ò Ó Ô Õ Ø ò ó ô õ ø o
Ö ö oe
Ú Û ù ú û u
Ü ü ue
Ý ý ÿ y
Þ Þ (Large Thorn)
þ Þ (small Thorn)
ß ss
Note: prior to Update 2014-12, upper and lower case Ø characters were mapped to a zero.
The upper and lower case IJ ligatures are mapped to the two letters I J.
Upper and lower case Letter L with Middle Dot are preserved ( Ŀ and ŀ).
Upper and lower case Œ ligatures converted to oe.
Accented W and Y characters are preserved (Ŵ ŵ Ŷ ŷ Ÿ ).
The ſ character (small letter “long s”) is preserved.
Arabic Characters
There are special cases implemented for tokenization of Arabic character sets, which
improves the findability of Arabic words.
Step 1 is character mapping. The character mapping is extended to handle cases in
which multiple characters must be mapped as a group. These mappings are:
Step 3 is removal of WAW and ALEF-LAM prefixes, only if doing so leaves at least 2
characters remaining.
The final step is removal of HEH-ALEF and YEH-HEH suffixes, again only if at least 2
characters will remain in the token.
Note that Arabic tokenization was improved significantly starting with Update 2014-
12.
Tokenizer Ranges
Ranges define the primitive building blocks of characters, organizing them in logical
groups. Each range specification is comprised of Unicode characters and character
ranges, expressed in hexadecimal notation. For example, a range for the simple
numeric characters 0 through 9 would be:
number 0x30-0x39
In practice, there are multiple Unicode code points where numbers could be
represented, so a richer definition of a number might need to include Arabic numerals
(x660-x669), Devenagari numerals (0x966-0x96f) and similar representations from
other languages. You would probably also want to use the character mapping feature
to convert these all to the ASCII equivalents:
number 0x30-0x39 0x660-0x669 0x996-0x96f
May or may not start with currency – currency would be a list of symbols such as
$ ¥ £ or €.
May or may not start with a dash after the optional currency sign.
Has one or more numbers (0-9) following optional dash and currency.
Has zero or more sets of nseparators (, and .) and numbers following the first
number.
In general, the regular expressions are greedy – matching the longest possible string.
The following operations on ranges are supported, and are applied following the range:
? Zero or one instances of the range
- Token matching this pattern is not valid, advance start pointer one
character and continue
The Tokenizer begins at a specific character, and attempts to find the longest valid
regular expression match. Once found, it takes the matching value as a word,
advances to the character following the match, and repeats. If no match is found, it
advances one character and repeats.
In general, regular expressions that you construct should be relatively lax. In the
currency example above, for instance, we do not enforce 3 digits between commas.
Erring on the side of indexing information rather than rejecting it is a good guideline.
}
Bigram indexing is the default behavior for these languages. Older versions of the
Search Engine indexed each East Asian character as a separate token. There is a
configuration setting in the [Link] file that can force use of the older method. This
may be useful if you have an older index that predates OTSE with significant East
Asian character content that you do not wish to re-index.
Tokenizer Options
If you are using the standard Tokenizer, the following options are available in
[Dataflow_xxx] section of the [Link] file:
TokenizerOptions=128
The default value is 0 (no options set). The options are a bit field, and can be added
together to combine values. The bit field values are:
1 : a dash character “-“ is counted as a standard character for words. The string
“red-bananas-26” would be indexed as a single token, instead of as 3 the
consecutive tokens “red”, “bananas”, “26”.
2 : XML comments are indexed. By default, strings which fit the pattern for an
XML comment are stripped from the input. XML comments have the form
<!--any text in comment-->
4 : treat underscore characters “_” as separators. This would cause input such
as “My_house” to be indexed as two tokens, “my” and “house”. The default
would preserve this as a single token.
8 : special case handling to look for software version numbers of the form v2.0
and treat them as a single token.
16: treat the “at symbol” @ as a character in a word.
32: treat the Euro symbol as a character in a word.
128 : used to request the “older” method of indexing East Asian character strings
with each character as a separate token. The default indexes these strings as 2-
character “bi-grams”.
inputfile is the name of the file containing the data you wish to tokenize.
If inputfile contains “THIS is a TEßT”, the output would be of the form:
|THIS|this
|is|is
|a|a
|TEßT|tesst
Where the first value on each line represents the word tokens accepted by the regular
expression parser, and the second value represents the results after the character
mappings are applied.
Sample Tokenizer
The following sample tokenizer file is similar to the default implementation. Indented
lines have been wrapped to fit the available space. In practice, lines should not be
broken.
ranges {
alpha 0x30-0x39 0x41-0x5a 0x5f 0x61-0x7a 0xc0-0xd6
0xd8-0xf6 0xf8-0x131 0x134-0x13e 0x141-0x148
0x14a-0x173 0x179-0x17e 0x384-0x386 0x388-0x38a
0x38c 0x38e-0x3a1 0x3a3-0x3ce 0x400-0x45f 0x5d0-0x5ea
0xFF10-0xFF19 0xFF21-0xFF3a 0xFF41-0xFF5a
number 0x30-0x39
numin 0x2c-0x2e
currency 0x24 0xfdfc
numstart 0x2d
alphain 0x5f
tagstart 0x3c
colon 0x3a
tagend 0x3e
slash 0x2f
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d 0xfa30-0xfa6a
0xfa70-0xfad9 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d
0x3041-0x3094 0x30a1-0x30fe 0xff66-0xff9d 0xff9e-0xff9f
arabic 0x621-0x63a 0x640-0x655 0x660-0x669 0x670-0x6d3
0x6f0-0x6f9 0x6fa-0x6fc 0xFB50-0xFD3D 0xFD50-0xFDFB
0xFE70-0xFEFC 0x6d5 0x66e 0x66f 0x6e5 0x6e6 0x6ee 0x6ef
0x6ff 0xFDFD
indic 0x900-0x939 0x93C-0x94E 0x950-0x955 0x958-0x972
0x979-0x97F 0xA8E0-0xA8FB 0xC01-0xC03 0xC05-0xC0C
0xC0E-0xC10 0xC12-0xC28 0xC2A-0xC33 0xC35-0xC39
0xC3D-0xC44 0xC46-0xC48 0xC4A-0xC4D 0xC55 0xC56
0xC58 0xC59 0xC60-0xC63 0xC66-0xC6F 0xC78-0xC7F
0xB82 0xB83 0xB85-0xB8A 0xB8E-0xB90 0xB92-0xB95
0xB99 0xB9A 0xB9C 0xB9E 0xB9F 0xBA3 0xBA4
Metadata Tokenizers
The default configuration uses the full text tokenizer for text metadata regions. OTSE
supports the use of additional tokenizers for text metadata regions. There are 3
requirements to enable this: creating the tokenizer file; referencing the tokenizer file in
the [Link] file; and associating the tokenizer with a metadata region.
Adding or changing the tokenizer configuration for text metadata is possible. When
the search system is restarted, the text metadata stored values are used to rebuild the
text metadata index using the new tokenizer settings. This may require several hours
on large search grids. There are configuration settings that determine the behavior of
the rebuilding when the tokenizers are changed. The first setting is a failsafe to prevent
accidental conversion if the tokenizers are deleted or changed unintentionally. It
requires that today’s date be provided for the conversion to occur. Use the value “any”
to allow conversion any time the tokenizers are changed. The second setting
determines whether the conversion is applied to existing data, or simply to new data.
Usually, applying to new data only is not recommended due to inconsistent results, so
the default value is true. In the [Dataflow_] section:
AllowAlternateTokenizerChangeOnThisDate=20170925
ReindexMODFieldsIfChangeAlternateTokenizer=true
The [Link] file is used to define where the search tokenizer files are located. In the
[Link] file, to add two metadata tokenizer files:
[Dataflow_]
RegExTokenizerFile2=c:/config/tokenizers/[Link]
RegExTokenizerFile3=c:/config/tokenizers/[Link]
Note that the additional tokenizer values start at the number 2. The first tokenizer entry
is always reserved for the full text tokenizer. The tokenizer definition files in this
example are located in the config/tokenizers directory, which is recommended by
convention as the preferred location for tokenizer definition files.
The next step is to identify the text metadata regions which should use the enumerated
tokenizers. This is done as an optional extension to the text region definition in the
[Link] file:
The search engine would then apply the rules defined in [Link] to the region
OTPartNum, and the tokenizer rules in the file [Link] to RegionX. The
tokenizer files are constructed using the same rules as the default full text tokenizer.
somevaluehere, then encoded with 3-grams (som ome mev eva val alu lue
ueh ehe her ere).
mappings {
0x9=0x0
0xa=0x0
0xb=0x0
0xc=0x0
0xd=0x0
0xe=0x0
0xf=0x0
0x10=0x0
0x11=0x0
0x12=0x0
0x13=0x0
0x14=0x0
0x15=0x0
0x16=0x0
0x17=0x0
0x18=0x0
0x19=0x0
0x1a=0x0
0x1b=0x0
0x1c=0x0
0x1d=0x0
0x1e=0x0
0x1f=0x0
0x20=0x0
0x21=0x0
0x22=0x0
0x23=0x0
0xfffb=0x0
0xfffc=0x0
0xfffd=0x0
}
ranges {
gram4 0x9-0xe00 0xe2f 0xe3b-0xe3f 0xe4e-0x3004
0x3007-0x3040 0x3095-0x30a0 0x30ff-0x33ff 0x9fa6-0xabff
0xd7a4-0xf8ff 0xfa2e-0xfa2f 0xfa6b-0xfa6f 0xfada-0xff60
0xffa0-0xfffd
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d 0x3041-0x3094
0x30a1-0x30fe 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d
0xfa30-0xfa6a 0xfa70-0xfad9 0xff66-0xff9d 0xff9e-0xff9f
}
words {
_NGRAM4 gram4+
onechar
_NGRAM2 gram2+
}
Partition Sizes
Search for a partition name in the OTPartitionName region to get a count of the number
of objects stored in a given partition.
Metadata Corruption
Search for -1 in the region OTMetadataChecksum to identify if the metadata for any
objects are corrupt. This is only valid if the metadata checksum feature is enabled.
Query time and throughput varies based on many factors. The first step in optimizing
search query behavior is understanding how time is being consumed during search
queries. To help with this, the Search Federator keeps statistical information about
query performance, which is written to the Search Federator log once per hour. Using
this data, you can assess whether changes to the system or configuration are
improving or degrading search performance.
The data is written in tabular form, such that you can copy it and paste it into a
spreadsheet as Comma Separated Values to make analysis easier. The log entries
have this form, with leading time stamps and thread data omitted:
are not persisted between restarts, so the data starts at zero after every startup of the
search grid. This information is written when the log level is set to status level or higher.
Data on a given query is collected when the query completes, so queries that cross an
hour or day boundary are reported for the time when the query finished.
This data is also available on demand through the admin interface using the command:
getstatustext performance
Administration API
In addition to a socket-level interface to support search queries, the search
components have a socket-level interface that support a number of administration
tasks. Each component honors a different set of commands, and in some cases reply
to the same command with different information. Commands that make sense for an
Index Engine may be irrelevant for the Search Federator.
This section outlines the most common commands and the components to which they
apply. The client making the requests is also responsible for establishing a socket
connection to the component. The configuration of the port numbers for the sockets is
controlled in the [Link] file.
You do not need to use this API for management and maintenance. Applications such
as Content Server leverage the Administration API to hide details of administration and
provide unified administration interfaces.
The examples below use a > (prompt) symbol to represent the command(s), followed
by the response. White space has been added in responses for readability.
stop
Stops the process as soon as possible. Applies to all processes.
> stop
true
getstatustext
In the Index Engine, this command returns information about uptime, memory use and
number index operations performed:
> getstatustext
In the Search Federator, getStatusText returns summary information about uptime and
requests. In addition, this call is used to obtain detailed information about the current
status of each metadata region. In the “moveable” section, each region defined in the
index is listed along with a status indicating whether it is moveable. The moveable
status essentially identifies text regions, which can be moved to other storage modes
(DISK versus RAM storage, for example).
There are sections for ReadWrite, NoAdd and ReadOnly. In these sections, every text
(moveable) region is listed. In this example, the partition is in Read-Write mode, so
the regions are listed in the ReadWrite section. For each text region, an estimate of
the amount of memory currently used by the region, and the amount of memory that
would be used if the region was changed to other storage modes is provided. Note
that these are ESTIMATES and should not be used to accurately compute memory
requirements.
> getstatustext
With the Search Federator, a variation of getstatustext can be used to retrieve data
about search query performance. The interpretation of the values is outlined in the
section entitled “Query Time Analysis”.
> getstatustext performance
<performance>
<hours>
<hour>
<hourNumber>13</hourNumber>
<numQueries>1</numQueries>
<elapsed>71305</elapsed>
<execution>1149</execution>
<wait>70156</wait>
<SELECT>376</SELECT>
<RESULTS>773</RESULTS>
<FACETS>0</FACETS>
<HH>0</HH>
<STATS>0</STATS>
</hour>
<hour>
<hourNumber>12</hourNumber>
<numQueries>4</numQueries>
<elapsed>149954</elapsed>
<execution>100071</execution>
<wait>49883</wait>
<SELECT>99761</SELECT>
<RESULTS>201</RESULTS>
<FACETS>16</FACETS>
<HH>0</HH>
<STATS>93</STATS>
</hour>
</hours>
</performance>
Similarly, the Update Distributor can provide accumulated statistics about indexing
throughput and errors with “getstatustext performance”. First introduced in 20.4, the
output is in XML form and includes the same data that is written to the logs on an hourly
basis.
<?xml version="1.0" encoding="UTF-8"?>
<performance>
<hours>
<hour>
<hourNumber>8</hourNumber>
<AddOrReplace>0</AddOrReplace>
<AddOrModify>0</AddOrModify>
<Delete>0</Delete>
<DeleteByQuery>0</DeleteByQuery>
<ModifyByQuery>0</ModifyByQuery>
<Modify>0</Modify>
…
Starting with the 2015-09 update, a new option for getstatustext will return a subset of
information, faster. The “basic” variation reduces the time needed by Content Server
to display partition data. The subset of data was specifically selected to meet the
needs of the Content Server “partition map” administration page. When basic is used,
the status and size of partitions is retrieved from cached data, and only updated during
select indexing operations such as “end transaction”. While technically the information
could be slightly incorrect, it is accurate enough for practical purposes. If there is no
cached data, then the slower methods are used – querying each index engine for data.
For the Index Engines, there is new data in this response. Percentage full is presented
in two different ways, one for text metadata, and one for usage of the allocated space
on disk of the index. The Behaviour represents the “soft” modes of a read/write partion
- update only, rebalancing. Sample responses from the other search processes are
shown below, returning the same codes as a “getstatuscode” command.
<?xml version="1.0" encoding="UTF-8"?>
<stats>
<UpDist1>
<status>135</status>
</UpDist1>
</stats>
</stats>
getstatuscode
This function is used to determine if a process is ready, in error, or starting up. Starting
up is generally the status while an index is being loaded.
> getstatuscode
12
All Processes
12 Ready
133 Done
registerWithRMIRegistry
For all processes, this command forces a reconnection with the RMI Registry, and
reloads the remote process dependencies. Useful for resynchronizing after some
types of configuration changes without needed to restart the processes. If the search
grid is configured to not use RMI, this command is ignored.
> registerWithRMIRegistry
received ack
checkpoint
The checkpoint function is issued to the Update Distributor to force all partitions to write
a checkpoint file. This is especially useful as part of a graceful shutdown process. If
large metalogs are configured, the time to replay the metalogs during startup can take
a long time. Forcing checkpoints shortly before shutdown eliminates metalogs and can
dramatically improve startup time. After issuing the checkpoint command, the Update
Distributor waits for a number to be provided. The number is a percentage,
representing the threshold over which a checkpoint should be written. For example, if
a checkpoint is normally written when metalogs reach 200 Mbytes, a value of 10 means
that a checkpoint should be immediately forced if the metalog has reached 20 Mbytes
in size. The same logic applies for other checkpoint triggers, such as number of new
objects or number of objects modified. Any value other than an integer from 0 to 99
will simply abort the command.
> checkpoint
> 10
true
reloadSettings
This command applies to all processes. Some, but not all, of the [Link] settings
can be applied while the processes are running, and some can only be applied when
the processes first start. This command requests that the process reload settings. A
list of reloadable settings is included near the end of this document.
> reloadSettings
received ack
getsystemvalue
Used to obtain specific values from the Index Engine. Currently, there are only two
keys defined. ConversionProgressPercent will return the percentage complete when
an index conversion is taking place. A “ping” operation to check that the process is
responding is also available. This command is different from the others in that it
requires two separate submissions, the first being the command and the second being
the key.
> getsystemvalue
> marco
polo
> getsystemvalue
> ConversionProcessPercent
36
addRegionsOrFields
This command applies to the Update Distributor only, and can be used to dynamically
add a region definition. Once added to an index, regions are generally sticky. The
[Link] file is not updated, so note that using this command may cause
a drift between the index and the [Link] file. This discrepancy is not a
problem, but should be kept in mind in support situations.
The syntax requires exactly one TAB character after the type and before the region
name. This command waits for additional lines of definitions until an empty line is sent,
which terminates the input mode. The function returns true on completion.
> addRegionsOrFields
> text flip
> integer flop
>
true
runSearchAgents
Update Distributor only. Instructs the Update Distributor to run all of the search agents
which are currently defined against the entire index. Results are sent to the search
agent IPool.
> runsearchagents
true
runSearchAgent
Update Distributor only. Instructs the Update Distributor to run a specific search agent.
The search agent named must be correctly defined in the [Link] file. Results are
sent to the search agent IPool. This command expects one line with the search agent
after the command.
> runsearchagent
> bob
true
runSearchAgentOnUpdated
Update Distributor only. Instructs the Update Distributor to run all of the specific listed
search agents. Time is based on the values in upDist.N file, and the timestamp is
updated (see Search Agent Scheduling). Requests are added to a queue and may
require some time to complete. Results are sent to the search agent IPool.
> runsearchagentonupdated
> MyAgentName
> AnotherAgent
true
runSearchAgentsOnUpdated
Update Distributor only. Instructs the Update Distributor to run all the search agents.
Time is based on the values in upDist.N file, and the timestamp is updated (see Search
Agent Scheduling). Requests are added to a queue and may require some time to
complete. Results are sent to the search agent IPool.
> runsearchagentsonupdated
true
Server Optimization
There are many performance tuning parameters available with OTSE. There is no
single perfect configuration that meets all requirements. You can optimize for indexing
performance or query performance. There are tradeoffs between memory and
performance, and many external parameters can affect the OTSE behavior. In this
section we examine some of the most common options for system tuning. The focus
here is on administration and configuration tuning, not on application optimization.
If your use of OTSE includes high volumes of indexing and metadata updates, then
fragmentation may occur more quickly. You can consider modifying the configuration
settings to run the defragmentation several times per day. While defragmentation is
happening, there will be short periods, typically a few seconds at a time, where search
query performance is degraded. In practice, we find that Low Memory Mode without
daily defragmentation is providing the best indexing throughput.
The tuning parameters typically do not require adjustment unless you are experiencing
extraordinary levels of memory fragmentation. Within the search.ini_override
file, in the [DataFlow] section, the following settings can be added to make adjustments
if necessary:
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
Defragmentation times can be a list in 24 hour format (for example, 2:30;14:30) to run
multiple times per day. Space is the maximum temporary memory to consume while
defragmenting in MB; the larger the value, the faster defragmentation runs – up to a
limit based on the size of the largest region. To completely disable defragmentation,
set the DefragmentMemoryOptions value to 0. Setting the options value to 1 is not
recommended – it enables aggressive defragmentation, whereby all regions are
defragmented without relinquishing control to allow searches while defragmentation
occurs.
There are two other defragmentation settings that you will normally not need to adjust:
DefragmentMaxStaggerMinutes=60
DefragmentStaggerSeedToAppend=SEED
If you have multiple search partitions, each partition will randomly select a
defragmentation start time up to “MaxStaggerMinutes” after the specified daily
defragmentation time. The purpose of this is to distribute CPU load randomly if you
have many partitions. The SEED value is a string used to seed the random number,
and is available to change if for some reason the default string “SEED” produces start
times which cluster too tightly. It is unlikely you will need to provide an alternative
string.
memory needed for other purposes, the practical upper limit for memory that can be
reserved for metadata is about 1 gigabyte. Customers using Content Server on
Solaris, which uses a 64 bit JVM, have reported success using larger partition sizes,
up to 3 gigabytes.
Assuming a 64 bit Java environment, such as Content Server 10.5 or 16, you can set
the partition sizes larger. Because of the number of variables, there is no simple
optimal size which is always correct. For systems which cannot contain the entire
index within a single partition, larger partition sizes are synonymous with fewer
partitions. Here are some of the tradeoffs:
The memory overhead for a partition is more or less constant, regardless of the
partition size. Larger partitions are therefore more efficient in terms of memory
use, which can reduce the overall cost of hardware.
During indexing, the Update Distributor will balance the load over the available
index engines. If high indexing performance is a key requirement, more partitions
may be preferable.
For search queries returning small numbers of results (typical user searches),
fewer partitions are more efficient. This is typical of most Content Server
installations.
Some specific types of queries are slow and performance is based on the number
of text values in the partition dictionary. Smaller partitions are therefore faster. If
regular expression (complex pattern) queries on text values stored in memory are
common for your application, then smaller partitions may be a better choice.
A small partition would reserve about 1 gigabyte of RAM for metadata. A very
large partition would be about 8 gigabytes in size. Experimenting with intermediate
sizes before configuring a large partition is strongly recommended.
Currently, very conservative default values are used: 80% full for rebalancing and 77%
for the stop rebalancing threshold, which reflects the amount of memory typically used
by existing Content Server customers.
Selecting a suitable threshold for update-only mode requires a little more thought, and
depends upon your expected use of the search engine. The default value with Content
Server is a setting of 70%, which reserves 10% of the space for metadata changes.
Some considerations for adjusting this setting include:
• If your system has applications or custom modules known to add significant
new metadata to existing objects, you should allow more space for updates.
• Archival systems which rarely modify metadata can reduce the space
reserved for updates. Note that Content Server Records Management will
often update metadata when activities such as holds take place, even with
archive applications.
Note that these values are representative for traditional partitions with 1 GB of memory
for metadata. If you are using a larger partition, then reserving less space for updates
and rebalancing may be appropriate. The best practice is to periodically review the
percent full status of your partitions, and adjust the partition percent full thresholds
based upon your actual usage patterns.
The values in the [Link] file that defined the various thresholds are:
MaxMetadataSizeInMBytes=1000
StartRebalancingAtMetadataPercentFull=99
StopRebalancingAtMetadataPercentFull=96
StopAddAtMetadataPercentFull=95
WarnAboutAddPercentFull=true
MetadataPercentFullWarnThreshold=90
The times are cumulative since the Update Distributor was started. Each entry has the
form:
Category N ms (count).
Total Time Total uptime of the Update Distributor – this includes the start-up time
that is not included in any other category – hence it will be larger than
the sum of the other categories.
Start Transaction Time the Update Distributor spends waiting for the Index Engines to be
ready to start a transaction.
End Transaction Time the Update Distributor spends waiting for a transaction to end,
excluding time to write checkpoint files. Too much time in this category
may indicate an excessive amount of time is spent running search
agents (for Content Server, usually Intelligent Classification or
Prospectors).
Checkpoint Time the Update Distributor waits for the Index Engines to write
checkpoint files. Large percentages of time here suggest that
checkpoints are created too frequently, or the storage system is under-
powered. Metalog thresholds can be adjusted to reduce the frequency
of checkpoint writes.
Local Update Time the Update Distributor is working with the Index Engines to update
the search index. This is useful time. It is common for this value to
remain below 15% of the time even when a system is performing well.
Global Update Time in which the Update Distributor is interrogating the Index Engines
prior to initiating the local update steps. A typical purpose is to establish
which Index Engine should receive a given indexing operation. Long
times here may indicate that Update Distributor batch sizes are too
small.
Idle The amount of time the update distributor is idle – it has completed all
the indexing it can, and is waiting for new updates to ingest. A high
percentage of time idle indicates that OTSE has additional capacity. If
indexing is slow and there is sufficient idle time, the bottlenecks likely
exist upstream in the indexing process (DCS, Extractors or DataFlow
processes). Note that you should always have some idle time, since the
demand on indexing throughput is not constant.
IPool Reading The amount of time the Update Distributor spends reading indexing
instructions from the disk. In general, this should be relatively small
compared to measurements such as Local Updates. If not, it may
indicate poor disk performance for the disk hosting the input IPools.
Batch Processing The amount of time planning how to proceed with the local update. This
value should be very small as a percentage of global update time.
Start Transaction Older systems, using RMI mode could not differentiate between time
and Checkpoint spent writing checkpoints and time spent on starting a transaction.
Therefore on these systems those two operations are grouped into a
single category. A properly configured system should have a value of 0
in this field.
Search Agents Time spent running search agent queries. Does not apply when
configured to use the older method of running agents after every index
transaction.
Network Problems The values NetIO1 through NetIO5 capture the number of times 1 to 5
retries were needed to read or write to network IO. The NetIOFailed
counts the number of times IO failed after 5 retries.
Because the characteristics of Low Memory mode are different, these values can be
adjusted upwards significantly, perhaps to 100 MB, or 50,000 new objects or 10,000
objects modified. In order to maintain backwards compatibility and mixed mode
operation, OTSE has a separate set of Checkpoint Threshold configuration settings for
Low Memory Mode:
MetaLogSizeDumpPointInBytesLowMemoryMode=100000000
MetaLogSizeDumpPointInObjectsLowMemoryMode=50000
MetaLogSizeDumpPointInReplaceOpsLowMemoryMode=5000
Throughput normally increases with larger values because the number of times that
Checkpoints are created decreases. At the same time, this increases the likelihood
that many partitions will need to create checkpoint files at the same time. This may
place a high load on your disk system, and stall indexing for longer periods when
Checkpoint writes happen.
Larger values mean that more data is kept in the metalog and accumlog files instead
of in the Checkpoint. Larger metalog files require more time to consume during the
startup process for Index Engines or Search Engines. In most cases, this is a one-
time penalty and is acceptable.
When checkpoints are written, the Update Distributor writes lines to the log file that
indicate progress against each of the three configuration thresholds for each partition
that will write a checkpoint. Reviewing these lines can help you understand where
adjustments may be appropriate. The log lines look like this:
Set the CheckMode to 1 to enable use of metadata Merge File mode. The LogSize
determines how large the CheckLog files may become before a merge operation is
triggered, and defaults to 512 MBytes. The MergeThreadInterval determines how often
the Index Engines check to see if a merge should be performed, with a default of 10
seconds. The MemoryOptions is optimized to minimize memory use; setting this value
1 uses perhaps 100 MB of additional RAM per partition for a relatively small
performance increase while performing merge operations.
Index Batch Sizes
The Update Distributor breaks input IPools into smaller batches for delivery to Index
Engines. The default is a batch size of 100. For Low Memory mode, this can be higher,
perhaps 500. Since the batch size is distributed across all the Index Engines that are
currently accepting new objects, the batch size can be further increased if you have
many partitions. A guideline might be 500 + 50 per partition. Larger batches result in
less transaction overhead.
[Update Distributor section]
MaxItemsInUpdateBatch=500
Note that the batch size is also limited by the number of items in an IPool. Often, the
default Content Server maximum size for IPools is about 1000, so this may also need
to be modified to take full advantage of increases in the Update Distributor batch size.
Starting with 20.3, batches are also split when the total size of the metadata plus text
in the objects to be indexed exceeds a defined threshold. The default is 10 MB, but
can be set higher if indexing large objects is common. This has been seen when
indexing email that has distribution lists with thousands of recipients. In the [Dataflow_]
section:
MaxBatchSizeInBytes=20000000
Prior to 20.3 the splitting of batches based on size used a different approach, where
the total size of the metadata of the objects in the batch can not exceed half of the
content truncation size (typically 5 MB).
There is another configuration setting that enables an optimization added in 16.2.2
related to how batches are handled. When processing ModifyByQuery or
DeleteByQuery operations, each request is sent to every Index Engine separately. In
practice, there are often many such contiguous operations in an IPool. The
optimization bundles these contiguous operations into a single communication to each
Index Engine, reducing the coordination overhead. By default, this optimization is
enabled, and can be controlled in the [DataFlow] section of the [Link] file:
GroupLocalUpdates=true
Partition Biasing
Research has shown that there is a strong correlation between the number of partitions
used for indexing and the typical indexing throughput rate. As expected, more
partitions improve parallel operation, and increases the throughput. However, the
transaction overhead per partition is relatively fixed, and the batch sizes become
fragmented into small batches when the operations are distributed to many partitions.
The default value is 0, which disables partition biasing. Biasing only applies to new
objects being indexed. Updates to existing objects are always sent to the partition that
contains the object, regardless of biasing. For biasing purposes, a partition is
considered “full” when it reaches its “update only” percent full setting. The algorithm
for distributing new objects across active partitions is based upon sending objects with
approximately similar total sizes of full text and text metadata.
During an indexing performance test at HP labs in the summer of 2013, a brief test of
indexing throughput versus the number of partitions was performed. At the time, the
index contained about 46 million objects. There was plenty of spare CPU capacity,
and a very fast SAN was used for the index. In this particular test, the throughput
peaked around 12 partitions.
Parallel Checkpoints
Another index throughput adjustment setting is control over parallel checkpoints.
When a partition completes an indexing batch, it checks to see if the conditions for
writing a Checkpoint have been met. If so, then all partitions are given the opportunity
to write Checkpoint files. The logic being that if at least one partition is stalled, then
any partition that might need to write a Checkpoint soon should do it now. However, if
there are many Checkpoints, you may saturate disk or CPU capacity when large
numbers of partitions write Checkpoints, causing dramatic performance degradation
while writing the Checkpoints. The parallel Checkpoint control lets you specify the
maximum number of partitions that are allowed to write a Checkpoint at the same
moment. If more need to write Checkpoints, they must wait until a slot is freed up by
a Checkpoint write completing in another partition. You should only need to adjust this
if thrashing due to partition writing is suspected as a problem. Disabled by default, in
the Dataflow section of the [Link] file:
MaximumParallelCheckpoints=8
A further optimization was added in version 20.4, in which a quick single-token search
for the data ID is performed to get a short list of objects which are then tested for the
phrase match. This behavior is considerably faster since phrase searches are
considerably slower than single token searches. This fast lookup can be disable if
necessary in the [Dataflow_ ] section of the [Link] file:
DisableDataIdPhraseOpt=true
Compressed Communications
There is a configurable option in OTSE that allows the content data sent from the
Update Distributor to the Index Engines to be compressed. For systems which have
excess CPU capacity and slow networking to the Index Engines, enabling this option
can improve indexing throughput. Most systems do not have this performance profile,
so the feature is disabled by default. The threshold setting determines the minimum
size of full text content that needs to be present before the compression is triggered
for a specific object. Note that compression also requires additional memory. The
memory requirement varies based upon the maximum size of the text content, and for
a system with a content truncation size of 10 MB an Index Engine would consume
another 12 MB of RAM. In the [Dataflow_] section:
CompressContentInLocalUpdate=false
CompressContentInLocalUpdateThresholdInBytes=65535
Data Storage Optimization
Regions are encoded for storage using a data structure that contains look-ahead
pointers which allow traversing the list quickly. Beginning with 21.2, this “skip list” has
been extended to support longer skips, allowing faster search and retrieval, but
requires a relatively small amount of additional memory. There are configuration
settings to enable and tune this behavior, with a default of “on” and skip values of 4096.
[IndexMaker]
UseLongSkips=true
LongSkipInterval=4096
Scanning Long Lists
There is a specific optimization available for updates to text metadata in partitions not
using Low Memory mode. Low Memory mode uses different data structures and does
not exhibit this behavior.
If metadata updates are applied to metadata values where many objects have the
same value, the update operation can be extremely slow. For example, the
“OTCurrentVersion” region may have 1 million objects with the value “true”. Updates
to this field would be very slow.
The optimization makes these updates fast, but requires additional memory. Because
many customers with this configuration have full partitions, they cannot tolerate extra
memory requirements, so the default is for the optimization to be disabled (a value of
0). The configuration setting specifies the distance between known synchronization
points in the data structure. Values of about 2000 perform well, values below 500
become memory-intensive. In the [Dataflow] section:
TextIndexSynchronizationPointGap=2000
scenarios, about 2 KB of metadata and 31 KB of full text content. Each batch added
a net increase of about 2 million new objects, although a mixture of metadata updates
and deletes were also included in each batch to simulate real-world behavior with
Content Server. The number of partitions was nominally 8, although variations were
tested. The chart on the next page provides a summary, with commentary below.
The test was seeded with an 8-partition index of about 14 million items. Initially, 12 to
16 partitions were enabled. After each batch of 2 million items was ingested, the
performance was reviewed and occasionally changes made to the configuration of
hardware or the index.
Below 50 million items in the index, an important observation is that the Update
Distributor does not appear to be a bottleneck, despite all data for all Index Engines
passing through the Update Distributor. We see many data points where the overall
throughput exceeds 100 items per second, which would be in the neighborhood of 8
million objects per day.
Once we had confirmed that performance with 16 partitions was relatively high, we
adjusted the number of partitions down to 8, to focus on building larger partitions in the
available lab time. As expected, the throughput with 8 partitions is significantly lower.
By the end of the test, the 8 partitions contained indexes of 10 million objects each. At
this size, the indexing throughput had decreased to just under 30 objects per second.
This is nearly 2 million objects per day, not including excess capacity for downtime or
spikes.
Some interesting data points:
• At about 94 million objects, we enabled more active partitions and observed
that much higher ingestion rates were still possible.
• Around the 30 million object mark, a faulty network card was replaced,
resulting in a material jump in performance.
• During one interval we duplicated the exact same test on the same hardware,
running concurrently. Our indexing tests were not fully engaging the capacity
of the HP hardware, generally staying below 30% CPU use. Doubling the
indexing load on the hardware resulted in dropping the throughput from about
40 to about 30 objects per second for the observed test, although we did
manage to get a peak CPU use above 60%. The duplicate concurrent test
had similar performance characteristics. It would appear that the HP
environment has capacity for a much larger index than we tested, or could also
be used for other purposes such as the Document Conversion Server.
• We disabled CPU hyper-threading for two runs, which reduced throughput
again from about 40 to 30 new objects per second. Lesson learned: leave
hyper-threading enabled for Intel CPUs.
What about searching? Search load tests from within Content Server were performed
concurrently while indexing was occurring. As expected, search became slower as the
index size increased. By test end, with 100 million items and indexing 40 objects per
second, simple keyword searches from the search bar averaged less than 3 seconds,
and advanced search queries about 6 seconds, including search facets. This is not
the search engine time, but the overall time including Content Server.
Does this ingestion case study have relevance for even larger systems? Yes. The
indexing throughput we measured is based on the number of “active” partitions, using
partition biasing. Eventually, you may have many more partitions, but by biasing
indexing to a limited subset, the indexing throughput can be modeled along the lines
seen in this example.
As a final note, this test was performed using Search Engine 10.0 Update 11. There
are a number of performance improvements, in particular for high ingestion rates that
have been implemented since this test was performed. Consider these data points to
be conservative.
Re-Indexing
Although OTSE has many features that provide upgrade capability and in-place data
correction, there are times when you may want to completely re-index your data set.
If you have a small index, re-indexing is fast and easy. For larger indexes, there are
some performance considerations.
It is faster to rebuild from an empty index than to re-index over existing data. There
are several reasons for this. Firstly, the checkpoint writing process slows down as the
index becomes larger, since there is more data to write to disk. When starting fresh,
the early checkpoint writing overhead is very small. Modifying values is also more
expensive than adding values – searching for existing values, removing them, and
adding new values to the structure is slower than simply adding data to a structure.
Another key factor is the metalog update rules. In particular, the default checkpoint
write threshold is lower for updates than it is for adding new items to the index. This is
a reasonable value during normal operation, but when a complete re-index is in
progress and all objects are being modified, this setting will result in a high checkpoint
overhead. A purge and re-index avoids this problem entirely. If re-indexing very large
data sets, increasing the threshold replace operations may be a useful strategy.
below a few typical updates per second (yes – per second. Depending upon the
situation, an Index Engine is capable of indexing 50 or more objects per second).
For maximizing indexing throughput, disk performance is a key parameter, since disk
I/O is usually the limiting factor. Using several sample test setups on similar (but not
identical configurations) in 2012 we measured indexing times with 4 partitions of:
390 Minutes with a single good SCSI hard disk installed in the computer.
5000+ Minutes attached to a busy NFS storage array shared with other
applications with a 10 GB network connection running on VMware ESX.
Read that last one again. You really can configure disk storage that will reduce the
performance of OTSE by a factor of 20 or more. Disk fragmentation also has an
impact. On Windows, we typically see a 20% indexing performance drop between a
pristine disk and one with 60% file fragmentation.
Note that the caching features of some SANs are too aggressive, and can report
incorrect information about file locking and update times.
Customers using basic Network Attached Storage such as file shares generally report
poor search performance. In general, storing the search index on a network file share
will give very poor results.
The incidence of network errors that customers experience when using either SAN or
NAS is surprisingly high. OTSE has relatively robust error detection and retries for
these cases, but failure of the search grid due to network errors is still possible. When
using any type of network storage for the index, monitoring the network for errors is a
good practice that may prevent a lot of frustration due to intermittent errors.
A dedicated physical high performance disk system will usually outperform a network
attached disk system. However, a SAN with high bandwidth often has other benefits,
such as high availability, which make them attractive. If you are configuring a SAN for
use with search, treat the search engine like a database. The performance of the disk
system is almost always the limiting factor in performance.
Any type of network storage is acceptable for index backups. In fact, backing up the
index onto a different physical system is generally recommended.
Finally, a word about Solid State Disks (SSD). SSDs are gaining acceptance for high
performance enterprise storage. The characteristics of fast SSD are a good fit for
search engines. Given the large number of small random access reads that occur
when searching, SSD storage is an excellent choice for maximizing search query
performance. Indexing performance is not as dramatically affected, since the Index
Engines are generally optimized to read and write data in larger sequential blocks.
However, even with indexing, the highest indexing throughputs we have measured in
our labs occurred with local SSD storage for the index, around 1 million objects indexed
per hour. If you need to improve the query performance or indexing throughput,
investing in good SSD storage media for the index is likely the best hardware
investment you can make.
NetIO5 for retries, and NetIOFailed for failures. The counts are also included in a
“getstatustext performance” query to the admin port.
Retries and failures indicate problems in the environment and may include unreliable
network cards, bad cables, port conflicts, or virus/port scanners.
By default, recording of the network quality metrics is enabled, and can be disabled in
the [Dataflow_] section of the configuration file by setting the value to false:
LogNetworkIOStatistics=true
In addition to the times, the number of disk errors that occur and the number of retries
needed to succeed are recorded. If errors exist, an additional line of this form will be
written:
Disk IO Retries Needed. 1 (7). 2 (6). 3 (8). 4 (2). 5+ (22).
failed (17).
For example, this entry indicates that on 7 occasions, 1 error/retry was required. On
22 occasions 5 or more retries were attempted, and 17 times the disk I/O failed even
with retries.
Similarly, the Search Engine reports performance for selected disk operations, writing
entries of this form:
Disk IO Counters. Read Bytes 112711231. Write Bytes 0.:
By default, reporting of this data is enabled and is written every 25 transactions. The
feature can be disabled and the frequency of reporting can be controlled in the
[Dataflow_] section of the [Link] file:
LogDiskIOTimings=true
LogDiskIOPeriod=25
Checkpoint Compression
There is an optional feature in OTSE that allows Checkpoint files to be compressed.
Checkpoint files can be large, over 1 GB as you exceed 1 million objects in a partition.
New Checkpoint files are written from time to time, usually by all partitions at once,
which can place a significant burden on the disk system.
The compression feature is disabled by default since, in a simple system with a single
spinning disk, compression makes Checkpoint writing CPU bound, and indexing
throughput may decrease by 10% to 15%. However, if you have a system which is
limited by disk bandwidth rather than CPU, then enabling Checkpoint compression
may be a good choice, and actually increase indexing performance. The compression
feature generally reduces the size of Checkpoint files by about 60%. Compression is
enabled in the [Dataflow_] section of the [Link] file:
UseCompressedCheckpoints=true
Chunk Size
Some storage systems are sensitive to the chunk size when reading or writing data.
The default is 8192. Although normally we recommend the Java default of 32768, this
parameter can be forced to a smaller maximum value if necessary in the DataFlow
section of the [Link] file:
IOChunkBufferSize=8192
Query Parallelism
The Search Federator asks each Search Engine to return results. There are two key
performance tuning values in this process in the [Link] file. The first is how
aggressive the Search Federator will be with respect to asking Search Engines to pre-
fetch results to keep the Search Federator result merging queue full. The default value
of 0 is used to pre-fetch as much as possible, measured in terms of Search Engine
result blocks. Setting this number higher will delay pre-fetching, which can reduce the
number of results fetched but introduces delays into result retrieval. For example, a
value of 3 will wait until a Search Engine has been asked for 3 blocks of results before
beginning to pre-fetch results.
MergeSortCacheThreshold=3
The other parameter is the number of results a Search Engine fetches each time the
Search Federator asks for a set of results. The default value is 50. Larger values are
more efficient when the typical query is for many results. Smaller values are more
efficient for typical relevance-driven queries. In general, if using the preload above, a
value of 20 to 50 is likely optimal, and reduces the potential load on the disk system.
MergeSortChunkSize=50
These values are multiplicative with the number of partitions. For example, if you have
8 partitions and a MergeSortChunk size of 250, then the MINIMUM number of results
that the Search Engines together will provide to the Search Federator is 2000. Keeping
MergeSortChunk size value low for systems with many partitions is recommended.
Throttling Indexing
In some environments, it may be the case that indexing operations are creating
metalogs faster than they can be consumed by the search engines. There is an upper
limit on how many unprocessed metalog files are acceptable, which can be adjusted if
necessary should Search Engines chronically lag behind the Index Engines. This can
happen in environments in which long-running search queries tie up the Search
Engines at the same time that high indexing rates are occurring. In some cases this
problem can be resolved by configuring Search Federator caching. When this limit is
reached the indexing updates will pause to allow the Search Engines to close the gap.
AllowedNumConfigs=200
In situations where queries are constantly running, it may be necessary to force a
pause in processing search queries in order to give the Search Engines an opportunity
to consume the index changes. There are two settings to control this, one that specifies
the maximum time that queries are allowed to run continuously (thus blocking updates),
and the other is the duration of the pause which is injected into searching. By default,
this feature is disabled.
[SearchFederator_xxx]
BlockNewSearchesAfterTimeInMS=0
PauseTimeForIndexUpdatingInMS=30000
The truncation size will also need to be adjusted upwards from about 10 MB to the
desired size, perhaps 210 MB. The timeouts for the Index Engines may also need to
be increased. Changes to settings in the Document Conversion Server will also be
required, including allocating more memory, adjusting truncation limits, and providing
much longer timeout values for processing formats.
Virtual Machines
In principal, virtual machines should be indistinguishable from physical computers from
the perspective of the software. In practice, there are occasionally problems which
arise from running software in a virtual environment. OTSE is known to operate with
VMWare ESX, Microsoft HyperV, and Solaris Zones. However, OpenText cannot in
reality rigorously test and certify every possible combination of hardware and virtual
environment, and there may be configurations of these virtual environments that
OpenText has not encountered which might be incompatible with the search grid.
The most important point is this: virtual machines do NOT reduce the size of the
hardware you need to successfully operate a search grid. If anything, operating a
search grid in a virtual environment will require MORE hardware to achieve the same
performance levels, when measured in terms of memory and CPU cores/speed.
For small installations of the search grid where performance issues are not a factor, a
virtual environment can be attractive. However, as your system increases in size to
require many partitions, be aware that a virtual environment may be more costly than
a physical environment for the search grid, which needs to be considered against VM
benefits such as simplified deployment and management. Consider a search engine
as being analogous to a database. For larger or performance-intensive database
applications, the database is often left on bare metal, even if the remainder of an
application is virtualized. The Search Engine has performance characteristics similar
to a database and it may make sense to leave the Search Engine on dedicated
hardware.
One example of a limitation we have seen is virtual machines in a Windows server
environment. In some cases, the I/O stack space is not sufficient once the extra VM
layers are introduced, and tuning of the Windows settings to increase I/O resources
becomes necessary.
As with most applications deployed in a virtual environment, the software runs slower.
The change in performance depends on many factors, but a 10% to 15% performance
penalty is not uncommon.
We have also seen instances in which the memory used by Java in a VM environment
is reported as much higher than the equivalent situation on bare hardware. In practice,
the actual memory in use is very similar, but the reported values can differ wildly. Often,
over a period of many hours, the reported VM memory will decline and converge on
memory consumption reported on a bare hardware environment.
Garbage Collection
The Java Virtual Machine will generally try to optimize the number of threads it
allocates to Garbage Collection. However, it is not always correct. For example, when
running in a Solaris Zones environment, the “SmartSharing” feature of Zones can
trigger the Java Garbage Collector to allocate very large numbers of threads and
memory resources, which in Zones may be manifested as Solaris Light Weight
Processes (LWPs).
If the number of threads on a system allocated to Garbage Collection seems unusually
large, you likely need to place a limit on the number of Garbage Collection threads,
which can be done using by modifying the Java command line to add the –
XX:ParallelGCThreads=N, where N is the maximum number of threads. Selecting N
may require experimentation, but values on the order of 8 are typical for a system with
8 partitions, and values over 16 may provide little or no incremental value.
File Monitoring
Some tools that monitor file systems can cause contention for file access. One known
example of this is Windows Explorer. If you browse to a folder used by SE 10.5 to
represent the search index using Windows Explorer, then you will likely cause file I/O
errors and a failure of the search system.
Virus Scanning
The performance impact of virus scanning applications on the search grid is
catastrophic because of the intense disk activity that the search grid performs. In some
cases, file lock contention can also cause failure or corruption of the index. You must
ensure that virus scanning applications are disabled on all search grid file I/O. The
search system only indexes data provided by other applications. If virus scanning is
necessary, then scanning the data as it is added to the controlling application (such as
Content Server) is the recommended approach.
Related to this, we see virus scanners now offering port scanning features as well.
Like virus scanners, we have found that port scanners can significantly reduce
performance or cause failure of the software.
Thread Management
OTSE makes extensive use of the multi-threading capabilities of Java. In general, this
leads to performance improvements when the CPUs have threads available. However,
for very large search grids with over 100 search partitions, the number of threads
requested by OTSE may exceed the default configuration values for specific operating
systems. Depending upon the operating system, it is usually possible to increase the
limits for the number of usable threads. This problem is less likely to occur when
running with socket connections instead of RMI connections.
Configuring an operating system to permit more threads for a single Java application
is beyond the scope of this document, and may also include tuning memory allocation
parameters for the JRE. The objective here is simply to make you aware that additional
system tuning outside the parameters of OTSE may be necessary.
Scalability
This section explores various approaches to scaling OTSE for performance or high
availability. OTSE does not incorporate specific scalability features. Instead, by
leveraging standard methods for system scalability with an understanding of how the
search grid functions, we can illustrate some typical approaches to search scalability.
Query Availability
The majority of customers that desire high availability are generally concerned with
search query performance and uptime. Usually, this is tackled by running parallel sets
of the Search Federators and Search Engines in ‘silos’, with a shared search index
stored on a high availability file system, as illustrated below:
To obtain the benefit of high availability, the search silos should be located on separate
physical hardware in order to tolerate equipment failure.
Search queries are not stateless transactions; they consist of a sequence of operations
– open a connection, issue a query, fetch results, and close the connection. Because
of this, simple load balancing solutions cannot easily be used as a front end for multiple
search federators. Instead, the application issuing search queries should have the
ability to direct entire query sequences to the appropriate silo and Search Federator.
Content Server is one such application. If multiple silos are configured, search queries
will be issued to each one alternately. In the event that one silo stops responding,
Content Server will remove that target from the query rotation. Refer to the Content
Server search administration documentation for more information.
In this configuration, the Search Engines share access to a single search index. This
works because Search Engines are “read only” services which lock files that are in
use. All changes to the Search Index files are performed by the Index Engines. When
a Search Engine is using an index file, it keeps a file handle open – effectively locking
it. The Index Engines will not remove an index file until all Search Engines remove
their locks on a fragment. Because these locks are based on file handles in the
operating system, a Search Engine which crashes will not leave locks on files.
When Search Engines start, they load their status from the latest current checkpoint
and index files, and apply incremental changes from the accumlog and metalog files.
Because of this, no special steps are needed to ensure that Search Engines in each
silo are synchronized. They will automatically synchronize to the current version of
the index.
It is possible for an identical query sent to each silo at the same time to have minor
differences in the search results. The differences are rare, probably small, and short
lived – and would not be noticed or important for most applications. These potential
variances arise due to race conditions. The Search Engines in each silo update their
data autonomously. When an Index Engine updates the index files, perhaps adding or
modifying a number of objects, the Search Engines will independently detect the
change and update their data. For a short period of time, a given update to the search
index may be reflected in one of the search silos but not the other.
This approach to high availability for queries also allows many search grid maintenance
tasks to be performed on Search Federators or Search Engines without disrupting
search query availability. By stopping one silo, performing maintenance, restarting the
silo, and then repeating the process with the other silo, user queries are not impacted
throughout the process. Note that some administration tasks which change
fundamental configuration settings may not be possible without service interruption.
An additional benefit of parallel silos is search throughput. Since applications such as
Content Server can distribute the query load across multiple silos, the overall search
performance might be higher. This will not be the case if the hardware on which the
search index is stored is a performance bottleneck, particularly the disk which is shared
by each silo.
For correct operation, each silo must have identical configuration settings. If you have
hand-edited any of the configuration files, you must ensure this is properly reflected on
both silos.
settings and external clustering hardware or software. The general principal is that two
completely separate search grids are created, the indexing workflow is split and
duplicated, and the indexes are independently created and managed. This is an
exercise pursued using products such as Microsoft Cluster Server, and beyond the
scope of this document.
Minimizing Metadata
Many Content Server applications index much more metadata than is actually used in
searches. Using the LLFieldDefinitions file to REMOVE metadata fields that will never
be used can minimize the RAM requirements.
Metadata Types
By default, metadata regions are created as type TEXT. Integers, ENUM, Boolean are
more efficient, and using the LLFieldDefinitions file to pre-configure types for these
regions can reduce the RAM requirements.
may require additional CPU cores or disk bandwidth to leverage the parallel capabilities
for performance.
Sample Data Point
Our sample system is comprised of a relatively typical mix of Content Server data
types from a “document management” application, including some use of forms
and workflow. There are several hundred core metadata regions, and several
thousand lesser-value metadata regions from applications such as Workflow.
In “RAM” mode, without tuning, a default 1 GB partition holds about 1.5 million
objects. Using a 3 GB partition size, we measure 4.4 million objects using about
2.5 GBytes of RAM for metadata.
In “DISK” mode, with the same data we can index the same 4.4 million objects
using a little more than 1.8 GBytes of RAM for metadata, which is roughly a 2.5
GByte partition.
In “Low Memory” disk mode, the same 4.4 million objects requires about 700
Mbytes of RAM, which can be done in a 1 GByte partition. We extrapolate that a
2 GB partition in Low Memory mode can potentially handle up to 10 Million indexed
objects from Content Server.
The general guideline using Low Memory mode with Content Server is that you can
expect a partition to accommodate 7 to 10 million typical Content Server objects with
reasonable performance using a 2 GB RAM partition size. The overall conservative
memory budget for such a partition is approximately 6 GB (2 GB RAM + 1 GB overhead
and Java for each of the Index Engine and Search Engine).
Memory Use
When running a Java process, the amount of memory it may use is specified on
the command line. Java can be aggressive about consuming this memory. You
may be able to operate a partition with 1 GB of RAM, but if you made 8 GB of
memory available, Java may consume all of it. This memory use can be
misleading when analyzing resources used by a search partition.
Redundancy
If you are building a high availability system with failover capabilities, the hardware
must be suitably duplicated.
Spare Capacity
In the event that there are maintenance outages, or a requirement to re-index portions
of your data, you will need spare CPU capacity to handle this situation. Although OTSE
is a solid product, indexing problems can happen – generally incorrect configuration or
network/disk errors, although (perish the thought) there are occasionally bugs found.
Sizing the hardware to meet the bare minimum operating capacity won’t allow you any
headroom to recover from problems.
Indexing Performance
As with all sizing exercises, making predictions is fraught with danger. Ignoring the
peril, our anecdotal experience is that the Index Engines can ingest more than 1
Gigabyte of IPool data per hour.
A specific example on a computer that we frequently use for performance testing:
• Windows 2008 operating system, 2 Intel X5660 CPUs, 16 Gbytes RAM
• Update Distributor
• 4 Index Engines / partitions
• Partition metadata size of 1000 Mbytes
• Index stored on a single SCSI local hard disk
• Predominantly English data flow
consumes more than 4 GB per hour, comprised of nearly 200,000 objects added or
modified per hour. Usually, high performance indexing is limited by disk I/O capacity.
Refer to the Hard Drive Storage section for more information.
Beyond about 4 partitions, the performance of the Update Distributor becomes a factor,
and you may need to ensure that the disk read capability for the indexing IPools is
adequate.
CPU Requirements
There is no single rule for the number of physical CPUs needed for a search grid. Don’t
rely on hyperscaling – physical CPUs are key. The requirement is directly related to
your performance expectations. Some of the variables you should bear in mind are
outlined here.
Most customers optimize for cost and have low CPU counts. This means that search
works, but user satisfaction with performance may be low.
Active searches are CPU intensive. If good search time performance is expected, you
should have at least 1 CPU per search engine. This is especially true if multiple
concurrent searches will be running.
Searches are bursty in nature. CPUs will sit idle until a search request arrives, then
saturate the system. Administrators will tend to look at the average CPU use over time,
and claim that utilization is low, therefore no additional CPUs are needed. They are
wrong. Check to see if CPU utilization hits high levels during active searches, then
plan your CPUs based on load during that period.
Search Agents (Intelligent Classification, Prospectors) place an additional load on the
Search Engines. If you are using these features heavily, you may need to
accommodate with some additional fractional CPU. Search Agents run on a schedule,
so they have no impact most of the time, but a heavy potential impact when run.
Indexing is expensive. If you need high indexing throughput, you should have at least
1 CPU per active partition, plus 0.25 CPU per inactive partition, plus 1 CPU for the
Update Distributor. With low indexing throughput requirements, 1 CPU for 4 Index
Engines may suffice.
In addition, spare capacity is needed on the Index Engines for the following events…
running index backups, writing checkpoints, performing background merge operations.
These operations are designed to limit activity to a subset of partitions concurrently
(default about 6). You can choose degraded indexing during these periods or allocate
additional CPUs.
Example… if you want good search performance with many searches being run
(including searches for background RM disposition and hold) and expect hundreds of
thousands of indexing additions and updates every day. A medium-large system with
40 partitions (perhaps 500 million items). Configured with 6 active partitions (number
of partitions that accept new data, write checkpoints, merge concurrently).
1 CPU – Update Distributor
6 CPUs – Active Index Engines
8 CPUs – Update Index Engines
40 CPUs - Search engines with fast response
Assume indexing throughput can tolerate short slowdowns for background operations,
no extras. Over 50 CPUs is an appropriate size. Conversely, the same system which
can tolerate large backlogs for indexing (perhaps catching up in the evenings) and is
comfortable with users waiting 20 seconds on average for a search can probably get
by with 16 CPUs.
Maintenance
As with all sophisticated server software, there are a number of suggestions, best
practices and configurations that contribute to the long term health and performance
assessment. This section outlines some of the considerations.
Log Files
Each OTSE component has the ability to generate log files. There are separate log
files for each instance of each component. The basic settings are:
Logfile=<SectionName>.log
RequestsPerLogFlush=1
IncludeConfigurationFilesInLogs=true
Where the Logfile= specifies the path for logging (the file name is generated from the
component and the name of the partition). Requests per log flush specifies how many
logging events should be buffered before writing. The value of 1 is the least
performant, but does the best job of guaranteeing that logging occurs if something
crashed unexpectedly.
At startup, information about the version of OTSE and the environment are recorded
in the form of copies of the main configuration files, and can be used to verify that the
correct versions of software are running. This can be disabled by setting the Include
Configuration Files setting to false.
Log Levels
The log files have a configurable level of detail used when writing log files. The log
level for each component of the search engine is separately configured in the [Link]
file:
DebugLevel=0
The available log levels are:
0 – Lowest level, “Guaranteed logging” level output still
occurs.
1 – Severe Errors are logged
2 – All Error conditions are logged
3 – Warnings are logged
4 – Significant status information is logged
5 – Information level, most detail
If you are experiencing problems that require diagnosis, setting the log level to 5 is
recommended. You do not need to restart the search engine processes to change the
DebugLevel, these are reloadable settings.
RMI Logging
The RMI logging section determines how the RMI Registry component performs
logging. It is defined in the General section and the behavior is similar to the
descriptions above, however the names of settings in the [Link] file are different.
RMILogFile ---> Logfile
RMILogTreatment ---> CreationStatus
RMILogLevel ---> DebugLevel
the corruption might be such that searching can still occur if only index offset files are
corrupted, preventing further indexing from happening.
Step 2
Check the IndexDirectory= setting in the [Link] file in the
[Partition_xxxx] section to be certain which directory you should work in. Certain
key files in the index partition directory need to be preserved and all other files in the
directory removed. The files that must be KEPT are:
Signature file (partition name with .txt extension, typically of the form
[Link])
ALL the .ini configuration files, which includes:
[Link]
Backup process definition files
Step 3
Create an empty file in the partition index directory named [Link]. At this
point the directory should have only the INI files, the signature file, and [Link].
Step 4
Start the Index Engine. It will create a new, empty search index.
Security Considerations
OTSE does not directly implement any application security measures. However, the
interfaces to the search components are well defined, and if necessary can be locked
down using standard computer and network security tools.
A quick checklist of security access points that should be considered if you are
contemplating securing access to OTSE and the index:
• Socket API ports
• RMI API ports
• Access to folders where OTSE stores the index on disk.
• Access to the configuration files – [Link], search.ini_override,
[Link], [Link].
• Access to create indexing requests, written to an input IPool folder.
• Access to logging files or folders.
• Access to the search agents configuration file.
• Access to the search agents output IPool.
• Execute permissions for launching the application.
• Folders used in backup and restore operations.
grant {
permission [Link] "<<ALL FILES>>", "read, write, delete,
execute";
does not represent a single moment in time – each partition may have a different
capture time. The Index Transaction Logs can be used in conjunction with the backups
to reconstitute a current index from the backups.
There are several configuration settings that control the behavior of the backup
process. In the [UpdateDistributor_] section of the [Link] file:
BackupParentDir=c:/temp/backups
MaximumParallelBackups=4
BackupLabelPrefix=MyLabel
ControlDirectory=
KeepOldControlFiles=false
The BackupParentDir field specifies where the backups should be written. This must
be a drive mapping that is visible to all the admin servers running search indexing
processes. Within this directory, a sub-directory with the time the backup starts will be
created, and within that directory each Index Engine will create a directory using the
partition names to store the index. You must have enough space available to capture
a complete copy of the index. The MaximumParallelBackups setting determines how
many Index Engines can be running backups concurrently. This number should reflect
the CPU and disk capacity of your system. The LabelPrefix is optional and can be
used by a controlling application to help track status. The ControlDirectory is optional,
allowing you to override the default location for control files used to manage the backup
process. The KeepOldControlFiles is included for completeness and is generally
reserved for running test scenarios. Except for the ControlDirectory, these settings can
be reloaded (changed without restart). However, some of the settings are only used
at the start of a backup, and best practice is to make changes only when there is no
backup running.
The admin port on the Update Distributor will listen for and respond to the following
commands related to creating backups
backup
backup pause
backup resume
backup cancel
getstatustext
Backup is used to start a new backup process. Cancel and pause will complete writing
backups for the partitions that have already been instructed to create backup files. This
may take several minutes, so status checks include “pausing” status results (note that
some partitions may still be writing their output even though status is “paused”.
Resume will continue a paused backup. The response to backup commands is “true”
if the command has been accepted and acted upon, false otherwise.
Getstatustext responses are extended to include information about backups. Status
includes: None; InProgress; Completed; Paused; Pausing; Cancelled; Failed. The
details about the backups are returned as XML elements in a getstatustext operation,
along these lines:
<BackupStatus>
<InBackup>InProgress</InBackup>
<BackupLabel>MyLabel_20190322_112519734</BackupLabel>
<TotalPartitionsToBackup>10</TotalPartitionsToBackup>
<PartitionsInBackup>4</PartitionsInBackup>
<PartitionsFinishedBackup>0</PartitionsFinishedBackup>
<BackupDir>C:\p4\search.ot7\main\obj\log\ot7testoutput\
BackupGridTest_testBackup5199Ten4\backups\
20190322_112519734</BackupDir>
<BackupMessage></BackupMessage>
</BackupStatus>
Up to 3 BackupStatus elements may exist, one each for “InProgress” (including Paused
or Pausing), “Cancelled” (including Failed) and “Completed”.
The Update Distributor persists the status and progress of backups in files named
upDist.#. A file named [Link] is used to track the current version of the [Link].
Because this data is persisted, any configured backup will resume and complete even
if the Update Distributor is stopped and started. By default, these files are stored in
the same directory that contains the Update Distributor log files. The file contents are
similar to this:
UpDistVersion 1
BackupStatus Completed
BackupTimestampString 20190325_143138107
BackupLabel MyLabel_20190325_143138107
BackupDir c:/temp/backups\20190325_143138107
NumPartitionsInThisBackup 1
NumBackupPartitionsCompleted 1
EndOfBackupRecord ----------------------------------------
BackupStatus Cancelled
BackupTimestampString 20190325_162312142
BackupLabel MyLabel_20190325_162312142
BackupDir c:/temp/backups\20190325_162312142
NumPartitionsInThisBackup 1
NumBackupPartitionsCompleted 0
EndOfBackupRecord ----------------------------------------
EndOfUpDistState
When a backup process completes successfully, a file named [Link] is
added to the backup location. This file is not required by OTSE for operation – it’s
presence is to make it easier for administrators that are inspecting the file system to
determine if the backup in that location is good. The file contains a summary of the
backup using the same syntax as the [Link] file, for example:
BackupStatus Completed
BackupTimestampString 20200528_123935489
BackupLabel SocketGridBase_BackupLabel_20200528_123935489
BackupDir
C:\p4\search.ot7\main\obj\log\ot7testoutput\BackupGridTest_testBackup5179a\b
ackups\20200528_123935489
NumPartitionsInThisBackup 2
NumBackupPartitionsCompleted 2
EndOfBackupRecord ----------------------------------------
Restoring Partitions
When restoring an index, the search partition(s) being restored must first be stopped.
Use file copy to restore the entire contents of the partition backup, then start the Index
Engine and Search Engine. The Transaction Logs can then be used to identify missing
transactions to bring the index up to date. Be sure you have Transaction Logs enabled.
As a convenience, entries are written to the Transaction Logs to mark the point at which
a backup occurred. The backup markers in the Transaction Log have this form:
2018-06-11T[Link]Z, Backup started,
backupDir="c:/temp/backups\20180608_132859489/partition1",
label="MyLabel_20180608_132859489-partition1", config="livelink.27"
Backup – Method 2
Operating system file copy utilities can be used to back up the search index. All search
and index processes must be stopped for this approach to succeed. Ensure that the
entire contents of the index directories for each partition are copied.
needed for backups. This information is primarily for troubleshooting and as a starting
point for developers that are integrating index backup and restore into their
applications.
To run a full backup, a configuration file with the name ‘[Link]’ must first be created
and placed in each partition folder. For a differential backup, a file with the name
‘[Link]’ must be created.
The backup utility is then run, which performs the backup operation on a single
partition.
On completion, the backup data is contained in a folder target directory, called FULL
for a full backup and DIFFx for a differential backup (where x is the order number of
this differential backup relative to the baseline full backup). The backup process also
creates a file called ‘[Link]’ with copies in the source and backup target partition
folders.
Sample [Link] File
Note that the [Link] file is identical except for its name. The [Link] file uses a basic
Windows INI file syntax with a single section [Backup]. There a comments injected
here for explanatory purposes (the line starts with a # symbol) which should not exist
in the actual file. In practice, the only values you may want to change are the log file
name and log level.
[Backup]
# AutoNew requests that a new folder is created if it
# does not already exist.
AutoNewDir=True
DelConfig=FALSE
# Index is the root location of the source index being backed up.
Index=F:/OpenText/cs1064main01/index/enterprise/index1
# Specify the names of regions that contain date and time values
# that can reasonably be expected to reflect object index dates.
IndexDateTag=OTCreateDate
IndexTimeTag=OTCreateTime
Related to this are the format codes that are used in the label string. The codes are:
Value Description
%% A percentage sign
%p AM or PM
%P AD or BC
[General]
# 0 status is good, other values are error codes
Status=0
DiffString=Differential
FullString=Full
[FULL]
CheckPointSize=624
MetaLogNumber=51
MetaLogOffset=0
AccumLogNumber=39
AccumLogOffset=0
I1=61
I1Size=447
I2=66
I2Size=39
TotalIndexSize=1109
Label=Enterprise_04_08_2011_Full_58863
Date=20110408 145139
MetaLogChkSum=524293
AccumLogChkSum=524293
CheckPointChkSum=206517074
I1ChkSum=15804739427
I2ChkSum=11071697352
ConfigChkSum=1160933350
Success=0
[DIFF2]
CheckPointSize=665
MetaLogNumber=53
MetaLogOffset=0
AccumLogNumber=41
AccumLogOffset=11785068
I1=69
I1Size=9
TotalIndexSize=674
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 150047
MetaLogChkSum=524293
AccumLogChkSum=258080884
CheckPointChkSum=1282731032
I1ChkSum=9500506248
ConfigChkSum=624209792
Success=0
[DIFF1]
CheckPointSize=664
MetaLogNumber=52
MetaLogOffset=4732284
AccumLogNumber=39
AccumLogOffset=5824292
TotalIndexSize=664
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 145644
MetaLogChkSum=1542696885
AccumLogChkSum=238343344
CheckPointChkSum=3018112456
ConfigChkSum=389190926
Success=0
Running the Backup Utility
Once the [Link] or [Link] file is in place the backup utility can be run. The utility is
contained within the Search Engine, and documented in the Utilities section of this
document.
Preparation
The partition to be restored is placed in a known location. A configuration file is created
which points to this location, called [Link]. The target directory needs to be empty,
which means moving or deleting any existing index.
Analysis
The [Link] file from the backup location is analyzed to determine which files and
folders are required to perform the restore operation. This information is written into
the [Link] file.
The controlling application then needs to prompt the administrator to stage the
necessary folders before proceeding. Content Server is one application which
performs this coordination.
Copy
In the copy phase, the files specified in the [Link] file are used as a guideline for
copying all the necessary files from that backup location to the search index. The copy
process takes place iteratively, with one differential backup folder processed on each
invocation, and the administrator staging needed files for the next copy operation. The
process is structured to support complex backup storage systems, where each backup
may have been placed in a tape archive.
Validate
The final step is validation, in which the restored index is checked for integrity.
These stages do not automatically happen one after the other. The administrator or
the controlling application needs to initiate the steps sequentially after ensuring that
appropriate file preparation occurs.
The restore operation works on a single partition. Content Server provides a
mechanism to simplify the restore of the entire index, and prompts that administrator
to ensure the appropriate files and folders are available at each step. The syntax of
the restore utility is documented in the Utilities section of this document.
[Link] File
The [Link] file is used for each stage of the restore procedure, and modified after
each stage. This file is the mechanism for transporting process information from one
phase to the next.
Before first running the analyze stage, a [Link] file needs to be created that looks
like this:
[restore]
otbinpath=d:\opentext\bin
SourceDir=d:\llbackup\ent\incr18
destdir=d:\temprest
option=analyse
Once the analysis is complete, the [Link] file will have been updated with
information about files that will be copied, and should look like this, without the added
comments and white space:
[restore]
OTBinPath=d:\opentext\bin
BackupIndexName=livelink
LogFilename=[Link]
RestoreHistory=[Link]
BackupHistory=[Link]
DestDir=d:\temprest
SourceDir=d:\llbackup\ent\incr18
loglevel=1
# The insert option identifies that copy will take place next
option=insert
Frag2=00094
Frag1Size=2.669858
Frag1CkSum=59557
Frag1=00087
Master=Yes
In operation, the Administrator (or controlling application) is expected to examine the
IMAGE# section for the current image number, and mount the backup folder which has
the specified label and date. Once this is staged, the Admin edits the [Link] file to
change the option from “insert” to “copy”, and run the restore.
The restore utility will then copy the files from that one image, change the option to
insert, update the current image number, and the process repeats until all the IMAGE
sections are processed.
Index Files
OTSE persists the search index on disk, in a specific hierarchy of folders and file
names. This section outlines each of the folders and files and its purpose. Below is a
typical listing for a search partition for reference which will be described in detail below.
There is one such folder for each partition.
[Link]
accumlog.39
checkpoint.51
[Link]
[Link]
livelink.280
[Link]
metalog.51
topwords.100000
MODaccumlog.47
MODindex
\2
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\map
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\\[Link]
\3
\\ same
61
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
\map
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
\[Link]
62
\ same
Signature File
This first file in the list, [Link], is technically not part of the search
index, and not required for search or indexing operations. Content Server adds this
file to allow the administration interfaces in Content Server to verify that related Search
Engines and Index Engines are referencing the same directories. If upgrades occur,
older server names may migrate, this is expected.
It is possible for multiple checkpoint files to exist for a partition. Normally, this only
occurs for a short period, when a Search Engine is still using an older checkpoint file
after the Index Engine has created a new one. The Index Engines will reduce the
number of checkpoint files to one at the earliest safe opportunity.
Lock File
The Lock File is used by the Index Engine to indicate that this partition is in use. This
is a failsafe mechanism to ensure that multiple Index Engines will not attempt to use
the same data. In a properly configured system, this would not happen. The Lock file
provides additional insurance.
Control File
The Control File, named [Link], is used by the Index Engines to record the name
of the current Config file. The Search Engines read this file to obtain the name of the
current Config file. To ensure atomic reads and writes, both the Index Engine and
Search Engine will lock this file when accessing it.
Top Words
Optional file. Top Words are used to track which words in an index are candidates for
exclusion from TEXT queries because they are too common. The file is named
topwords.n, where n is one of 10000, 100000 or 1000000 – which reflects the number
of objects in the partition when the file was generated.
Config File
Named livelink.x, where x is an incrementing number. The config file contains detailed
information about the index fragments, working file offsets, file checksums, and other
parameters needed by the Index Engine and Search Engine to properly interpret the
index files.
A new Config file is written each time the Index Engine creates a new fragment or
generates a checkpoint. A Search Engine will place a non-exclusive lock on the Config
file which represents the accumlog and metalog files it is currently consuming.
The Index Engine will clean up older, unused Config files.
Metalogs
A metalog contains incremental updates to metadata. The Index Engine writes
updates to the metalogs, and occasionally creates a checkpoint file that rolls up all the
metalogs since the last checkpoint into a new checkpoint file.
Search engines consume updates from the metalog files to keep their copy of the
metadata current. When a metalog exceeds a configurable size, a new checkpoint is
created and a new metalog started. It is possible for multiple metalogs to exist for short
periods while the Search Engines consume older metalogs.
the start of this section, these folders are labeled 61 and 62. Folder 61 is exploded to
show the files within.
A new Index Fragment is created when the Index Engine fills the accumulator and
‘dumps’ it to disk. The files within a fragment are never modified once written to disk.
The Index Engines occasionally merge fragments to consolidate them, creating new
larger fragments in the process, and allowing the smaller fragments to be deleted. A
cleanup task in the Index Engine will delete the older, smaller fragments once the
Search Engine stops referencing them.
In an optimal configuration, the merge process attempts to structure the fragments
such that the number approaches about 5 fragments, with geometrically related sizes.
For example, 1000 MB, 300 MB, 100 MB, 30 MB, 10 MB. In practice, the sizes will
vary from this pattern given the reality of the sizes available for merging and the
opportunistic scheduling of merges based on the indexing load. If the indexing load is
high and sustained, the opportunity for merges may be rare, and the number of Index
Fragments can become large. Large numbers of fragments are undesirable for query
performance, so there is a configuration setting in the [Link] file that places an
upper limit on the number of acceptable fragments, which will force merge activity,
stalling the indexing process if necessary.
Within the Index Fragment Folder, there are a number of files as described below.
Core, Region and Other
Examining the fragment folder, note that there are files of the same type but having the
prefixes core, region and other. These file sets are similar, but used for different data.
The ‘core’ files contain the full text search data for words which are comprised of the
basic ASCII character set (typically English).
The ‘region’ files contain the full text index for XML region names. These are special
cases that improve the performance of search for values within XML fields.
The ‘other’ files contain the full text index for all other words – those which are not
English and not XML tags.
The descriptions below for core files are also applicable to the files with ‘region’ and
‘other’ prefixes.
Index Files
The file [Link] contains the ‘dictionary’ of terms, plus pointers to the object id file.
As the dictionary grows large, multiple levels of dictionary pointers are created, so you
will often see [Link], [Link], and so forth. These higher numbered index
files contain references to the lower numbered files, with successively more accurate
data points. For instance, the [Link] file contains entries for every 16th dictionary
value. This hierarchy improves dictionary lookup time. This structure repeats until the
highest numbered index file is smaller than 1 MByte. This 1 MByte dictionary is kept
in memory to optimize performance.
Object Files
The file [Link] contains a list of all internal object IDs and pointers to the word
location lists in the offset file.
Offset File
The file [Link] contains the lists of word offsets. These word offsets indicate to
the search engine the relative position of a word within an indexed object.
Skip File
The file [Link] contains pointers to the offset file that allows the Search Engine
to quickly skip over large data sets.
Map File
The map file contains checksums that can be used to verify that the index fragment
files have not been corrupted. There is only one Map file per partition fragment.
MODCheck.x
This is the master file for the metadata values, and the target after a merge.
The value of x increments after each merge operation.
MODcheckLog.x
Changes to text values are recorded in this file until a merge operation occurs.
MODpremerge.x+1
MODptrs.x+1
Files containing pointers used for recovery and playback during startup.
It is possible that multiple versions (values of .x) of these files may exist, especially if
a Search Engine is lagging in accepting updates from the Index Engine, or multiple
Search Engines exist.
Configuration Files
OTSE derives the bulk of its configuration settings from a number of files. In this
section, we review each of the files to convey the basic purpose of each.
[Link]
Most settings for OTSE are contained within the [Link] file. There is one [Link]
file per Admin Server. In practice, this usually means one per physical computer,
although other permutations are possible.
When used with Content Server, the [Link] file is generated by Content Server.
Although Content Server may preserve some of the edit changes you might make to
the [Link] file, this is not guaranteed. In general, you should not edit this file. Most
of the entries are set by Content Server, and using the Content Server search
administration pages is the preferred method for interacting with this file.
If you must edit this file within a Content Server application, consider using the
search.ini_override file instead.
The [Link] file follows generally accepted conventions for the structure of a ‘.ini’
file.
The file consists of several configuration sections. Where sections contain settings for
a particular partition, the section name will include the partition name. Refer to the
[Link] section of this document for detailed information on entries in the [Link]
file.
Search.ini_override
This file is specifically designed to supplement or override any values set in the
[Link] file. Because the [Link] file is controlled by Content Server, editing the
[Link] file does not ensure that your changes will be preserved.
The override file is optional. When present, it need contain only those configuration
settings which you want to take precedence over the default settings or the settings
within the [Link] file.
There is a special value that can be used in override settings, the DELETE_OVERRIDE
value. When this value is encountered, it means that the explicit value for the setting
in the [Link] file should be ignored, and the default value used instead.
For example, the default value for CompactEveryNDays is 30. If the [Link] file
contains the setting:
CompactEveryNDays=100
CompactEveryNDays=DELETE_OVERRIDE
Then the default value of 30 will be used.
Note that the override file may need to be edited any time the partition configuration
changes. The most common situation is that when you create new partitions, you will
need to add corresponding sections to the override file.
If you use automatic partition creation (such as date based partition creation) within
Content Server, you may have difficulty keeping the override file current with newly
created partitions, and the override file might not be a good choice for this type of
deployment.
[Link]
This is an optional configuration file which is used to set the parameters for index
backup operations and record the status of the last backup operation. You should not
normally modify this file. Refer to the section on index backup for more information.
[Link]
This file defines the storage modes for text metadata regions, and should be located
in the partition directory. There is one [Link] file per partition.
Although each partition could have different settings, keeping them identical across
partitions is generally recommended, and within a Content Server environment this is
enforced. A [Link] file has the following form:
[General]
NoAdd=DISK
ReadOnly=DISK
ReadWrite=RAM
[ReadWrite]
someRegion1=DISK
someRegion2=RAM
[ReadOnly]
someRegion1=RAM
someRegion3=DISK
[NoAdd]
someRegion1=DISK_RET
someRegion2=RAM
The General section defines the default storage mode for a text metadata region. The
ReadWrite, ReadOnly and NoAdd sections allows control over storage of specific
regions, which have priority over the General section. The possible values are DISK,
RAM and DISK_RET. Refer to the section on text metadata storage for details.
[Link]
The field definitions file has several purposes. Experience indicates that most
customers do not understand or modify this file, which is unfortunate, since significant
performance and memory use benefits may be possible by reviewing and editing this
file BEFORE indexing your content. Once an index has been created, it is not possible
to change some of the settings in this file without generating startup errors.
One function of the file is to establish the type for each metadata region to be indexed.
Each region is tagged with a type such as:
• INT
• LONG
• TEXT
• DATETIME
• TIMESTAMP
• USER
• CHAIN
• AGGREGATE-TEXT
A second purpose for the field definitions file is to provide metadata parsing hints for
nested metadata regions. Using the NESTED operative, the input IPool parser can
ignore outer tags and extract and index the inner region elements.
The field definitions file also provides instructions for special handling of certain region
types. This includes dropping, removing, renaming and merging metadata regions.
You can also use the aggregate feature to create a new region comprised of multiple
text regions.
One field definitions file is required per Admin server. As a general rule, each field
definitions file should be identical for partitions with different Admin servers.
Differences will result in inconsistent handling of regions between partitions.
Content Server does not edit, generate or manage this file. In general, changes to this
file must be done manually. There is one exception to this – the [Link] file has a
special setting for logically appending lines to the [Link] file. This allows
limited control over the definitions from Content Server. For example, if the [Link]
file contained these two lines:
ExtraLLFieldDefinitionsLine0=CHAIN MyID UserID
TwitterID FacebookID
ExtraLLFieldDefinitionsLine1=LONG OTBigNumber
Then at startup time, OTSE acts as if these lines existed at the end of the
[Link] file:
CHAIN MyID UserID TwitterID FacebookID
LONG OTBigNumber
Content Server usually ships with two versions of this file – a standard version, and
one for use with Enterprise Library Services. The determination of which version to
use is determined by a setting in the [Link] file:
FieldModeDefinitions=[Link]
Detailed information about each of the functions and data types of the field mode
definitions file can be found in the section of this document which covers metadata
regions.
[Link] Summary
This section gathers together most of the accessible configuration values that can be
used in the [Link] file, or the search.ini_override file. There are a number of
additional values which are only used for specific debugging or testing purposes that
are not listed here. A number of these configuration values are covered in more detail
in relevant sections of this document.
Not all processes read all sections of the [Link] file. Content Server generates
[Link] files for each process, and typically only includes values needed by the
process. Note that Content Server files do not include all of the entries, and default
settings are common.
Default values are displayed in this section wherever possible. Annotations in this
section are indicated with a // at the beginning of the line – this is not syntax supported
in an actual [Link] file, it is used here as a documentation device.
The settings in the INI file are applied when the processes start. Changes to this file
may require a restart of some or all of the search grid in order to take effect. Some of
these values can be re-applied to a running process without a restart, refer to the
“Reloadable Settings” section for a list.
General Section
This section is required for every [Link] file. The basic purpose is to share with all
components the configuration settings for the RMI Grid Registry and the Admin Server.
If RMI communication between grid components is not used, then the General Section
is ignored and not required.
[General]
AdminServerHostName=localhost
// RMI Registry
RMIRegistryPort=1099
RMIPolicyFile=[Link]
RMICodebase=../bin/[Link]
RMIAdminPort=8997
RMILogTreatment=0
RMILogLevel=10
Partition Section
The Partition Section contains basic information about a partition, such as size,
memory usage preferences and, and mode of operation. The section name must
include the partition name after the underscore.
[Partition_]
AllowedNumConfigs=500 (-1 = none)
AccumulatorSizeInMBytes=30
PartitionMode=ReadWrite | ReadOnly | NoAdd | Retired
DataFlow Section
The DataFlow section contains the majority of configuration settings relating to how
data should be processed. The partition name must be appended to the section name
after the underscore.
[DataFlow_]
FieldDefinitionFile=[Link]
FieldModeDefinitions=[Link]
QueryTimeOutInMS=120000
SessionTimeOutInMS=216000
StatsTriggerThreshold=200
LastModifiedFieldName=OTModifyDate
// Time zone obtained from OS by default, you can set e.g +5 for EST
TimestampTimeZone=
// Accumulator configuration
ContentTruncSizeInMBytes=10
DumpOnInactiveIntervalInMS=3600000
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0
MaxAverageTokenLengthHeuristic2=15.0
MinDocSizeInTokens=16384
DumpToDiskOnStart=false
AccumulatorBigDocumentThresholdInBytes=5000000
AccumulatorBigDocumentOverhead=10
CompleteXML=false
// Tokenizer
RegExTokenizerFile=[Link]
RegExTokenizerFileX=c:/config/tokenizers/[Link]
TokenizerOptions=0
UseLikeForTheseRegions=
OverTokenizedRegions=
LikeUsesStemming=true
AllowAlternateTokenizerChangeOnThisDate=20170925
ReindexMODFieldsIfChangeAlternateTokenizer=true
// Facets
ExpectedNumberOfValuesPerFacet=16
ExpectedNumberOfFacetObjects=100000
MaximumFacetValueLength=32
UseFacetDataStructure=true
MaximumNumberOfValuesPerFacet=32767
NumberOfDesiredFacetValues=20
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10
GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize
MaximumNumberOfCachedFacets=25
DesiredNumberOfCachedFacets=16
SubIndexCapSizeInMBytes=2147483647
// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
MaximumSubIndexArraySize=512
CompactEveryNDays=30
NeighbouringIndexRatio=3
// Metadata defragmentation
DefragmentFirstSundayOfMonthOnly=0
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
DefragmentMaxStaggerInMinutes=60
DefragmentStaggerSeedToAppend=SEED
// Relevance tuning
ExpressionWeight=100
ObjectRankRanker=
ExtraWeightFieldRankers=
DateFieldRankers=
TypeFieldRankers=
DefaultMetadataFieldNamesCSL=
// Set true for minor query performance boost on older CS instances
ConvertREtoRelevancy=false
//
DiskRetSection=DISK_RET
FieldAliasSection=FAS_label
TextAllowTopwordsBuild=true
TextNumberOfWordsInSet=15
TextUseTermSet=true
TextPercentage=80
[UpdateDistributor_]
// RMIServerPort not needed for direct socket connection mode
RMIServerPort=
AdminPort=
AllowRebalancingOfNoAddPartitions=false
IEUpdateTimeoutMilliSecs=3600000
MaxItemsInUpdateBatch=100
MaxBatchesPerIETransaction=1000
MaxBatchSizeInBytes=20000000
ReadOnlyConvertionBatchSize=1
// Retry and total wait time talking to UD, direct socket mode
WaitForTransactionMS=10000
MaxWaitForTransactionMS=600000
// logging
LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1
[IndexEngine_]
AdminPort=
IndexDirectory=
// For direct (non RMI) a timeout between connection and first command
IEConnectionTimeoutInMS=10000
[SearchFederator_]
RMIServerPort=
AdminPort=
SearchPort=8500
Logfile=<SectionName>.log
RequestsPerLogFlush=1
[SearchEngine_]
AdminPort=
IndexDirectory=
// Disk tuning values that you should leave alone unless you
// are having disk problems. Use cautiously.
UseSystemIOBuffers=true
MaximumNumberCachedIOBuffers=100
SizeInBytesIOBuffers=4096
DiskRet Section
This section is present to allow use of DISK_RET storage mode in older systems where
Content Server does not support DISK_RET configuration in the search administration
pages. Normally, should only be present in a search.ini_override file. CS10 Update 3
and later would put this into the [Link] file instead.
[DiskRetSection]
RegionsOnReadWritePartitions=
RegionsOnNoAddPartitions=
RegionsOnReadOnlyPartitions=
[SearchAgent_]
operation=OTProspector | OTClassify
[FAS_label]
From=to
// example
Author=OTUserName
[IndexMaker]
ObjectSkip=32
ObjectUseRLE=true
ObjectUseNyble=true
OffsetSkip=16
OffsetUseRLE=true
OffsetUseNyble=true
SmallestIndexIndexSizeInBytes=1048576
IndexingPartitionFactor=256
UseLongSkips=true
LongSkipInterval=4096
Reloadable Settings
A subset of the [Link] settings can be applied to search processes that are already
running. This feature is triggered using the “reloadSettings” command over the admin
API port. The [Link] settings applied at reload are:
Common Values
These values are reloadable in the Update Distributor, Index Engines, Search
Federator and Search Engines.
Logfile
RequestsPerLogFlush
CreationStatus
DebugLevel
LogSizeLimitInMBytes
MaxLogFiles
MaxStartupLogFiles
IncludeConfigurationFilesInLogs
NumberOfFileRecoveryAttempts
LargeObjectPartition
ObjectSizeThresholdInBytes
BlockBackupIfThisFileExists
BlockStartTransactionIfThisFileExists
If using RMI…
RMIRegistryPort
RMIPolicyFile
RMICodebase
AdminServerHostName
PolicyFile
Search Engines
DefaultMetadataFieldNamesCSL
DefragmentMemoryOptions
DefragmentSpaceInMBytes
DefragmentDailyTimes
DefragmentMaxStaggerInMinutes
DefragmentStaggerSeedToAppend
SkipMetadataSetOfEqualValues
MetadataConversionOptions
ExpressionWeight
ObjectRankRanker
ExtraWeightFieldRankers
DateFieldRankers
TypeFieldRankers
UseOldStem
HitLocationRestrictionFields
FieldAliasSection
DefaultMetadataAttributeFieldNames
SystemDefaultSortLanguage
SortingSequences
PrecomputeFacetsCSL
MaximumNumberOfCachedFacets
DesireNumberOfCachedFacets
TextNumberOfWordsInSet=15
TextUseTermSet=true
TextPercentage=80
Update Distributor
MaxItemsInUpdateBatch
MaxBatchSizeInBytes
MaxBatchesPerIETransaction
NumOfMergeTokens
RunAgentIntervalInMS
** The list of partitions is also reloaded from the section names in the Update
Distributor, allowing partitions to be added without restarts.
Although Search Agent definitions are not included in this list, changes to the Search
Agents do not require a restart. Search Agents use another mechanism for updates;
refer to the section on Search Agents for details.
Tokenizer Mapping
Earlier in this document, the Tokenizer section references various character mappings.
For reference, a detailed list of character mappings performed by the tokenizer is
included below. If a character is not included in this table, it is not mapped – it is added
to the index as itself.
The leftmost character in each row (and its hexadecimal Unicode value) represents
the output character(s) of the mapping. The remaining values following the colon
represent a list of source characters that are mapped to that output character. Each of
these source characters in the list is separated by a comma, with Unicode values in
parentheses.
ѝ (45d):
Ѝ (40d)
ў (45e):
Ў (40e)
џ (45f):
Џ (40f)
а (430):
А (410)
б (431):
Б (411)
в (432):
В (412)
г (433):
Г (413)
д (434):
Д (414)
е (435):
Е (415)
ж (436):
Ж (416)
з (437):
З (417)
и (438):
И (418)
й (439):
Й (419)
к (43a):
К (41a)
л (43b):
Л (41b)
м (43c):
М (41c)
н (43d):
Н (41d)
о (43e):
О (41e)
п (43f):
П (41f)
р (440):
Р (420)
с (441):
С (421)
т (442):
Т (422)
у (443):
У (423)
ф (444):
Ф (424)
х (445):
Х (425)
ц (446):
Ц (426)
ч (447):
Ч (427)
ш (448):
Ш (428)
щ (449):
Щ (429)
ъ (44a):
Ъ (42a)
ы (44b):
Ы (42b)
ь (44c):
Ь (42c)
э (44d):
Э (42d)
ю (44e):
Ю (42e)
я (44f):
Я (42f)
ا Arabic
(627): ﴼ,(675) ٵ ,(625) إ,(623) أ,(622) ( آfd3c), ﴽ
(fd3d), ( fe75), ( ﺁfe81), ( ﺂfe82), ( ﺃfe83), ( ﺄfe84), ﺇ
(fe87), ( ﺈfe88), ( ﺍfe8d), ( ﺎfe8e)
وArabic (648): ؤ,(676) ٶ ,(624) ( ؤfe85), ( ﺆfe86), ( وfeed), ﻮ
(feee)
يArabic (64a): ﯨ,(678) ٸ,(649) ى,(626) ( ئfbe8), ( ﯩfbe9), ﱝ
(fc5d), ( ﲐfc90), ( ﺉfe89), ( ﺊfe8a), ( ﺋfe8b), ( ﺌfe8c), ﻯ
(feef), ( ﻰfef0), ( ﻱfef1), ( ﻲfef2), ( ﻳfef3), ( ﻴfef4)
هArabic (647): ﳙ,(629) ( ةfcd9), ( ﺓfe93), ( ﺔfe94), ( ﻩfee9), ﻪ
(feea), ( ﻫfeeb), ( ﻬfeec)
0 (30): ۰ (660), ۰ (6f0), 0 (ff10)
1 (31): ۱ (661), ۱ (6f1), 1 (ff11)
2 (32): ۲ (662), ۲ (6f2), 2 (ff12)
3 (33): ۳ (663), ۳ (6f3), 3 (ff13)
4 (34): ٤ (664), ۴ (6f4), 4 (ff14)
5 (35): ٥ (665), ۵ (6f5), 5 (ff15)
6 (36): ٦ (666), ۶ (6f6), 6 (ff16)
7 (37): ۷ (667), ۷ (6f7), 7 (ff17)
8 (38): ۸ (668), ۸ (6f8), 8 (ff18)
9 (39): ۹ (669), ۹ (6f9), 9 (ff19)
ۇArabic (6c7): ,(677) ( ٷfbc7), ( ﯗfbd7), ( ﯘfbd8), ( ﯝfbdd)
ەArabic (6d5): ( ۀ6c0), ( ۀfba4), ( ﮥfba5), ( ﯀fbc0)
ロ (30ed): ロ (ff9b)
ン (30f3): ン (ff9d)
゛ (309b): ゙ (ff9e)
゜ (309c): ゚ (ff9f)
Additional Information
Version history and selected built-in utilities.
Version History
This section of the document identifies which updates of Search Engine 10 and 10.5
contain new features or material changes in behavior. This is not comprehensive, but
a list of the more notable changes.
Search Engine 10
Released with Content Server 10, approximately September 2010. The versions of
the search engine prior to this release were generally referred to as OT7.
• Add support for key-value attributes in text metadata, used for multi-lingual
metadata indexing and search.
• Added Hindi, Tamil and Telugu to the standard tokenizer.
• New percent full model with “soft” update-only mode and rebalancing.
• Defragmentation of metadata storage.
• Added ModifyByQuery.
• Added DeleteByQuery.
• Added Disk Retrieval Storage mode.
• Bi-gram indexing of far-east character sets. May require re-indexing of existing
content with far-east character sets.
• Faster ‘stemming’ focused on noun plurals.
• Content Status feature added.
• Synthetic regions: partition name and mode.
• Change bad metadata to record error instead of halting.
• Search Federator closes connections from inactive clients.
• Rolling log file support added.
• Various bug fixes
• Support for Java 6 (Update 20)
• Java 8 u162
Error Codes
Errors and warnings from OTSE may be exposed in multiple ways. Process Error
codes are responses to communications. Detailed information about errors is normally
contained in the log files. The chart below articulates many of the possible Process
Error codes. This is not a comprehensive list.
Update Distributor
Code Description
129 Unable to load JNI library. To read or write IPools, OTSE leverages
Content Server libraries. This file is named [Link] (Windows) or
[Link] and is expected to reside in the <OTHOME>\bin directory.
131 Insufficient memory. The memory can be adjusted using the –XMX
parameter on the command line. Content Server exposes this
control in its administration pages.
173 Index is full. All Index Engines report they are unable to accept new
objects.
Index Engine
Code Description
180 Index failed to start. In some cases, this error is acceptable if the
Index Engine is already running.
181 Request to start the Index Engine has been ignored because an
index restore operation is in progress.
Search Federator
Code Description
Search Engine
Code Description
Utilities
OTSE contains a number of built-in utilities and diagnostic tools. These are often used
by OpenText support staff and developers when analyzing and testing an index. Many
of these will have limited value for customers, but may be of assistance when
diagnosing particular index problems. For convenience, basic documentation for some
of the more common utilities is included here.
Many of the utilities are NOT a supported feature of the product. They are not
guaranteed to work as described, and may be modified or removed at any time.
You are strongly advised to use the utilities on a backup of your index,
and not on a production copy. The potential exists to render an index
unusable for your application with some of these tools. You have
been warned.
General Syntax
The utilities are invoked by launching the search JAR using appropriate parameters.
The general syntax is:
java [-Xmx#M] –classpath <othome>\bin\[Link]
[Link].<subclasspath>
[parameters]
Where:
<othome>\bin is the file path where the search JAR file is located.
Backup
The backup utility is used to create either differential or full backups of a partition. Refer
to the section on Backup and Restore for more information.
Java –classpath [Link] [Link]
-inifile J:\index\[Link]
Where the inifile identifies the backup configuration file to be used.
Restore
The restore utility is used to restore an index from a prior backup. Restore to the
section on Backup and Restore for more information.
Where the inifile identifies the [Link] file to be used. You may need to run the
restore process many times. Using the utility directly is not for the faint of heart, and
you should probably let Content Server manage this for you.
DumpKeys
The DumpKeys utility attempts to generate a list of all the object IDs for objects in the
partition. This is often a tool of last recourse for repairing a corrupted index. The
DumpKeys tool will sometimes be able to get data from a partition which is unreadable.
The input to dumpkeys is the [Link] file and partition information, and the output is
a file of object IDs. Sample output looks like this:
c DataId=41280133&Version=1
c DataId=41280132&Version=1
c DataId=41280131&Version=1
The first character details where the object ID was found. If in the checkpoint file, the
first character is a ‘c’ (as in the example above). If an object ID was found in the
metalog file (recently indexed), the first character reflects the operation type:
n: new
a: add
r: replace
m: modify
d: delete
Invoking Dumpkeys:
java -Xmx2000M -Xss10M -cp .\[Link];
[Link] -inifile <path_to_search.ini> -
sectionName <IE_or_SE_Section_Name> -log <Path_to_log_file> -output
<Path_to_DumpKeys_Output>
Parameters:
path_to_search.ini: Path to the [Link] file, typically /config/[Link]
IE_or_SE_Section_Name: The full section name including the SearchEngine_ or
IndexEngine_ prefix.
Path_to_DumpKeys_Output: Path to where the output file should be created.
Path_to_log_file: Path to where the log file should be created.
VerifyIndex
This utility performs internal checks of the structure of the index. Levels 1 through 5
are cumulative, and level 10 is a distinct operation. Parameters are:
–level K -config SearchIniFile –indexengine IEName
[–outFile OutFile] [-html true] [-verbose true]
SubIndex Statistics
Index Statistics
RebuildIndex
This utility rebuilds the dictionary and index for metadata in a partition. This is possible
because an exact copy of the metadata is stored in the checkpoint files. This does not
affect the full text index. This utility can often be used to repair errors detected by a
Level 10 VerifyIndex.
Parameters:
Where
SearchIniFile is the location and name of the [Link] file which should be used.
IEName is the name of the partition which should be rebuilt.
Because this utility needs to build and load the entire index, you may need to ensure
an appropriate -Xmx (memory allocation) parameter is specified on the Java command
line.
LogInterleaver
Each component of the search grid – index and search engines, search federators and
the update distributor – create their own log files. It can be difficult trying to trace a
single operation through multiple log files. The LogInterleaver function will combine
multiple log files by ordering entries according to their time stamps into a single log file
to simplify interpretation. The output file will have a slightly different syntax – each line
of output will be prefixed by the original log file name.
Parameters:
-d logDir | -o outputFile
OR
[Link]
Log files from the search components have a time stamp in milliseconds from a
reference date. This utility will convert a log file to have human-readable time/date
values instead, which can be helpful when interpreting the logs manually.
This utility is somewhat unusual in that it reads from console input and writes to console
output, so the typical usage is to “pipe” the source logfile into the java command line,
and redirect the output to a target file like this:
[Link]
This utility enters a console loop. You enter one line of text, and it responds by printing
out each search token generated on a separate line. Control-C will terminate the loop.
Optional command line parameters:
-TokenizerOptions <Number> -tokenizerfile <RegExParserFile>
Where Number represents the bitwise controls for tokenizer options, as defined in the
Tokenizer section of this document. The tokenizerfile parameter specifies an
optional or custom tokenizer definition that may be used.
ProfileMetadata
This utility function loads a checkpoint file, and writes information about the metadata
in the checkpoint to the console. You may wish to redirect the console output to a file
to capture the data.
Parameters:
Where:
l: profile level where 0=High Level,
1=Field Level (Default),
2=Field Part Level
values: true requests the # of objects with values
and the estimated total memory requirement
checkpointFile: file name of the checkpoint file to be
profiled
Refer to sample output fragments for the profile levels below.
Level 0:
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719
Level 1:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
2060 Field(Text):OTDocCompany
1996 Field(Text):OTDocRevisionNumber
10668 Field(Text):OTVerCDate
0 Field(Text):OTReservedByName
Level 2:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
1376 Field(Text [RAM]):OTDocCompany dictionary (mappingEntries=0 wsEntries=1
tokenEntries=3)
256 Field(Text [RAM]):OTDocCompany content
428 Field(Text [RAM]):OTDocCompany index
2060 Field(Text [RAM]):OTDocCompany combined
1312 Field(Text [RAM]):OTDocRevisionNumber dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
256 Field(Text [RAM]):OTDocRevisionNumber content
428 Field(Text [RAM]):OTDocRevisionNumber index
1996 Field(Text [RAM]):OTDocRevisionNumber combined
10668 Field(Date):OTVerCDate combined
684 Field(Date):OTDateEffective combined
1312 Field(Text [RAM]):OTContentIsTruncated dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
33920 Field(Text [RAM]):OTContentIsTruncated content
428 Field(Text [RAM]):OTContentIsTruncated index
35660 Field(Text [RAM]):OTContentIsTruncated combined
Field(UserID):OTAssignedTo combined
…
Field(Integer):OTTimeCompleted combined
0 Field(UserLogin):OTReservedByName combined
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719
If the parameter “values” is true, the information for each region is considerably more
detailed:
[Link]
The search configuration files allow you to control several aspects of file I/O. Tuning
these for optimal performance can be difficult, since many factors are involved. The
DiskReadWriteSpeed utility can help by simulating disk performance using several of
the available configurations. For each mode, this utility performs 32678 iterations of
the test using 8KB block of data. Note that this information can help you tune disk
performance or identify system I/O bottlenecks, but is not necessarily sufficient to draw
a firm conclusion regarding the optimal configuration.
Parameters:
(write|read|both) TestDirectory
The operations tested are:
SearchClient
The SearchClient is a console application that allows you to interactively issue
commands to the Search Federator. The SearchClient is useful for determining that
search is working as expected, or running queries without having an application such
as Content Server running. All console output is expressed in UTF-8 characters. Note
that you might need to adjust the default Search Federator timeout values higher if
using the SearchClient.
It is possible to use the SearchClient with an index that is also being used in a live
production system. In this situation, a SearchClient that is open consumes a search
transaction from the available pool, so this may impact the available pool of search
transactions.
Parameters:
–host SFHost –port SearchPort [-adminport SFAdminPort] [-time
true] [-echo true] [-pretty true]
SFHost is the URI for the target Search Federator, connected on SearchPort. The –
time true parameter adds response time information to each response.
The –echo parameter will add the input command to the output. This is useful when
redirecting input from a file for batch operations, so you can associate the commands
with the responses. By default, echo is false.
The –pretty parameter will use an alternate formatting of GET RESULTS. The alternate
format does not adhere to the API spec, but is better formatted for human readability
when developing or debugging.
The –csv true parameter will output the results in a form that can be easily imported
into a spreadsheet (comma separated values). This feature is most useful when
redirecting input and output from/to files. If –pretty is specified, it takes precedence
over –csv.
The –adminport setting enables specific commands to be interpreted and sent to the
administration port of the Search Federator. These admin commands are:
Problem Illustration
subindex1 has internal IDs = 1,2,3,4,5,6,8,9
subindex2 has internal IDs = 5,7,8,9,10
BaseOffset problem: subindex1 should only contain 1,2,3,4. Internal IDs 5,6,8 and 9
overlap with subindex2.
Fix: cut 5,6,8 and 9 from subindex1.
Items 5, 8, and 9 already exist as duplicates in subindex2. However, item 6 only exists
in subindex1, so the fix would remove the only instance of item 6 from the index
content.
After fix:
subindex1New: 1,2,3,4
subindex2: 5,7,8,9,10
Output of DumpSubIndexes before fix: ids for subindex1, subindex2 and deleteMask
Ouput of RepairSubIndexes: a file which lists the objects removed from subindex1
(5, 6, 8 and 9) along with their external IDs for re-indexing.
Output of Diff tool: a file which only lists object 6 along with its external ID for re-
indexing.
Output of DumpSubIndexes after fix: ids for subindex1New, subindex2 and
deleteMask
Repair Option 1
This approach requires about 30 to 60 minutes for a typical partition, and makes the
index usable as quickly as possible. However, there may be a lot of objects that need
to be reindexed.
Running the RepairSubIndexes utility
java -[Link]
[Link]
-level x -config [Link] -indexengine firstEngine
Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.
If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will warn
and exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending
on the size of the sub-index that is being fixed. The utility will produce an
output file bearing the name of the sub-index that was fixed. This file contains
the internal-external objectID (OTObject region value) pairs that can be
utilized for re-indexing.
3. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.
Additional Comments:
While running the tools, it is strongly recommended that the output be
redirected out to a file for easier analysis (… > [Link]).
During the repair process, it is possible to navigate inside the directory where
the index under repair sits. It is possible to observe the new sub-index
fragment being written out, growing larger in size over time.
At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being
repaired (same location where new fragment is made)
Repair Option 2
This method typically requires about 45 minutes longer per partition, but minimizes the
number of objects which may require re-indexing.
Running the RepairSubIndexes utility
example (assuming that both the new [Link] and [Link] are in the
current directory):
• where dir is the index directory where all the output files were written out
• where deleteIDsFile is the output file made by the RepairSubIndexes utility for
the sub-index that was fixed
• where subIndexIDsFile is the appropriate output file made by
DumpSubIndexesIDs utility. It is crucial to use the correct file; if we have
subindex1 and subindex2 with overlap and subindex1 was cut out, then use the
DumpSubIndexesIDs file for subindex2.
example:
The minimum [Link] sections necessary to run this tool are the Index Engine
section, Dataflow section and Partition section. Any file paths mentioned in these
sections should be adjusted to point to the actual location of your index partition
directory in your environment
Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.
If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will warn and
exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending on the
size of the sub-index that is being fixed. The utility will produce an output file
bearing the name of the sub-index that was fixed. This file contains the internal-
external objected (OTObject region value) pairs that can be utilized for re-indexing.
3a. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
3b. Run the DumpSubIndexesIDs utility after repair. This will generate a date-stamped
file for each sub-index. The file contains all the internal-external IDs for each sub-
index.
3c. Run the DiffObjectIDFiles tool (this only takes a few minutes). This will produce a
smaller set of objects to re-index. This set contains objects whose content was
cut from the bad sub-index and whose content is NOT contained anywhere else
in the partition.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.
Additional Comments:
While running the tools, it is strongly recommended that the output be redirected
out to a file for easier analysis (… > [Link]).
During the repair process, it is possible to navigate inside the directory where the
index under repair sits. It is possible to observe the new sub-index fragment being
written out, growing larger in size over time.
At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being repaired
(same location where new fragment is made.
Index of Terms
de-duplication, 67 first, 80
Default, 89 Fragmentation, 170
Default Search Regions, 114 [Link], 212
Defining a Region, 18 Garbage Collection, 195
defragmentation, 170 Get Facets, 61
Delayed Commit, 191 Get Regions, 67
DelayedCommitInMilliseconds, 191 Get Results, 59
Delete, 54 Get Time, 66
DeleteByQuery, 54 getstatuscode, 167
DESC, 89 getstatustext, 162
descending, 89 getsystemvalue, 168
[Link], 212 hh, 65
DiffObjectIDFiles, 280 High Ingestion, 175
Disk Configuration, 191 Hit Highlight, 65
Disk fragmentation, 188 HIT LOCATIONS, 71
Disk Performance, 189, 190 HyperV, 195
Disk Storage, 36 [Link], 282
DiskReadWriteSpeed, 275 IN operator, 86
DROP, 19 Index Engines, 8
DumpSubIndexesIDs, 280 Integer, 28
Email Domain, 132 Interchange Pools, 51
Empty Regions, 20 IOChunkBufferSize, 192
Entire Value, 73 iPool errors, 50
ENUM, 30 iPools, 51
Error Codes, 267 IPv6, 207
EuroWordNet, 118 JNI, 5
Existence, 90 Key, 25, 26
Expand, 64 Lang File, 213
EXTRAFILTER, 100 left-truncation, 75, 76
Facet Memory, 96 Like, 128
Facet Security, 94 LogInterleaver, 273
Facets, 91 Long, 28
File Monitoring, 196 Low Memory, 37
FileCleanupIntervalInMS, 142 LQL, 69
About OpenText
OpenText enables the digital world, creating a better way for organizations to work with information, on premises or in the
cloud. For more information about OpenText (NASDAQ: OTEX, TSX: OTC) visit [Link].
Connect with us:
[Link]
Copyright © 2021 Open Text SA or Open Text ULC (in Canada).
All rights reserved. Trademarks owned by Open Text SA or Open Text ULC (in Canada).