Indexed word limits
By default, Tealeaf imposes a limit of 32
characters on the lengths of words to be indexed. Any word that is longer than 32 characters is length is truncated to 32 characters for purposes of indexing.
For example, when the maximum word length is 32 characters, the words ThisWordIsMyFavoriteWordOfAllTime
and ThisWordIsMyFavoriteWordOfAllTimeNoItsNot
are both indexed as ThisWordIsMyFavoriteWordOfAllTim
.
You can change the value of the Maximum Word Size
setting to accommodate longer words if they are commonly in use on your web application. The maximum accepted word length is 128.
- Changing this value can significantly alter the size of your indexes. Tealeaf recommends using the default setting.
- Changes to this setting apply only to indexes that are created after the change. Typically, those indexes are created the following day.
- The underlying search engine imposes a maximum limit of 80 characters on field names. When the maximum word length is greater than 80 characters, the underlying search engine limits field names to 80 characters. Field names that are longer than 80 characters are not included in the index at all. Using these words as search terms or field names will produce no results.
If you are searching for words longer than the maximum word size:
- You can use the wildcard (
*
) to search. - You can create a search field that applies an MD5 hash to the value. Users submit the full text version of the search term, which is converted to the 32-character MD5 hash value and submitted to the search engine for processing.
Indexing hyphens
The Tealeaf search engine indexes blocks of text yet provides mechanisms for how special characters are treated. Hyphens in session data can be treated in multiple ways.
For example, the term cross-reference
might appear in indexed data as:
crossreference
cross-reference
cross reference
Individual words within the hyphenated phrase are always indexed. In the above example, cross
and reference
are indexed in all methods.
You can configure the session indexer to index hyphenated text using any or all of the above methods. To specify the indexing style for hyphens, set Indexing Hyphen Style
to one of the following values:
Value | Description |
---|---|
Ignored |
Ignore hyphen (crossreference ). |
Searchable Text |
Treat hyphens as searchable text (cross-reference ). |
Spaces |
Treat hyphens as space (cross reference ). This is the default value. |
All |
Index in all of the above styles.
Note: Setting this value to
All to index in all styles may bloat index sizes and produce unexpected results in searches involving longer phrases or words with multiple hyphens. |
You should monitor changes in indexing rates after making this change.
Index format and storage
Indexes consist of an index library file (IXILB.ILB
) and a corresponding group of index files (*.IX
).
- The
ILB
file is used only if dtSearch Desktop is enabled. Index libraries are essentially lists that keep track of the names and locations of each index. The IXLIB.TLL
file contains the same information as the library file, in addition to information used exclusively by Tealeaf CX.
Index directories
An index directory is a sub-directory below the TeaLeaf\Canister\Indexes
directory. Index directories are named with the time and date of index creation in the following format:
YYYYMMDDxxx
where: xxx
is three sequential uppercase letters. For example, an index created on December 12, 2018 may be stored in a directory named 20181212AAA
.
An index file may represent several sessions, a single session, or a partial session depending on the limits specified for your indexing options. The number of created indexes depends on the individual index size limit specified in the Indexing Options dialog box. For example, if the individual index size is limited to 50 MB, a new index directory is created after the files in the current index directory reach this limit.
After an index is created, it is added to the library file and listed by directory name.
Character indexing
Some rules by which the index performs indexing of specific characters can be applied through the alphabet.dat
file. Additional special rules may be apply to specific data structures.
Format of Index Control File (IXLIB.TLL
)
The following table provides a description of each tag that comprises the IXLIB.TLL
file, a Tealeaf-specific file used by the CX RealiTea Viewer for searching. It may be necessary to check this file for troubleshooting purposes.
Tag | Description |
---|---|
<Day> |
A text version of the date |
<Julian> |
A pseudo-julian date: (year - 2000) * 1000) + DayOfTheYear |
<FirstUse> |
UNIX™ time of the last time of the first session in the index |
<LastUse> |
UNIX time of the last time of the last session in the index |
<IndexName> |
The name of the index |
<IndexPath> |
Relative path of the index |
<Valid> |
Is this index valid? False under certain situations, primarily merging, while indexes are being created. |
<InUse> |
Is the index in use? |
<FirstSession> |
Canister session identifier of the first session in the index |
<LastSession> |
Canister session identifier of the last session in the index |
<CheckRequired> |
Should a verification be run on this index? This option is set only when the -F flag is given to IndexCheck, or if something went wrong during normal operation. |
<IndexSize> |
dtSearch determination of the size of the index |
<DocCount> |
dtSearch internal value of the document count for this index |
<CheckCount> |
Is the TLPIS.ix file current for this index? |
Indexed content types
The following content types, also called Internet media types and MIME types, are indexed by default.
text/html
text/plain
text/xml
application/xhtml+xml
application/rdf+xml
application/vnd.mozilla.xul+xml
application/xml