About Tokens and Patterns

For data processed prior to Release 4.3.11.0, the software uses tokens to identify Patterns and different types of information in a document, such as emails, images, audio files, or parsing errors. These tokens and their associated Pattern values are stored in content processed prior to 4.3.11.0, and you can search using the tokens and the values.

For newly imported, updated, or reprocessed data in 4.3.11.0 or later, the software no longer stores tokens and values associated with any enabled System Patterns or Custom Patterns in the content. Instead, the software uses two Pattern-related metadata fields to contain Pattern information, and you can search for Pattern information using the metadata fields. In addition, the tokens related to parsing errors and types of content such as image files or audio files (parsing tokens) no longer apply.

Working with Patterns and Pattern Values (for data processed as of 4.3.11.0)

For data processed in 4.3.11.0 or later, the software uses two Pattern-related metadata fields to store Pattern information, and you can search for Pattern information using the metadata fields.

If a Pattern is enabled, you can search for documents with that Pattern using the pattern metadata field and the Pattern name. You can type the Pattern name in either lowercase or uppercase format; regardless, the software always uses lowercase format. The results will include the Pattern matches. For example:

pattern::email

If a Pattern is enabled with values stored, you can also search using the patternvalue metadata field. This patternvalue search enables you to find documents with a matching Pattern value from the document content or email content (which does not include the email header). For this search, you must place the value within single quotes (for a literal search). For example:

patternvalue::'jsmith@someco.com'

patternvalue::'http://www.state.gov/s/ct/'

patternvalue::'\\\server\'

Only a few System Patterns are enabled by default and store matching values as terms. The following table shows which Patterns are enabled by default and how to search for them using a pattern field search with the Pattern name. For more information, see System Patterns.

 

Searchable Pattern Name (for data processed in 4.3.11.0 or later) Status Values Stored
pattern::<custom_name> Disabled  
pattern::creditcard Disabled  
pattern::date_euro Disabled  
pattern::date_us Disabled  
pattern::email Enabled Yes
pattern::ipv4 Disabled  
pattern::phone Disabled  
pattern::ssn Disabled  
pattern::unc Enabled Yes
pattern::uri Enabled Yes

Working with Tokens (for data processed before 4.3.11.0)

Data processed prior to 4.3.11.0 supports tokens and token search in content.

Tokens are associated with either one of the following:

Tokens are applied during the pre-4.3.11.0 parsing process. When the pre-4.3.11.0 parsing process finds content that matches token criteria, it tags that document with that token. Also, Clustering of data processed prior to 4.3.11.0 may yield parsing tokens and Pattern tokens in individual Clusters.

For data processed prior to 4.3.11.0, you can also find documents that contain parsing errors or a certain type of content by searching for their parsing tokens using the searchable 'token-<token_name>' format. This format requires you to place the search within single quotes and specify the Token name in lowercase, since the software normalizes a Token name to lowercase.

In a Clustered view of data processed prior to 4.3.11.0, you might see Clusters with parsing tokens in their top terms list. These documents were clustered based on how their content was assessed, for example, token-error_unknown_type, token-error_no_content, and token-error_parsing.

To perform a parsing token search in data processed prior to 4.3.11.0, use the following format:

'token-<token_name>'

For example:

'token-image'

Prior to 4.3.11.0, you can also a token search directly for a matching value of a Pattern if the Pattern is enabled and has values stored (for example, a given email).

Note: For Projects using V2 tokenization, you must enclose a token search within single quotes. Any Projects that use V1 tokenization do not require the token searches to be placed within single quotes, although doing so will still yield expected results. Also note that the token- format works the same, and has the same requirement for being placed within single quotes for new Projects.

Working with Parsing Tokens (for data processed before 4.3.11.0)

For data processed prior to Release 4.3.11.0, the software uses tokens to identify different types of content in a document, such as image or audio files, or parsing errors. These tokens and their associated values are stored in content, and you can search using the token names, as shown in the following table.

 

Searchable Token Name Description
'token-audio' Audio file
'token-encrypted' Encrypted file, which could be password protected and/or encrypted
'token-error_file_open' Error opening file
'token-error_no_content' Files with no content or file was zero length
'token-error_parsing' File with a parsing error
'token-error_timeout' Timeout trying to parse file
'token-error_unknown_type' File with undetermined type
'token-error_unrecognized_type' File type not supported by parsing library
'token-error_xml_parse' Error trying to read metadata
'token-exe' Executable file (this will find all types of exe, dll, or other system files identified as binary files)
'token-image' Image file
'token-md5'
(See Note 1)
MD5 hash codes
'token-stopwords_only' File with content that includes only stop wordsClosed Words that do not generally enhance meaning, such as articles, pronouns, and prepositions, plus other selected words in languages such as English, Spanish, and French.. Note: A token search for stop words only documents will return no results, but this token is considered when the system forms clusters.

Notes:

  • Note 1: MD5 values are always stored as terms so that you can search for an MD5 hash code as content.
  • Note 2: For data processed prior to 4.3.11.0, three tokens, token-currency, token-quantity, and token-numeric_term, are reserved for identifying numeric content and are disabled (that is, you cannot use them to search for numeric content). However, the Numeric Index Settings are enabled by default and store numeric content values that are searchable. Custom Patterns can also be created to employ numeric content, if desired.

Working with System Patterns and Their Tokens (for data processed before 4.3.11.0)

In data processed prior to 4.3.11.0, you can search for enabled System Patterns using the searchable token name format. The following table shows which System Patterns are enabled by default and how to search for them in data processed prior to 4.3.11.0. For more information, see System Patterns.

 

Searchable Token Name (for data processed prior to 4.3.11.0) Status Values Stored
'token-<custom_name>' Disabled  
'token-creditcard' Disabled  
'token-date_euro' Disabled  
'token-date_us' Disabled  
'token-email' Enabled Yes
'token-ipv4' Disabled  
'token-phone' Disabled  
'token-ssn' Disabled  
'token-unc' Enabled Yes
'token-uri' Enabled Yes