Parsing Tokens, Numbers, and Patterned Data

To get the most out of the Digital Reef analysis techniques, it helps to understand the steps that documents go through as the Administrator populates the Project with data. This topic describes the steps the Administrator can take to get documents ready for analysis.

Manage Project Settings

Project Settings affect all Data Sets in the Project. The Admin or user with equivalent permissions can specify a wide range of settings, including how Stop WordsClosed Words that do not generally enhance meaning, such as articles, pronouns, and prepositions, plus other selected words in languages such as English, Spanish, and French. are treated. The Admin can also manage Tags, Numeric Settings, System PatternsClosed A set of preconfigured Patterns (also known as regular expressions) that match specific types of data. The content of a System Pattern cannot be edited and a System Pattern cannot be deleted. They can be enabled and disabled with or without storing values and can be copied to serve as the basis for Custom (user-defined) Patterns. They must be enabled before they are available for use. A subset of System Patterns are enabled automatically., and allow Users to create Custom PatternsClosed A locally defined pattern (also known as a regular expression) identified by its name, that is used during the parsing process to match specific data patterns. Custom Patterns can be created, deleted, enabled, and disabled. Custom Patterns must be enabled before data is added or reprocessed. to enhance the type of content that is recognized by the parsingClosed Parsing occurs when you add files to a Case or Project. The parsing operation identifies file types, intelligently extracts useful text, words, phrases, and metadata for each supported file type, and assigns tokens to files. The tokens track different types of content and can be used to search for different types of content and, optionally, values of that type of content. You can enable optional patterns to expand the parsing behavior, but they must be configured and enabled before you add documents to an Index. process.

Note: Patterns apply to initial import, a Pattern update, or reprocessing. If you change the System Patterns or Custom Patterns for your Project, you can either use the Reprocess option from results to reprocess and pick up the latest Pattern changes, or you can update the Patterns using the standalone Update Patterns option (for example, by right-clicking on a Data Set in the tree). Otherwise, your Pattern changes will have no effect on the existing Data Set documents.

Parsing Defaults

The default parsing process enables System Patterns (email, uri, and unc) with stored values to enable searching for individual values of that type. The default Project Index Settings include numeric settings that enable you to search for the stored numeric values. Note that the default parsing process does not capture other patterned dataClosed Data that can be identified in a predictable sequence of characters, such as a Social Security Number (NNN-NN-NNNN); an IP address (192.168.12.133); a Uniform Resource Locator (HTTP://, HTTPS://, FTP://); and others. A user with the appropriate permissions can enable predefined System Patterns (a few are enabled by default), or define Custom Patterns. as "terms," which means that by default, searches on these items return no results, and, for data processed prior to 4.3.11.0, clustering cannot be based on other types of patterned data. As part of determining the appropriate Project Settings (or Organization Settings using templates), an Administrator can change the default parsing behavior by changing the Patterns. For example, the Admin may want to enable more of the System Patterns or establish Custom Patterns. For the enabled Patterns, Patterns can be used to merely identify documents that contain the data identified by the Patterns (using the Pattern name in a pattern metadata field search) or to also capture the data that match the Patterns and make the stored values available for searches and similarity comparisons.

Each document added to a Data Set is parsed to create the derivative files that are used in clustering and searches. The parsing process can perform the following actions:

Advanced Parsing Options

A user with the ability to manage Project or Organization Settings can affect how data is recognized during the parsing operation. For example, an Organization Administrator can enable selected System Patterns and Custom Patterns for a Project. When the parsing software finds a document that includes data that matches an enabled Pattern, it flags that document (assigns a tokenClosed In data processed prior to 4.3.11.0, labels that the parsing software associates with each document to track the different types of content in the document. Also used to identify documents that experienced parsing errors.) as containing that type of data. The Administrator can optionally configure a System Pattern or a Custom Pattern to also store matching content as values, which makes that data available for searches and clustering.

For example, during the parsing operation, the system can assign a credit card number token (one of the System Patterns) to every document that contains one or more credit card numbers. You can then do a token search ('token-creditcard') and list the documents that include credit card numbers. If you configure the System Pattern to also store values, the parsing operation captures credit card numbers as data and you can then search for specific credit card numbers.

Project Settings are available to do the following:

  • Enable the parsing of Stop Words.
  • Enable and disable System Patterns.
  • Create Custom Patterns.
  • Allow users to create Custom Patterns. (These must then be enabled by the before they are included in the parsing operation.)
  • Configure System and Custom Patterns so that the system stores matching values (in the patternvalue metadata field).

For example, if the System Pattern for Social Security Numbers is enabled and configured to capture SSNs as values, the parsing process detects all documents that satisfy the ssn Pattern and have SSN values. This enables searches for documents that have the ssn Pattern as well as searches for specific Social Security Numbers. Documents processed prior to 4.3.11.0 that have a large number of Social Security Numbers could be similar enough to be formed into their own Cluster.

For more information on all of the System Patterns, see System Patterns.

Note: Enabling Patterns may increase the time it takes to import files and the amount of space required to store the files. You enable System and Custom Patterns individually, so the more Patterns you enable, the greater the increase in processing time and storage requirements.

Custom Patterns

A user with the appropriate permissions normally creates Custom Patterns to perform pattern matches that are not covered by any of the System Patterns. To go into effect for the next parsing of data, a Pattern must be enabled.

Notes:

  • Only a few System Patterns are enabled for you initially. Many System and Custom Patterns are disabled by default. They must be enabled they apply to all Data Sets in the Project.
  • You cannot edit or delete the predefined System Patterns. You can copy them as the basis for a Custom Pattern, however.
  • System and Custom Patterns must be configured and enabled before adding documents because that is when the parsing operation takes place.
  • For information on searching for values associated with Patterns, see Use the Standard Search Syntax for Basic Queries.

See also: