Parsing Tokens, Numbers, and Patterned Data
To get the most out of the Digital Reef analysis techniques, it helps to understand the steps that documents go through as the Administrator populates the Project with data. This topic describes the steps the Administrator can take to get documents ready for analysis.
Manage Project Settings
Project Settings affect all Data Sets in the Project. The Admin or user with equivalent permissions can specify a wide range of settings, including how Stop Words Words that do not generally enhance meaning, such as articles, pronouns, and prepositions, plus other selected words in languages such as English, Spanish, and French. are treated. The Admin can also manage Tags, Numeric Settings, System Patterns
A set of preconfigured Patterns (also known as regular expressions) that match specific types of data. The content of a System Pattern cannot be edited and a System Pattern cannot be deleted. They can be enabled and disabled with or without storing values and can be copied to serve as the basis for Custom (user-defined) Patterns. They must be enabled before they are available for use. A subset of System Patterns are enabled automatically., and allow Users to create Custom Patterns
A locally defined pattern (also known as a regular expression) identified by its name, that is used during the parsing process to match specific data patterns. Custom Patterns can be created, deleted, enabled, and disabled. Custom Patterns must be enabled before data is added or reprocessed. to enhance the type of content that is recognized by the parsing
Parsing occurs when you add files to a Case or Project. The parsing operation identifies file types, intelligently extracts useful text, words, phrases, and metadata for each supported file type, and assigns tokens to files. The tokens track different types of content and can be used to search for different types of content and, optionally, values of that type of content. You can enable optional patterns to expand the parsing behavior, but they must be configured and enabled before you add documents to an Index. process.
Note: Patterns apply to initial import, a Pattern update, or reprocessing. If you change the System Patterns or Custom Patterns for your Project, you can either use the Reprocess option from results to reprocess and pick up the latest Pattern changes, or you can update the Patterns using the standalone Update Patterns option (for example, by right-clicking on a Data Set in the tree). Otherwise, your Pattern changes will have no effect on the existing Data Set documents.
Parsing Defaults
The default parsing process enables System Patterns (email, uri, and unc) with stored values to enable searching for individual values of that type. The default Project Index Settings include numeric settings that enable you to search for the stored numeric values. Note that the default parsing process does not capture other patterned data Data that can be identified in a predictable sequence of characters, such as a Social Security Number (NNN-NN-NNNN); an IP address (192.168.12.133); a Uniform Resource Locator (HTTP://, HTTPS://, FTP://); and others. A user with the appropriate permissions can enable predefined System Patterns (a few are enabled by default), or define Custom Patterns. as "terms," which means that by default, searches on these items return no results, and, for data processed prior to 4.3.11.0, clustering cannot be based on other types of patterned data. As part of determining the appropriate Project Settings (or Organization Settings using templates), an Administrator can change the default parsing behavior by changing the Patterns. For example, the Admin may want to enable more of the System Patterns or establish Custom Patterns. For the enabled Patterns, Patterns can be used to merely identify documents that contain the data identified by the Patterns (using the
Each document added to a Data Set is parsed to create the derivative files that are used in clustering and searches. The parsing process can perform the following actions:
- Build an index of the document terms.
- Capture embedded metadata
Metadata about the file content. For an email message, embedded metadata includes fields such as To, From, and Subject. For other types of documents, embedded metadata includes fields such as Author, File Type, and Date Modified. Embedded metadata fields are available for Search operations. Note that Digital Reef uses "normalized" metadata fields because different file types encode the same metadata using different labels., structural metadata
Metadata fields that show information about the structure of the file. For example, an email might have an "attachments" metadata field to indicate that the email message had an attachment. Other structural fields include filetype, size, and filemd5. You can view structural and other metadata fields in the Metadata tab of the Document Viewer., and analytic metadata.
- Prior to export, a user with the appropriate permissions can use Project settings or templates) to determine which Export Fields appear in the export manifest, the order of the fields, and the field name. For example, to accommodate an eDiscovery work flow, you may want to order key export-only fields ahead of other fields and rename certain fields.
After export, you can view the complete set of system metadata fields (all indexed fields and export-only fields) in the appropriate file manifest (for example, Concordance .DAT, .CSV, or EDRM XML). An EDRM XML file manifest additionally provides EDRM-specific fields.
- Calculate one MD5 hash code
MD5 (Message-digest algorithm 5) is an IETF standards-based method of converting an input to a unique 128-bit value. Each document in a Data Set has two hash codes calculated for it. If you view metadata for a document, you can view (and copy) the "contentmd5" value, which represents the content of the file, and the "filemd5" value, which represents both the content and the embedded metadata for the file. value for the content (contentmd5) and one for the content plus the embedded metadata (filemd5). Emails get an additional hash code for each altbody section.
- Break out email archive files into individual files and attachments.
- For data processed prior to 4.3.11.0, tag documents with tokens, whichidentify the different types of content in a document and allow you to search for documents with specific types of content. Each enabled System Pattern and Custom Pattern adds a token to responsive documents as well. See Tokens for a complete list of tokens.
Advanced Parsing Options
A user with the ability to manage Project or Organization Settings can affect how data is recognized during the parsing operation. For example, an Organization Administrator can enable selected System Patterns and Custom Patterns for a Project. When the parsing software finds a document that includes data that matches an enabled Pattern, it flags that document (assigns a token In data processed prior to 4.3.11.0, labels that the parsing software associates with each document to track the different types of content in the document. Also used to identify documents that experienced parsing errors.) as containing that type of data. The Administrator can optionally configure a System Pattern or a Custom Pattern to also store matching content as values, which makes that data available for searches and clustering.
For example, during the parsing operation, the system can assign a credit card number token (one of the System Patterns) to every document that contains one or more credit card numbers. You can then do a token search ('token-creditcard') and list the documents that include credit card numbers. If you configure the System Pattern to also store values, the parsing operation captures credit card numbers as data and you can then search for specific credit card numbers.
Project Settings are available to do the following:
- Enable the parsing of Stop Words.
- Enable and disable System Patterns.
- Create Custom Patterns.
- Allow users to create Custom Patterns. (These must then be enabled by the before they are included in the parsing operation.)
- Configure System and Custom Patterns so that the system stores matching values (in the
patternvalue
metadata field).
For example, if the System Pattern for Social Security Numbers is enabled and configured to capture SSNs as values, the parsing process detects all documents that satisfy the ssn Pattern and have SSN values. This enables searches for documents that have the ssn Pattern as well as searches for specific Social Security Numbers. Documents processed prior to 4.3.11.0 that have a large number of Social Security Numbers could be similar enough to be formed into their own Cluster.
For more information on all of the System Patterns, see System Patterns.
Note: Enabling Patterns may increase the time it takes to import files and the amount of space required to store the files. You enable System and Custom Patterns individually, so the more Patterns you enable, the greater the increase in processing time and storage requirements.
Custom Patterns
A user with the appropriate permissions normally creates Custom Patterns to perform pattern matches that are not covered by any of the System Patterns. To go into effect for the next parsing of data, a Pattern must be enabled.
Notes:
- Only a few System Patterns are enabled for you initially. Many System and Custom Patterns are disabled by default. They must be enabled they apply to all Data Sets in the Project.
- You cannot edit or delete the predefined System Patterns. You can copy them as the basis for a Custom Pattern, however.
- System and Custom Patterns must be configured and enabled before adding documents because that is when the parsing operation takes place.
- For information on searching for values associated with Patterns, see Use the Standard Search Syntax for Basic Queries.
See also: