Manage Project Index Settings

Home > selected Project > menu or right-click > Settings > Index Settings
Project > Settings drop-down > Project Settings > Index Settings

Note: Digital Reef now restricts import and reprocessing of data to Projects using Parsing Library V2. All new or migrated Digital Reef Projects use Parsing Library V2, which is identified in the Project Index Settings. Legacy Parsing Library V1 Projects will need migration to V2. When writing queries that include file types in new or migrated Projects, remember to use the Parsing Library V2 file type name.

Users in a role with the appropriate permissions can view and/or manage settings related to indexing and parsing. These settings can also be edited when you import a particular Data Set.

Note: As of 4.3.11.0, the current Project Index Settings (and Patterns) generally apply to initial import or reprocessing. The exceptions are the Custodian Directory Location, Media ID Location, and Stop Words settings, which are not subject to change upon reprocessing.

If you have the appropriate permissions, you can also manage the Index settings in a template at the Organization level Index settings.

Parsing Settings

As of 5.2.5.x, the top right portion of the Index Settings screen will identify the Parsing Library Version in effect in your Project, as follows:

Parsing Library Version: V2 — The Parsing Library version that applies to all Projects created as of 5.2.5.x or migrated to V2.
Parsing Library Version: V1 — The legacy Parsing Library version that applies to Projects created prior to 5.2.5.x that have not yet been migrated to V2.

Note: All new Projects will use Parsing Library V2 by default.

You can configure a number of parsing settings for the Project:

Parse Currency
Parse Numeric Quantities
Parse Numeric Terms
Detect Languages
Detect Viruses
Extract Embedded Images wider than 200 px OR taller than 200 px
Prioritize MAPI Fields over Transport Header Metadata
Split Bloomberg Chat
Split Journaled Emails
Split Individual RSMF Messages
Document Processing Timeout (in minutes, 5 by default)
Archive Processing Timeout (in minutes, 600 by default)

Parse Currency

This setting is selected by default to control the parsing of terms representing currency for any data that is Indexed in the Project.

A currency value within a document must be unambiguous. To be recognized as a currency, a string must be one of these currency symbols, $, €, £, ¥, immediately followed by a numerical quantity.

Note: In general, if any of the numeric settings are needed, make sure they are all enabled before data is processed to the selected Index state (or reprocessed), which enables the software to store values representing numeric content and allow users to search for numeric content such as numeric currency. See System Expressions and Numeric Settings for more information about the numeric settings. Disabling these settings means that numerics are not part of the Index. If you import with these settings disabled, you must reprocess orreimport to have the numerics in the Index.

Parse Numeric Quantities

This setting is selected by default to control the parsing of terms representing numeric quantities. The following examples show how the parsing process identifies numeric quantities:

One or more numbers, which can be separated by a comma (,) or a period. Examples:
- 100
- 123,456
- 1.1
A different base or radix, such as 0xA26.
Scientific notation, for example, 6.022×1023.

Parse Numeric Terms

This setting is selected by default to control the parsing of numeric terms. A numerical term contains numbers and other characters but does not match the definition of a numerical quantity or numerical currency, such as part numbers, serial numbers, phone numbers, chemical compounds, and percentages. Note that the hyphen character, ‘-‘, is allowed in a numerical term.

Unrecognized numerical quantities may be recognized as numerical terms instead. Examples:

1-1
75bn
100% (results not currently highlighted)

In addition, one or more numbers followed immediately by a single occurrence of m, b, or k is accepted as a numeric term.

A single occurrence of m, b, or k is accepted. Examples:

135m
1.212k
1,345b

Detect Languages

This setting is selected by default to have the software automatically detect supported languages when data is imported and indexed or reprocessed. As long as it is selected prior to import or reprocessing of data, you can see the list of detected languages and the dominant languages. Language codes are used to identify the supported languages, as listed in Supported Languages for Automatic Language Detection (for example, en for English, es for Spanish, fr for French, and ro for Romanian).

Detect Viruses

Select this setting to have the software detect viruses upon initial import or during reprocessing of a document. By default, this setting is cleared, which means that viruses are not identified at import (or during document reprocessing).

When this option is enabled, you can view information about a detected virus (the name of the virus) for a document in the virus metadata field. Note that if a virus is detected for an attachment, both the attachment and its parent report the virus detected in this field. To find all documents with viruses, you can search using virus::<exists>.

Extract Embedded Images wider than 200 px OR taller than 200 px (in Parsing Library V2 projects, values appear upon checkbox selection)

Select this setting for import or reprocessing to have the software extract embedded images from MSGs, EMLs, or eDocs and have them appear as separate documents in the index. When you select this option in a Parsing Library V2 Project, you will see the minimum size requirement (by default, wider than 200 pixels or taller than 200 pixels for the height). If you specify your own values instead of using the default values, the minimum is based on whichever value is larger (width or height). If you enable this setting and supply values, and then disable the setting, your values will be preserved for your convenience (but shown grayed out). If you clear out all values when you have the setting enabled and try to save your changes, a message indicates that the field cannot be blank and the default values of 200 are restored. By default, this setting is cleared, which means that embedded images (for example, a logo in gif format) are not extracted from MSGs, EMLs, or eDocs at import or reprocessing. The embeddedchildren metadata field identifies embedded images with a value of image.

Prioritize MAPI Fields over Transport Header Metadata

For MSGs at initial import or reprocessing, this setting affects how the software uses field information from the Message Transport Header fields versus the individual MAPI fields:

When this setting is cleared (the default), the software uses the values from all available Message Transport Header metadata fields over values from available MAPI fields. Where a Message Transport Header metadata field is not populated, the MAPI field value will be used, ensuring more complete metadata population.
When this setting is selected, the software uses the values from all available MAPI fields over values from available Message Transport Header fields. Where a MAPI field is not populated, the Message Transport Header value will be used, ensuring more complete metadata population.

Note: If you decide to select this setting for initial import or reprocessing, you might want to ensure that the results are as expected before you populate Project Data. (You may want to compare the results of an import using the default setting of Cleared to the results of an import using the setting of Selected.)

This setting affects the calculation of the dupe_fingerprint metadata field value. If you enable this option, you may see additional email metadata and duplicates detected.

This setting is available for Project-, Organization-, and System-level Index Settings, and as a Data Set option when importing a new Data Set. It is also available for reprocessing.

Split Bloomberg Chat

Select this setting to split up Bloomberg chats for import or reprocessing. Once enabled, this option splits a given chat at midnight Coordinated Universal Time (UTC), which means after 11:59:99 PM on a given day, new chat at 12:00:00 AM. Each split chat will look like an email message, and you can view the entire chat as an Email Thread, where the split chats in the Thread appear in chronological order. By default, this setting is cleared, which means that Bloomberg chats are kept whole.

Split Journaled Emails

Select this setting to extract the attachments from parent Microsoft Exchange journaled emails. (Microsoft Exchange Journal Archiving archives emails by attaching them to new respective wrapper emails.) By default, this setting is cleared, which means that the software identifies the parent journaled emails (based on the header values in those emails) and reports the appropriate journal type, but doesnot extract the attachments from the parents. When this setting is selected, the items are treated as follows:

The parent wrapper will be a standalone item.
The first-level child (the real email) will be a standalone parent journaled email.
Any extracted children of the first-level child will be children of that parent email (not the wrapper).

After initial processing, or reprocessing with children (when this option is enabled), you can view information about the processed items in the following metadata fields:

Note: Any reprocessing must be done with children when this option is enabled.

journalemailparent — The original parent handle value for any top-level emails extracted from the wrapper when the Split Journaled Emails option is enabled.
journalemailhandle — The handle value of the journal wrapper email for all members in the original email family of one of these wrappers.
journalemailtype — The type of journal email (always reported for any parent journaled email, regardless of the Split Journaled Emails setting):
- JournalEmailWrapper_Standalone (wrapper with no attachments)
- JournalEmailWrapper_SingleEmail (wrapper with one child email)
- JournalEmailWrapper_MultipleEmails (wrapper with multiple child emails)
- JournalEmailChild (reported for all extracted children of the parent email when the Split Journaled Emails option is enabled).

Note: The deduplication scope is affected by this setting. When cleared, the deduplication will only take into account the email wrappers. When selected, the deduplication takes into account both the email wrappers and the original emails.

Split Individual RSMF Messages

Select this setting to split each Relativity Short Message Form (RSMF) files into an RSMF file parent and a series of individual child EML files that were embedded in it. By default, this setting is cleared, which means that Digital Reef identifies the RSMF file and (based on the JSON metadata embedded in the file) associates it with metadata describing its contents, but does not extract the messages from the file as separate items. (Although metadata is extracted from embedded JSON and ZIP files, these files are not themselves extracted from the parent RSMF file, regardless of the Split Individual RSMF Messages setting.)

Note: Deduplication scope is affected by this setting. When cleared, deduplication takes into account only the RSMF files themselves; when selected, deduplication takes into account both the RSMF files and the and the child EML files.

Document Processing Timeout

Document Processing Timeout:<value> — This document timeout value applies to initial import. When a given document (for example, a loose document) reaches the limit, processing of the document stops at that point. You can use the default document timeout value of 5 minutes, or you can specify a timeout value in the range 3 to 180 minutes (3 hours), inclusive.

Archive Processing Timeout

Use this setting if you need to adjust the timeout value (in minutes) used for the processing of archives, including mail archives such as PST, NSF, MBOX, and file archives such as ZIP, TAR, and RAR. (This timeout does not apply to Bloomberg archive processing.) The default processing timeout value is 600 minutes, which should help accommodate archives that take considerable time to process. You can use this default or specify your own archive processing timeout value in the range 10 to 1200 minutes (20 hours). This setting will not use a value less than 10 or greater than 1200. Specifying a value less than 10 will display a popup message indicating that the value used will be 10. Likewise, specifying a value greater than 1200 will display a popup message indicating that the value used will be 1200. To illustrate how this timeout value works, consider an NSF that is subject to the default timeout value of 600 minutes (10 hours). If the processing time of an email in the NSF archive reaches that limit of 600 minutes, then processing of the entire NSF will stop at that point. A higher timeout value may be warranted to help avoid timeouts for certain archives and enable processing to complete. Although the import may take longer with a higher value, it will have more opportunity to complete.

Note: The current Archive Processing Timeout value in effect in the Project Index Settings will initially populate the Archive Processing Timeout value in the Reprocess dialog (for reprocessing documents with children). Users can then use this value for reprocessing documents children, or override the value with another value in the acceptable range.

Advanced Analytics Operations

Include Stop Words - When data is imported and added to Project Data, this Stop Words setting dictates whether Stop Words are either ignored or included in Advanced Analytics operations such as Clustering, Word List generation, and comparisons that assess Document Similarity When you are working in Project Data, you can run a search for Document Similarity using one or more selected documents or an entire view as the basis of the search against a given target. The operation compares a calculated value for the content of the selected documents or a Synthetic Document to the calculated value of the target.. Stop Words are always included for Indexed operations such as Freeform Searches, but Stop Words are ignored by default for Clustering, Word List generation, and similarity comparisons. Users with permissions may decide to enable this option before data import to have Stop Words treated as valid terms for such operations. Stop Words include words that function as articles, pronouns, and propositions, plus other selected words. See Stop Words for a complete list of the Stop Words. This setting is not currently subject to change upon reprocessing.

Note: Data that is prepared with an Analytic Index also observes a minimum term length setting of 3 (to avoid common short words, one- and two-letter words), as well as a maximum term length setting of 32 characters. The Stop Words list happens to include many of these short words. However, only words appearing on the actual Stop Words list are affected by the Include Stop Words setting.

Stop Words Configuration Notes:

Note the following:

Stop Words are ignored by default in Document Similarity When you are working in Project Data, you can run a search for Document Similarity using one or more selected documents or an entire view as the basis of the search against a given target. The operation compares a calculated value for the content of the selected documents or a Synthetic Document to the calculated value of the target. operations and Clustering, if applied, and do not appear in a calculated Word List. Including Stop Words will change the default behavior (that is, have Stop Words treated as valid terms in such operations).
For such operations, the Stop Words setting in effect when the data was Indexed affects how Stop Words are treated.
For query-based Search operations, Stop Words are valid searchable terms and the Stop Words setting has no effect.

Copy to Document Storage

You have the option of automatically performing a Copy to Document Storage operation on an imported Data Set with or without exclusions. Copy to Document Storage copies the Data Set's source files from the import location to the Data Set's folder on the available document storage location along with the extracted documents, enabling you to reduce storage usage at the import location by removing the source files. This check-box is cleared by default; select it to add automatic Copy to Document Storage.

Enable Automatic Copy to Document Storage

Note: Copy to Document Storage does not apply to a Load File import.

Automatic Copy to Document Storage can be used with a standard import of a Data Set from a Data Area or with a shared Data Set. Bear in mind that performing this copy as part of import increases the time required. You can also perform a Copy to Document Storage operation after import, using the equivalent right-click option for a given Data Set under Imports or the equivalent toolbar option when viewing the Imports Summary with a list of all Data Sets.

When you copy a Data Set's source files to Document Storage, a subsequent Export of the Project also exports the source files from Document Storage; deleting the Project also deletes the source files from Document Storage. Note that this operation by default uses all available processing Cores when accommodating a large data set that requires a distribution of work over multiple Analytic Engine resources. In this case, during the copy operation, the software compresses all source data (that is, it utilizes containers for appropriate file types). If you have the appropriate permissions, you can control the use of all available Cores for the Job.

Note that a Warning icon () appears in the Work Basket if a Copy to Document Storage task completed with exceptions as a partial copy (that is, not everything was copied). The Warning status applies to any file that was excluded or failed to copy. In this case, the download WARNING_DETAILS_REPORT.csv identifies the reason why the file was not copied. (See How to Perform a Copy to Document Storage for more information about the exclusions or errors that apply to the copy operation.)

Note: The Copy to Document Storage operation has the ability to preserve the staging used by certain file types, such as Forensic Image file types, multi-part RAR files, and Bloomberg Dump files, as long as their associated document classes are not excluded. When a Copy to Document Storage operation successfully copies everything in the Data Set, the Connector or Data Area information for the Data Set is no longer displayed in the Imports Summary, since the Data Area import location is no longer associated with that Data Set.

Use the following checkbox settings to control the exclusion of some document classes or NIST eDocs during a Copy to Document Storage operation. By default, the operation excludes these document classes as well as NIST eDocs (for example, if you use the settings from a new Index template in an existing Organization, or the default Index settings in a new Organization).

Exclude Document Class: Disk Image

By default, a Copy to Document Storage operation excludes files with a document class of Disk Image (for example, a Logical Evidence File). Clear this checkbox to include documents with a document class of Disk Image.

Exclude Document Class: Archive

By default, a Copy to Document Storage operation excludes files with a document class of Archive (for example, a RAR, TAR, or ZIP/ZIPX). Clear this checkbox to include documents with a document class of Archive.

Exclude Document Class: Message Archive

By default, a Copy to Document Storage operation excludes files with a document class of Message Archive (for example, a mail containers such as a PST or NSF). Clear this checkbox to include documents with a document class of Message Archive.

Exclude NIST eDoc

By default, a Copy to Document Storage operation excludes NIST eDoc files. Clear this checkbox to include NIST eDoc files.

Other Settings

A user with the appropriate permissions can configure other eDiscovery settings.

Specifying a Custodian Directory allows auto-discovery of Custodians. This means that once data becomes part of Project Data, you will see the Custodians automatically added to the Project.

Note: To select any of these settings, click on a setting, then type in the box and click Enter. When you are done changing all settings on the screen, click Save.

Custodian Directory [value] level(s) down from Data Area

Use this field to assign a numeric value that reflects the position of the Custodian Directory at a data area. The default is 0. A value of 1 indicates that the Custodian Directory is in the first directory position. Initially, the value you specify reflects the actual staging used. If you change the value, it dictates the Custodian Directory position for future data added to the Project. The mapping affects the file manifest upon export. Specifying a value enables you to auto-discover Custodians to add to the Project. You can specify only numeric values for this field. This setting is not subject to change upon reprocessing.

Media Directory [value] level(s) down from Data Area

Use this field to assign a numeric value that reflects the position of the Media (Type) Directory at the import location. The default is 0. A value of 2 indicates that the Media Directory is in the second directory position. Initially, the value you specify reflects the actual staging used for the data. If you change the value, it dictates the Media Directory position for future data added to the Project. The mapping affects the file manifest upon export. You can specify only numeric values for this field. This setting is not subject to change upon reprocessing.

Matter Name

Use this field to assign an appropriate name used to identify the eDiscovery Matter, which appears in the file manifest upon export (by default, in the ProjectMatterName field). Initially, the Project Name assigned during Project creation is used as the Matter Name.

Matter Number

Use this field to assign an ID (numeric value) used to identify the eDiscovery Matter, which appears in the file manifest upon export (by default, in the Matter field). This field is initially blank.

Note: Matter Name and Matter Number are useful at the Project level; they are not intended to apply to the Organization template level.

Excluded File Extensions for Extraction

For initial import or reprocessing using the current Index Settings, use this section to specify the common file extensions of any container files that you want to explicitly exclude from file extraction. When you enter one or more file extensions in this section and save your changes, the software will not extract files from the container files with the specified file extension (their docext value or their origdocext value, if populated). This section is intended to help you prevent the extraction of files from container files that are not deemed useful, such as jar files. By default, this section is empty and there are no container file extensions excluded for extraction.

To fill out this section, enter one file extension per line, without the leading periods (for example, specify jar, not .jar). Each entry must be unique or you will see an error message saying that the named extension already exists. (The uniqueness check is not case sensitive, so JAR would be the same as jar.) Note that the entry for a file extension supports a maximum of 100 characters. A scroll bar will appear to accommodate a longer list of items.

If you want to delete an entry, click the icon.

The list will be sorted in alphabetical order when you click Save.

Files that are excluded from extraction during processing will have the 01000 SKIPPED_FILE parsingstatus.

Note: Do not reprocess files with the 01000 SKIPPED_FILE parsingstatus, as it will have no effect on the situation.

Automatic OCR Settings

You can specify language, accuracy, and timeout settings that apply to automatic OCR processing.

Enable Automatic OCR Processing

Use this setting to control whether OCR processing is performed automatically as part of the import process. By default, this setting is cleared. If you select this setting, (which affects all subsequent imports associated with added Data Sets), OCR processing is performed using the current automatic OCR Settings and queries defined under the OCR Settings. The default OCR Settings calculate no-content PDFs, no-content Microsoft documents, TIFF files, OCR Failures, and low-content PDFs.

Note: If you have Automatic OCR enabled and perform reprocessing, the reprocessing will perform OCR processing of all documents that meet the OCR Candidate queries, regardless of whether they have been previously subject to OCR processing.

Language Selection

Use the default of General, which accommodates all detectable languages, including all Latin languages and Chinese, Japanese, and Korean (CJK) languages, or specify a language that requires explicit enabling for OCR processing:

Arabic
Cyrillic (includes Bulgarian, Byelorussian, Chechen, Kabardian, Macedonian, Moldavian, Serbian (non-Latin), Russian, and Ukrainian)
English (for English-only OCR)
Greek
Hebrew
Thai

Accuracy

This setting enables you to select the level of accuracy for automatic OCR processing, either Medium (the default) or High.

Page Timeout

This setting controls the OCR page timeout. The default is 120 seconds. You can edit this value as long as you maintain a non-zero value. Negative values are not valid.

Note: Instead of enabling automatic OCR processing, which performs OCR automatically for each import, consider performing OCR processing after import, after you have had a chance to review the calculated OCR Candidates in the report for your imported Data Set. From the OCR Candidates chart in the report, you can drill-through an entry to view the results. You can then use the OCR option available from the document results to perform OCR processing of the selected results. You can perform OCR processing regardless of whether you have already populated Project Data. Remember that if you have damaged, encrypted, or protected documents that you want to reprocess, you can use the Reprocess option from results. (Reprocessing is not generally associated with OCR.) See How to Perform OCR Processing for information about the steps and supported languages for OCR processing.

OCR Queries

A user with the appropriate permissions can take advantage of the default OCR queries, which calculate the different types of OCR Candidates as part of the import process, or add custom queries to identify OCR Candidates. The default queries can be removed, if necessary, or just supplemented by other queries that you add.

The OCR Candidates calculated by the applied queries will appear on the OCR section of the Scan Report for a Data Set under Imports. For all OCR Candidates, or a given type of Candidate, drill-through is supported. From the Drill-through Search Result, click OCR to perform OCR processing of selected or all documents in the list. (You can also perform OCR processing of Candidates for a Custodian.)

Note: When loading these queries, note that queries to be loaded must not contain any name collisions with existing views in the Project. When naming your queries, make sure that they do not match existing views in the Project.

The default queries to detect OCR Candidates are listed in the following sections.

Note: When writing queries that include file types, you must use the correct file type based on the Parsing Library Version in effect in your Project. New and migrated Projects will be V2. Projects created prior to Release 5.2.5.x will be V1 until migrated. See Supported File Types for Analysis for a list of file types based on the Parsing Library Version. In a new Organization as of 5.2.5.x, the default OCR queries will have the V2 file types. In an existing Organization, the default OCR queries in existing Projects and templates will continue to have the V1 file types prior to migration, but all new Projects and templates will have the V2 file types.

Low Content PDF

< 5 terms/page)

This query calculates the number of PDF files with a low amount of content (less than 5 terms per page) that are eligible for OCR processing.

For Parsing Library Version V2:

filetype::"Adobe PDF" AND NOT (parsingstatus::NODATA OR parsingstatus:: UNSUPPORTED_PDF_FORM) AND (averagenumberoftermsperpage::[00000 ~~ 00010] OR charcountlongestword::[00000 ~~ 00003]) AND NOT ocrstatus::<exists>

For legacy Parsing Library Version V1:

filetype::Adobe Acrobat (PDF) AND NOT (parsingstatus::NODATA OR parsingstatus:: UNSUPPORTED_PDF_FORM) AND (averagenumberoftermsperpage::[00000 ~~ 00010] OR charcountlongestword::[00000 ~~ 00003]) AND NOT ocrstatus::<exists>

No Content Microsoft Docs

This query calculates the number of no-content Microsoft documents that are eligible for OCR processing.

For Parsing Library Version V2:

(filetype::"Microsoft Excel" OR filetype::"Microsoft Word for Windows" OR filetype::"Microsoft PowerPoint" OR filetype::"Microsoft Word for Mac") AND ((averagenumberoftermsperpage::00000) OR (parsingstatus::00005)) AND (NOT ocrstatus::<exists>)

For legacy Parsing Library Version V1:

filetype::Microsoft AND ((averagenumberoftermsperpage::00000) OR (parsingstatus::00005)) AND (NOT ocrstatus::<exists>)

No Content PDF

This query calculates the number of no-content PDFs that are eligible for OCR processing:

For Parsing Library Version V2:

filetype::"Adobe PDF" AND parsingstatus::NODATA AND NOT ocrstatus::<exists>

For legacy Parsing Library Version V1:

filetype::"Adobe Acrobat (PDF)" AND parsingstatus::NODATA AND NOT ocrstatus::<exists>

OCR Failure

This query calculates the number of files (image files) that were subject to an unexpected OCR failure during processing. If the OCR Candidates report shows a non-zero value for OCR Failure, you can try to perform OCR the affected files again to attain parsingstatus::SUCCESS.

parsingstatus::01200

Tiff

This query calculates the number of TIFF files that are eligible for OCR processing:

filetype::"Tagged Image File Format" AND NOT ocrstatus::<exists>

Unknown Language PDF

This query calculates the number of unknown language PDFs that are eligible for OCR processing (that is, the number of PDFs for which the language is unknown).

Note: This query became available as of 4.3.10.0 for new Organizations and their Projects using the default Organization Index Settings template (or a newly created System Index template).

For Parsing Library Version V2:

filetype::"Adobe PDF" AND language::unknown AND NOT ocrstatus::<exists> AND NOT parsingstatus::(ENCRYPTED OR PROTECTED)

For legacy Parsing Library Version V1:

filetype::"Adobe Acrobat (PDF)" AND language::unknown AND NOT ocrstatus::<exists>

To add your own OCR Candidate query, enter a name for the query for Search Name and enter the query for Search Query.

Note: The maximum length for an OCR query is 1024 characters.

Each query addition generates a new line so that you can continue to add queries, if desired. If you need to delete a query, click the icon.

You can add an OCR query after import, have the Data Set Report updated to reflect calculation of the new candidate documents, and perform OCR of those candidate documents.

Additional Example: Other Images

If you want to process secondary image files, you can add more OCR queries. For example, you can add queries for the additional image types:

For Parsing Library Version V2:

(filetype::"JPEG Image" OR filetype::"JPEG2000 Image" OR filetype::"GIF Image" OR filetype::"Portable Network Graphics Image" OR filetype::"Microsoft Bitmap" OR filetype::"X-Windows xbitmap") AND (NOT ocrstatus::<exists>)

For legacy Parsing Library Version V1:

(filetype::"JPEG File Interchange" OR filetype::"Compuserve GIF" OR filetype::"Portable Network Graphics Format" OR filetype::"Windows Bitmap" OR filetype::"X-Windows Bitmap" OR filetype::"Progressive JPEG") AND (NOT ocrstatus::<exists>)

Note: When editing an OCR Candidate Search in the Index Settings, be sure to change the name of the query. This ensures proper handling of the query.

Custom Warnings and Errors

A user with the appropriate permissions can add custom queries to identify documents meeting that query. There are no default custom queries.

The queries will appear in the Custom Warnings and Errors section of the Scan Report for a Data Set or for all Imports (if the report is enabled for those views). For a given query, drill-through is supported.

To add a custom query, enter a name for the query for Search Name and enter the query for Search Query.

Note: The maximum length for a custom query is 1024 characters.

Each query addition generates a new line so that you can continue to add queries, if desired. If you need to delete a custom query, click the icon.

You can add a custom query after import and update the report to reflect calculation of the documents meeting the query.

Note: When writing queries that include file types, you must use the correct file type based on the Parsing Library Version in effect in your Project. See Supported File Types for Analysis for a list of file types based on the Parsing Library Version.

Sample custom query

The following shows a custom query (which could be called Potential Mail Store Error) that checks for potential corrupted PST or OST Mail Containers. This query would help pinpoint PST or OST files that could not be processed correctly and therefore could not be identified by their expected filetype and a docclass of Message_Archive:

(docext::PST OR docext::OST) AND docclass::EDoc AND NOT childcount::<exists>

Custom Email Header Field Mapping

Data ingested by Digital Reef may include emails from various email archiving platforms (such as Zantaz, Smarsh, and Global Relay), the email headers of which contain fields that are not currently supported by Digital Reef. This section enables you to choose one of these options for including this information when ingesting such data:

Include all of the fields in each email header in a single, searchable emailheader metadata field.
Map specific email header fields to custom metadata fields at ingest or reprocessing so that the header data can be searched and exported more flexibly and efficiently using those custom fields.

For example, the X-ZANTAZ-RECIP field in Zantaz email headers contains the full list of email recipients, whereas the body of any given email may not . With email data from Zantaz, you can use this section to be sure header data you need is retained, either by including all of the Zantaz header fields for each message in the emailheader field or by mapping one or more of the Zantaz header fields ( X-ZANTAZ-RECIP, X-ZANTAZ-RECIP, and so on) to individual custom fields (such as drcustom_X_ZANTAZ_RECIP, drcustom_X_ZANTAZ_BCC, and so on). After ingestion or reprocessing, the emailheader field or custom fields (depending on the option you choose) are available for inclusion in Export Fields templates, allowing you to export the email header data you need.

Include Entire Email Header

When this setting is selected, the entire email header of each email is indexed and stored as a single metadata field called emailheader at ingest or reprocessing. (For MSGs, this field is generated only from the transport header, where it is available.) This makes it possible to search for information that may appear in the email headers and export it by selecting the emailheader metadata field for inclusion in an Export Fields template.

Parse Custom Email Header

When you select this setting (which is available only if Include Entire Email Header is selected), you can enter a list of the email header fields found in the data (one by one or by pasting the whole list in) and either automatically generate custom names for the metadata fields the corresponding data will be placed in or enter your own (again, one at a time or by pasting a list). The generated name is of the form drcustom_header-field-name, in lowercase only and with hyphens replaced by underscores in the header field name, so that the custom field containing X-ZANTAZ-RECIP field values would be dr_custom_x_zantaz_recip. The custom field names list is validated so that no duplicates of existing fields are included; when you save the index settings, they are sorted alphabetically.

In some cases the best approach may be to ingest with only Include Entire Email Header selected and then search on the emailheader field to determine which fields in the email headers you want to break out as custom fields. You can then select Parse Custom Email Header and reprocess.

Index Settings: Save to and Load from Template Options

If you have the appropriate permissions, you can save your Index Settings to a template or load the settings from a template. To do this, click the ellipses to the right of Index Settings in the tree of eDiscovery Project Settings, as follows:

Note: Save to and Load from operations for this setting observe an "overwrite" behavior. For example, for a Load from operation, your current settings are replaced by the settings from the selected template/settings. Note that some settings, such as Patterns, Tags, Domain Lists, Alias Lists and Excluded Content observe an "append" behavior instead. Note that as of 5.2.5.x, these template operations will require a match in parsing library versions (for example, you can only load existing V1 template information into V1 Project Index Settings).

Save to Template - If you have Add/Edit permissions to Index Settings templates in the Organization, you can use the Save to Template option to save your current settings to a selected Organization template. You can either select an existing Organization template (including the Default Index Settings template), or you can select the top-level (New Template,) which launches the New Template dialog. Note that as of 5.2.5.x, this operation will require a match in parsing library versions, and you will see an error, The template operation cannot be performed due to a mismatch in parsing library versions between the project and the selected template if you try to save information to a template that does not match the parsing library version of your current Settings. You can only save V1 Project Index Settings into existing V1 template information or save V2 Project Settings into V2 template information. All new templates will be V2. If you try to save V1 settings to a new template, a new template will be created, but it will be a V2 template. Therefore, when you click OK for the Save to Template operation, you will see the mismatch error to indicate that the operation could not be performed as you had intended. You can then close the error popup and close or cancel the Save to Template operation.
Load from Template - If you have Add/Edit permissions to Index Settings in the Project, you can use the Load from Template option to load settings from a selected template (from a list of available Organization templates). The loaded settings then appear and are saved automatically. Note that loading from a System template requires System-level View permission for a given Setting. (This means you must be a System User in a role with at least View permission to see a list of System templates for a particular type of template.) Note that as of 5.2.5.x, this operation will require a match in parsing library versions, and you will see an error, The template operation cannot be performed due to a mismatch in parsing library versions between the project and the selected template if you try to load template information that does not match the parsing library version of your current Settings. You can only load existing V1 template information into V1 Project Index Settings or load V2 template information into V2 Project Index Settings.