Manage Index Settings in a Template

Home > selected Organization > menu or right-click > Settings > eDiscovery Templates > Index Settings
Project > Settings drop-down > Organization Settings > eDiscovery Templates > Index Settings

Requires Organization - Index Settings - Add/Edit Permissions

Note: Digital Reef now restricts import and reprocessing of data to Projects using Parsing Library V2. All new or migrated Digital Reef Projects use Parsing Library V2, which is identified in the Project Index Settings. Legacy Parsing Library V1 Projects will need migration to V2. When writing queries that include file types in new or migrated Projects, remember to use the Parsing Library V2 file type name.

Users in a role with the appropriate permissions can view and/or manage settings related to indexing and parsing in a template at the Organization level.

Note: As of 4.3.11.0, the current Project Index Settings (and Patterns) generally apply to initial import or reprocessing. The exceptions are the Custodian Directory Location, Media ID Location, and Stop Words settings, which are not subject to change upon reprocessing.

When you select a template type from the Organization Settings, a user with the appropriate permissions can use the top-level Templates context menu to perform the following action:

Create a template by clicking the (New Template) option, which launches the New Template properties dialog.

For a selected template, a user with the appropriate permissions can click the ellipses and use the context menu to perform the following actions:

Save to Template – Launches the Save to Template dialog, which enables you to save current settings to an available template, or select New Template, which launches the New Template dialog. Note that as of 5.2.5.x, this operation will require a match in parsing library versions, and you will see an error, The template operation cannot be performed due to a mismatch in parsing library versions between the project and the selected template if you try to save template information to a template that does not match the parsing library version of your current template. All new templates will be Parsing Library Version V2 templates.
Load from Template – Launches the Load from Template dialog, which enables you to load the settings from a System template or Organization template you select using the option. The loaded settings and fields then appear. Note that as of 5.2.5.x, this operation will require a match in parsing library versions, and you will see an error, The template operation cannot be performed due to a mismatch in parsing library versions between the project and the selected template if you try to load template information that does not match the parsing library version of your current template.

Select Set as Default, which marks the selected template as the default template. This is not available for the Default template of a given type, or for any other template already set as the Default. Note that as of 5.2.5.x, in existing Organizations, you cannot save a Parsing Library V1 template as the default template, for either an Index Settings template or a Billing Report template. Attempting to set a V1 template as the default template will display an error, Cannot make OIT template (V1) the default template.

Edit – Launches the Edit Template dialog, which enables you to edit the template name and/or description of the selected template.
Delete – Delete a template, which causes the display of a popup asking you to verify the deletion of the template from the Organization.

Parsing Settings

The top right portion of the Index Settings template will identify the appropriate Parsing Library Version for the template, as follows:

Parsing Library Version: V2 — The Parsing Library version to be used in all Projects created as of 5.2.5.x or migrated to V2.
Parsing Library Version: V1 — The legacy Parsing Library version that can only be used in Projects created prior to 5.2.5.x that have not yet been migrated to V2.

Note: All new and migrated Projects use Parsing Library V2 by default.

You can configure a number of parsing settings:

Parse Currency
Parse Numeric Quantities
Parse Numeric Terms
Detect Languages
Detect Viruses
Include Entire Email Header
Extract Embedded Images
Prioritize MAPI Fields over Transport Header Metadata
Split Bloomberg Chat
Split Journaled Emails
Split Individual RSMF Messages
Document Processing Timeout (in minutes, 5 by default)
Archive Processing Timeout (in minutes, 600 by default)

Parse Currency

This setting is selected by default to control the parsing of terms representing currency for any data that is Indexed in the Project.

A currency value within a document must be unambiguous. To be recognized as a currency, a string must be one of these currency symbols, $, €, £, ¥, immediately followed by a numerical quantity.

Note: In general, if any of the numeric settings are needed, make sure they are all enabled before data is processed to the selected Index state (or reprocessed), which enables the software to store values representing numeric content and allow users to search for numeric content such as numeric currency. See System Expressions and Numeric Settings for more information about the numeric settings. Disabling these settings means that numerics are not part of the Index. If you import with these settings disabled, you must reprocess or reimport to have the numerics in the Index.

Parse Numeric Quantities

This setting is selected by default to control the parsing of terms representing numeric quantities. The following examples show how the parsing process identifies numeric quantities:

One or more numbers, which can be separated by a comma (,) or a period. Examples:
- 100
- 123,456
- 1.1
A different base or radix, such as 0xA26.
Scientific notation, for example, 6.022×1023.

Parse Numeric Terms

This setting is selected by default to control the parsing of numeric terms. A numerical term contains numbers and other characters but does not match the definition of a numerical quantity or numerical currency, such as part numbers, serial numbers, phone numbers, chemical compounds, and percentages. Note that the hyphen character, ‘-‘, is allowed in a numerical term.

Unrecognized numerical quantities may be recognized as numerical terms instead. Examples:

1-1
75bn
100% (results not currently highlighted)

In addition, one or more numbers followed immediately by a single occurrence of m, b, or k is accepted as a numeric term.

A single occurrence of m, b, or k is accepted. Examples:

135m
1.212k
1,345b

Detect Languages

This setting is selected by default to have the software automatically detect supported languages when data is imported and indexed or reprocessed. As long as it is selected prior to import of data, you can see the list of detected languages and the dominant languages. Language codes are used to identify the supported languages, as listed in Supported Languages for Automatic Language Detection (for example, en for English, es for Spanish, fr for French, and ro for Romanian).

Detect Viruses

Select this setting to have the software detect viruses upon initial import (or during reprocessing of a document). By default, this setting is cleared, which means that viruses are not identified at import (or during document reprocessing).

When this option is enabled, you can view information about a detected virus (the name of the virus) for a document in the virus metadata field. Note that if a virus is detected for an attachment, both the attachment and its parent report the virus detected in this field. To find all documents with viruses, you can search using virus::<exists>.

Include Entire Email Header

Select this setting to have the software automatically include the entire Email Header for an email, including custom header fields, during parsing or reprocessing. For MSGs, this field is generated only from the transport header, where it is available. By default, this option is cleared. When it is enabled, the entire email header for an email is indexed and stored as a single metadata field called emailheader. Indexing the entire header information enables a search for information that may appear in custom header fields. For example, some email messages may not contain all recipients of a message, but the header does contain the full recipient list in custom fields. Example: X-ZANTAZ-RECIP: "John Smith" <john.smith@sample.com>, "Jane Sample" <jane.sample@sample.com>. When setting up your Export Fields Template, you can decide whether to include the emailheader metadata field with the complete header content in the load file.

Extract Embedded Images wider than 200 px OR taller than 200 px (values appear upon checkbox selection)

Select this setting for import or reprocessing to have the software extract embedded images from MSGs, EMLs, or eDocs and have them appear as separate documents in the index. When you select this option, you will see the minimum size requirement (by default, wider than 200 pixels or taller than 200 pixels for the height). If you specify your own values instead of using the default values, the minimum is based on whichever value is larger (width or height). If you enable this setting and supply values, and then disable the setting, your values will be preserved for your convenience (but shown grayed out). If you clear out all values when you have the setting enabled and try to save your changes, a message indicates that the field cannot be blank and the default values of 200 are restored. By default, this setting is cleared, which means that embedded images (for example, a logo in gif format) are not extracted from MSGs, EMLs, or eDocs at import or reprocessing. The embeddedchildren metadata field identifies embedded images with a value of image.

Prioritize MAPI Fields over Transport Header Metadata

For MSGs at initial import or reprocessing, this setting affects how the software uses field information from the Message Transport Header fields versus the individual MAPI fields:

When this setting is cleared (the default), the software uses the values from all available Message Transport Header metadata fields over values from available MAPI fields. Where a Message Transport Header metadata field is not populated, the MAPI field value will be used, ensuring more complete metadata population.
When this setting is selected, the software uses the values from all available MAPI fields over values from available Message Transport Header fields. Where a MAPI field is not populated, the Message Transport Header value will be used, ensuring more complete metadata population.

Note: If you decide to select this setting for initial import or reprocessing, you might want to ensure that the results are as expected before you populate Project Data. (You may want to compare the results of an import using the default setting of Cleared to the results of an import using the setting of Selected.)

This setting affects the calculation of the dupe_fingerprint metadata field value. If you enable this option, you may see additional email metadata and duplicates detected.

This setting is available for Project-, Organization-, and System-level Index Settings, and as a Data Set option when importing a new Data Set. It is also available for reprocessing.

Split Bloomberg Chat

Select this setting to split up Bloomberg chats. Once enabled, this option splits a given chat at midnight Coordinated Universal Time (UTC), which means after 11:59:99 PM on a given day, new chat at 12:00:00 AM. Each split chat will look like an email message, and you can view the entire chat as an Email Thread, where the split chats in the Thread appear in chronological order. By default, this setting is cleared, which means that Bloomberg chats are kept whole.

Split Journaled Emails

Select this setting to have the software extract attachments from parent Microsoft Exchange journaled emails. (Microsoft Exchange Journal Archiving will archive emails by attaching them to new respective wrapper emails.) By default, this setting is cleared, which means that the software will identify the parent journaled emails (based on the header values in those emails) and report the appropriate journal type, but will not extract the attachments from the parents. When this setting is selected, the items are treated as follows:

The parent wrapper will be a standalone item.
The first-level child (the real email) will be a standalone parent journaled email.
Any extracted children of the first-level child will be children of that parent email (not the wrapper).

After initial processing, or reprocessing with children, you can view information about the processed items in the following metadata fields:

Note: Any reprocessing must be done with children.

journalemailparent — The original parent handle value for any top-level emails extracted from the wrapper when the Split Journaled Emails option is enabled.
journalemailhandle — The handle value of the journal wrapper email for all members in the original email family of one of these wrappers.
journalemailtype — The type of journal email (always reported for any parent journaled email, regardless of the Split Journaled Emails setting):
- JournalEmailWrapper_Standalone (wrapper with no attachments)
- JournalEmailWrapper_SingleEmail (wrapper with one child email)
- JournalEmailWrapper_MultipleEmails (wrapper with multiple child emails)
- JournalEmailChild (reported for all extracted children of the parent email when the Split Journaled Emails option is enabled).

Note: The deduplication scope is affected by this setting. When cleared, the deduplication will only take into account the email wrappers. When selected, the deduplication takes into account both the email wrappers and the original emails.

Split Individual RSMF Messages

Select this setting to split each Relativity Short Message Form (RSMF) files into an RSMF file parent and a series of individual child EML files that were embedded in it. By default, this setting is cleared, which means that Digital Reef identifies the RSMF file and (based on the JSON metadata embedded in the file) associates it with metadata describing its contents, but does not extract the messages from the file as separate items. (Although metadata is extracted from embedded JSON and ZIP files, these files are not themselves extracted from the parent RSMF file, regardless of the Split Individual RSMF Messages setting.)

Note: Deduplication scope is affected by this setting. When cleared, deduplication takes into account only the RSMF files themselves; when selected, deduplication takes into account both the RSMF files and the and the child EML files.

Document Processing Timeout

Document Processing Timeout:<value> — This document timeout value applies to initial import. When a given document (for example, a loose document) reaches the limit, processing of the document stops at that point. You can use the default document timeout value of 5 minutes, or you can specify a timeout value in the range 3 to 180 minutes (3 hours), inclusive.

Archive Processing Timeout

Use this setting if you need to adjust the timeout value (in minutes) used for the processing of archives, including mail archives such as PST, NSF, MBOX, and file archives such as ZIP, TAR, and RAR. (This timeout does not apply to Bloomberg archive processing.) The default processing timeout value is 600 minutes, which should help accommodate archives that take considerable time to process. You can use this default or specify your own archive processing timeout value in the range 10 to 1200 minutes (20 hours). This setting will not use a value less than 10 or greater than 1200. Specifying a value less than 10 will display a popup message indicating that the value used will be 10. Likewise, specifying a value greater than 1200 will display a popup message indicating that the value used will be 1200. To illustrate how this timeout value works, consider an NSF that is subject to the default timeout value of 600 minutes (10 hours). If the processing time of an email in the NSF archive reaches that limit of 600 minutes, then processing of the entire NSF will stop at that point. A higher timeout value may be warranted to help avoid timeouts for certain archives and enable processing to complete. Although the import may take longer with a higher value, it will have more opportunity to complete.

Note: The current Archive Processing Timeout value in effect in the Project Index Settings will initially populate the Archive Processing Timeout value in the Reprocess dialog (for reprocessing documents with children). Users can then use this value for reprocessing documents children, or override the value with another value in the acceptable range.

Advanced Analytics Operations

Include Stop Words - When data is imported and added to Project Data, this Stop Words setting dictates whether Stop Words are either ignored or included in Advanced Analytics operations such as Clustering, Word List generation, and comparisons that assess Document Similarity When you are working in Project Data, you can run a search for Document Similarity using one or more selected documents or an entire view as the basis of the search against a given target. The operation compares a calculated value for the content of the selected documents or a Synthetic Document to the calculated value of the target.. Stop Words are always included for Indexed operations such as Freeform Searches, but Stop Words are ignored by default for Clustering, Word List generation, and similarity comparisons. Users with permissions may decide to enable this option before data import to have Stop Words treated as valid terms for such operations. Stop Words include words that function as articles, pronouns, and propositions, plus other selected words. See Stop Words for a complete list of the Stop Words. This setting is not currently subject to change upon reprocessing.

Note: Data that is prepared with an Analytic Index also observes a minimum term length setting of 3 (to avoid common short words, one- and two-letter words), as well as a maximum term length setting of 32 characters. The Stop Words list happens to include many of these short words. However, only words appearing on the actual Stop Words list are affected by the Include Stop Words setting.

Stop Words Configuration Notes:

Note the following:

Stop Words are ignored by default in Document Similarity When you are working in Project Data, you can run a search for Document Similarity using one or more selected documents or an entire view as the basis of the search against a given target. The operation compares a calculated value for the content of the selected documents or a Synthetic Document to the calculated value of the target. operations and Clustering, if applied, and do not appear in a calculated Word List. Including Stop Words will change the default behavior (that is, have Stop Words treated as valid terms in such operations).
For such operations, the Stop Words setting in effect when the data was Indexed affects how Stop Words are treated.
For query-based Search operations, Stop Words are valid searchable terms and the Stop Words setting has no effect.

Copy to Document Storage

You have the option to automatically perform a Copy to Document Storage operation of an imported Data Set with or without exclusions.

Enable Automatic Copy to Document Storage

Note: Copy to Document Storage does not apply to a Load File import.

For a standard import of a Data Set from a Data Area or for a shared Data Set, you can select the Copy to Document Storage setting if you want to copy the source files from the import location to the Organization’s designated Document Storage as part of the import configuration process. This enables you to free the storage associated with the import location. This setting is cleared by default. Performing this copy as part of import will impact the import time. You can also perform this copy of source files to Document Storage after import, using the equivalent right-click option for a given Data Set under Imports or the equivalent toolbar option when viewing the Imports Summary with a list of all Data Sets. Document Storage is established for an Organization during Organization Provisioning. If you copy documents to Document Storage, an Export of the Project (for Project Export) will also export the source documents in Document Storage. Deleting the Project will also delete the source documents in the Document Storage. Note that this operation by default uses all available processing Cores when accommodating a large data set that requires a distribution of work over multiple Analytic Engine resources. In this case, during the copy operation, the software will then compress all source data (that is, it will utilize containers for appropriate file types). If you have the appropriate permissions, you can control the use of all available Cores for the Job.

Note that a Warning icon () appears in the Work Basket if a Copy to Document Storage task completed with exceptions as a partial copy (that is, not everything was copied). The Warning status applies to any file that was excluded or failed to copy. In this case, the download WARNING_DETAILS_REPORT.csv identifies the reason why the file was not copied. (See How to Perform a Copy to Document Storage for more information about the exclusions or errors that apply to the copy operation.)

Note: The Copy to Document Storage operation has the ability to preserve the staging used by certain file types, such as Forensic Image file types, multi-part RAR files, and Bloomberg Dump files, as long as their associated document classes are not excluded. When a Copy to Document Storage operation successfully copies everything in the Data Set, you will no longer see the Connector or Data Area information displayed for the Data Set in the Imports Summary, since the Data Area import location is no longer associated with that Data Set.

Use the following checkbox settings to control the exclusion of some document classes or NIST eDocs during a Copy to Document Storage operation. By default, the operation excludes these document classes as well as NIST eDocs (for example, if you use the settings from a new Index template in an existing Organization, or the default Index settings in a new Organization).

Exclude Document Class: Disk Image

By default, a Copy to Document Storage operation excludes files with a document class of Disk Image (for example, a Logical Evidence File). Clear this checkbox to include documents with a document class of Disk Image.

Exclude Document Class: Archive

By default, a Copy to Document Storage operation excludes files with a document class of Archive (for example, a RAR, TAR, or ZIP/ZIPX). Clear this checkbox to include documents with a document class of Archive.

Exclude Document Class: Message Archive

By default, a Copy to Document Storage operation excludes files with a document class of Message Archive (for example, a mail containers such as a PST or NSF). Clear this checkbox to include documents with a document class of Message Archive.

Exclude NIST eDoc

By default, a Copy to Document Storage operation excludes NIST eDoc files. Clear this checkbox to include NIST eDoc files.

Excluded File Extensions for Extraction

For initial import or reprocessing using the current Index Settings, use this section to specify the common file extensions of any container files that you want to explicitly exclude from file extraction. When you enter one or more file extensions in this section and save your changes, the software will not extract files from the container files with the specified file extension (their docext value or their origdocext value, if populated). This section is intended to help you prevent the extraction of files from container files that are not deemed useful, such as jar files. By default, this section is empty and there are no container file extensions excluded for extraction.

To fill out this section, enter one file extension per line, without the leading periods (for example, specify jar, not .jar). Each entry must be unique or you will see an error message saying that the named extension already exists. (The uniqueness check is not case sensitive, so JAR would be the same as jar.) Note that the entry for a file extension supports a maximum of 100 characters. A scroll bar will appear to accommodate a longer list of items.

If you want to delete an entry, click the icon.

The list will be sorted in alphabetical order when you click Save.

Files that are excluded from extraction during processing will have the 01000 SKIPPED_FILE parsingstatus.

Note: Do not reprocess files with the 01000 SKIPPED_FILE parsingstatus, as it will have no effect on the situation.

Other Settings

A user with the appropriate permissions can configure eDiscovery settings.

Specifying a Custodian Directory allows auto-discovery of Custodians. This means that once data becomes part of Project Data, you will see the Custodians automatically added to the Project.

Note: To select any of these settings, click on a setting, then type in the box and click Enter. When you are done changing all settings on the screen, click Save.

Custodian Directory [value] level(s) down from Data Area

Use this field to assign a numeric value that reflects the position of the Custodian Directory at a data area. The default is 0. A value of 1 indicates that the Custodian Directory is in the first directory position. Initially, the value you specify reflects the actual staging used. If you change the value, it dictates the Custodian Directory position for future data added to the Project. The mapping affects the file manifest upon export. Specifying a value enables you to auto-discover Custodians to add to the Project. You can specify only numeric values for this field. This setting is not subject to change upon reprocessing.

Media Directory [value] level(s) down from Data Area

Use this field to assign a numeric value that reflects the position of the Media (Type) Directory at the import location. The default is 0. A value of 2 indicates that the Media Directory is in the second directory position. Initially, the value you specify reflects the actual staging used for the data. If you change the value, it dictates the Media Directory position for future data added to the Project. The mapping affects the file manifest upon export. You can specify only numeric values for this field. This setting is not subject to change upon reprocessing.

Note: Project-level Index Settings include two additional fields, Matter Name and Matter Number, which are useful at the Project level only. They are not intended to apply to the Organization template level.

Automatic OCR Settings

You can specify language, accuracy, and timeout settings that apply to automatic OCR processing.

Enable Automatic OCR Processing

Use this setting to control whether OCR processing is performed automatically as part of the import process. By default, this setting is cleared. If you select this setting, (which affects all subsequent imports associated with added Data Sets), OCR processing is performed using the current automatic OCR Settings and queries defined under the OCR Settings. The default OCR Settings calculate no-content PDFs, no-content Microsoft documents, TIFF files, OCR Failures, and low-content PDFs.

Language Selection

Use the default of General, which accommodates all detectable languages, including all Latin languages and Chinese, Japanese, and Korean (CJK) languages, or specify a language that requires explicit enabling for OCR processing:

Arabic
Cyrillic (includes Bulgarian, Byelorussian, Chechen, Kabardian, Macedonian, Moldavian, Serbian (non-Latin), Russian, and Ukrainian)
English (for English-only OCR)
Greek
Hebrew
Thai

Accuracy

This setting enables you to select the level of accuracy for automatic OCR processing, either Medium (the default) or High.

Page Timeout

This setting controls the OCR page timeout. The default is 120 seconds. You can edit this value as long as you maintain a non-zero value. Negative values are not valid.

Note: Instead of enabling OCR processing, which performs OCR automatically for each import, consider performing OCR processing after import, after you have had a chance to review the calculated OCR Candidates in the report for your imported Data Set. From the OCR Candidates chart in the report, you can drill-through an entry to view the results. You can then use the OCR option available from the document results to perform OCR processing of the selected results. You can perform OCR processing regardless of whether you have already populated Project Data. Remember that if you have damaged, encrypted, or protected documents that you want to reprocess, you can use the Reprocess option from results. (Reprocessing is no longer associated with OCR.) See How to Perform OCR Processing for information about the steps and supported languages for OCR processing.

OCR Queries

A user with the appropriate permissions can take advantage of the default OCR queries, which calculate the different types of OCR Candidates as part of the import process, or add custom queries to identify OCR Candidates. The default queries can be removed, if necessary, or just supplemented by other queries that you add.

The OCR Candidates calculated by the applied queries will appear on the OCR section of the Scan Report for a Data Set under Imports. For all OCR Candidates, or a given type of Candidate, drill-through is supported. From the Drill-through Search Result, click OCR to perform OCR processing of selected or all documents in the list. (You can also perform OCR processing of Candidates for a Custodian.)

Note: When loading these queries, note that queries to be loaded must not contain any name collisions with existing views in the Project. When naming your queries, make sure that they do not match existing views in the Project.

The default queries to detect OCR Candidates are listed in the following sections.

Note: When writing queries that include file types, you must use the correct file type based on the Parsing Library Version in effect in your Project. New and migrated Projects will be V2. Projects created prior to Release 5.2.5.x will be V1 until migrated. See Supported File Types for Analysis for a list of file types based on the Parsing Library Version. In a new Organization as of 5.2.5.x, the default OCR queries will have the V2 file types. In an existing Organization, the default OCR queries in existing Projects and templates will continue to have the V1 file types prior to migration, but all new Projects and templates will have the V2 file types. Template operations will require a match in library versions.

Low Content PDF

< 5 terms/page)

This query calculates the number of PDF files with a low amount of content (less than 5 terms per page) that are eligible for OCR processing.

For Parsing Library Version V2:

filetype::"Adobe PDF" AND NOT (parsingstatus::NODATA OR parsingstatus:: UNSUPPORTED_PDF_FORM) AND (averagenumberoftermsperpage::[00000 ~~ 00010] OR charcountlongestword::[00000 ~~ 00003]) AND NOT ocrstatus::<exists>

For legacy Parsing Library Version V1:

filetype::Adobe Acrobat (PDF) AND NOT (parsingstatus::NODATA OR parsingstatus:: UNSUPPORTED_PDF_FORM) AND (averagenumberoftermsperpage::[00000 ~~ 00010] OR charcountlongestword::[00000 ~~ 00003]) AND NOT ocrstatus::<exists>

No Content Microsoft Docs

This query calculates the number of no-content Microsoft documents that are eligible for OCR processing.

For Parsing Library Version V2:

(filetype::"Microsoft Excel" OR filetype::"Microsoft Word for Windows" OR filetype::"Microsoft PowerPoint" OR filetype::"Microsoft Word for Mac") AND ((averagenumberoftermsperpage::00000) OR (parsingstatus::00005)) AND (NOT ocrstatus::<exists>)

For legacy Parsing Library Version V1:

filetype::Microsoft AND ((averagenumberoftermsperpage::00000) OR (parsingstatus::00005)) AND (NOT ocrstatus::<exists>)

No Content PDF

This query calculates the number of no-content PDFs that are eligible for OCR processing:

For Parsing Library Version V2:

filetype::"Adobe PDF" AND parsingstatus::NODATA AND NOT ocrstatus::<exists>

For legacy Parsing Library Version V1:

filetype::"Adobe Acrobat (PDF)" AND parsingstatus::NODATA AND NOT ocrstatus::<exists>

OCR Failure

This query calculates the number of files (image files) that were subject to an unexpected OCR failure during processing. If the OCR Candidates report shows a non-zero value for OCR Failure, you can try to perform OCR the affected files again to attain parsingstatus::SUCCESS.

parsingstatus::01200

Tiff

This query calculates the number of TIFF files that are eligible for OCR processing:

filetype::"Tagged Image File Format" AND NOT ocrstatus::<exists>

Unknown Language PDF

This query calculates the number of unknown language PDFs that are eligible for OCR processing (that is, the number of PDFs for which the language is unknown).

Note: This query became available as of 4.3.10.0 for new Organizations and their Projects using the default Organization Index Settings template (or a newly created System Index template).

For Parsing Library Version V2:

filetype::"Adobe PDF" AND language::unknown AND NOT ocrstatus::<exists> AND NOT parsingstatus::(ENCRYPTED OR PROTECTED)

For legacy Parsing Library Version V1:

filetype::"Adobe Acrobat (PDF)" AND language::unknown AND NOT ocrstatus::<exists>

To add your own OCR Candidate query, enter a name for the query for Search Name and enter the query for Search Query.

Note: The maximum length for an OCR query is 1024 characters.

Each query addition generates a new line so that you can continue to add queries, if desired. If you need to delete a query, click the icon.

You can add an OCR query after import, have the Data Set Report updated to reflect calculation of the new candidate documents, and perform OCR of those candidate documents.

Additional Example: Other Images

If you want to process secondary image files, you can add more OCR queries. For example, you can add queries for the additional image types:

For Parsing Library Version V2:

(filetype::"JPEG Image" OR filetype::"JPEG2000 Image" OR filetype::"GIF Image" OR filetype::"Portable Network Graphics Image" OR filetype::"Microsoft Bitmap" OR filetype::"X-Windows xbitmap") AND (NOT ocrstatus::<exists>)

For legacy Parsing Library Version V1:

(filetype::"JPEG File Interchange" OR filetype::"Compuserve GIF" OR filetype::"Portable Network Graphics Format" OR filetype::"Windows Bitmap" OR filetype::"X-Windows Bitmap" OR filetype::"Progressive JPEG") AND (NOT ocrstatus::<exists>)

Note: If you are going to edit an OCR Candidate Search in the Index Settings, be sure to change the name of the query. This ensures proper handling of the query.

Custom Warnings and Errors

A user with the appropriate permissions can add custom queries to identify documents meeting that query. There are no default custom queries.

The queries will appear in the Custom Warnings and Errors section of the Scan Report for a Data Set or for all Imports (if the report is enabled for those views). For a given query, drill-through is supported.

To add a custom query, enter a name for the query for Search Name and enter the query for Search Query.

Note: The maximum length for a custom query is 1024 characters.

Each query addition generates a new line so that you can continue to add queries, if desired. If you need to delete a custom query, click the icon.

You can add a custom query after import and update the report to reflect calculation of the documents meeting the query.

Sample custom query

The following shows a custom query (which could be called Potential Mail Store Error) that checks for potential corrupted PST or OST Mail Containers. This query would help pinpoint PST or OST files that could not be processed correctly and therefore could not be identified by their expected filetype and a docclass of Message_Archive:

(docext::PST OR docext::OST) AND docclass::EDoc AND NOT childcount::<exists>

Index Settings: Save or Discard Changes

If you do not save your changes before navigating away, you will be prompted to either save your changes and continue navigating away, discard your changes and continue navigating away, or cancel your changes and remain in the current location.

Save – Saves your changes to the Index Settings.
Discard Changes – Discards your changes to the Index Settings.