Set Reprocess Document Options

Imports or a Data Set under Imports > Search Results > Reprocess
Project Data or any view of Project Data (except Export views) > Search Results > Reprocess

Requires Imports - Add/Edit Permissions

Note: Digital Reef now restricts import and reprocessing of data to Projects using Parsing Library V2. You can no longer import or reprocess data in a Parsing Library V1 Project.

Users in a role with permissions can reprocess selected documents (from certain Search results) after the initial creation of an Index. Users with these permissions can request reprocessing either with or without the children of the parent documents from Search results (typically, from all Imports or a private Data Set).

Note: Index Settings such Prioritize MAPI Fields over Transport Header Metadata, Detect Viruses, and Split Journaled Emails apply to reprocessing. Be aware that reprocessing will remove any previous OCR results, and if you have Automatic OCR enabled in the Index Settings, all documents that meet the OCR Candidate queries will be subject to OCR processing. When Automatic OCR processing is performed as part of reprocessing, two additional Work Basket tasks are generated to indicate the associated OCR operations: one for a drill-through search of the Reprocess results view to get the Total OCR Candidates, and another for the OCR Processing of the Total OCR Candidates.

You can request reprocessing regardless of whether you have populated Project Data. However, be sure to following the guidelines described for the reprocessing options.

Note: If you want to share a Data Set that you have created with other Projects in your Organization, make sure that you perform all reprocessing before you share the Data Set. Reprocessing and OCR are unavailable for a Shared Data Set. When you select Reprocess in a view, the software validates whether reprocessing is permitted, and a Work Basket task appears to confirm the validation. Reprocessing is not permitted for any view of a Shared Data Set.

You may want to reprocess documents after import in the following situations:

  • You have documents that could not be parsed successfully because they are damaged, encrypted, or password-protected files. For example, you inspect the Warning and Errors section of a Data Set Summary Report and notice that you have many damaged, encrypted, or password-protected files that could be reprocessed after the situations have been addressed (for example, you have configured password-cracking options and supplied password files, repaired a PST, decrypted an NSF file, or addressed a protected ZIP file). If files such as damaged files can be fixed or addressed, you can have them reprocessed from their original import location, or you can copy them to an export area to be externally reprocessed for the Project. You drill-through entries such as Damaged, Encrypted, Protected, or Archive Extraction Error in the Warning and Errors section, and from the drill-through Search results, you select files or all files and click Reprocess or, for external reprocessing, Copy to External Area. If you have already populated Project Data, you would want to perform the reprocessing or loading of the files with the Reprocess documents with children option. After reprocessing or external processing, check the report again. Files that previously had no children detected because they were damaged, encrypted, or protected may now have newly discovered children. For example, you may see that a repaired PST now has children that have been added to the Index or you may have performed password cracking for encrypted/protected PDFs, ZIPs, RAR files, or Microsoft Office documents. For more information about password cracking, see Configure Password Cracking for Reprocessing. For more information about handling protected key files such as Lotus Notes NSF files, see Container Key Files. See How to Perform External Processing for more information about external reprocessing.
  • You perform a Search (for example, of a Data Set) and find that you need to reprocess certain documents due to a parsing change (and therefore get updated metadata information). From the Search results, you can select the documents from the documents list, or right-click on the entire Search Results view. Select Reprocess and select the appropriate option to reprocess documents from the Search results. If you have populated Project Data, you can select Reprocess documents with children for parent-level documents that you know do not have children present in Project Data; for documents that have children present in Project Data, select the Reprocess documents only option instead, or remove the children from Project Data before attempting to reprocess with children. (For example, if you need to reprocess an entire PST, you would need to remove all of its children from Project Data before attempting to reprocess with children.) When reprocessing documents with children, you can optionally specify a processing timeout value (in minutes) for an archive, as described later in this topic.
  • You have changed your Project Patterns or Index Settings. (Most likely, you will want to do this reprocessing of all documents with the Reprocess documents only option.)

Note: Reprocessing is a powerful feature, but it can be complicated to determine the optimal way to perform the reprocessing for a given situation. If you are unsure about how you should perform reprocessing (for example, if you need help assessing the impact of a parsing change in terms of reprocessing), please consult your Digital Reef representative.

When you select Reprocess from a Search Results view, you can choose the appropriate reprocessing options, as described in the next sections.

General Reprocess Document Timeout Option

The following option generally applies to reprocessing:

  • Document Processing Timeout:<value> — This document timeout value applies to any individual document selected for reprocessing (either reprocessing documents only or documents with children). When a given document (for example, a loose document) reaches the limit, processing of the document stops at that point. You can use the default document timeout value of 5 minutes, or you can specify a timeout value in the range 3 to 180 minutes (3 hours). This option will not allow you to use a value less than 3 or greater than 180 minutes. Specifying a value less than 3 will display a popup message indicating that the value used will be 3. Likewise, specifying a value greater than 180 will display a popup message indicating that the value used will be 180.

Reprocess Options by Type

There are two main types of reprocessing, selectable by clicking the appropriate tab:

  • Reprocess Documents Only  (default) — Use this mode to either reprocess an entire result view (for example, a search of a Data Set, all Imports, or Project Data view), or to reprocess selected documents in a result view. Keep in mind that if you select a subset of documents (for example, a selected parent document or an individually selected attachment), this mode reprocesses only those selected documents and does not extend the reprocessing to any unselected family members. This type of reprocessing ignores container files. For example, a disk image or archive selected for reprocessing with this option will always be skipped.
    • Extract from Container   (Lotus Notes only) — This option restricts reprocessing to Lotus Notes parent-level items extracted from NSF container files. Existing attachments to those parent-level items are not changed, and new attachments are not extracted. After reprocessing with this option, you may see a change in the document handles displayed for the parent Lotus Notes items. This may also cause changes to any exported volumes that included the reprocessed items.
  • Reprocess Documents with Children — Use this mode to reprocess selected parent-level documents and all children of the reprocessed parent documents or emails, as long as the children of those parent-level records are not present in Project Data. This option helps discover the children of documents that could not previously be processed (for example, previously damaged or encrypted files that have been fixed in a Data Set). All parent-level records whose children are present in Project Data are not reprocessed, regardless of their source Data Set (that is, they are skipped). Before attempting to reprocess a given parent-level record, remove its children records from Project Data. If you select this option, the following options apply:
    • Sync Document Children to Project Data — For reprocessed parent-level documents that already reside in Project Data, synchronizes the children of those documents to ensure that Project Data is updated to reflect changes (for example, newly discovered children). You can select one or both of the following options with this option:
      • Apply Parent Tags to Children — Copies the tags associated with a parent to all of its children. In this case, the children inherit the parent's tag history (for example, the Tag Apply events).
      • Apply New Tags to Children - Select Tags — Applies one or more tags that you select to the children. Use the Select Tags button to select the tags you want in a popup. A Selected:<tags> message appears to indicate what you have selected, or a Selected: 0 tags message indicates if you have not selected any tags for this option.
    • Archive Processing Timeout: <value> — When reprocessing documents with children, you can specify a timeout value (in minutes) for the reprocessing of archives, including mail archives such as PST, NSF, MBOX, and file archives such as ZIP, TAR, and RAR. (This timeout does not apply to Bloomberg archive processing.) When you launch this dialog, the initial value you see will reflect the current processing timeout value in effect in the Project Index Settings (either the default timeout value of 600 minutes if no user has changed the value, or the timeout value configured by a user with the appropriate permissions). You can retain this value if you want, or override it for reprocessing with another value in the range 10 to 1200 minutes (20 hours).minutes for a message archive, but you can specify a value in the range This UI operation will not use a value less than 10 or greater than 1200 minutes. Specifying a value less than 10 will display a popup message indicating that the value used will be 10. Likewise, specifying a value greater than 1200 will display a popup message indicating that the value used will be 1200. To illustrate how this timeout value works, consider an NSF that is subject to the default timeout value of 600 minutes (10 hours). If the processing time of an email in the NSF archive reaches that limit, then processing of the entire NSF will stop at that point. Do not expect the Work Basket Reprocessing task to directly reflect the reprocessing timeout value you specify. The timeout value applies to the archive, and many factors can impact the entire task time. This option is not available until you select Reprocess Documents with Children.
  • OK — Click this to start the reprocess operation. You can monitor the Reprocess task in the Work Basket and use View Details to view information about the operation, as described in About Viewing the Reprocessing Task Details. If the task completes but with exceptions, you can right-click the task in the Work Basket and download the generated WarningDetails.csv file, which will identify each error encountered (one per line), with the document handle and reason for each. See About the CSV Warnings File for Reprocessing for more information.

Note: In a Project using a Shared Data Set (a sub-Project), you will be blocked from performing a Reprocess operation for that Shared Data Set.

  • Cancel — Click this to cancel the operation.

About Viewing the Reprocessing Task Details

If you select View Details for the reprocessing task, you can verify the reprocessing options in effect for the operation. Examine the Reprocessing Details section of the details for the reprocessing task to see entries for the selected reprocessing options (separate lines for each option).

Example of Reprocessing Details (Reprocess Documents with Children):

Apply New Tag to Children                     False

Apply Parent Tags to Children                True

Document Timeout                                  5 Minutes

Archive Timeout                       600 Minutes

Reprocess Documents with Children        True

Scan Date                                                  2020-11-17-19-28-08

Sync Document Children to Project Data True

Update Native Files                     False

If you run the Reprocess operation with Reprocess Documents Only, the Extract from Container entry also appears as True or False, depending on the selection made.

About the CSV Warnings File for Reprocessing

If the Reprocessing operation encounters exceptions, the Work Basket task displays a Warning icon (), and you can right-click and use the Download option to download a CSV (WARNING_DETAILS_REPORT.csv) that contains the following column information:

  • Document Handle
  • Reason

The following information can appear in the Reason column, which provides information about why a given document was not reprocessed:

  • Document skippedReported for a document that was skipped because it could not be reprocessed. For Reprocess Documents with Children, all parent-level records whose children are present in Project Data are skipped, regardless of their source Data Set. Before attempting to reprocess a given parent-level record with children, remove its children records from Project Data. For Reprocess documents only, a container file such as a disk image or archive will always be skipped.

See How to Perform Document Reprocessing for more information.