How to Perform Document Reprocessing of Results

Imports or a Data Set under Imports > Search Results > Reprocess
Project Data or a view of Project Data (except Export views) > Search Results > Reprocess

Requires Imports - Add/Edit Permissions

Note: Digital Reef now restricts import and reprocessing of data to Projects using Parsing Library V2. You can no longer import or reprocess data in a Parsing Library V1 Project.

Users in a role with the appropriate permissions can use the Reprocess option to request the reprocessing of selected documents from Search Results (for example, for all of Imports, a private Data Set, or even Project Data). Documents selected for reprocessing using the Reprocess option from Search Results are reprocessed at their original source data location.

For all document reprocessing, the software performs its own password cracking and accumulates a list of working (found) passwords. You can also specify your own password cracking criteria. For more information about configuring password cracking criteria, see Configure Password Cracking for Reprocessing.

Note: Reprocessing and OCR processing are not permitted if any of the documents are from a Shared (public in the Organization) Data Set. Once a Data Set is Shared, it is owned by the Organization. When you select Reprocess, the software validates whether reprocessing is permitted, and a Work Basket task appears to confirm the validation. Reprocessing is not permitted for any view of a Shared Data Set.

You may want to reprocess documents after import in the following situations:

  • You have documents that could not be parsed successfully because they are damaged, encrypted, or password-protected files. For example, you inspect the Warning and Errors section of a Data Set Summary Report and notice that you have many damaged, encrypted, or password-protected files that could be reprocessed after the situations have been addressed (for example, you have configured password-cracking options and supplied password files, repaired a PST, decrypted an NSF file, or addressed a protected ZIP file). If files such as damaged files can be fixed or addressed, you can have them reprocessed from their original import location, or you can copy them to an export area to be externally reprocessed. You drill-through entries such as Damaged, Encrypted, Protected, or Archive Extraction Error in the Warning and Errors section, and from the drill-through Search results, you select files or all files and use the Reprocess option or, for external reprocessing, the Copy to External Area option. If you have already populated Project Data, you would want to perform the reprocessing or loading of the files with the Reprocess documents with children option. After reprocessing or external processing, check the report again. Files that previously had no children detected because they were damaged, encrypted, or protected may now have newly discovered children. For example, you may see that a repaired PST now has children that have been added to the Index or you may have performed password cracking for encrypted/protected PDFs, ZIPs, RAR files, or Microsoft Office documents. For more information about password cracking, see Configure Password Cracking for Reprocessing. For more information about handling protected Lotus Notes NSF files, see Add and Manage Lotus Notes ID Files. See How to Perform External Processing for more information about external reprocessing.
  • You perform a Search (for example, of a Data Set) and find that you need to reprocess certain documents due to a parsing change (and therefore get updated metadata information). From the Search results, you can select the documents from the documents list, or right-click on the entire Search Results view. Select Reprocess and select the appropriate option to reprocess documents from the Search results. If you have populated Project Data, you can select Reprocess documents with children for parent-level documents that you know do not have children present in Project Data; for documents that have children present in Project Data, select the Reprocess documents only option instead, or remove the children from Project Data before attempting to reprocess with children. When reprocessing documents with children, you can optionally specify a timeout value for the message archive (for example, PST or NSF). The default timeout value is 600 minutes, and you can specify a value range 10 to 1200 minutes (20 hours). For example, for a given message archive such as an NSF, with the default value of 600 minutes, if the processing time of an email in the NSF reaches the limit, then processing of the entire NSF will stop at that point.
  • You have changed your Project Patterns or Index Settings. (Most likely, you will want to do this reprocessing of all documents with the Reprocess documents only option.)

Note: Reprocessing is a powerful feature, but it can be complicated to determine the optimal way to perform the reprocessing for a given situation. If you are unsure about how you should perform reprocessing (for example, if you need help assessing the impact of a parsing change in terms of reprocessing), please consult your Digital Reef representative.

How Reprocessing Works

The following summarizes key points about how reprocessing works:

  • If you have not yet added any documents from a Data Set to Project Data, all documents are eligible for reprocessing by default, regardless of whether or not they have children present in Project Data.
  • Once you add documents from a Data Set to Project Data, the following rules apply when you attempt to use the reprocess option Reprocess documents with children (instead of the default Reprocess documents only):
    • Any parent-level records whose children are present in Project Data are not reprocessed, regardless of their source Data Set (that is, they are skipped). Before attempting to reprocess a given parent-level record, remove its children records from Project Data. Note that a child is a descendant of a document, located in either a Message Attachment Group (MAG) or a Document Attachment Group (DAG). It is not limited to being a direct descendant of a document and can be a descendant at any level. For example, a Word Document (embedded OLE) attached to an Excel attached to an email is a child of the email itself.

    • Any parent-level records whose children are not present in Project Data are reprocessed. This means that a previously damaged, encrypted, or protected parent-level document that has been fixed can have its newly discovered children added to Project Data.

Note: If you select the Reprocess documents only option, you can also use the Extract from Container option, which limits the reprocessing to parent email items that can be extracted from their NSF container files. Once the Extract from Container option is selected, the reprocessing does not include files other than the extracted parent Lotus Notes email items. Therefore, only select this option when you want to re-extract selected Lotus Notes parent email items from NSF files in order to reprocess them for changes (including calendar items, journal entries, tasks, and contacts). This option applies to the parent emails only, not their attachments. If you select this option, you may see a change in the document handles displayed for the parent Lotus Notes emails (for example, when viewing the metadata for the attachments to the parent Lotus Notes emails). Also, this option may cause changes to any exported volumes that included the selected parent emails.

See Reprocess Options for more information about using the reprocessing options.

The following summarizes the steps involved in reprocessing to address files that are initially identified as damaged, encrypted, protected, or archive extraction errors:

  1. Review the Organization Index Settings template and Project Index Settings to select the appropriate settings that affect your import. The Index Settings include an OCR enabled/disabled setting (which you typically keep disabled for import).
  2. When you are done establishing the import settings you want, select your data for import (and any key Legal Discovery options that need to be set), and perform the import. You can decide whether to perform the import with the Add to Project Data option cleared (the default) or set (which populates Project Data). Keeping the option cleared gives you more flexibility if you later need to reprocess documents with existing children.
  3. If you want to inspect potentially problem files after import (under Imports) select the imported Data Set and click the Reports tab to view the Scan Report. In particular, view the Warning and Errors section with any Damaged, Encrypted, Protected, or Archive Extraction Error files. You can also search for these files using the parsing status (for example, parsingstatus::00027 for encrypted files, parsingstatus::00028 for damaged files, parsingstatus::00029 for protected files, and parsingstatus::01010 for archives with extraction errors, such as a protected ZIP file). Once you resolve the issues associated with these files (for example, you fix a damaged PST file, or configure password-cracking criteria), you can reprocess them.
  4. Drill-through the appropriate Damaged, Encrypted, or Protected file categories and then use the appropriate   Reprocess option for selected documents in the drill-through Results view. The Reprocess option is available from the Search Results only (for example, as a toolbar option from the documents list, or as a right-click option in the Navigation tree for a results view). If you have populated Project Data, you can also issue   Reprocess from any Project Data-based results view.
  5. Review the Reprocess options. See Reprocess Options for more information about using the reprocessing options.
  6. After you start the reprocessing operation and it completes (you can monitor the task in the Work Basket). Selecting View Details for the reprocessing task in the Work Basket enables you to verify the reprocessing options in effect for the operation. Examine the Reprocessing Details section of the details for the reprocessing task to see entries for the selected reprocessing options (separate lines for each option). See Reprocessing Options for an example.

Note: Be aware that reprocessing will remove any previous OCR results, and if you have Automatic OCR enabled in the Index Settings, all documents that meet the OCR Candidate queries will be subject to OCR processing. When Automatic OCR processing is performed as part of reprocessing, two additional Work Basket tasks are generated to indicate the associated OCR operations: one for a drill-through search of the Reprocess results view to get the Total OCR Candidates, and another for the OCR Processing of the Total OCR Candidates.

  1. View the Scan Report information again. The Warning and Errors counts should be lower in the Scan Report and the Summary should indicate more successfully parsed documents. The parsingstatus metadata field will be updated with the latest parsing status, and you can check the origparsingstatus field if you want to see the parsing status originally reported after import.
  2. When you are satisfied with the results of your reprocessing, add documents to the Project Data if you have not already done so (using a right-click of the Data Set under Imports and selecting Add to Project Data).

Note: If any document from a Data Set has been added to Project Data, the Reprocess option to Reprocess documents with children does not apply to any documents in that Data Set that have existing children. If you need to reprocess some previously imported documents that have existing children, and you have already populated Project Data, you may want to consult your Digital Reef representative. You may need to remove the documents with existing children from Project Data before performing the reprocess operation.

Reprocessing documents that reside in Project Data causes a recalculation of the dupe_fingerprint value for those documents (just those documents, not all of Project Data).