How to Perform OCR Processing

Drill-through of entry in OCR Candidates Report from a Data Set Report or All Imports Report > OCR

Requires Imports - Add/Edit Permissions

Users in a role with the appropriate permissions can perform OCR processing of your imported data based on calculated OCR Candidates. You can either perform OCR processing of selected candidates from eDiscovery, or you can copy a search results view of OCR candidates to an export area in order to perform OCR external processing.

Note: OCR processing and reprocessing are not permitted if any of the documents are from a Shared (public in the Organization) Data Set. Once a Data Set is Shared, it is owned by the Organization.

About OCR Processing of Selecting Documents

The OCR software is designed to limit OCR processing to unique documents that have not already been submitted for OCR processing in the Project. Whenever you submit documents for OCR processing, the software checks your list against the list of documents for which OCR processing has already been performed in the Project. Note the following:

  • Your list is checked for duplicates in order to limit OCR processing to one representative document within each set of duplicates. In this case, all duplicates in the set will have the same OCR text and OCR metadata.
  • If you submit a filemd5 duplicate of a document that has already been subject to OCR processing, the document you submit will inherit the OCR text and OCR metadata from the already processed document.
  • In the situation where you need to produce new OCR results for a document that has already been subject to OCR processing, you must set up an OCR operation that includes all filemd5 duplicates of that document.

OCR Processing Steps

To perform the import and OCR processing, perform these steps:

  1. Review the Organization Index Settings template and Project Index Settings to select the appropriate settings that affect your import. The Index Settings include an Automatic OCR Settings section that contains settings for automatic OCR processing (enable/disable), language selection, accuracy, and page timeout, and an OCR Settings section that contains a default set of queries to calculate OCR Candidates (No Content PDFs, TIFFs, Low Content PDFs, No Content Microsoft documents, and OCR Failures). You can use these default queries, and you can add your own if you want to define your own queries to calculate OCR Candidates.

Note: Be aware that reprocessing will remove any previous OCR results, and if you have Automatic OCR enabled in the Index Settings, all documents that meet the OCR Candidate queries will be subject to OCR processing. When Automatic OCR processing is performed as part of reprocessing, two additional Work Basket tasks are generated to indicate the associated OCR operations: one for a drill-through search of the Reprocess results view to get the Total OCR Candidates, and another for the OCR Processing of the Total OCR Candidates.

  1. When you are done establishing the OCR Settings and other import settings you want, select your data for import (and any key Legal Discovery options that need to be set), and perform the import. Typically, you perform the import with the Add to Project Data option cleared (the default), which gives you the ability to review any problem files and review the OCR Candidates and perform OCR Processing first, before populating Project Data.
  2. Under Imports, you can either view the Reports tab for all Imports, or you can select a particular imported Data Set and click the Reports tab to view the OCR Candidates section with the calculated candidate No Content PDFs, Tiff files, Low Content PDFs (Content PDF < 5 terms/page), No Content Microsoft documents, OCR Failures, and a Total of all candidates. The candidates are calculated based on the queries applied in the OCR Settings section of the Index Settings template or Project Index Settings.
  3. Drill-through one of the OCR candidate categories or the Total OCR Candidates entry (to perform processing of all candidates) and then use the OCR option for selected documents or all documents in the drill-through Results view. The OCR option is available from the drill-through Search Results only, as a right-click from the drill-through Search Results, or as a toolbar option. When you select this option, you will see a popup allowing you to select OCR Processing Settings.
  4. In the Work Basket, you will see a task validating that OCR processing can be done, followed by a running OCR task that enables you to track progress. For example, you can right-click and select Task Details for the running OCR task and see the following information to track progress:
    • Documents Processed: Master <value> - The number of master documents processed.
    • Documents Submitted: Master <value> - The number of master documents submitted for OCR processing.
    • Documents Processed: Clone <value> - The number of cloned documents processed.
    • Documents Submitted: Clone <value> - The number of cloned documents submitted for OCR processing.

    Note: If you cancel an in-progress OCR task, you will see a message asking you to confirm the cancellation. This message includes a warning that canceling an OCR task will not preserve any OCR text already generated. You may want to evaluate the current progress before proceeding.

  5. View the Scan Report information again. The OCR Documents and OCR Pages counts are updated in the Scan Report. The index is updated with the latest information, even if the documents have already been added to Project Data. Metadata field information will be updated. For example, the averagenumberoftermsperpage field and the pagecount field, which applies to PDFs as well as Microsoft Office files, will be populated based on OCR information. When you view a document that has been subject to OCR processing in the Document Viewer, the HTML tab shows the converted file.
  6. You may want to view other sections of the report, such as OCR Confidence chart and the Warning and Errors section, to see if you should also reprocess any Damaged, Encrypted, or Protected files. To reprocess, drill through the appropriate entries, and use the Process > Reprocess option to perform reprocessing of selected files. To learn about the general reprocessing restrictions, see the topic How to Perform Document Reprocessing of Results. These restrictions affect whether or not reprocessing or external reprocessing changes are reflected or ignored.
  7. When you are satisfied with the results of your OCR processing (and any reprocessing you decide to do), you can add documents to the Project if you have not already done so. For example, you can right-click the Data Set under Imports, and select Add to Project Data.

Note: If OCR is enabled when the import occurs, OCR processing is performed automatically based on the calculated Candidate information, and you can review the results in the OCR Documents and Pages entries in the Scan Report. After import, OCR processing can be performed at any time, regardless of whether the documents have been added to Project Data; the information will be updated for the affected files as long as all filemd5 duplicates are included in the OCR processing. In a typical scenario, you would process the desired OCR Candidates (for either all Imports or a particular Data Set) and then populate Project Data.

OCR Language Support

The OCR Processing software detects and handles the following languages, which fall under the General category, without additional configuration:

  • Albanian
  • Catalan
  • CJK (simplified and traditional Chinese, Japanese, and Korean)
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Esperanto
  • Estonian
  • Finnish
  • French
  • Galician
  • German
  • Hungarian
  • Icelandic
  • Italian
  • Latvian
  • Lithuanian
  • Maltese
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Serbian (Latin) (referred to as Serbian in earlier 4.3.x releases)
  • Slovak
  • Slovenian
  • Spanish
  • Swedish
  • Turkish

The following languages are not detected automatically, and require individual selection for OCR Processing:

  • Arabic
  • Cyrillic languages, which include the following:
    • Bulgarian
    • Byelorussian (also known as Belarusian)
    • Chechen
    • Kabardian
    • Macedonian
    • Moldavian
    • Serbian (non-Latin)
    • Russian
    • Ukrainian
  • Greek
  • Thai

For a list of the language codes associated with language detection, see Supported Languages for Automatic Language Detection.