Find Duplicates of a Selected Document or View

Document List > selected Document Find Exact | Content | Near Duplicates of This ...

View in Tree > Find Exact | Content Duplicates

A user with the appropriate permissions can use duplicate detection operations to find duplicate documents for the following:

  • For a selected document or email in a document list, to use the document or email as the basis of a search against all documents in the target view. This kind of document-specific search for duplicates can help identify Exact Duplicates, Content Duplicates, or Near Duplicates of the selected document. See Find Duplicates of a Selected Document or Email for more information.
  • For an entire view in the Navigation Tree, to find all documents that are either Exact Duplicates or Content Duplicates within the view. See Find Duplicates within a Selected View for more information.

The three types of duplicate detection operations can be summarized as follows:

  • Exact Duplicates (minimum index level for a Data Set view: File Metadata Index) – Documents that have the same content and embedded metadata. Files are an exact duplicate if they have matching filemd5 values. A search for Exact Duplicates applies to an entire view or a selected document.
  • Content Duplicates (minimum index level: Content Index) – Documents that have the same content (contentmd5 value). The contentmd5 represents the content of a selected document, or the subject line and body text of a selected email. Document type and formatting are ignored. For example, a PDF and a Word document used to create that PDF would be a content match. Files are a content match if they have matching contentmd5 values. A search for Content Duplicates applies to an entire view or a selected document.
  • Near Duplicates (minimum index level: Analytic Index for all of Project Data) – This type of search find documents with content that is almost the same as the selected document without regard to file type and format, but also considering the number of terms. The Threshold setting lets you specify the level of content match for the operation. A Near Duplicate operation is a similarity comparison that also calculates whether the source and compared document have about the same number of terms. A search for Near Duplicates applies only to a selected document from a document list, not to an entire view.

Find Duplicates of a Selected Document or Email

Near Duplicate detection evaluates Document SimilarityClosed When you are working in Project Data, you can run a search for Document Similarity using one or more selected documents or an entire view as the basis of the search against a given target. The operation compares a calculated value for the content of the selected documents or a Synthetic Document to the calculated value of the target. . This operation requires an Analytic Index representation level for allProject data, which your eDiscovery Administrator typically handles as part of Import. A document viewed from these results indicates common terms, as long as highlighting is enabled.

Find Duplicates of a Document Options

The following options help you filter the list of available targets and/or hide targets that have no documents:

  • Filter (2+ chars)... – You can use the Filter text box at the top right to filter the list, shown in a tree-like structure. (The icon indicates that filtering is available.) If you have a large number of locations, using the Filter box enables you to pinpoint the items you want to work with based on a quick Filter term search containing two or more characters you enter. You can explicitly apply a filter by typing two or more characters in the text box and clicking Enter (the return key). If you type two or more characters in the text box, the software will automatically start to apply the filter for you. While filtering is in progress, you will see the icon. You can still refine your filtering while the appears. When filtering is complete, the text box changes to a yellow background color. For any applied filter, you can then clear the filter by removing the text in the box and clicking Enter, by removing the text from the box, or by clicking the that appears at the far right of the Filter box. Clearing a filter restores the list to its original state.
  • Hide Empty ViewsHides any views that have no documents (that is, they are empty) from the list of available views.

Once you have the list of top-level nodes you expect to see based on your permissions, use the node controls (for example, and ) to open and close a top-level node to show or hide its associated views. Views that may be selected as a target for the operation, depending on your permissions, include the following:

Note: Double-clicking a target will select the target, close the dialog as if you clicked the OK button, and run the search.

  • All of Imports or a Data Set for Find Exact or Content Duplicates, not Near Duplicates
  • All of Project Data
  • Under Project Data, a selection from one of the following nodes (when opened):
    • Under Custodians, a Custodian view
    • Under MediaIDs, a MediaID view
    • Under Batches, a Batch view
    • Under Folders, a Folder view
    • Under Tags, a Tag view
  • Under Searches > Saved Searches, a Saved Search
  • Under Searches > Search History, a Search Result view (to see more searches in the list, click 10 more...)
  • Under Workflows and a given Workflow, a given Step in the Workflow (which appears with the Step Number, followed by Term Query or Date Range Query and a portion of the query, up to 255 characters)

Note: Individual items appear with their respective icon and name. They will also reflect their appropriate archive state (if archived, then items appear grayed out). If applicable, they will also show their number of document in parentheses. Hovering over an archived item will temporarily change the appearance of the item to an enabled (unarchived) state.

  • Under Searches > Saved Searches, a Saved Search
  • Under Searches > Search History, a Search Result view (to see more searches in the list, click 10 more...)
  • Under Workflows and a given Workflow, a given Step in the Workflow

For a Find Near Duplicates of This... operation, once you have selected a target, you also specify a threshold for the operation, as follows:

Note: This operation applies only to Project Data-based views.

  • Threshold – The default threshold is by default set to 80 when searching for Near Duplicates in order to require a high degree of similarity. In general, adjusting the threshold yields a different number of results. You can specify a threshold value in the range 0 to 99, where 0 detects a nonzero amount of similarity or commonality. Higher values such as 80 or 90 require a higher degree of similarity or commonality. To require a moderate degree of similarity or commonality, select a value such as 40 or 50. In general, the lower the threshold, the more results you will see, since you are requiring less similarity or commonality. Specifying a higher threshold value yields a smaller number of results.

Search or Cancel the Operation

Once you are satisfied with your target selection (and threshold for Near Duplicates), click the appropriate action button:

  • Search — Runs the appropriate search for a type of duplicate using the selected target. (As an alternative, you can double-click a target, which selects the target, closes the dialog, and runs the search. A Work Basket task is generated for the Search and the Search Results folder of the tree is activated to show you the Search Results.
  • Cancel — Cancels the operation and returns to the appropriate view.

See also:

Finding Duplicate Documents for more information about searching for document duplicates.

Find Duplicates within a Selected View

For a selected view in the Navigation Tree, you can right-click and select the Find Exact Duplicates or Find Content Duplicates operations. (You can also use the ellipsis menu, for items that provide the ellipsis menu.)

For example, you can evaluate Exact Duplicates of all of Project Data in the Navigation Tree by selecting the Find Exact Duplicates option. When performing duplicate detection on an item in the Navigation Tree, you do not specify any search criteria; the operation runs and reports Search Results immediately.

The search results for this type of search for duplicates generates Duplicate Groups. All documents that are duplicates will reside in the same, numbered group (as shown in the Group column).

This search can be run for the following views:

  • Under Imports, a single Data Set
  • All of Project Data
  • Under Project Data, a selection from one of the following nodes (when opened):
    • Under Custodians, a Custodian view
    • Under MediaIDs, a MediaID view
    • Under Batches, a Batch view
    • Under Folders, a Folder view
    • Under Tags, a Tag view
  • Under Searches > Saved Searches, a Saved Search
  • Under Searches > Search History, a Search Result view, including results of all Imports, a Data Set, or a Project-Data-based result view except an Export Comparison Report (to see more searches in the list, click 10 more...)
  • Under Workflows and a given Workflow, a given Step in the Workflow