Duplicate Detection

Duplicate detection enables you to identify files that are exact or content duplicates or near duplicates. You can perform duplicate detection searches from a document list or Document Viewer, by selecting a source document or email and then the option to find exact, content, or near duplicates of that source document/email in a target view you select.

Note: A search for near duplicates applies only to Project Data-based views.

Note: For email, duplicate detection works according to the email de-dupe settings under Project Settings > Analytic Settings.

A duplicate can be any of the following:

About Near Duplicates

You can find Near Duplicates of a selected document or email across one or more Project Data-based views. This operation requires an Analytic Index, so if documents within the target views are not yet at an Analytic Index level, you will be notified of the situation.

To be identified as a Near Duplicate of the source document or email (which serves as a pivot document of the search), a document must meet the following criteria:

  • It must satisfy the specified similarity threshold when compared to the source document (that is, its content is at least that similar to the source document’s content). Using the default threshold of 80 means that a document would have a score of 80 for content similarity. The default threshold aims to pinpoint documents with a high level of similarity. Lowering the threshold lowers the amount of content similarity required and therefore broadens the search.
  • It must have about the same number of terms as the source document (that is, have about the same overall document term length, after including or ignoring Stop Words based on Project Settings and observing the maximum/minimum individual term length). This aims to pinpoint documents that have a similar number of terms, within a calculated minimum and maximum range based on the source document’s number of terms and threshold. For example, for a source document with 100 terms and a similarity threshold of 80, a Near-Duplicate would have a document term length of between 80 and 120.

Results Window

The Search Results folder is activated when you perform a Search and your results appear in the center pane.