Manage Analytic Settings in a System Template

Settings > System Settings > Templates > Analytic Settings

Requires System-level Analytic Settings - View permissions to view the information, Add/Edit permissions to manage the template information, Delete to delete a template

System Users with the appropriate permissions can select System Settings > Templates > Analytic Settings to view, modify, create, and delete System templates for analytic settings such as such as Deduplication and Emaill Threading.

Loading from and Saving to Templates

If no System templates have been created, the page is initially empty. If there are existing System templates they are listed in the Templates pane, with the first one selected and loaded into the settings in the Deduplication, Email Threading. and Exclusion Searches sections. You can select and load any other template either by selecting it or by clicking the context menu icon () to the right of any of the listed templates, choosing Load from Template, and selecting it from the dialog.

After you have made changes to the Analytic Settings (as described below) you can click Save to save to the current template, or save as a different template by choosing Save to Template from the context menu and selecting the one you want to save as from the dialog.

To create a new template, launch the New Template dialog by clicking at the top of the Templates pane. Once you have created the template, you can choose the analytic settings you want and click Save.

Additional context menu options include Edit, which lets you modify the name and/or description of the selected template, and Delete, which lets you delete the selected template.

Note: As of the 5.0 release, Stop Words is part of Index Settings, Hide Duplicates in List is part of Project Preference, and eDiscovery Filters are no longer supported.

Deduplication

An eDiscovery Administrator can select the deduplication scope in the Project, and for email within Project Data, whether to detect strict file duplicates only, or duplicates in message content and attributes.

As a user preference, you have the option to hide duplicates in all document lists in the Project.

Deduplication Scope:

Global (default) — By default, the software calculates duplicates across all documents in Project Data. This is referred to as Global deduplication (previously referred to as Horizontal).

Custodial — The software calculates duplicates on a per-Custodian basis (vertical by custodian) instead of globally across the Project Data. Documents associated with the Unassigned Custodian are subject to deduplication as a group.

Note: The dedupe counts and filesize you see in key reports will reflect the current deduplication scope for the Project (the Search Summary per-clause information of a Search Results report, or the Document Classification chart in a given report). For an all Imports or Data Set view, the values shown in the Document Classification chart will reflect file MD5 deduplication, since the Email Deduplication Settings and calculations based on family membership apply only to views of Project Data (and initially calculated when data is added to Project Data by a user with Permissions).

Email Deduplication Scope:

A user with permissions can use the default scope (duplicates in message content and common message attributes), select message attributes to form your own email deduplication scope, or limit the deduplication to file MD5 (exact file duplicates) only:

File Duplicates Only (File MD5) —Your Organization Administrator can optionally select this option to have the software to detect duplicates on strictly a file-level basis (file MD5). When this option is enabled, emails that may indeed have duplicate content may be considered unique files because they have differences at the file level.
Message Content and Attributes (default) — By default, the system detects email duplicates by examining content (content MD5 values) as well as common attributes (From and Recipients, which represent participant/recipient information, Date Sent, and Attachments). When this option is enabled, your Organization Administrator can decide whether to limit the detection to content MD5 only, or include one or more of the following common attributes, which you control using checkboxes:
- From (set by default) — Detects email duplicates using the email from field information.
- Sender (cleared by default) — Detects email duplicates using the email sender field information.
- Recipients (set by default) — Detects email duplicates using the set of recipients (for example, to and cc fields). This option uses the recipientsmd5 metadata field value. If two documents have the same recipients MD5 value, they have the same recipients.
- Bcc (cleared by default) — Detects email duplicates using the bcc field values for email recipients blind copied on emails.
- Attachments(set by default) — Detects email duplicates using the set of attachments. This option uses the attachmentsmd5 metadata field value. If two documents have the same attachments MD5 value, they have the same attachments.
- Date Sent (set by default) — Detects email duplicates based on a sent date.
- MessageID (cleared by default) — Detects email duplicates using a standard and unique Message ID. Any empty or blank Message IDs are not deduped against one another; they are produced individually.

The selections determine the processing of email in Project Data, affect the reporting of dedupe counts and size value, and control how email is handled for an Export that includes duplicates. The dupe_fingerprint metadata field value is computed according to the Email deduplication Settings. For files that are not email, this is always the file MD5 value. (The dupe_fingerprint field is one of the Analytic Metadata fields, as described in View and Learn about Metadata for a Document.)The dupe_fingerprint value is available for Export by default.

Email deduplication Usage Notes:

For Lotus Notes emails, any alt fields are not considered for deduplication (for example, altfrom and altbcc).
For export, the DuplicateCustodian field contains a semicolon-delimited list of the Custodians for which this document had a duplicate. This includes the Primary Custodian, if applicable. Note that this field applies for Global deduplication only (where the software calculates duplicates across all documents in Project Data). If you want, you can use the DuplicateCustodianOther field, which contains a semicolon-delimited list of the Custodians for which this document had a duplicate, except the Primary Custodian. DuplicateCustodian and DuplicateCustodianOther behave the same except that DuplicateCustodian may include the Primary Custodian in the list of Custodians with duplicates of a given document, while DuplicateCustodianOther will never include the Primary Custodian in the list (even if that Primary Custodian did in fact have duplicates).
The software also supports reporting of file SHA-1 values (40-digit Secure Hash Algorithm values) upon import in the filesha1 field. At export, the software also supports reporting of a calculated DupeFingerprintSha1 field value. For email, this is the SHA-1 value computed at export according to the Email deduplication Settings and populated in the appropriate export load file. For files that are not email, this is always the file SHA-1 value.
Reprocessing documents from Project Data causes a recalculation of the dupe_fingerprint values for just those documents (not for all of Project Data).
Deduplication is not performed for any Synthetic Document.

Note: If a user with permissions makes changes to email deduplication settings or calendar deduplication settings, the Apply button must be used to apply the changes. To confirm changes, monitor a Work Basket task called Recalculate Working Set Deduplication Terms (one for Project Data, and one for the Discard Pile). If this task does not appear after changing settings (for example, after loading settings from a template), then the recalculation has not occurred and the changes must be applied to see the task.

Calendar Deduplication Scope:

The Calendar deduplication Settings control the processing of calendar items (from Microsoft Outlook or Lotus Notes) in Project Data, affect the reporting of dedupe counts and size value, and control how calendar items are handled for an Export that includes these duplicates. (For Outlook, Calendar items have a file type of MS Outlook, an auxfiletype of msg, and a msgclass of Calendar. For Lotus Notes, Calendar items have the file type vCalendar and a msgclass of Calendar.) The Calendar settings determine how the value for a metadata field called dupe_fingerprint is computed for calendar items. In a new Project, your Administrator can use the default scope (duplicates in message content and common message attributes), select a subset of attributes to form another Calendar deduplication scope, or limit the deduplication to file MD5 (exact file duplicates) only:

File Duplicates Only (File MD5) (default for existing Projects, prior to support of Calendar dedupe) — Detects calendar duplicates on strictly a file-level basis (file MD5). When this option is enabled,calendar items that may indeed have duplicate content may be considered unique files because they have differences at the file level.
Message Content and Attributes (default for new Projects — Detects calendar duplicates by examining content (content MD5 values) as well as common attributes. When this option is enabled, you can decide whether to limit the detection to content MD5 only, or include one or more of the following common attributes, which you control using checkboxes:
- From (set by default) — Detects calendar duplicates using the from field information.
- Sender (cleared by default) — Detects calendar duplicates using the email sender field information.
- Recipients (set by default) — Detects calendar duplicates using the set of recipients. This option uses the recipientsmd5 metadata field value. If two calendar items have the same recipients MD5 value, they have the same recipients.
- Attachments(set by default) — Detects calendar duplicates g the set of attachments. This option uses the attachmentsmd5 metadata field value. If two calendar items have the same attachments MD5 value, they have the same attachments.
- Date Started / Ended (set by default) — Detects calendar duplicates using the datestarted and dateended metadata fields.

Users with permissions can reprocess Calendar items with the Reprocess documents only option from Project Data. This causes a recalculation of the dupe_fingerprint values for just those Calendar items.

Email Threading

Use this section to review and modify settings that affect the behavior of Email Threading.

Enable Automatic Email Threading

When Automatic Email Threading is enabled (the default), email threading is automatically applied to all data added to the Project and users can view email threads in Project Data on the Email Threads tab. If Automatic Email Threading hreading is not enabled, users can perform on-demand threading by selecting the Thread All Emails option on that tab..

Email Threading Modes

Whether the Project is set up for Automatic Email Threading or you perform on-demand Email Threading (by selecting the Thread All Emails option), Email Thread membership is determined by the Email Threading modes. The Email Threading modes use different metadata field information to determine the members of a Thread. By default, Email Threading is performed using all three supported Email Threading modes, but your eDiscovery Administrator can select a subset of the modes, as long as at least one mode is enabled. If one or two of these modes is cleared, you will see the change take effect when you perform an action that causes Email Threading to be recalculated (for example, adding or removing data from Project Data). Expect to see a different number of threads calculated after a change in modes.

RFC 2822 Metadata — Performs Email Threading using the messageid, references, and inreplyto metadata fields. The messageid is a unique ID (alphanumeric value) that identifies an email message. The inreplyto field identifies the message to which a new message serves as a reply. The references field is used to identify a thread of conversation.
MS Outlook Metadata — Performs Email Threading using the threadindex metadata field value, which acts like a combination of the inreplyto and references fields. This Threading mode is used by Microsoft for Outlook Exchange.
Content-Based — Performs Email Threading based on email content. This type of threading determines whether the content from one email is partially or wholly included in another email. Content-based threading includes header terms (from fields such as from, sent, to, and subject) and content terms.

Inclusive Email Scope

This setting determines whether the evaluation of "inclusive" (unique content) within an email thread is limited to email messages only or both email messages and their attachments.

Message Only (default) — Evaluates only the email messages within an email thread for unique content, not their attachments. An email is considered inclusive if it has unique content when compared to other email messages within the email thread.
Message and Attachments — When set, this option extends the evaluation of unique content within an email thread to each email and its attachments. In this case, an email is considered inclusive if one or more of its attachments are unique (that is, not associated with other members of the email thread).

Users with permissions to perform exports can check the Export-Only metadata field ThreadInclusive to see if an email is considered inclusive within an email thread. If so, the field's value is Y.

Exclusion Searches

When you add a Data Set from Imports to Project Data, you can leverage a set of Exclusion Searches to omit certain types of data typically considered irrelevant in a Project. You can use the predefined exclusion searches or add your own, one at a time. You can also remove the Exclusion Searches.

Note: When writing queries that include file types, you must use the correct file type based on the Parsing Library Version in effect in your Project, either V1 (Projects prior to Release 5.2.5.x) or V2 (new Projects as of 5.2.5.x by default). See Supported File Types for Analysis for a list of file types based on the Parsing Library Version.

Exclusion searches are intended to run on an entire Data Set view. You can bypass the Exclusion Searches if you perform an operation that adds selected documents in a Data Set instead of an entire Data Set to Project Data. This means that if you select one or more directories, archives, disk images, or NIST files in the Data Set and add/save them to Project Data, or tag them, they will become part of Project Data.

Note: When loading Analytic Settings, note that the list of Exclusion Searches to be loaded must not contain any name collisions with existing views in the Project. When naming your Exclusion Searches, make sure that they do not match existing views in the Project.

Directories

Uses the following query to exclude directories:

docclass::DIRECTORY

Disk Images

Uses the following query to exclude disk images (for example, for disk image types such as EWF and LEF):

docclass::DISK_IMAGE

NIST

Uses the following query to exclude NIST files types in the National Software Reference Library (NSRL), a database provided by the National Institute of Standards and Technology (NIST). NIST EDoc files are often system-related files that provide no value.

docclass::EDOC AND kftdesc::<exists>

To add your own Exclusion Search query, enter a name for the query for Search Name and enter the query for Search Query.

Note: The maximum length for an Exclusion Search query is 1024 characters.

Each query addition generates a new line so that you can continue to add queries, if desired. If you need to delete a query, click the icon.

Note: Any changes you make to the list of Exclusion Searches affect the next batch of data added to the Project Data already added to Project Data is not affected.