Manage Analytic Settings in an OrganizationTemplate

Home > selected Organization > menu or right-click > Settings > eDiscovery Templates > Analytic Settings
Settings > Organization Settings > eDiscovery Templates > Analytic Settings

Requires Organization - Analytic Settings - View (to view these settings), Add/Edit (to manage the settings), and Delete (to delete an Analytic Settings template)

Users in a role with the appropriate permissions can select Organization Settings > eDiscovery Templates > Analytic Settings to view, modify, create, and delete Organization templates for analytic settings such as Deduplication and Email Threading.

Loading from and Saving to Templates

If no Organization templates have been created, the Default Analytics Settings template is displayed in the Templates pane and loaded into the Deduplication, Email Threading, and Exclusion Searches sections. (This template provides the default Analytics Settings for Projects within the Organization.) If there are existing Organization templates they are also listed in the Templates pane, and you can load any of them either by selecting the one you want or by clicking the context menu icon () to the right, choosing Load from Template, and selecting it from the dialog.

After you have made changes to the Analytic Settings (as described below) you can click Save to save to the current template, or save as a different template by choosing Save to Template from the context menu and selecting the one you want to save as from the dialog.

To create a new template, launch the New Template dialog by clicking at the top of the Templates pane. Once you have created the template, you can choose the analytic settings you want and click Save.

Additional context menu options include Edit, which lets you modify the name and/or description of the selected template, and Delete, which lets you delete the selected template.

Note: As of the 5.0 release, Stop Words is part of Index Settings, Hide Duplicates in List is part of Project Preference, and eDiscovery Filters are no longer supported.

Deduplication

Your Organization Administrator can select the deduplication scope in the Project, and for email within Project Data, whether to detect strict file duplicates only, or duplicates in message content and attributes.

As a user preference, you have the option to hide duplicates in all document lists in the Project.

Deduplication Scope:

Global (default) — By default, the software calculates duplicates across all documents in Project Data. This is referred to as Global deduplication (previously referred to as Horizontal).

Custodial — The software calculates duplicates on a per-Custodian basis (vertical by custodian) instead of globally across the Project Data. Documents associated with the Unassigned Custodian are subject to deduplication as a group.

Note: The dedupe counts and filesize you see in key reports will reflect the current deduplication scope for the Project (the Search Summary per-clause information of a Search Results report, or the Document Classification chart in a given report). For an all Imports or Data Set view, the values shown in the Document Classification chart will reflect file MD5 deduplication, since the Email Deduplication Settings and calculations based on family membership apply only to views of Project Data (and initially calculated when data is added to Project Data by a user with Permissions).

Email Deduplication Scope:

A user with permissions can use the default scope (duplicates in message content and common message attributes), select message attributes to form your own email deduplication scope, or limit the deduplication to file MD5 (exact file duplicates) only:

File Duplicates Only (File MD5) —Your Organization Administrator can optionally select this option to have the software to detect duplicates on strictly a file-level basis (file MD5). When this option is enabled, emails that may indeed have duplicate content may be considered unique files because they have differences at the file level.
Message Content and Attributes (default) — By default, the system detects email duplicates by examining content (content MD5 values) as well as common attributes (From and Recipients, which represent participant/recipient information, Date Sent, and Attachments). When this option is enabled, your Organization Administrator can decide whether to limit the detection to content MD5 only, or include one or more of the following common attributes, which you control using checkboxes:
- From (set by default) — Detects email duplicates using the email from field information.
- Sender (cleared by default) — Detects email duplicates using the email sender field information.
- Recipients (set by default) — Detects email duplicates using the set of recipients (for example, to and cc fields). This option uses the recipientsmd5 metadata field value. If two documents have the same recipients MD5 value, they have the same recipients.
- Bcc (cleared by default) — Detects email duplicates using the bcc field values for email recipients blind copied on emails.
- Attachments(set by default) — Detects email duplicates using the set of attachments. This option uses the attachmentsmd5 metadata field value. If two documents have the same attachments MD5 value, they have the same attachments.
- Date Sent (set by default) — Detects email duplicates based on a sent date.
- MessageID (cleared by default) — Detects email duplicates using a standard and unique Message ID. Any empty or blank Message IDs are not deduped against one another; they are produced individually.

The selections determine the processing of email in Project Data, affect the reporting of dedupe counts and size value, and control how email is handled for an Export that includes duplicates. The dupe_fingerprint metadata field value is computed according to the Email deduplication Settings. For files that are not email, this is always the file MD5 value. (The dupe_fingerprint field is one of the Analytic Metadata fields, as described in View and Learn about Metadata for a Document.) The dupe_fingerprint value is available for Export by default.

Email deduplication Usage Notes:

For Lotus Notes emails, any alt fields are not considered for deduplication (for example, altfrom and altbcc).
For export, the DuplicateCustodian field contains a semicolon-delimited list of the Custodians for which this document had a duplicate. This includes the Primary Custodian, if applicable. Note that this field applies for Global deduplication only (where the software calculates duplicates across all documents in Project Data). If you want, you can use the DuplicateCustodianOther field, which contains a semicolon-delimited list of the Custodians for which this document had a duplicate, except the Primary Custodian. DuplicateCustodian and DuplicateCustodianOther behave the same except that DuplicateCustodian may include the Primary Custodian in the list of Custodians with duplicates of a given document, while DuplicateCustodianOther will never include the Primary Custodian in the list (even if that Primary Custodian did in fact have duplicates).
The software also supports reporting of file SHA-1 values (40-digit Secure Hash Algorithm values) upon import in the filesha1 field. At export, the software also supports reporting of a calculated DupeFingerprintSha1 field value. For email, this is the SHA-1 value computed at export according to the Email deduplication Settings and populated in the appropriate export load file. For files that are not email, this is always the file SHA-1 value.
Reprocessing documents from Project Data causes a recalculation of the dupe_fingerprint values for just those documents (not for all of Project Data).
Deduplication is not performed for any Synthetic Document.

Note: If a user with permissions makes changes to email deduplication settings or calendar deduplication settings, the Apply button must be used to apply the changes. To confirm changes, monitor a Work Basket task called Recalculate Working Set Deduplication Terms (one for Project Data, and one for the Discard Pile). If this task does not appear after changing settings (for example, after loading settings from a template), then the recalculation has not occurred and the changes must be applied to see the task.

Calendar Deduplication Scope:

The Calendar deduplication Settings control the processing of calendar items (from Microsoft Outlook or Lotus Notes) in Project Data, affect the reporting of dedupe counts and size value, and control how calendar items are handled for an Export that includes these duplicates. (For Outlook, Calendar items have a file type of MS Outlook, an auxfiletype of msg, and a msgclass of Calendar. For Lotus Notes, Calendar items have the file type vCalendar and a msgclass of Calendar.) The Calendar dedupe settings determine how the value for a metadata field called dupe_fingerprint is computed for calendar items. In a new Project, your Administrator can use the default scope (duplicates in message content and common message attributes), select a subset of attributes to form another Calendar deduplication scope, or limit the deduplication to file MD5 (exact file duplicates) only:

File Duplicates Only (File MD5) (default for existing Projects, prior to support of Calendar dedupe) — Detects calendar duplicates on strictly a file-level basis (file MD5). When this option is enabled,calendar items that may indeed have duplicate content may be considered unique files because they have differences at the file level.
Message Content and Attributes (default for new Projects) — Detects calendar duplicates by examining content (content MD5 values) as well as common attributes. When this option is enabled, you can decide whether to limit the detection to content MD5 only, or include one or more of the following common attributes, which you control using checkboxes:
- From (set by default) — Detects calendar duplicates using the email from field information.
- Sender (cleared by default) — Detects calendar duplicates using the email sender field information.
- Recipients (set by default) — Detects calendar duplicates using the set of recipients. This option uses the recipientsmd5 metadata field value. If two calendar items have the same recipients MD5 value, they have the same recipients.
- Attachments(set by default) — Detects calendar duplicates g the set of attachments. This option uses the attachmentsmd5 metadata field value. If two calendar items have the same attachments MD5 value, they have the same attachments.
- Date Started / Ended (set by default) — Detects calendar duplicates using the datestarted and dateended metadata fields.

Users with permissions can reprocess Calendar items with the Reprocess documents only option from Project Data. This causes a recalculation of the dupe_fingerprint values for just those Calendar items.

Email Threading

Use this section to view settings that affect the behavior of Email Threading.

Automatic Email Threading

By default, Automatic Email Threading is selected and enabled, which means that the software automatically performs email threading for the data that has been added to the Project. This mean that users can use the Email Threads tab to view the email threads in Project Data and do not need to select the Thread All Emails option on that tab.

Email Threading Modes

Whether the Project is set up for Automatic Email Threading or on-demand Email Threading (where you select theThread All Emails option from the Email Threads tab), Email Thread membership is determined by the Email Threading modes. The Email Threading modes use different metadata field information to determine the members of a Thread. By default, Email Threading is performed using all three supported Email Threading modes, but your eDiscovery Administrator can select a subset of the modes, as long as at least one mode is enabled. If one or two of these modes is cleared, you will see the change take effect when you perform an action that causes Email Threading to be recalculated (for example, adding or removing data from Project Data). Expect to see a different number of threads calculated after a change in modes.

RFC 2822 Metadata — Performs Email Threading using the messageid, references, and inreplyto metadata fields. The messageid is a unique ID (alphanumeric value) that identifies an email message. The inreplyto field identifies the message to which a new message serves as a reply. The references field is used to identify a thread of conversation.
MS Outlook Metadata — Performs Email Threading using the threadindex metadata field value, which acts like a combination of the inreplyto and references fields. This Threading mode is used by Microsoft for Outlook Exchange.
Content-Based — Performs Email Threading based on email content. This type of threading determines whether the content from one email is partially or wholly included in another email. Content-based threading includes header terms (from fields such as from, sent, to, and subject) and content terms.

Inclusive Email Scope

This setting determines whether the evaluation of "inclusive" (unique content) within an email thread is limited to email messages only or both email messages and their attachments.

Message Only (default) — Evaluates only the email messages within an email thread for unique content, not their attachments. An email is considered inclusive if it has unique content when compared to other email messages within the email thread.
Message and Attachments — When set, this option extends the evaluation of unique content within an email thread to each email and its attachments. In this case, an email is considered inclusive if one or more of its attachments are unique (that is, not associated with other members of the email thread).

Users with permissions to perform exports can check the Export-Only ThreadInclusive metadata field to see if an email is considered inclusive within an email thread. If so, the field's value is Y.

Exclusion Searches

When you add a Data Set from Imports to Project Data, you can leverage a set of Exclusion Searches to omit certain types of data typically considered irrelevant in a Project. You can use the predefined exclusion searches or add your own, one at a time. You can also remove the Exclusion Searches.

Note: When writing queries that include file types, you must use the correct file type based on the Parsing Library Version in effect in your Project, either V1 (Projects prior to Release 5.2.5.x) or V2 (new Projects as of 5.2.5.x by default). See Supported File Types for Analysis for a list of file types based on the Parsing Library Version.

Exclusion searches are intended to run on an entire Data Set view. You can bypass the Exclusion Searches if you perform an operation that adds selected documents in a Data Set instead of an entire Data Set to Project Data. This means that if you select one or more directories, archives, disk images, or NIST files in the Data Set and add/save them to Project Data, or tag them, they will become part of Project Data.

Note: When loading Analytic Settings, note that the list of Exclusion Searches to be loaded must not contain any name collisions with existing views in the Project. When naming your Exclusion Searches, make sure that they do not match existing views in the Project.

Directories

Uses the following query to exclude directories:

docclass::DIRECTORY

Disk Images

Uses the following query to exclude disk images (for example, for disk image types such as EWF and LEF):

docclass::DISK_IMAGE

NIST

Uses the following query to exclude NIST files types in the National Software Reference Library (NSRL), a database provided by the National Institute of Standards and Technology (NIST). NIST EDoc files are often system-related files that provide no value.

docclass::EDOC AND kftdesc::<exists>

To add your own Exclusion Search query, enter a name for the query for Search Name and enter the query for Search Query.

Note: The maximum length for an Exclusion Search query is 1024 characters.

Each query addition generates a new line so that you can continue to add queries, if desired. If you need to delete a query, click the icon.

Note: Any changes you make to the list of Exclusion Searches affect the next batch of data added to the Project Data already added to Project Data is not affected.