Create and Manage Clusters

Project Data > Selected view > Clusters tab

Requires Project Data - View Permissions

Clustering a Project Data view provides an enhanced level of discovery — the documents are grouped by common content without regard to file format. Clustering, which builds Clusters of similar documents in a given Project Data view for all documents that are backed by an Analytic Index, supports the advanced analytics operations and the assessment of Document SimilarityClosed When you are working in Project Data, you can run a search for Document Similarity using one or more selected documents or an entire view as the basis of the search against a given target. The operation compares a calculated value for the content of the selected documents or a Synthetic Document to the calculated value of the target. . To view the Clusters for a set documents you are analyzing in a view in Project Data, select the Clusters tab. If the view has not been Clustered, click the Build Clusters... button. Note that the initial Clustering operation can take time to complete.

You can generate and download a Cluster report that captures Clustering information for the view.

Build Clusters

When you first click the Clusters tab for a view, there are no Clusters, to you must build the Clusters for that view. Use the main Build Clusters... button to build Clusters based on the contents of the view (for example, all of Project Data or a Search Results view). When you build Clusters, you have the option to Subcluster using a number of Subcluster levels. You should try to have at least 25 documents before creating Clusters. If the result set is too small (for example, 7 documents), you will see an error stating that the document set is too small to cluster, and the failed Cluster view will be deleted. (In this case, you will see Work Basket tasks for the Cluster failure and the deletion of the failed Cluster root view.)

Note the following about building Clusters.

  • A message appears in the Clusters tab for the view to indicate that Clustering is in progress.
  • The Clustering could take some time, depending on the size of the view.
  • All future visits to the Clusters tab for that view can be handled quickly.
  • You will see a number of Work Basket tasks associated with the Clustering of each top-level Cluster and Subcluster while the Clustering is in progress. (Many of these tasks are temporary, and are not retained once Clustering has been completed.)

Once you build Clusters, you can use the Clusters tab for the following tasks:

  • Browse the Clusters of related documents.
  • View the Top Terms that define the data within the view, Cluster by Cluster.
  • View the documents in a Cluster.
  • Update Clustering, Rebuild Clustering, and build Subclusters from a large Cluster of documents.

The Clusters tab provides a Cluster View with two sections:

  • Cluster List and Cluster toolbar — This section provides a paged list of all Clusters in the view. Each Cluster has a name and Top Terms, document state, and a Tag State for applied Tags. When you click to select a Cluster, the bottom half of the screen updates to show the documents list for the Cluster. Each Cluster represents a groups of documents with similar content. With a small view, you see a simple list of the Clusters. On a larger view, or after performing a Subcluster operation on an existing Cluster, you might have levels of Clusters (Subclusters).
  • Document List and Document toolbar — This section provides a paged list of the documents in the currently selected Cluster.

Note: For a Cluster document list, if you enable the Hide Duplicates option, the number shown in the bottom right portion of the screen will update to show the deduplicated count (for example, Display 1-50 of 250). The number of duplicates is not displayed for this view type, however. For information that applies to all document lists, see Work with a Document List.

Expand and Explore Clusters in the Cluster List

Once you have built Clusters with the desired level of Subclusters, you can expand and explore the content of a Cluster, keeping the top-level Cluster open as you explore the available Subcluster levels (based on how you chose to build your Clusters). Indentation of the associated icon in the Top Terms column helps you keep track of the different levels. To explore a Cluster, use the appropriate icon in the Top Terms column, as follows:

  • Click the icon to expand a given Cluster to reveal its first-level Subclusters. Click the indentedicon for a first-level Subcluster to reveal its second-level Subclusters. Do the same for any additional Subcluster levels that you want to explore.

  • Click the icon to close a given Cluster or Subcluster.

If there are no Subclusters to open and view, you will not see an icon in the Top Terms column.

You can use main toolbar or context menu options (right-click options) for a Cluster to perform additional Cluster operations.

The Clusters list displays these columns by default:

  • ID – The ID of the Cluster (for example, 1). A plus sign + indicates a Cluster that has Subclusters. Clicking the + reveals the Clusters in that parent Cluster.
  • Top Terms – The Top Terms for the Cluster. These indicate the type of information that is shared by the documents.

Note: For data that was processed or reprocessed prior to 4.3.11.0, the Top Terms for a Cluster may include tokensClosed In data processed prior to 4.3.11.0, labels that the parsing software associates with each document to track the different types of content in the document. Also used to identify documents that experienced parsing errors. (parsing tokens, or tokens associated with Patterns that are enabled and have values stored). Clusters do not include tokens as of 4.3.11.0.

  • Documents – The number of documents in that Cluster (or Subcluster).
  • Tags – One or more Tags that have been applied to an entire Cluster. Each Tag has a color assigned to it. You can tag Clusters or Subclusters. When you tag and entire Subcluster, the Tag appears for that entire Subcluster, but not at the Cluster level.

Optional column you can display:

  • Name – The full name used for the Cluster (for example, ProjectData-1). The name includes the view or search result name plus the ID.

For multipage lists, you can select a page to display. By default, a given page will display 100 items. The paging area shows a range and enables you to enter the page number in the box or use the appropriate arrows to navigate.

You can perform an immediate refresh by clicking the icon on the Page Control bar.

Use the Cluster Toolbar Actions

You can perform the following main Cluster actions:

  • Rebuild... – This toolbar option removes all of the current Clusters from the view and rebuilds all Clusters, potentially yielding different results. When you rebuild Clusters, you have the option to Subcluster using a number of Subcluster levels. You may notice that the Subcluster level you select for this operation will cause the level to be retained when you then perform Subclustering of a Cluster, and will stick until you Rebuild again at a different level or just generate top-level Clusters without selecting Subclustering. Note that generation and download of the Cluster report is unavailable during a rebuild of the Clusters. If you are constantly adding and removing documents from a clustered view, you may need to navigate back to the view and rebuild all of the Clusters.
  • Generate Report... (requires Document Reports permission) – This toolbar option enables you to generate a Cluster report as an XLSX file. Once the report has been generated, you can download the report from the Work Basket. When you select this option, a Generate Cluster Report dialog enables you to select whether you want to include a number of Subcluster levels in the report. Note that the XLSX uses a structure that enables you to view the different levels of Clusters (the top-level Clusters as well as the selected number of Subcluster levels). In the XLSX, you can click the plus sign (+) to the far left to expand a given top-level Cluster and explore its Subclusters.
  • Update – If you add documents to a clustered view (for example, you are on the Clusters tab for Project Data and you add documents to Project Data), you can click this toolbar option to update Clustering. Documents removed from a particular view (such as removing documents from Project Data) would also allow you to update Clustering. If no updates are needed, this option will be grayed out.

About Out-of-Date Clusters

If Clusters are out of date (for example, because more documents have been added or removed from the view), you will see a warning icon and the following message:

Clusters are out of date.

Hovering over the icon will also display the following message:

<view_name> has been changed. Clusters are out of date.

If you see this icon and message, consider updating your Clusters.

Use the Context Menu Options for a Cluster

For a selected Cluster, right-click the Cluster or click the ellipses at the far right to see a context menu with the following options, as long as you have permissions to perform those actions (actions that are not permitted will be grayed out):

  • Build Subclusters – Perform further Clustering (Subclustering) of a selected Cluster (typically, a Cluster with many documents that you want to investigate further, perhaps at a deep Subclustering level). When you build Subclusters, you can select the number of Subcluster levels. The level you seewhen the dialog opens is either what you previously selected for a Rebuild operation, or the value 2, and you can use this level or select different level. Subclustering a Cluster involves Uncluster and Cluster operations. In general, Subclustering of a Cluster can be done when the Cluster has at least 10 documents. If the Cluster is too small for Subclustering (under 10 documents), you will not be able to perform the Subclustering.
  • Create Manifest (requires Document Reports permission) – Launches the Create Manifest dialog, from which you can generate a CSV or XML manifest of a selected Cluster or Subcluster, using either the current fields or all fields. From the Work Basket task for the manifest generation, you can then right-click and select Download to download the file to a destination local to your computer. Users with permissions can also save the manifest to a server location. For download of a large manifest file (over 200 MB), the software places the manifest in a ZIP file, which you can then unzip. Note that this process can take time.

Select Documents in a Cluster Document List

The documents list for a given Cluster uses a per-document checkbox to support selection of individual documents and a top checkbox to support selection of all documents in the view:

Note: For information that applies to all document lists, such as the paging area options, see Work with a Documents List.

  • (per-document checkbox) — Use the checkbox to the left of a document name to select that document in the Cluster. To clear the selection of a document on a page of the Cluster list, clear the per-document checkbox. Use the per-document checkbox to select one or more documents on a given page of a Cluster for an intended action.
  • (top-level checkbox) — Use the checkbox that appears at the top of the list to select all documents in the currently selected Cluster (across all pages for a tab). To clear the selection of all documents in the view, clear the top checkbox. Note that if you want to Tag an entire Cluster, select the top-level checkbox for the Cluster document list and use the Document menu Add Tags... option. (The Document menu offers other operations as well, such as Remove Tags... and Add To... or Remove From....) Tagging that applies to an entire Cluster will be reflected in the Tags column for that Cluster. If you select one or more but not all documents in a given list, the top checkbox changes to to indicate that one or more items have been selected, but not all items (for example, if you selected 17 of 18 documents on a page, you would see Selected: 17 on this page).

Cluster Document Information

The documents list for a selected Cluster provides information about each document in the Cluster:

  • Doc Number – A three-part number representing the Document Number in the format C.V.N, where C =A Data Collection (Data Set) number, unique per Organization, V =A Data Collection Checkpoint Value, unique per Data Collection, and N = A document number, unique within the Data Collection Checkpoint. When searching for a Document Number using the docnum metadata field, specify the entire value, since wildcards are not supported for this field. You can also use a range search. Example: docnum::[3.101.50000~~3.101.60000].
  • Family/Thread – Enables you to identify whether a document is part of a Family and/or Thread:
    • identifies a document that is part of a family (message attachment group or document attachment group). You can click this icon to open the family inline.
    • identifies a document that is part of an email thread. You can click this icon to open the email thread inline.
  • Tags – One or more Tags that have been applied to the document. Each Tag has a color assigned to it. You can select a Tag from the list to apply it. You can see a number of Tags (as colored checkmarks) applied to the document in the Tag column; to see a complete list of Tags, hover over the icon, which shows you the full list of applied Tags.
  • Name – This column displays key information about the file. The information displayed depends on the type of file:
    • The icon indicates that the file is a document found on disk. If an Author is available, the author's name is identified, followed by the Filename of the document. Note that embedded documents extracted during import are assigned a filename in the format <parentfilename>_OLE_<value>.<ext>. Embedded documents include Microsoft Excel files and text files. For example, for an Excel (xls) file that is the first OLE child linked to a Word document named spreadsheet.doc, the OLE filename would be spreadsheet.doc_OLE_1.xls.
    • The icon indicates that the file is an email. The sender of the email is identified in the From: field, followed by the Subject line of the email. This applies to email messages, calendar items, as well as journal entries and tasks. (Note that embedded images are not extracted for MSGs, EMLs, or eDocs during import by default, but are identified in the embeddedchildren metadata field with a value of image.)
    • The icon indicates a directory, if directories have not already been excluded from the Project Data. Your eDiscovery Administrator can take advantage of default exclusion queries in the Analytic Settings to exclude items such as directories, NIST files, and archive files from Project Data. If directories are included in the Project Data, then this field identifies the name of the directory.

Note: A document's file extension will not always reflect the document's real file type. For example, a mydoc.txt file may actually be an MBOX from which emails are extracted. You can rely on the Digital Reef software to detect the correct file type, which you can verify in the document metadata.

  • To – For emails, this is derived from the display name, if available (for example, Joe Jones), or the email address of the email sender and recipient (for example, jjones@someco.com). Each recipient is separated by a comma or semicolon, depending on the source data.
  • Size – Shows the document size on disk.
  • Date – For document lists derived from Project Data, this column displays the primary date information based on the file type of the source file, displayed in the format yyyy-MM-dd-HH-mm-ss. The value in this field is propagated from parent files to their child files (and the children will have that primary date only, not their own). You can sort on this column in order to see families grouped. This field displays information associated with the dateprimary field, which determines the primary date as described in Work with a Document List.
  • Score (default sort column for a Cluster document list, in descending order) – Rates how relevant a document is within a selected Cluster. Documents are sorted according to their Score with the most relevant documents (higher Score) listed first. A Score can be in the range 0 to 100, and the value appears in the Score column to two decimal places (for example, 18.33). For a given Cluster, the first document shown in the list is the seed document and has a Score of 100. Duplicate documents in a Cluster all have a Score of 100 (100.0000). Documents in the Unclaimed List have a Score of 0.

Optional columns that can be displayed for a Cluster document list:

The following columns are hidden by default for a Cluster document list, but you can change your column selections to display them, and you can change the column order by dragging a column to the desired position:

  • Sent The sent display date for emails in the format yyyy-MM-dd-HH-mm-ss (for example, 2000-02-17-06-17-13). You can sort on this column in order to see families grouped.
  • Received – The received date for emails in the format yyyy-MM-dd-HH-mm-ss. You can sort on this column in order to see families grouped.

Use the Document Menu Options for a Cluster

For a selected Cluster (or Subcluster) on the Clusters tab for a view, use the Document drop-down menu to see a list of document options.

Note: For operations that require you to select a target Folder or other view, be aware that the available target options change based on your context. For example, if you are removing documents from a Folder, you cannot create a new Folder.

  • Add Tags – Launches the Tag dialog, from which you can select one or more Tags to apply to selected documents in the current Cluster (or Subcluster), or to tag the entire Cluster (by selected the top-level checkbox for all docs). You can also create a Tag and use it right away.
  • Remove Tags – Launches the Tag dialog, from which you can select Tags to remove.
  • Add to – Enables you to add selected documents or all documents in the current Cluster (or Subcluster) to a selected Custodian, MediaID, Batch, or Folder view based on your current view and your permissions. For more information about adding documents, see Add to or Remove Documents from Select Project Data Views. For more information about managing Custodian views, see Manage Custodians and Data Assigned to Custodians.
  • Remove from – Removes selected documents (or all documents) in the current Cluster from a selected Custodian, MediaID, Batch, or Folder view in Project Data based on permissions. The documents are still available within the Project, they just no longer reside within that view. For a Custodian (or MediaID or Batch), removing documents from a given named Custodian automatically reassigns the documents to the Unassigned view of that type. (Removing documents from the Unassigned Custodian is not permitted; if you want to assign documents from Unassigned to another view such as a Custodian, perform an Add To operation to the appropriate view. For more information about managing Custodian views, see Manage Custodians and Data Assigned to Custodians.)
  • Find More like these – Uses a selected set of documents/emails in the current Cluster to search for similar content. This type of search finds documents that have the most content similarity to the documents submitted as the focus of the search. It assesses whole-document similarity and reports a Score and %Terms match.
  • Download as PDF – Enables you to download a single document, multiple selected documents on a page, or all documents in a Cluster or Subcluster as a PDF to your local environment so that you can view the documents in PDF format. When you select this operation, you can select the Stamp Document Number option if you want to include a stamp with the document number (docnum) on the bottom right of each page in the PDF. If you select the top checkbox to save all documents as PDFs, you will see a Warning popup that states the following: You are attempting to download all documents in this list as PDFs. Depending on the size of the documents, this could take considerable time and/or render the browser unresponsive. Consider creating a new export stream to produce the PDFs directly to an export location instead. At this point, you must either confirm the operation by clicking Continue, or click Cancel instead. Whether you select one, multiple, or all documents to download, the software will prepare a ZIP file, by default named <projectname>_PDFs.zip. An information popup indicates that the PDFs are being prepared for downloading, and once finished, the archive (ZIP) can be downloaded from the Work Basket. Note that certain file types are ignored for PDF generation, including any selected directory folders not removed from your Project during setup by your administrator, disk images, file archives, mail archives, empty files, and files for which the native is not available. A WARNING_DETAILS_REPORT.csv file identifying the files that were skipped or failed PDF generation can be downloaded from the appropriate PDF-related Work Basket task. See About Downloading Documents as PDFs and Natives for more information.
  • Download NativeEnables you to download a single document, multiple selected documents on a page, or all documents in a Cluster or Subcluster to your local environment so that you can view the documents in their native format. Any selected directory folders not removed from your Project during setup by your administrator are ignored for the download. A WARNING_DETAILS_REPORT.csv file identifies any native files that were not downloaded. (See About Downloading Documents as PDFs and Natives for more information.) If you select the top checkbox for all documents, you will see a Warning popup that states the following: You are attempting to download all natives in this list. Depending on the size of the documents, this could take considerable time and/or render the browser unresponsive. Consider creating a new export stream to produce the natives directly to an export location instead. At this point, you must either confirm the operation by clicking Continue, or click Cancel instead. Whether you select one, multiple, or all documents to download, the software will prepare a ZIP file, by default named <projectname>_Documents.zip. An information popup indicates that the documents are being prepared for downloading, and once finished, the archive (ZIP) can be downloaded from the Work Basket.
  • Send to Discard Pile – Removes the selected documents Project Data and places a copy in the Discard Pile. . A Work Basket task called Sending Documents to Discard Pile reports the results. Documents removed from Project Data can later be restored with their Project Data information (for example, Tags).

Notes:

Use the Document Options for a Selected Document

When you select a single document in a Project Data-based list and right-click (or use the click the ellipses at the far right), a document context menu appears with a list of options:

  • Open Document Inline – Launches the Document Viewer inline, within your current browser window.
  • Open Document in New Window – Launches the Document Viewer in a new browser window (or tab, depending on your browser options). This enables you to select any document in the paged Document List and see the full content of that document (or other views, such as Metadata or History). You can also launch multiple windows for different documents to perform side-by-side reviews of multiple documents. When you open the Document Viewer in a new browser window, you can select view modes in the top center portion of the screen, navigate documents by using the page controls at the bottom, and perform operations such as tagging.
  • Open Family Inline – Launches a Family-specific view of the Document Viewer for a given Family (MAG or DAG) inline, within your current browser window.
  • Open Family in New Window – Launches the Document Viewer for a Family (MAG or DAG) in a new browser window (or tab). This enables you to focus on the other family members of a selected parent email/document or email or embedded attachment. Family members are indented under their parent. MAGs are sorted by the email sent date.
  • Open Thread Inline – Launches a Thread-specific list in the Document Viewer inline, within your current browser window.
  • Open Thread in New Window – Launches a Thread-specific list in the Document Viewer in a new browser window (or tab). This enables you to focus on each message in the Thread and the associated attachments, if applicable.
  • Find Exact Duplicates of This... – Searches for documents that have exactly the same content and metadata as the selected document. An exact duplicate would have the same file MD5 value.
  • Find Content Duplications of This... – Searches for documents that have the same content as the selected document.
  • Find Near Duplicates of This... – Searches for documents whose content is almost the same as a selected document. Evaluation of what constitutes a near-duplicate document includes comparison of the overall term length, but not file type or format. A Threshold setting enables you to specify the level of content match for the operation. Find Near Duplicates minimally requires an Analytic Index.

About Seed Documents

When a Clustering operation starts, the system selects a "seed document" and compares it to all the other documents in the view. A Cluster is formed when other documents with similar content are clustered around the seed document. If you update Clustering for a set of documents, it is possible, even likely, that the system will select different seed documents. Different seed documents form different Clusters. So removing and reclustering or updating Clustering can show you different relationships and allow you to gain a greater understanding of content.

About the Unclaimed List

The Clusters tab displays a Cluster titled Unclaimed List, which holds documents that have no similarity to any of the terms in the current seed documents in the view. An Unclaimed List Cluster is always the last Cluster in the view. No Top Terms appear for an Unclaimed List when the documents in the Cluster contain no common terms.