Work with Data Set Documents

Selected Data Set in Navigation Tree > All Docs tab

Requires Data Set - View Permissions

If you have permissions, you can select a given Data Set under Imports in the tree, and an All Docs tab by default displays all documents that apply to the Data Set. You can also use an additional Exceptions tab to get a list of documents in the Data Set that have parsing exceptions. Two special tabs also provide a Communication Grid and a Domain Grid.

For information that applies to all document lists, such as checkboxes for document selection and column selection, see Work with a Document List.

This topic focuses on the columns and options that apply to a non-result document list for a Data Set.

Supported Tabs

You can use different tabs to display a list of all documents, or information for specific items such as Exceptions, Communication Grid, Domain Grid, and Reports:

All Docs — A general view of all documents in the Data Set view. You can double-click a document to open the document in the inline Document Viewer.
Exceptions — For a given Data Set, this tab acts like a filter to display only the documents in the Data Set that have Parsing Exceptions. In this Exception list, additional columns display the File Type and Parsing Status. File Type provides the detected file type for the document, while Parsing Status displays the Exception Code followed by the brief description of the Exception (for example, 00005 NODATA). If you also display Import Path, you can see the path of the document at import. You can double-click a document to open the Document Viewer inline. Note that when you are on the Exceptions tab, performing a search will still search the entire Data Set, and performing other operations (for example, tagging) on all selected documents will act on the entire Data Set.
Communication Grid — This tab shows you a Communication Grid that helps you analyze the email communication for the current view and see how many emails were sent from a given email address to another email address. It shows the Top 50 email address FROM and TO combinations and lets you make grid selections.
Domain Grid —This tab shows you a Domain Grid that helps you analyze the sending and receiving email domain information for the current view and see how many emails were sent from a given domain to another domain. It shows Top 50 FROM and TO domain combinations and lets you make grid selections.

Reports — This tab shows you the appropriate report information for the Data Set view.

Note: A Loading message appears while documents are being loaded into a view. In a large Data Set or view, the documents list (or sorting or report generation) may take some time to complete. If you see the Loading message for a while, you may want to go to another view, perform other operations, and return to this view later.

Document Information for Data Set Views

By default, the All Docs tab for a Data Set provides the following columns, with information about each document in a Data Set:

Doc Number (default sort ascending column for a Data Set view) – A three-part number representing the Document Number in the format C.V.N, where C =A Data Collection (Data Set) number, unique per Organization, V =A Data Collection Checkpoint Value, unique per Data Collection, and N = A document number, unique within the Data Collection Checkpoint. When searching for a Document Number using the docnum metadata field, specify the entire value, since wildcards are not supported for this field. You can also use a range search. Example: docnum::[3.101.50000~~3.101.60000]. Family members (a parent and its children) have sequential document numbers.
Family /Thread – Enables you to identify whether a Data Set document is part of a Family. You can open a Family by clicking the Family icon for the document. (Threads apply only to views of Project Data.)

Name – This column displays key information about the file. The information displayed depends on the type of file:
- The icon indicates that the file is a document found on disk and is followed by the filename. Note that embedded documents extracted during import are assigned a filename in the format <parentfilename>_OLE_<value>.<ext>. Embedded documents include Microsoft Excel files and text files. For example, for an Excel (xls) file that is the first OLE child linked to a Word document named spreadsheet.doc, the OLE filename would be spreadsheet.doc_OLE_1.xls. (Note that embedded images are not extracted during import, but are identified in the embeddedchildren metadata field with a value of image.) Parent OLE documents (Message attachments with Message OLE attachments or eDocs with eDoc OLE attachments) will include the text of each OLE document, and each OLE document in the parent document is separated by a row of dashes.
- The icon indicates that the file is an email and is followed by the subject line of the email. This applies to email messages, calendar items, as well as journal entries and tasks. (Note that embedded images are not extracted for MSGs or EMLs or eDocs during import by default, but are identified in the embeddedchildren metadata field with a value of image.)
- The icon indicates a directory, followed by the directory name.
- The icon indicates that the file is an attachment. An attachment is shown indented under its parent in a non-search results view.

Note: A document's file extension will not always reflect the document's real file type. For example, a mydoc.txt file may actually be an MBOX from which emails are extracted. You can rely on the Digital Reef software to detect the correct file type, which you can verify in the document metadata.

To – For emails, this is derived from the display name, if available (for example, Joe Jones), or the email address of the email sender and recipient (for example, jjones@someco.com). Each recipient is separated by a comma or semicolon, depending on the source data.
Size – Shows the document size on disk.
Date – For Data Set document lists, this column reports either the last modified date for files or the sent date for emails in the format yyyy-MM-dd-HH-mm-ss. The date information is shown according to the Project time zone, either the default time zone of Coordinated Universal Time (UTC), or a time zone selected using the Project Preferences.

Additional Columns Shown by Default for the Data Set Exceptions tab

The following columns are additionally shown by default when you click the Exceptions tab for a Data Set.

Parsing Status– The 5-digit parsing status code followed by the brief description of the Exception (for example, 00005 NODATA).
File Type – The identified file type for the document (for example, Microsoft Word 2000, Adobe Acrobat (PDF), or Internet HTML).

Optional Columns to Display for a Data Set

The following columns are hidden by default for a Data Set view, but you can change your column selections to display them, and you can change the column order by dragging a column to the desired position:

Attachments – For an email message only, indicates the number of attachments (childcount) that this email has. This field remains blank for documents that are not email messages.
Is Attachment – For a document that serves as an attachment to an email message, confirms whether or not this document is an attachment by showing true or false. Note that this field will show true for any direct attachment as well as any files associated with the attachment, such as an embedded file or .zip file. This field remains blank for regular documents that are not attachments or associated files associated with an attachment of an email message.
Author – For a document, the author of the document, if author information is available. For an email, the person or entity responsible for sending an email (derived from the from field information).
Scan Date – Time stamp of when this document was scanned and added to the system in the format yyyy-MM-dd-hh-mm-ss. For example, a file uploaded might report 2022-05-02-20-07-43 for the scan date (based on the metadata field datescanned). You can see where the document came from by viewing its metadata.
Date Created – The document creation date in the format yyyy-MM-dd-hh-mm-ss (for example, 2021-03-10-20-19-25). The corresponding metadata field is createdtime, which applies to NTFS with CIFS (for example, an import from a CIFS Connector).
Parsing Status (optional column from All Docs tab) – The 5-digit parsing status code followed by the brief description of the Exception (for example, 00021 FILE_NOT_SUPPORTED for Parsing Library V1, or 00068 FILE_ID_ONLY for Parsing Library V2).
Import Path – The import path for the document, which includes the import location label and/or the method of import (for example, upload) and archive information, if applicable.
File Type (optional column from All Docs tab) – The identified file type for the document (for example, Microsoft Word 2000, Adobe Acrobat (PDF), or Internet HTML).

Note: When you change column selections and/or position for a view, your current selections are retained for that type of view for the duration of your session. This enables you to keep your column preferences for a given view type in effect as you navigate to different places in the application. For example, if you make column selection and/or position changes for a view such as a Data Set view, you can open the Document Viewer and see those selections, then close the Viewer and still see your selections. Your selections are maintained whenever you switch from one view to another view of the same type (for example, you switch from one Data Set to another), even if you move to another type of view in between. For example, if you change column selections and positions for a Data Set view, then move to a Project Data-based view (which shows its column selections), and then move to another Data Set view, you would still see your Data Set column selections and positions.

Document Menu Options for a Data Set

Once you select one or more documents, use the Document drop-down menu to see a list of options available for a non-results Data Set view based on permissions. Note that when working with the Exceptions tab, search operations and operations performed for all selected documents (with the top checkbox) will act on the entire Data Set view, not just the Exceptions tab. For more information about operations and their associated permissions, see View and Manage Role-Based Permissions.

Note: If you perform an operation that adds selected documents in a Data Set instead of an entire Data Set to Project Data, the software does not run the Exclusion Searches, which by default exclude archives, directories, disk images, and NIST files from Project Data. This means that if you select one or more directories, archives, disk images, or NIST files in the Data Set and add/save them to Project Data, or tag them, they will become part of Project Data.

For options that require an entire view, use the right-click options for the Data Set view in the Navigation Tree.

Note: For operations that require you to select a target Folder or other view, be aware that the available target options change based on your context. For example, if you are removing documents from a Folder, you cannot create a new Folder.

Add Tags... – Launches the Tag dialog, from which you can select Tags to apply. You can also create a Tag and use it right away. If you Tag documents in a Data Set view, you can only select Document for the scope, and the software adds the documents to Project Data and performs the tagging.
Remove Tags... – Launches the Tag dialog, from which you can select Tags to remove.
Add to... – Enables you to add documents to a selected Custodian, MediaID, Batch, or Folder view in Project Data based on permissions. For more information, see Add to or Remove Documents from Select Project Data Views. For more information about managing Custodian views, see Manage Custodians and Data Assigned to Custodians.
Add to Project Data – Adds the selected documents in the Data Set to Project Data. The Data Set must be at a Content or Analytic Index level. This command is unavailable for a Data Set at the File Metadata or System Metadata Index level. Note that when you select this operation, the software performs Custodian, MediaID, and Batch viewgeneration for all documents in Project Data, not just for the selected documents.
Remove from... – Removes selected documents from a Custodian, MediaID, Batch, or Folder view in Project Data. The documents are still available within the Project, they just no longer reside within that selected view. Removing documents from a given named Custodian (or MediaID or Batch) automatically reassigns the documents to the Unassigned view of that type. (Removing documents from an Unassigned view is not permitted; if you want to assign documents from Unassigned to another view such as a Custodian, perform an Add to operation to the appropriate view. For more information about managing Custodian views, see Manage Custodians and Data Assigned to Custodians.)
Remove from Project Data - Removes the selected documents from Project Data entirely, including the Discard Pile, if the selected documents reside there. A Work Basket task called Removing Documents from Project Data reports the results. Documents removed from Project Data/Discard Pile are still available in the appropriate Data Set in the Project, in the event that you need to add them to Project Data again, but the documents no longer have any Project Data information that was previously applied, such as Tags.
Download as PDF – Enables you to download a single document, multiple selected documents on a page, or all documents in the view as a PDF to your local environment so that you can view the documents in PDF format. When you select this operation, you can select the Stamp Document Number option if you want to include a stamp with the document number (docnum) on the bottom right of each page in the PDF. If you select the top checkbox to save all documents as PDFs, you will see a Warning popup that states the following: You are attempting to download all documents in this list as PDFs. Depending on the size of the documents, this could take considerable time and/or render the browser unresponsive. Consider creating a new export stream to produce the PDFs directly to an export location instead. At this point, you must either confirm the operation by clicking Continue, or click Cancel instead. Whether you select one, multiple, or all documents to download, the software will prepare a ZIP file, by default named <projectname>_PDFs.zip. An information popup indicates that the PDFs are being prepared for downloading, and once finished, the archive (ZIP) can be downloaded from the Work Basket. Note that certain file types are ignored for PDF generation, including any selected directory folders not removed from your Project during setup by your administrator, disk images, file archives, mail archives, empty files, and files for which the native is not available. A WARNING_DETAILS_REPORT.csv file identifying the files that were skipped or failed PDF generation can be downloaded from the appropriate PDF-related Work Basket task. See About Downloading Documents as PDFs and Natives for more information.

Download Native – From the Exceptions tab or All Docs tabs, enables you to download a single document, multiple selected documents on a page, or all documents in the view to your local environment so that you can view the documents in their native format. Any selected directory folders are ignored for the download. A WARNING_DETAILS_REPORT.csv file identifies any native files that were not downloaded. (See About Downloading Documents as PDFs and Natives for more information.) If you select the top checkbox for all documents, you will see a Warning popup that states the following: You are attempting to download all natives in this list. Depending on the size of the documents, this could take considerable time and/or render the browser unresponsive. Consider creating a new export stream to produce the natives directly to an export location instead. At this point, you must either confirm the operation by clicking Continue, or click Cancel instead. Whether you select one, multiple, or all documents to download, the software will prepare a ZIP file, by default named <projectname>_Documents.zip. An information popup indicates that the documents are being prepared for downloading, and once finished, the archive (ZIP) can be downloaded from the Work Basket.

Note: If you see a CAE_ERROR with a description of PAGE_JOB:null, ask your System Administrator to check your NAS storage timing. If the NAS timing is off, you may see this error when generating certain document lists that rely on the availability of created files (for example, if you try to use View Exceptions for a data set after Project Data is populated).

Selected Document Options

When you select a single document in a Data Set document list and right-click (or use the click the ellipses at the far right), a document context menu appears with a list of options:

Open Document Inline – Launches the Document Viewer and have it appear in place of your Document List content in the lower portion of the screen. When working inline, you can select view modes, navigate documents by using the page controls at the bottom, and perform operations such as tagging.
Open Document in New Window – Launches the Document Viewer in a new browser window (or tab, depending on your browser options). This version of the Document Viewer enables you to select any document in the paged Document List and see the full content of that document (or other views, such as Metadata or History). You can also launch multiple windows for different documents to perform side-by-side reviews of multiple documents. When you open the Document Viewer in a new browser window, you can select view modes in the top center portion of the screen, navigate documents by using the page controls at the bottom, and perform operations such as tagging.
Open Family Inline – Launches a Family-specific version of the Document Viewer for a given Family (MAG or DAG) inline and have it appear in place of your Document List content in the lower portion of the screen.
Open Family in New Window – Launches the Document Viewer for a Family (MAG or DAG) in a new browser window (or tab). Launching this version of the Document Viewer enables you to focus on the other family members of a selected parent email/document or email or embedded attachment. Family members are indented under their parent. MAGs are sorted by the email sent date.
Find Exact Duplicates of This... – Searches for documents that have exactly the same content and metadata as the selected document. An exact duplicate would have the same file MD5 value.
Find Content Duplications of This... – Searches for documents that have the same content as the selected document.

Navigation Tree Options for an Entire Data Set

For a list of options that apply to an entire Data Set, you can use the right-click options for a Data Set in the Navigation Tree.

In general, options that include ... in the name indicate that they have an associated dialog. Options without ... run when you select them and do not have an associated dialog.

The right-click options for an entire Data Set are as follows:

Add Tags...

Tag dialog

Create a Tag

Add To... — If you have the appropriate permissions, you can either add documents from a Data Set view to all of Project Data, or to a selected Custodian, MediaID, Batch, or Folder view in Project Data based on permissions. Note that you can only add data at the Analytic Index or Content Index level to Project Data. You cannot add data at the System or File Metadata Index level to Project Data.
Add To Project Data — Provides a quick way to add all the documents in a Data Set to Project Data. Note that when you select this operation, the software performs Custodian, MediaID, and Batch viewgeneration for all documents in Project Data, not just for the documents from this Data Set.
Remove from Project Data — Removes all Data Set documents that are in Project Data, if applicable. The documents will still reside in the Data Set, just not in Project Data.
Find Exact Duplicates — Searches for documents in the Data Set that have exactly the same content and metadata as the selected document. An exact duplicate would have the same file MD5 value.
Find Content Duplicates — Searches for documents in the Data Set that have the same content as the selected document.
Calculate Word List — Calculates the Word List for all documents in the Data Set. A task appears in the Work Basket while the Word List is being generated. When the task completes, you can view the Word List.
View Word List... — Launches the Word List dialog and enables you to view the calculated Word List for all documents in the Data Set.
Create Manifest... — For a Data Set selected from the Imports Summary, launches the Create Manifest dialog, from which you can generate a CSV or XML manifest for a Data Set, using either the current fields or all fields. From the Work Basket task for the manifest generation, you can then right-click and select Download to download the file to a destination local to your computer. Users with Server Access permissions can also save the manifest to a server location. For download of a large manifest file (over 200 MB), the software places the manifest in a ZIP file, which you can then unzip. Note that this process can take time.
Download All as PDFs — Enables you to download all documents in the view as PDFs to your local environment so that you can view the documents in PDF format. When you select this operation, you can select the Stamp Document Number option if you want to include a stamp with the document number (docnum) on the bottom right of each page in the PDF. Note that this operation will also show a Warning popup that states the following: You are attempting to download all documents in this list as PDFs. Depending on the size of the documents, this could take considerable time and/or render the browser unresponsive. Consider creating a new export stream to produce the PDFs directly to an export location instead. At this point, you must either confirm the operation by clicking Continue, or click Cancel instead. If you proceed, the software will prepare a ZIP file, by default named <projectname>_PDFs.zip. An information popup indicates that the PDFs are being prepared for downloading, and once finished, the archive (ZIP) can be downloaded from the Work Basket. Note that certain file types are ignored for PDF generation, including any selected directory folders not removed from your Project during setup by your administrator, disk images, file archives, mail archives, empty files, and files for which the native is not available. A WARNING_DETAILS_REPORT.csv file identifying the files that were skipped or failed PDF generation can be downloaded from the appropriate PDF-related Work Basket task. See About Downloading Documents as PDFs and Natives for more information.
Download All Natives — Enables you to download all documents in the view to your local environment so that you can view the documents in their native format. You will see a Warning popup that states the following: You are attempting to download all natives in this list. Depending on the size of the documents, this could take considerable time and/or render the browser unresponsive. Consider creating a new export stream to produce the natives directly to an export location instead. At this point, you must either confirm the operation by clicking Continue, or click Cancel instead. If you proceed, the software will prepare a ZIP file, by default named <projectname>_Documents.zip. An information popup indicates that the documents are being prepared for downloading, and once finished, the archive (ZIP) can be downloaded from the Work Basket. Note that any directory folders are ignored for the download. See About Downloading Documents as PDFs and Natives for more information.
View Configuration... — For a selected Data Set, displays the set of index settings that were in effect when the Data Set was processed. See View Configuration for more information.
Reindex... — Submits a request to reindex a Private Data Set at an available level (for example, to go from Do Not Index to a selected level, or to go from Content to Analytic level). You cannot reindex a Shared Data Set. The Index Level for the Data Set, shown in the Data Sets Summary, will change to In Progress if the operation is permitted. This option is unavailable for a Shared Data Set in a sub-Project (that is, a Project using the Shared Data Set).
Update Patterns — For a Private Data Set or Shared Data Set in the originating Project that is at the Content or Analytic Index level, this option submits a request to update the Patterns for the Data Set (that is, Patterns that have been modified since import). You cannot update Patterns for a Data Set at index levels other than Content or Analytic (for example, File Metadata or System Metadata). This option is grayed out and not available for a Shared Data Set in a sub-Project (that is, a Project using the Shared Data Set), but it is available for a Shared Data Set in the originating Project. If you update Patterns in the originating Project for a Shared Data Set, all sub-Projects using the Shared Data Set will receive the Pattern changes for that particular Data Set. Sub-Projects will be blocked while the Pattern update is in progress.Enabled Pattern matches will be stored in the pattern metadata field; Pattern matches for enabled Patterns with Store Value enabled will be stored in the patternvalue field. This option will act on the text representations that already exist from initial import, reprocessing, or OCR. If you see a Warning icon () for this operation in the Work Basket, one or more documents did not have their Patterns updated. You can then click the icon to request a download or use the right-click Download option for the completed Work Basket task to download a WARNING_DETAILS_REPORT.csv file that identifies the reason why the Patterns were not updated for certain files. (See How to Update Patterns for more information about the exceptions that apply to this operation.)
Share | Unshare — For a Data Set eligible for sharing in the Organization, makes the Data Set Public (Shared). A Data Set is initially Private. Note that imported Load Files cannot be shared. If you want to share a Private Data Set with other Projects in your Organization, make sure that you have the Index Level that you want, and have performed any OCR or other reprocessing before you share the Data Set. (Reindexing, OCR, and reprocessing are unavailable once the Data Set is Shared.) When you are satisfied with the Index Level and processing of the Data Set, select the Share option for the Data Set if you want to share the Data Set. A popup confirms that the named Data Set has been shared. Note that once a Data Set is Shared, you will get an error if you try to change the Shared state by clicking Unshareif the Data Set is in use by any another Project. Only originating Projects can unshare the Shared Data Set when it is not in use by other Projects. Projects that add the Shared Data Set (that is, they are a sub-Project but not the originating Project) cannot use either the Share or Unshare options for that Data Set; these options is unavailable.
Copy to Document Storage... (available only for Data Sets that have not already been copied or were partially copied) – For a Data Set created in your Project or a shared Data Set added to your Project, this option enables you to review the Copy to Document Storage Exclusion Options and then copy the source files from an imported Data Set's import location to the Organization’s designated Document Storage. (A System Administrator manages the Document Storage for an Organization using the Admin interface.) This helps free the storage associated with the import location. Some document metadata fields (dahandle and darelativepath) will be updated to reflect the new location of the documents after the copy. If this option appears grayed out for a Data Set, everything in the Data Set has already been copied to Document Storage. When you perform this operation, a copy task (Copy to Document Storage <Data Set Name>) appears in the Work Basket. If necessary, you can cancel this task from the Work Basket. If you see a Warning icon () in a completed Copy to Document storage task in the Work Basket, one or more files were either excluded from the copy or failed to copy. You can then use the right-click Download option for the Work Basket task to download a WARNING_DETAILS_REPORT.csv file that identifies the reason why the file was not copied. (See How to Perform a Copy to Document Storage for more information about the exclusions or errors that apply to the copy operation.) Since the Data Set is considered partially copied, you can then select the Copy to Document Storage option from the Imports Summary again if you want to retry the copy to potentially copy more of the files previously excluded or that failed to copy. You can also perform a copy of the Data Set source files to Document Storage as part of a given import, although doing so will impact your import time. This option does not apply to a Load File and is grayed out and not available for a Shared Data Set in a sub-Project (that is, a Project using the Shared Data Set).

Note: The Copy to Document Storage operation has the ability to preserve the staging used by certain file types (as long as their associated document classes are not excluded by the operation), such as Forensic Image file types, multi-part RAR files, and Bloomberg Message Dump files. For more information about Copy to Document Storage, see How to Perform a Copy to Document Storage Operation.

Edit... — For users with the appropriate Imports - Add/Edit permissions, launches the Edit Data Set dialog to permit editing of the name or description of the Data Set. This option is unavailable for a Shared Data Set in a sub-Project (that is, a Project using the Shared Data Set).
Delete — For users with the appropriate Imports - Delete permissions, initiates a request to delete a Data Set and its documents. You can also use this option to delete a Data Set based on an imported Load File or detach from a Data Set that you have added to your Project from the Organization. When you select this option, confirmation is required; select OK to process or Cancel to cancel the operation. Your ability to delete a Data Set from a Project is based on whether the Data Set is currently in use within the Project (for example, in Project Data, or a Folder). If the Data Set is not in use, the operation deletes documents and references to the selected Data Set from the Project. For a Data Set owned by the Project (Private), the operation also deletes the Data Set from the Organization and frees all associated resources.