Calculate and Use the Word List

Imports or Data Set view > Calculate Word List
Imports or Data Set view > View Word List


Project Data | Custodian view | Tag view | Folder view > Search Result view > Calculate Word List

For Imports or a Data Set, calculate requires Imports - Add/Edit Permissions

If you have the appropriate permissions, you can request calculation of the Word List for the following views:

  • All Imports
  • A Data Set
  • Project Data
  • a Custodian view
  • a MediaID view
  • a Batch view
  • a Folder view
  • a Tag view
  • a Project Data-based search result from Search History (including a Sample)
  • A Saved Search of a Project Data-based search result

Note: The Word List applies to all of the data with an Analytic Index level (for example, under Imports). At that least one of the imported or source Data Sets must be at the Analytic Index level to generate the Word List. Data Sets at other levels (for example, Content Index, File Metadata, or System Metadata) are ignored. If no Data Sets are at the Analytic Index level, then you cannot calculate the Word List. In this situation, an error message appears. Contact your Administrator to address the situation. You cannot calculate the Word List for an individual Data Set that is at an Index level other than Analytic.

The Word List could take time to produce in a large Project, so a temporary Work Basket task enables you to monitor the progress. When this Work Basket task completes and no longer appears, you can attempt to view the Word List.

Note: You cannot calculate or view the Word List for tree items such as the Discard Pile, Analytics (for example, a Comparison or Synthetic Document ), a Workflow, or any type of Export.

As long as the Word List has been calculated and is up-to-date for the selected view, you can select View Word List to examine the Word List Report for that view's data and perform searches using words you select from the Word List.

Note that if the view contents have changed since the Word List was last calculated (for example, documents have been added or removed), you will not be able to view the Word List until you select Calculate Word List to recalculate the Word List for that view. This ensures that you will always see an up-to-date Word List. If the view has not changed since the Word List was last calculated, then you can view the Word List at will. In this case, the Calculate Word List will be unavailable to indicate that you have the most recent Word List for the view.

About the Words in a Word List

When examining words in the Word List, note the following:

  • All words in the list are shown in lowercase.
  • The list will include simple terms.
  • The list will not include alphanumerics.
  • The list will not include standalone numbers.
  • Stop Words will not appear in the calculated Word List if the default Stop Words setting is in effect during import and indexing. To have Stop Words included, the Index Settings must include Stop Words when the data is imported and indexed.
  • The Word List is calculated from a vector dictionary (not directly from the index) and includes content only. It upholds a minimum term size (by default, 3 characters) and a maximum term size (by default, 32 characters). Therefore, your Word List will not include terms shorter than 3 characters or terms longer than 32 characters, regardless of whether you include Stop Words for imports (in the Index Settings).

For data processed prior to 4.3.11.0, also note the following:

  • The list will include matches for the default enabled and stored patterns, which include any detected email addresses, local or UNC paths, and complete URIs/URLs. For data processed as of 4.3.11.0, the Word List will not contain matches for patterns, as those values are associated with the pattern and patternvalue fields.
  • The list will include matches for tokens associated with enabled patterns.
  • If you select local or UNC paths or URIs/URLs for a search, note that you will have to edit them to place them in single quotes and escape any \ (backslash) characters in a path.

Available Words

The Word List Report provides a list of available words with column information and checkboxes, as follows:

  • (top-level checkbox) — You can use this top-level checkbox to select all words (for example, in a filtered list) on a page. You can then add up to a page of these words at a time (100 words at a time) to the Words to Search list for a search. To deselect all words, clear the top checkbox. If you select one or more but not all Words in the list, the top checkbox changes to to indicate that one or more items have been selected on a page, but not all items. In addition, a message indicates how many items you have selected (for example, Selected: 1 on this page). This top checkbox only affects the population of the Words to Search list, not the list to download. Any filtering you do determines what is downloaded (or, if you do not filter, then all words are eligible to download).
  • (per-word checkbox) — Use the checkbox to the left of a word to select that word. To deselect a word, clear the per-word checkbox. Use the per-word checkbox to add selected individual words to the Words to Search list for a search.
  • WordA list of words detected for the view for which the Word List was calculated (after import). You can optionally sort the Word List by this column in alphabetical ascending or descending order. This column provides a Filter box, as described in the next section.
  • Occurrences — The total number of times the word appears (based on the data imported in the Project). You can optionally sort the Word List using this column.
  • Documents (default sort column) — The number of imported documents in the Project containing the word. This is the default sort column for the Word List, in descending order.

Word List Filter Option

  • Filter text box Filter (wildcards accepted) – You can use the Filter text box under the Word column. (The icon indicates that filtering is available.) Using the Filter box enables you to pinpoint the words you want to match based on a Filter term search (not case-sensitive). You can specify a whole word to match that word, use the * (asterisk) as a wildcard to help you match one or more characters in a word, or use the ? to match a single character in a word. For example, you might want all words that include da, so you type *da*. The list changes so that the Word List is filtered based on what you typed (for example, the filter *da* will match words that include da or DA, and the filter c:\* will match terms that start with c:\ or C:\). A download of the Word List will uphold that filtering (so you will not get the entire Word List, you will get the filtered list). For example, if you search for words matching wa?t and download the word list, the downloaded document contains only words matching wa?t (like want and wait). The filter can accommodate terms that start with a character such as single quote character (for example, 'info*). You can explicitly apply a filter by typing in the text box and clicking Enter (the return key). If you type in the text box, the software will automatically apply the filter for you, and the text box changes to a yellow background color. For any applied filter, you can then clear the filter by removing the text in the box and clicking Enter, by removing the text from the box, or by clicking the that appears at the far right of the Filter box. Clearing a filter restores the list to its original state. See About the Words in a Word List for more information.

Page Control Bar Options

For multi-page lists, you can select a page to display. The Page Control bar on the screen (the bottom of the Available list) shows a range and enables you to enter the page number in the box or use the appropriate arrows to navigate. Note that you must hit the Enter key after typing a page number in the box for the change to take effect.

Each page displays 100 documents.

The Page Control bar also offers the following:

  • – Enables you to perform an immediate refresh by clicking the Refresh icon on the Page Control bar.
  • – Enables you to request download of the Word List. Download of the Word List minimally requires Document Reports permissions. For a user with both Document Report and Connector Access permissions, this button launches the Download Word List popup, which offers both local Download and Save to Server options. For a user with just Document Reports permissions, this button will just request a download of the Word List using a Downloading Word list manifest task in the Work Basket. You can monitor and later retrieve the downloaded Word List from this Work Basket task using a right-click Download option. This enables you to download the Word List to your local environment in CSV format. Any download of the Word List respects your current sort and filter options for the Word List. In a large Project, the Word List may take some time to download, as indicated by a message. To open the Work Basket task and view details about the task, click View Details.

Note: The Download function currently supports 2 million words. So in a larger set of data, you can either filter, or, if you do not have Save to Server permissions, you can contact a Digital Reef Administrator, who can arrange an export of the Word List to a server location (using an Export Data Area). The Work Basket task for the Download reports a Failure message when the Word List is too large to download.

Words to Search

If you want to set up searches using words from the Word List, you may want to start by filtering the list. Then select words from the filtered list (or the top-level checkbox for all words on a page) and use the center button to add the words to the Words to Search list on the right. You can add up to a page of words at a time to the Words to Search list.

Note: When you add words to the list, the words still remain on the left (so that they can be included in a download).

Important Note: You can use the top-level checkbox in the Available Words list to add an entire page of these words at a time (100words at a time) to the Words to Search list for a search.

The Words to Search list is not paged.

For data processed prior to 4.3.11.0, when you add words to the right, the software automatically accommodates words that represent a local or UNC path or URI/URL to ensure that the search will not fail due to the existence of a backslash (\). UNC paths and URIs/URLs will be automatically placed in single quotes, and any backslash (\) characters that are part of a path will be escaped for you. (These automatic changes will not be visible to you in the Word List dialog.)

You can continue to build your Words to Search list until you are satisfied. Then select the search settings that you want and select where to search.

Once you populate the Words to Search list, use the checkboxes to select any words you want to remove from the list:

  • (top-level checkbox) — You can use this top-level checkbox to select all words (for example, in a filtered list) on a page. You can then add up to a page of these words at a time (100words at a time) to the Words to Search list for a search. To deselect all words, clear the top checkbox. If you select one or more but not all Words in the list, the top checkbox changes to to indicate that one or more items have been selected on a page, but not all items. In addition, a message indicates how many items you have selected (for example, Selected: 1 on this page).
  • (per-word checkbox) — Use the checkbox to the left of a word to select that word. To deselect a word, clear the per-word checkbox. Use the per-word checkbox to add selected individual words to the Words to Search list for a search.

— After you make the appropriate checkbox selections on the Available Words list, you can click the right green arrow button to add the selected words to the Words to Search list. (This arrow becomes active when you make selections on the Available list.)

(left red arrow) — After you make the appropriate checkbox selections on the Words to Search list, click the left red arrow to move the selected words back to the Available Words list. (This arrow becomes active when you make selections on the Words to Search list.)

Options for Setting Up a Search

Search Settings

The following three search settings enable you to request that your search include email or document families, metadata field information for a subset of common fields, and/or synonyms.

Include Families (enabled by default) — This checkbox option is enabled by default to ensure that all available family members of a Message Attachment Group (MAG) or Document Attachment Group (DAG) are included in the results of the Search operation. This includes a selected parent email, parent document, associated attachments, and embedded messages or documents. For example, with this option enabled, a search that returns a parent email in the results also includes that parent's attachments and any associated embedded files. Similarly, if a document attachment appears in the results, its other family members, such as its parent email, also appear in the results. Disabling this option causes the results to include just the selected documents, not the entire family (MAG or DAG).

Include Metadata (enabled by default) — This checkbox option is enabled by default to expand the search of each keyword in a query to include a set of metadata fields as well as content. You can select the Search Fields you want to have searched automatically. See Using the Include Metadata Option for a list of the default fields searched. When the Include Metadata option is enabled, all individual keywords as well as the keywords in phrases are subject to expansion By default, the Include Metadata checkbox option is Enabled. You can control the expansion on a per-term basis and limit the search of a given keyword to just content by specifying content::<keyword> for a given keyword or content::(<keyword1> <keyword2> <keyword3>)for a group of keywords. 

in <selected view> — After you click  the Select Target button to select where to search, you see the search target appear, and you can click this button to start the search using the selected words. This button will not be active unless you have words in the list. The Select Target button enables you to select an available target, such as Project Data, a Folder, or a Custodian.

Close — Click this button to close the Word List.

Note: Once you start a search, the Word List stays open to enable you to perform additional searches. When you want to go view the results (shown as a Freeform Search), close the window using the top right , or click Close. In the Reports tab for the results, you will see the selected words included in the search, each separated by an OR.

 Usage Notes

  • You may want to filter the Word List before you set up a search. Filtering enables you to quickly focus on words that match what you specify (for example, using wildcards).
  • You can sort based on the word, the occurrence, or document values.
  • You can page through a list of the Available Words.
  • You will be notified if the Word List is too large to download locally.