4 min read

All Search Technologies Are Not Created Equal

Featured Image

It's helpful to understand that eDiscovery platforms have different search technologies under the hood. Historically, dtSearch was used due to affordability and ease of implementation. More recently, open source and free search technologies have become available. The most widely used is Lucene, which major companies use in eDiscovery and the general technology world.

dtSearch and Lucene

dtSearch, a constant in the market for years, is a closed source tool with most of the syntax having been widely adopted. Searching "within" by using "W/" is fairly well known; "pipe W/5 broken" means the user is looking for any mention of the term pipe within five words of broken. dtSearch also provides a robust set of default indexing settings with common noise words, vocabulary, and ignorable white space characters. These defaults are easily modified but note that complete indexing is required when a modification is made.

In contrast, Lucene, in its purest form, does not provide a set of default noise words, vocabulary, and ignorable white space; it requires locating those elsewhere. The search syntax in Lucene uses a markup language called JSON. Most implementations of Lucene-style indexes in the eDiscovery space mask this JSON search language with an interpreter that mirrors the syntax used by dtSearch, which looks and feels familiar to users. The primary benefit of using modern indexing like Lucene is in both the licensing cost and advanced features such as storing an index across multiple discrete servers. This increases the management cost for an index but allows more redundancies within the searching system. For example, an index server can be rebooted and still perform searches which is not possible with dtSearch.

Live Searching

Live search or database text searches are alternatives to the index-based searching discussed above. A significant portion of eDiscovery platforms rely on some type of database to store metadata coding results. Platforms often allow a search directly against these databases for metadata filtering. Some also accommodate real-time searching of textual data stored in these databases.

Microsoft SQL is a common one, and it has a robust textual index if configured properly. If a platform allows searching this way, it's a quick route around noise words and other special characters. For instance, an index-assisted search for the term "2%" will not locate any hits because it considers the % to be a noise character, but an SQL search interpreter that executes the search correctly will locate those two characters next to each other.

Knowing how searches will perform and the program's underlying search engine helps users understand the limitations of their technology. Then, users can develop better searches and get to key documents faster. It's critical to employ an expert in this area, and computer scientists can be particularly beneficial. Programming and search language are similar in terms of logical progressions but can become complex very quickly. Litigation support professionals or vendors with extensive technical experience can save many hours of frustration with search issues.

Search Term Reports

One important tool in everyone's search workflow is the use of search term reports. This is more than just putting the number of results from each individual search into an Excel sheet. Rather, when using a search term report, the user provides the platform with a list of searches which can include filtering, keywords, concept searching, PII, searching for inclusive emails, and other AI-assisted searches. It is far more than just giving a set of keywords to a program and getting a report on how they have performed (although that is the basic functionality).

Once a user provides a set of searches, the system runs those and presents various calculations. The most basic is the number of documents that hit in a given search which is useful to identify broken or bad searches. As an example, if a search returns no results, it may be because the search is not formatted correctly. Search term reports also calculate review burden and unique hits. In review burden, the system identifies a document that has a hit but also includes all the family items related to that document. If a hit is in an email attachment containing 3 attachments, the single hit will cause a review burden of 4 total documents.

Unique hits identify searches and document counts where only that search is found within the document and no other searches. This helps identify terms for which a large number of unique hits exist in a given search and indicates the search needs to be qualified or refined to narrow the results. Imagine a search term report containing the words ball, football, and hockey, and the set contains 10,000 documents. The search term report shows 5 unique hits for football, 1,000 hits for hockey, and 7,000 for ball. This indicates the term ball should be qualified since there are a lot of documents with just the word ball but not for hockey or football.

Not One and Done

It's very rare that a search is performed once, either for user error or because the documents contain unexpected content. Search term results are meant to be analyzed and used in crafting a more targeted successive search. Searching isn’t as easy as it may seem, which is helpful to remember when choosing a subscription service or eDiscovery solution. Having experts that can help with search nuance is a considerable time saver and can also help users learn how to conduct a search effectively. If you’re having trouble with search, or wonder if your platform is right for you, take the eDiscovery Assessment and identify the best approach to handle your projects.