4 min read

So, You Want DIY eDiscovery? First Up: Loading

Featured Image

You’ve decided to handle eDiscovery yourself, whether that means starting with a new tool, switching tools, or learning more about the processes you’ve been using. There are so many benefits to doing your own eDiscovery, and a few places where things can get complicated. This blog installment shares a few tips we’ve picked up through years of loading data ourselves and helping clients load their information.

First Steps

The biggest challenge of loading data yourself is determining what type of data you have. Is it raw native? Or processed data? And what exactly are those?

Raw native is generally a mix of files like PDFs and Excel documents, and sometimes PSTs. Processed data is either load files or productions, typically in OPT, DAT, or CSV files which may contain images, text files, or native files. An easy way to tell the difference between raw native and processed data is file names – processed will have sequential file names where raw data will not.

Raw Native

There are a few things to find out before loading raw native data, and they affect how it will be handled.

First, who is the custodian of the data? Although this seems elementary, there are many times when this information is not communicated which can cause significant de-duplication issues down the line.

Second, does the data need to be filtered? If so, what are the parameters of that filtering (the most common is restricting the date)?

Third, what is the method of de-duplication? The most effective de-duplication is global (across the entire data set) because it maintains the custodian, original file names, and sometimes dates. Some tools do not globally de-dupe so ensure that your software can do this – if not, custodian-level de-dupe can be performed. The de-duplication method is also an important factor, whether by hash, metadata, or text-based.

An additional consideration is how the data was collected, as that can affect name paths and dates. This is particularly important if a certain date range is going to be filtered out. For instance, if data is moved by drag and drop from a desktop to a thumb drive, it can change the file creation metadata.

Loading Processed Data

Load files come as DAT, CSV, or OPT file formats, and you need one of those three files for a processed load. DAT or CSV files generally have metadata, whereas OPT files are images without metadata and do not contain parent-child relationships. The load file contains the context metadata and links to native files, images, and text. Load files also have a common key field such as a BegDoc or ControlID that helps link parents to children, natives to records, images to records, and text to records. The data in a load file may be natives, images, or both. Some load files have placeholders for large natives like Excel documents.

If you are requesting a load file, you must be very specific about the production format in order to have the smoothest possible load. If you don’t have a format, you can find an example here. This is so it loads easily and has the richest amount of metadata.

Quality Control

In general, quality control is the same whether you have processed or native data. What you’re looking for in quality control, sometimes known as exception handling, is what is missing? Are there gaps in Bates numbers? Are there emails that don’t appear? Some tools perform automated detection of missing emails and threads if Conversation Index is available.

When performing quality control on raw native data, the most important question is whether the data in the text is good or not. Data that occurs behind password protection or corrupt data in a container file (such as a ZIP) creates an exception because the tool knows the data is there but can’t retrieve it. It may know the file name or size but can’t extract the information from the container; these are generally rendered as extraction exceptions.

Another issue is text extraction vs. OCR. Most native file types (except PDF) will contain extractable text. Sometimes, however, that text doesn’t come out as good quality; in that case, you should OCR the text to make it easier for review.

The third issue with native data is uncommon file types. In general, the “big five” are the most important file types: Office documents, email, images, containers, and AutoCAD. Others may create an exception and require special handling.

Issues specific to processed data can include bad or missing load files or incorrect file paths. The first issue arises because of incomplete or improperly written production specifications. Missing parent-child relationships will manifest as an email referencing an attachment that isn’t there. The path of a child missing a parent will have a parent file that hasn’t been produced.

Conclusion

When doing eDiscovery yourself, loading may be the first major hurdle. It can seem daunting, but several items listed here can help you create a checklist to cover a majority of the problems encountered during data loading. Having knowledge of these common issues, combined with a specific production specification, should help alleviate most loading snafus.