When is a Terabyte Not a Terabyte?

Tuesday, July 25, 2017

Today, Advanced Discovery released a set of eDiscovery national and peer group benchmarks; I’m thrilled to be able to share the news – and my excitement.  While our new metrics answer some old questions, and challenge some long-standing assumptions, they more importantly lay out a transparent set of performance expectations with which to improve the eDiscovery industry – for both clients and service providers.

First, the eDiscovery challenge: let’s say you are handed a hard drive with exactly one terabyte of data on it, totaling 3.2 million parent items. Prior to processing, what’s your method for estimating final data size and number of documents that will need to be analyzed and eventually reviewed? Could you get everything you need to know by asking for one piece of background information? And even if you had estimates from prior projects, how do you know if your numbers reflect best practices for similar type matters in your specific industry?

Somewhere in the doldrum hours between handing a hard drive to a courier and the notification that it’s been processed for review, we’ve all had a moment where the eternal question is asked: how much data will this really be – and how many documents will we really need to review? Almost every project manager, analyst, financier, budgeteer, musketeer, attorney, lawyer, barrister or paralegal who works with eDiscovery projects has asked this question at least once. And everyone who has asked the question before the data was ingested is usually met with the same answer: it depends.

Between expansion, extraction, exceptions, deNISTing, and deduping, determining what to forecast before you get the real number is usually based less on slide rule and more on rule of thumb.

Maybe your experience says it’s usually 4,000 documents per gigabyte, but then comes along the case where it’s 15,000. Maybe your experience says every gigabyte collected expands by a factor of 1.5. In 2009, I had a case where a 25 megabyte zip file containing 25 text files expanded by a factor of 8500, to more than 200 gigabytes. My IT department called me, concerned my workstation was under attack. That’s what “it depends” looks like.

But what would I do if “it depends” really can’t hold off the question today? What if I need to provide a defensible estimate of how much it will cost to review that next drive? What if I can’t wait for that shipment and processing to make my forecast? What if I need to Get Answers Now?

Well, I’ll tell you what we did. We took 3 years of aggregated information from thousands of matters, and identified repeatable, verifiable trends. The numbers were compelling, and it turns out that asking what industry the metrics came from may be all you need to know to get a more accurate and defensible forecast.

If our one terabyte example came from Biotechnology or Finance, then the data size in gigabytes will likely expand by a factor of 1.6. But the document count will expand by a factor of 4.4 for the Biotech company, compared to a whopping 18.8 from the Finance data of comparable size. Compare that to the average Food and Beverage company’s file expansion factor of 19.4 or Entertainment company’s 31.9, and you start to see why it’s so important to get this benchmarked.

Even though the “average” size-on-disk data expansion factor is between 1.5 and 2.0, the target company’s standard industry code (SIC) can now be used to tell you if the document count will expand by 6 (Healthcare), or 19.4 (Utilities). We even had to run several internal double-checks to confirm the numbers for insurance, which expand on disk by a factor of 2 for data, but expand in record count by a factor of 389. In most cases, industry-specific document expansion factors are closely correlated to the number of attachments to email data and the amount of network collections, but Insurance across the board appears to be solidly due to how often vast quantities of data are imaged, high page count documents.

We’re not just providing forecasting numbers: in the interest of transparency we’re laying our performance benchmarks as well. We’re averaging an 85% reduction in total data size through our project lifecycles, and a 90% reduction in document count by the time that we’re done with our ECA and Analytics tools.  That means that on average, clients across industries are only required to review 10% of full native documents vs. their original file count after ingestion and processing.

And document counts drive eDiscovery review costs and timing.

Don’t just take my word for it: we publicized the indices so you can compare your own performance to our national and peer industry benchmarks, too. I’m proud to say our eDiscovery Industry Benchmarks bring a new level of transparency into key cost- and time-management indicators. Our numbers go beyond data size expansion, providing information on file count expansion, and ECA and data reduction to eventual document review size, broken down by industry code.

Check out our interactive infographic, and take note of the key indicators that most affect you or your clients, and how changes in those indicators may affect eventual review scope.

If you’re seeing lower data reductions and higher review percentages in your cases, think about how you can better use analytics to save you time and money.  Partnering with an expert to get you at or better than your peer group performance can have a significant impact on you eDiscovery cost, turnaround time – and matter outcomes.


PAUL LAVEN – Relativity Certified Master and Director of Solutions, North America

Paul is a Relativity Certified Master, with a double certification in  Analytics and Assisted Review. With more than 18 years of technical project management experience, he has worked extensively in eDiscovery, with a specific focus on advanced analytics in the financial, energy, and pharmaceutical industries.  Paul has provided expert witness testimony and depositions, been featured on the kCura Advice Blog and presented case studies at LegalTech NY. He is a former non-commissioned officer in the United States Marines.

Get in Touch