Anderson Archival logo

Five Crucial Mistakes Archives Make with OCR

finding a search result
One of Anderson Archival's team members

by Marcia Spicer

One thing almost every digitization client has in common is a desire to share their collection. That’s great, and it’s a goal digitization works toward almost naturally. But someone has to think about the how. Yes, we want to share, facilitate research, create opportunities for spontaneous discovery of the collection… but how? 

For collections that want to contain full-text document search, a key part of the how is OCR.

Learn more about Optical Character Recognition in our explainer. 

Like most things, OCR (Optical Character Recognition) exists on a spectrum. Simply checking the OCR box doesn’t mean that the job is done. Without fully considering the ways OCR can work for your collection, it is easy to make these five crucial OCR mistakes.

  1. “Automated OCR will do the job.”

The old saying is true: time is money. When faced with complicated and expansive costs related to digitizing a collection, it is easy to plan on saving a few pennies by going with automated OCR rather than the precision of a manual review.

The truth is, when modern internet users search a collection they expect instant and complete results. Depending on complicating factors detailed below, a user could pull up your collection, hit search, and find zero results. Depending on how dedicated the user is, that bad result could be the last time they try to engage with the collection, no matter the quality of the digitization.

A manual review of OCRed text catches—and corrects—errors that the software misses.

  1. “We’ll skip the review of proper names.”

Proper names along with identified search terms are among the most important text to get accurate, even if you can’t or don’t want to invest in 100% OCR quality review. Taking the time to correct key names and places adds to the likelihood that your digital collection is used effectively.

Why does OCR software so frequently miss proper names? Dictionary words are a key part of its programming, but words the software can’t recognize from its programmed dictionary throw it for a loop. Factors like poor image quality and old or unstandard fonts add to the challenge of recognizing proper names. Without human review, Tolkien might be read as Tolkion or To1kien.

Correcting these errors manually means a search for Tolkien always finds a result.

  1. “Numbers aren’t essential.”

We at Anderson Archival strive to know your collection as well as you do. You know who will be searching the digital version and how they expect to arrive at the correct result. For many collections, numbers may not be as important to potential audiences, but for others, key dates, addresses, and designations may be prime search material.

A recent processing with OCR software significantly misread the number 22 as 55. The misreading came down to italics, a slightly crooked scan, and fonts that have fallen out of common use, and the software did its best. Whether a 22 as 55 will impact the usability of your collection depends on the particular features of the collection itself.

  1. “Since OCR is what’s really important, let’s use the scans we already have.”

Images are everything. When a piece of software is “reading” an image, everything matters—every pixel, background noise, color, printing over images—everything. When scans are blurry or low-resolution, the software has even less to work with than normal, and the results can be abysmal.

On fuzzy, unclear images, automated OCR struggles to parse even common words, let alone proper names or historical spellings. Manual review requires time and will take even longer when working with poor-quality images, because there’s more to correct. And humans struggle to read hazy visuals just as much as the software.

How can you fix this? New clean, high-resolution images not only improve initial automated OCR, but also improve the speed at which a manual review can proceed. Beyond the OCR factor, scans that are accurate to the original are just nicer to look at.

  1. “Just the text, thanks.”

There’s no doubt that raw full-text search gets the job done. When performed correctly, you know the results are present. However, it isn’t uncommon that the result you are looking for—a specific reference to Tolkien’s experience during World War I, for example—to be lost among numerous other results when searching the full text using the keyword “Tolkien.”

Granular search capabilities help users retrieve exactly the result they need. When search functionality is augmented with metadata and faceted search options that same user searching for “Tolkien” could narrow the results to within a specific time period or filter out works written by Tolkien. These laser focused search results save time and frustration compared to a simple full-text search

 

When designing your digital collection, it is important to balance budgetary concerns with creating a collection that has high adoption rates. The easiest way to get there is to work with a true partner like Anderson Archival. We take great pride in treating your collections like they are our own.

Subscribe to Our Newsletter

Digital preservation is about connecting to history. We do our best to bring you the important news and personal stories you’re interested in. We’re always looking for article ideas. Come learn with us!