Enhancing accuracy using the ocr.config file and regex.txt file

You can enhance the accuracy of OCR by modifying the ocr.config file and the regex.txt file.

OCR configuration file (ocr.config)

The accuracy of OCR is dependent on the clarity of the image sent to the tesseract engine. The clarity of the image is affected by modifying the ocr.config file in the tessdata folder (C:\ProgramData\Verint\DPA\Data\tessdata).

The ocr.config file contains these settings:

engine_mode=0

perform_negate=1

perform_scale=1

perform_unsharp_mask=1

perform_otsu_binarize=0

dark_bg_threshold=0.5

scale_factor=7

usm_halfwidth=15

usm_fract=1

otsu_sx=2000

otsu_sy=2000

otsu_smoothx=0.8

otsu_smoothy=0.8

otsu_scorefract=0.8

regex_file=regex.txt

The most effective configuration options are described below. You should not modify any other configuration options in the ocr.config file.

perform_otsu_binarize - Blurs pixilation on images where images have been magnified using scale factor.
scale_factor - Magnifies the original detected area.
usm_halfwidth - Increases the smoothing edges of letters.
usm_fract - Reduces pixilation caused by magnification
regex_file=regx.txt - Enable post-OCR regular expression filtering in file regext.txt.

The ocr.config settings are case sensitive and must all remain in lower case

After modifying the ocr.config file, changes do not take effect until the next restart of the DPA client.

Post OCR detection filtering using the regex.txt file

OCR can detect false items. DPA can perform additional filtering on the OCR detection to clear away any "noise" that the OCR falsely detects. This noise is very specific to the application in use and cannot be predicted easily. To combat this false detection, DPA can apply regular expression matching on the OCR detected output. The regular expression is configured in a regex.txt file contained in the tessdata folder.

The regex.txt file supplied in the standard DPA TessDataInstaller MSI is not generic and requires updating for specific customer requirements.

The regex_file= setting must be present in the ocr.config file and the regex.txt file must be present in the tessdata folder. In this case, DPA uses the content of the regex.txt file to perform search and replacement filtering against the detected text found by OCR.

If the regex_file=regex.txt setting is not present in the ocr.config file, this feature is ignored.

Each search and replace expression in the regex.txt file is used against every OCR text detected. You cannot specify different regular expressions for different triggers.

If the regular expression matches, then the replacement is performed and the resulting changed text is then used for the detected text. You can "chain together" as many regular expressions as are required within the regular expression file. This chaining allows progressive filtering to enhance the accuracy of the detected text.

If the regular expression does not match the detected text, the replacement is ignored and the next pair of search and replacements are checked.

Regular expression file: regex.txt

This section provides an example regex.txt file containing several search and replacement filters. These filters progressively "clean" the detected text.

In this example, the final expressions match two different outcomes: one matches a 10-digit customer account number and one matches a USPS postal zip code. This example shows how the regex.txt file can be used to filter out two different numerical detections.

Example regex.txt file content:

#The next two lines search for T and replace T with 7

#The next two lines search for ? and replace ? with 7

#The next two lines search for all digits in detected text, and returns them all as one number

*ALL*

#The next two lines match if there are 10 digits, and formats them with a hyphen

^(\d{5}).*?(\d{5})$

$1-$2

#The next two lines match if there are only 5 numbers, or 5 followed by 4 numbers

^(\d{5})(\d{4})?$

$1 $2

Explanation of example regex.txt file

Line pair	Outcome/Description
T 7	The target was guaranteed to be a numerical value, but OCR was found to confuse 7 with T. This line pair replaces T with 7.
\? 7	The target was guaranteed to be a numerical value, but OCR was found to confuse 7 with ?. This line pair replaces ? with 7. The \ is needed as ? is a reserved character in regular expressions.
\d ALL	The target was guaranteed to be a numerical value, but OCR was found to include additional other items within the output (for example 12'345-67890). This line pair matches ALL decimals and removes everything else from the input.
^(\d{5}).*?(\d{5})$ $01-$02	This line pair places the first 5 digits and the last 5 digits into a value separated by a hyphen. This number matches the customer account number.
^(\d{5})(\d{4})?$ $01 $02	This line pair matches only if there are 5 numbers or 9 numbers. It then splits them into the standard USPS zip code format.

$01 ... $n correspond to the regular expression capture groups on each line, which allows you considerable flexibility in building up the final expression filter.