Academic Cancer Center Partners with GenomOncology to Optimize Optical Character Recognition Pipeline

Overview

A US-based academic cancer center focuses on bringing a synergy between the best medical education, research, and patient care. Specifically, their department of pathology delivers advanced laboratory diagnostics to patients, progressing medical knowledge regarding the understanding, diagnosis, and treatment of human disease.

The Problem

This academic cancer center’s department of pathology recently recovered 250,000 Surgical Pathology and 100,000 Blood Lab / Autopsy reports. These reports were typewritten between the 1980s and early 2000s, before the implementation of electronic health records, and stored on microfiche as image files (TIFF). In order to utilize this information within their current systems, the academic cancer center needed to extricate and convert the report data.

The Solution

To extract the text from these report images, and create new, searchable PDFs with a text-overlay, the academic cancer center partnered with GenomOncology.

GenomOncology offers a breadth of precision oncology solutions, including a focused data enablement solution, igniteIQ, that simplifies the extraction of clinically-relevant information from PDF documents and images to access discrete data for registries, analytics, and research. As a result, this academic cancer center implemented igniteIQ. 

The Process

All the historical documents were uploaded to igniteIQ, and the documents underwent an image cleaning process. This process is designed to improve recognition within the lower quality documents, such as printed or scanned reports. After the image was cleaned, igniteIQ used optical character recognition (OCR) to convert that image of text into a machine-readable text format. The tool engaged the solution’s biomedical term-based OCR spell correction to automatically update any potential spelling inconsistencies.

The GenomOncology team completed a final quality check of the process, ultimately reducing overall errors. After the final review, igniteIQ returned a new PDF version of the file that was sent back to the academic cancer center for further inspection.

“These searchable pathology reports now provide our pathologists with the tools to effectively identify and retrieve tissue and tumor blocks from the clinical pathology archive that date back more than 40 years. Important research can now be carried out on these valuable tissue resources,” said MD, PhD, Vice-Chair of Research of the Department of Pathology at the academic cancer center. 

Results

GenomOncology selected 210 random images to use as test examples. With these examples, GenomOncology generated a “gold standard” text for each image and then compared the extracted text to the gold standard, word by word, and calculated OCR accuracy, resulting in: 

  • Image Cleaning Pipelines and Accuracy: 83.4% accuracy

  • Greyscale: 83.9% accuracy

  • Erosion / Dilation: 85.1% accuracy

  • Artifact removal (despeckle): 87.4% accuracy

  • Despeckle + Erosion / Dilation: 83.1% accuracy

Ongoing Plan

With searchable PDFs of historical pathology reports, the academic cancer center department of pathology can efficiently and effectively streamline the searching of their pathology archive to identify and retrieve histological slides and tissue blocks that date back to 1981. These valuable clinical specimens are now available to their pathologists and their collaborators for new research opportunities.