Healthcare NLP - Using Existing IP & Expert in the Loop Model to Overcome Cold Start

Introduction

GenomOncology helps healthcare providers use data to improve cancer care. We do this by finding and sorting treatment and clinical trial options based on a patient's diagnosis and molecular biomarkers. Our solutions will even take into account a patient's zip code when identifying trial options, to ensure a reasonable driving distance.

While our software has a complete database of clinical trials and therapies for cancer, we often find that our clients, healthcare institutions, can't fully take advantage of our solutions due to "unstructured" patient data. Critical patient information, including lab results, can be trapped inside of images or PDFs from scans, emails, or faxes. It has been estimated that over 80% of all medical data is in such unstructured formats.

Unfortunately, the solutions available for converting this data are not ideal.

Automated solutions, even the "state of the art" Artificial Intelligence and Machine Learning tools, do not provide a sufficient level of quality for patient care. While the emerging technology has advanced significantly in the last 5 years, the main challenge in healthcare remains a lack of available annotated clinical data required to train deep learning-based Natural Language Processing (NLP) engines. This will likely continue to be a challenge due to the patient health information protection.

Without automation, the data entry burden is time consuming, costly, and error prone. This increases the total cost, and reduces the effectiveness of precision medicine based solutions, such as GenomOncology. Because of these challenges to our core clinical decision support solutions, we have invested in building a pragmatic solution for extracting clinical data at scale with high quality results.

However, like any ambitious machine learning project, we had a "cold start" problem where we not only had no labeled clinical data... we had no clinical data period, outside of some small open-source data sets.

Overcoming the Cold Start

At GenomOncology, not only do we understand software development, but we have deep expertise in the clinical, laboratory, genomic, and bioinformatic processes that inform precision oncology. Our team of curators apply that expertise in reviewing documents, such as clinical trial protocols, lab testing results, and clinical guidelines. This team has curated and assembled, through open-source and licensed content, a comprehensive knowledge graph of clinical trials, approved treatments, diseases, molecular biomarkers, and other ontologies.

Our first step in getting past our "cold start" was to leverage our existing knowledge graph used for our Clinical Decision Support (CDS) solutions. We built a processing engine that leverages our graph's database of names and synonyms for things ("entities") like diseases, drugs, and genes. This engine can parse any text and identify the objects within it. Not only can our engine recognize that a string like "NSCLC" means the disease "Non-Small Cell Lung Carcinoma," but it can also link to it's unique identifier in several well-known, public datasets. The engine also can parse and link what we call "dynamic entities," such as dates, genomic variants, or measurements such as tumor dimensions.

Along with converting text to linked entities, our engine has the ability to convert images/scans via an oncology-specific pipeline that does image cleaning, Optical Character Recognition (OCR), and structure detection for pulling data out of tables, lists, and other semi-structured sections of text.

Expert in the Loop

Once our engine was extracting data with a reasonable level of quality, our next step was to build an optimized content review experience in a web user interface (UI). This UI was designed to be responsive and easy to use to maximize the valuable time of our team of subject matter experts (SMEs). Our curators can view PDF documents side-by-side with the extracted structured data for verifying and correcting any mistakes or missing elements.

Each time a document is reviewed and data is corrected, the system gets better through a feedback loop to the processing engine by building annotated data sets and augmenting the knowledge graph with new entities and synonyms. This feedback loop increases the accuracy of the engine and reduces the review time of the curators, thus turning the NLP flywheel faster and faster, document by document.


igniteIQ

We call this system igniteIQ and it has been designed specifically with our industry in mind:

  • Hosted in the cloud or behind your institution's firewall

  • Fully encrypted to protect patient data

  • Flexible document types and data schema

  • Data review by your subject matter experts or ours

  • Easy to install and maintain with Docker containers

  • Extensible by your development teams

  • Analytics dashboard for quality and efficiency tracking

  • Configurable integration with backend systems via API, FTP, etc.

Learn more about the features and benefits of igniteIQ today.

Ian Maurer