Information Extraction via Large Language Model

May 30, 2024

I recently graduated from the Software Engineering Master's program at the Harvard Extension School. My final course was a semester long Capstone project, where I worked I worked alongside four other engineers on an information extraction project with the Pfizer corporation. The article summarized below describes the tool we created for Pfizer. The full article can be found here.

The logos of the Harvard Extension School and the Hectre project.

Article Title
Automated Clinical Trial Data Extraction with H.E.C.T.R.E. (Harvard Extension Clinical Trial Results Extraction)

Team Members
Manny Kwaning, James Nicholson, Veronika Post, Kartik Srikumar, Carl Zhao

Pfizer Partners
Sima Ahadieh, Meg Bennetts, Dr. Jim Hughes

Professor
Dr. Peter Henstock

Teaching Staff
Simili Abhilash

Abstract
Results from clinical trials of drugs are published in the news then in scientific literature. Pharmaceutical companies like Pfizer collect published journals, manually extract data, and perform meta-analysis for drug development. The extraction process is tedious and error prone. Systematic review of literature stored in unstructured pdf documents can take several months.

Objective
To help companies and researchers with data extraction from clinical trial papers, we developed a solution that makes this process faster and cheaper.

Methods
HECTRE is a web-based application that can also be interacted with from a command line interface (CLI). It leverages large language models (LLM) and extensible prompt engineering techniques to extract clinical trials data from pdf documents. HECTRE is built with parallel programming which makes extracting data fast and efficient. It allows multiple documents to be uploaded and extracted at the same time and reduces the time of extracting data from several months manually to less than 10 minutes per 10 journal papers.

Results
The accuracy of data extraction performed by HECTRE varies and cannot be concluded in one number. The automatic testing showed that the literature data extraction accuracy is 75.93%, and the clinical data extraction accuracy is 61.71% (the respective means of all the papers tested). However, the manual testing showed that these numbers might be underrepresenting the true HECTRE accuracy.

Conclusions
HECTRE can extract clinical data efficiently with a high degree of accuracy. With additional fine tuning of the chosen model and refined prompts the accuracy of the extraction can be improved significantly. HECTRE optimizes the extraction process from being entirely manual to a process that may require minimal manual verification.

The full article can be found here.