Automated Optical Structure Recognition and Activity Extraction

The goal is to automatically process PDF files containing 2D chemical structures, convert these structures into a SMILES/machine-readable form. Corollary information like structure identification should be extracted and associated with the relevant structure. Final stage would be to also recognize and extract relevant tables that could then be collated with the extracted structures.

A prototype of this system is already in place, but still requires manual intervention and verification at various stages. Ideal cases can be processed in a few minutes, average cases 10-15 min, and extreme cases or recognition failures require human intervention.

Term

Spring 2023

Topic

Industry/Economics

Technical Area(s)

Machine Learning (ML)

Natural language processing (NLP)

Featured

Off