The goal is to automatically process PDF files containing 2D chemical structures, convert these structures into a SMILES/machine-readable form. Corollary information like structure identification should be extracted and associated with the relevant structure. Final stage would be to also recognize and extract relevant tables that could then be collated with the extracted structures.

A prototype of this system is already in place, but still requires manual intervention and verification at various stages. Ideal cases can be processed in a few minutes, average cases 10-15 min, and extreme cases or recognition failures require human intervention.

Automated Optical Structure Recognition and Activity Extraction - Spring 2023 Discovery Project
Term
Spring 2023
Topic
Industry/Economics
Technical Area(s)
Machine Learning (ML)
Natural language processing (NLP)