The goal is to automatically process PDF files containing 2D chemical structures, convert these structures into a SMILES/machine-readable form. Corollary information like structure identification should be extracted and associated with the relevant structure. Final stage would be to also recognize and extract relevant tables that could then be collated with the extracted structures.
A prototype of this system is already in place, but still requires manual intervention and verification at various stages. Ideal cases can be processed in a few minutes, average cases 10-15 min, and extreme cases or recognition failures require human intervention.
Term
Spring 2023
Topic
Industry/Economics
Technical Area(s)
Machine Learning (ML)
Natural language processing (NLP)
Featured
Off