Seminar | November 2 | 10:30-11:30 a.m. | 177 Stanley Hall

 Neil Thomas

 Electrical Engineering and Computer Sciences (EECS)

Proteins are the molecular machines that perform the vast majority of natural biological functions. Discovering proteins to perform novel functions or optimizing them for an existing function are central goals of synthetic biology. Doing so is challenging primarily because for most proteins there is limited understanding of how they function, let alone how to modify them; experimental characterization and crystal structures are expensive and time-consuming to collect. For a given protein, however, genes performing related functions can be found in the genomes of diverse organisms -- the natural result of the process of evolution. With improved techniques for genetic sequencing, an abundance of data deposited in protein sequence databases has become available. This presents a tantalizing modeling opportunity: models that can understand protein function through the observation of related sequences can reduce the reliance on experimental characterization and unlock new possibilities for protein discovery and optimization. Building such models has been a goal of bioinformatics research, and has more recently emerged as a goal of machine learning research. In particular, "protein language models," models trained to learn a distribution over sequence data, have shown promise in predicting functional properties of proteins.

In this work, we leverage the information in protein sequence databases to the following ends. First, we present a benchmark for the effectiveness of protein language models using a suite of protein prediction tasks. Second, we draw a connection between a well-established graphical model of protein families and the neural network architecture of protein language models. Third, we present a framework for deriving synthetic protein fitness landscapes from evolutionary data that can be used to in-silico evaluate strategies for model-guided protein design.

Zoom link:
https://berkeley.zoom.us/j/94781598923?pwd=WDU4d0VITjJWMEV2TTVzZXBJeEtDUT09
passcode: proteinml

 nthomas@berkeley.edu

 Jean Nguyen,  jeannguyen@eecs.berkeley.edu,  510-642-9413

Event Date
-
Status
Happening As Scheduled
Primary Event Type
Seminar
Location
177 Stanley Hall
Performers
Neil Thomas
Subtitle
Leveraging Evolutionary Information to Improve Protein Modeling
Event ID
149242