Introduction

Integral membrane proteins as important targets of study

Cellular life owes its existence to biological membranes—oily, two-molecule-thick barriers that separate a cell’s water-soaked interior from a water-permeating world. The goal of my research has been to answer certain specific questions related to these proteins that exist in these ‘barriers’, termed Integral Membrane Proteins (IMPs). IMPs serve as essential conduits between the interior and exterior of cells and subcellular environments, playing critical roles in the transmission of information, the transport of cargo, and the exchange of energy. They are fundamental to numerous biological processes and are encoded by approximately 25% of human genes. Their significance can sometimes be underscored by their prominence as drug targets, with many pharmaceuticals affecting these proteins.

One important way of understanding proteins is by their 3-dimensional structure. Despite their importance, IMPs are underrepresented in structural databases. They account for only 2% of the structures in the Protein Data Bank (PDB), in stark contrast to their biological and pharmacological relevance. In one view, this historical discrepancy arises in part from the challenges associated with producing sufficient quantities of specific IMPs in model laboratory systems (termed heterologous overexpression) for further analyses to lead to structure. The complex and hydrophobic nature of these proteins, which necessarily includes successful targeting to a membrane in the heterologous system, makes them difficult to express in large amounts.

Overcoming these expression challenges has been considered an important target of research to advance our understanding of IMPs including in part by supporting structural studies of IMPs. Typically, expression challenges are studied piecemeal, in the context of a single protein of interest or a single biochemical phenomena of interest, e.g. translation initiation, targeting to the membrane. However, here we explored a more expansive approach to predicting the expression of IMPs by learning from data. In other words, we aimed to develop a generalizable mathematical model that could predict the expression of IMPs in a specific model laboratory system based on the protein’s coding (DNA) sequence.

The problem of membrane protein expression

The heterologous expression of integral membrane proteins (IMPs) in Escherichia coli frequently results in insufficient yields for effective structural or biochemical analysis. Researchers investigating specific IMPs often test multiple homologs to identify one that produces adequate quantities. The problem is chaotic, often there is significant variability in expression levels even among protein homologs with high sequence similarity.

There is currently no universally effective method for improving the expression of integral membrane proteins (IMPs). Indeed, much work has gone into pinpointing the cellular mechanisms behind initial expression failures with no overarching answers. Efforts to enhance IMP expression in E. coli can be categorized into two main groups: (a) altering the target target nucleotide or amino acid sequence or (b) altering the host organism. In both cases, the hope is to sidestep issues the initial sequence or baseline host may have posed for efficient heterologous expression.

In the first group (a), variants of the protein that are more amenable to biogenesis or other cellular processes are sought which thereby enhancing overall expression yield. One popular method is codon optimization. In this method, the coding sequence of a target protein is engineered to meet certain criteria, e.g. to match the host organism’s codon usage preference, to match the codon usage of highly expressed genes. Many complex implementation of codon optimization exist; for example, in codon “harmonization,” each codon of the target is selected by aiming to match the relative frequency of the original codon in the native host with a codon of similar relative frequency in the heterologous host. Many mechanistic explanations exist for why codon changes could improve expression, e.g. more efficiently utilizing of the host’s tRNA pool, modulating mRNA secondary structure, causing ribosomal pausing.

Target proteins are also modified in hopes of increasing protein expression. Methods like error-prone PCR/directed evolution have has been used to introduce mutations that increase yield (in some cases by increasing protein stability). Other approaches like the concatenation of fusion proteins (e.g. maltose-binding protein (MBP), glutathione S-transferase (GST), green fluorescent protein (GFP)) are sometimes used to enhance expression as well, which are thought to enhance solubility and stability as well as aid in proper folding and purification [37, 38]. These approaches are broadly hit-or-miss and often personal/anecdotal experience with a particular method guides future choices as opposed to algorithmically driven.

In the second group (b), the host organism is modified to optimize the expression of a target IMP. This can involve a variety of changes like co-expressing molecular chaperones such as GroEL/GroES or DnaK/DnaJ/GrpE, which are thought to help prevent aggregation and assist in membrane insertion, or knocking out protease genes that degrade misfolded or improperly assembled proteins. Directed host evolution has also been used to improve protein expression, where the host organism is subjected to iterative rounds of mutagenesis and selection to evolve strains that can better handle the target at hand. Testing a variety of commercially available protein-expression strains is commonplace; some of these strains simply have lower transcription or any variety of minor changes that by-chance improve expression of a particular target protein. Finally, wholesale change of the heterologous expression host is an option that researchers consider.

A data driven approach to predicting membrane protein expression

We sought to attempt an alternative approach to increasing the heterologous expression of membrane protein targets. Namely, we would aim to predict how well a given target IMP would express before any experimental work. To do so, we proposed to analyze data from past experiments to systematically develop a mathematical model that could “learn” trends from past data and use those trends to predict the outcome for new targets, in short, machine learning. In some circles also referred to as data-driven analysis or artificial intelligence.

The most critical enabling part of developing a machine learning model is access to datasets, preferably large. Thankfully, moderate and large-scale datasets have been published in the literature largely due to an explosion of structural biology research (and large-scale programs) due to the NIH-funded Protein Structure Initiative (PSI). The PSI aimed to determine as many protein structures as possible, generating a wealth of experimental data on thousands of target proteins including membrane proteins. These studies, encompassing high-throughput expression trials, detailed biophysical characterizations, and structural determinations, provide a rich repository of both successful and failed attempts at membrane protein expression. Meticulous records of gene sequences, plasmid constructs, expression systems, growth conditions, solubilization and purification protocols, and final expression yields were all maintained and publicly available. In addition, small and moderate scale datasets have been published in the literature by individual research groups, often, focused on a specific family of proteins or a specific host system. 1  A statistical model for improved membrane protein expression using sequence-derived features concerns how we went about this task, where we sourced data from, how the data was cleaned, how we built a predictive model, and how that predictive model performed on unseen data.

Since the model was built and published in 2017, machine learning (“artificial intelligence” as it is known in some circles) has progressed significantly, especially in applications to biology. The question of predicting membrane protein expression merits returning to with some parting thoughts below:

At a high-level, the way we built out initial model was to tabulate all the biological and biochemical processes (or hunches) that experts thought were related somehow to membrane protein expression. Code was written to capture each of these concepts as numerical features calculated from nucleic acid and protein sequences. A large-set of data with known outcomes were used to parameterize these features (which was then validated against other data sets).

The beauty of this approach is also its most significant pitfall: we were enabled by and also necessarily limited by what was known (or “hunched”) and how this ‘knowledge’ was encoded into numerical features. Given the scale of the data, it was not possible to use the data to significantly explore and refine the feature space without overtraining the model. However, the ideal approach would be able to “discover” features important to membrane protein expression without needing to be told what they are. This would also enable the creation/discovery of features related to processes not yet known but still important to expression. Operationally, one way to understand this desire is that we aim to create a numerical-space that compresses / captures the essence of membrane protein and gene sequences. Advances in large-scale neural-network models applied to biological sequences have aimed to do just this for proteins at large. In this approach, a primary “foundation” model is trained on a large corpus of data to learn a dense representation of that data. Future models that utilize that same type of data can build upon the “foundation” model by adding just a few more free parameters to learn the specifics of the problem at study.

As a specific example, Transformer models have been trained to take protein sequences as input (20 x N characters), processes each through a much more dense representation (e.g. 100 x 100 real numbers), and then reconstruct the initial sequence as output. This model is rather attractive for a variety of reasons. By virtue of parameterizing itself to reconstruct the input sequence, the numerical intermediate space is thought to capture ‘everything’ about the input sequence. The numerical space is also continuous, meaning its encoding lends itself to traditional machine learning approaches. Critically, the model is parameterized by sequence data alone (no expression data), meaning the corpus to be trained on is much more vast than the data we used to build our expression model.

For the next researcher interested in this topic, my suggestion is to explore this route by first using the intermediate parameter space provided by the multitude of embedding-type “foundational” models that exist. One complexity is that many foundational models are trained on general protein sequences, not gene sequences, but we know that gene sequences are important to membrane protein expression. Retraining a foundation model to include gene sequence is a non-trivial task (not to mention munging the data to be fed into such a model) that would require significant computational resources, but perhaps it is a task that could be attempted with the right collaborative team.

Extensions to tail-anchored membrane protein targeting

The above discussion has focused on the expression of integral membrane proteins (IMPs). However, a specific subset of IMPs, known as tail-anchored (TA) proteins are the topic of 2  Molecular basis of tail-anchored integral membrane protein recognition by the co-chaperone Sgt2, 3  The STI1-domain is a flexible alpha-helical fold with a hydrophobic groove, and 4  Sequence-based features that are determinant for tail-anchored membrane protein sorting in eukaryotes. Tail-anchored proteins are characterized by a single transmembrane domain located at the C-terminal “tail”, which anchors the protein to the plasma membrane. Due to their distinct topology, TA proteins cannot use the conventional pathway for integral membrane protein targeting (i.e. via the Signal Recognition Particle “SRP” and Sec Pathway) and instead require their own specialized targeting and insertion mechanisms to ensure proper membrane localization. One of these alternate pathways is the guided entry of TA proteins (GET) pathway, which involves a series of chaperones and translocation factors. Dysregulation of TA protein targeting can lead to mislocalization, aggregation, and loss of function–highlighting the importance of understanding TA protein biogenesis. In 2  Molecular basis of tail-anchored integral membrane protein recognition by the co-chaperone Sgt2, we describe how multiple lines of evidence support a structural model for one of the co-chaperones involved in TA protein targeting, Sgt2 (yeast)/SGTA (human) (“Sgt2/A”). Based on this model, we made various inferences about a class of proteins that could serve similar functional roles and argue that they likely have similar structures.

Furthermore, the Sgt2/A model provided motivation to explore how the geometry of hydrophobic residues on tail-anchors might influence the targeting of TA proteins to the membrane. In particular, it showed a groove where hydrophobic residues on a tail-anchor could be recognized and biochemical experiments suggested that a “face” of hydrophobic residues on an alpha-helix was actually necessary for binding. We wondered if a face of residues (or geometry otherwise) was actually a broader biological phenomena that could help us better understand the overall determinants of TA protein targeting.

For our part, it was an interesting question because the going hypothesis was that overall hydrophobicity and C-terminal positive charge of tail-anchors simply determined TA protein localization. In particular, in the case of moderately hydrophobic tail-anchors, e.g. only half hydrophobic residues, we wondered if the geometry of the hydrophobic residues could be an additional axis that determined localization. To this end, we showed that considering geometry along with hydrophobicity scale can better predict protein localization data data collated in bioinformatics databases.

While our path towards this hypothesis is curious, I wonder about the application of “foundation” models to problems of this sort. In particular, consider foundation model trained on protein sequence data that is then extended to predict localization of human TA-proteins. In doing so, a subset of the dimensions of continuous subspace would be varying and thereby considered for the localization model. Among these dimensions, would one encode hydrophobicity, another C-terminal positive charge, and another geometry of hydrophobic residues? Could other dimensions relevant to the localization protein encode other biophysical parameters important to TA-protein targeting that we have not yet considered? The application of foundation models to TA protein targeting, indeed mechanistic biochemistry at-large, will be an exceptionally interesting area of study in the coming years.

Colophon

This thesis document was written using Quarto inside RStudio. The complete source is available from GitHub. This version was built with R version 4.4.0 (2024-04-24) and Quarto 1.4.528.