A machine-learning algorithm to identify transporters of aromatic bioproducts for the food industry A machine-learning algorithm to identify transporters of aromatic bioproducts for the food industry Microbes can be engineered to produce valuable chemicals, including aromatic compounds of importance to the food industry. For example, two important flavor/fragrance compounds, 2-phenylethanol (2PE) and vanillin, can already be microbially synthesized. However, for such non-native aromatics, end-product toxicity is a significant concern, with the feedback inhibition typically limiting achievable titers, yields, and productivities. To counter the toxic effects of these and other aromatics, our groups are improving tolerance and productivity by actively excreting toxic products from cells as they are synthesized using transporters (e.g., efflux pumps). Efforts are limited, however, by the fact that known transporters for aromatics are currently scarce in literature. To address this issue, the proposed project seeks to develop an algorithm capable of accurately predicting the activity of both characterized transporters and largely uncharacterized transporters for industrially-relevant aromatics. This flexible algorithm (which can be tuned to work on other products of interest) will streamline the search for relevant exporters to enhance tolerance to and production of target aromatics. No such algorithm, which can successfully predict transport activity for diverse substrates, currently exists. The main challenges to this are the lack of 1) characterization of transporter structures and mechanisms, and 2) the database of substrates and non-substrates for transporters. In cases such as these, the minimalist strategies proposed here, which will predict activity by building up from protein sequences and substrate chemical structures using machine learning, are well-suited. Data on transporter-substrate interactions will be collected by text mining of public transporter databases (e.g., Transporter Classification Database (TCDB)) for lists of known transporter-substrate interactions. Natural language processing will further help to identify substrates included in the TCDB. This will involve both native transporters as well as heterologous ones. One major limitation of this approach alone is that non-substrates for most transporters are rarely listed on such sites. Recently, a library screening of native E. coli transporters for activity on a subset of aromatics (including 2PE) was performed by our groups. These additional experimental results can provide important non-substrate information which will form the foundation of our datasets. To quantitatively describe the substrate structural similarity for more accurate prediction, we will use chemical structure descriptors to decompose substrates into substructures supplemented with chemical properties (e.g., electronegativity). To build the relationship of transporter protein sequences and substrates, sequences will be reduced to the short fragments composed of amino acid alphabets. Individual sequence fragments will be evaluated for significance in terms of transport activity with principal component analysis, and significant ones will be kernelized as an input for machine learning. This can be progressively optimized for greater predictive performance by tuning the input variables using machine-learning methods. As a preliminary test, we have finished text mining of TCDB for transporter-substrate interactions and identified optimal chemical descriptors for target aromatics. Under various testing constraints, the model has proven predictive with accuracy consistently exceeding 80%; suggesting the viability of our approach. With the support from AIAP, we will further develop this algorithm by integrating our newly obtained experimental dataset for improved negative interaction prediction. Ultimately, we will test the utility of computationally identified aromatic exporters for improving microbial tolerance to and production of 2PE and/or vanillin using previously engineered strains. The obtained results will enable iterative optimization of the algorithm for higher accuracy.
|Effective start/end date||9/7/18 → 9/6/19|
- INDUSTRY: Foreign Company: $18,181.00
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.