Skip to main content

Researchers unveil BioTranslator, a machine learning model that bridges biological data and text to accelerate biomedical discovery

Dense swirls and plumes of brightly colored cellular material in blue, green, purple, orange and red form an irregular mass in the center of the frame. Overlaid on the red portion is a section of a chain of hexagonal shapes in red and blue representing an enzyme, highlighted in white with red circles radiating out from the center. The cellular material is pictured against a grey background patterned with tiny floating matter.
A visualization of p97, an enzyme that plays a crucial role in regulating proteins in cancer cells, inhibited from completing its normal reaction cycle by a potential small molecule drug. With BioTranslator, the first multilingual translation framework for biomedical research, scientists will be able to search potential drug targets like p97 and other non-text biological data using free-form text descriptions. National Cancer Institute, National Institutes of Health

When the novel coronavirus SARS-Cov-2 began sweeping across the globe, scientists raced to figure out how the virus infected human cells so they could halt the spread. 

What if scientist had been able to simply type a description of the virus and its spike protein into a search bar, and received information on the angiotensin-converting enzyme 2 — colloquially known as the ACE2 receptor, through which the virus infects human cells — in return? And what if, in addition to identifying the mechanism of infection for similar proteins, this same search also returned potential drug candidates that are known to inhibit their ability to bind to the ACE2 receptor?

Biomedical research has yielded troves of data on protein function, cell types, gene expression and drug formulas that hold tremendous promise for assisting scientists in responding to novel diseases as well as fighting old foes such as Alzheimer’s, cancer and Parkinson’s. Historically, their ability to explore these massive datasets has been hampered by an outmoded model that relied on painstakingly annotated data, unique to each dataset, that precludes more open-ended exploration.

But that may be about to change. In a recent paper published in Nature Communications, Allen School researchers and their collaborators at Microsoft and Stanford University unveiled BioTranslator, the first multilingual translation framework for biomedical research. BioTranslator — a portmanteau of “biological” and “translator” — is a state-of-the-art, zero-shot classification tool for retrieving non-text biological data using free-form text descriptions.  

Portrait of Hanwen Xu wearing glasses and a dark button-down shirt open over a white t-shirt, with a neutral expression and standing outdoors with blurred green and pink foliage behind him. The sun illuminates the right side of his face (left side from the viewer's perspective), side by side with a portrait of Addie Woicik outdoors on a snow-covered glacier, with hair pulled back and sunglasses perched on her head. She is wearing a periwinkle scarf around her neck and a pale red t-shirt with the straps of her backpack visible.
Hanwen Xu (left) and Addie Woicik

“BioTranslator serves as a bridge connecting the various datasets and the biological modalities they contain together,” explained lead author Hanwen Xu, a Ph.D. student in the Allen School. “If you think about how people who speak different languages communicate, they need to translate to a common language to talk to each other. We borrowed this idea to create our model that can ‘talk’ to different biological data and translate them into a common language — in this case, text.”

The ability to perform text-based search across multiple biological databases breaks from conventional approaches that rely on controlled vocabularies (CVs). As the name implies, CVs come with some constraints. Once the original dataset is created via the painstaking process of manual or automatic annotation according to a predefined set of terms, it is difficult to extend them to the analysis of new findings; meanwhile, the creation of new CVs is time consuming and requires extensive domain knowledge to compose the data descriptions.

BioTranslator frees scientists from this rigidity by enabling them to search and retrieve biological data with the ease of free-form text. Allen School professor Sheng Wang, senior author of the paper, likens the shift to when the act of finding information online progressed from combing through predefined directories to being able to enter a search term into open-ended search engines like Google and Bing.

Portrait of Sheng Wang wearing glasses and a navy blue suit jacket over a pink button-down shirt, standing in front of a window on a high floor overlooking low-rise buildings surrounded by trees with a mountain range barely visible in the background.
Sheng Wang

“The old Yahoo! directories relied on these hierarchical categories like ‘education,’ ‘health,’ ‘entertainment’ and so on. That meant that If I wanted to find something online 20 years ago, I couldn’t just enter search terms for anything I wanted; I had to know where to look,” said Wang. “Google changed that by introducing the concept of an intermediate layer that enables me to enter free text in its search bar and retrieve any website that matches my text. BioTranslator acts as that intermediate layer, but instead of websites, it retrieves biological data.”

Wang and Xu previously explored text-based search of biological data by developing ProTranslator, a bilingual framework for translating text to protein function. While ProTranslator is limited to proteins, BioTranslator is domain-agnostic, meaning it can pull from multiple modalities in response to a text-based input — and, as with the switch from old-school directories to modern search engines, the person querying the data no longer has to know where to look.

BioTranslator does not merely perform similarity search on existing CVs using text-based semantics; instead, it translates the user-generated text description into a biological data instance, such as a protein sequence, and then searches for similar instances across biological datasets. The framework is based on large-scale pretrained language models that have been fine-tuned using biomedical ontologies from a variety of related domains. Unlike other language models that are having a moment — ChatGPT comes to mind — BioTranslator isn’t limited to searching text but rather can pull from various data structures, including sequences, vectors and graphs. And because it’s bidirectional, BioTranslator not only can take text as input, but also generate text as output.

“Once BioTranslator converts the biological data to text, people can then plug that description into ChatGPT or a general search engine to find more information on the topic,” Xu noted.

A diagram from the paper illustrating how BioTranslator converts Input: user-written text to Output: non-text biological data. On the left are three examples of text descriptions, fed through BioTranslator symbolized by a collection of circles illuminated in the center and connected to each other by lines, and on the right are the corresponding biological data instances. A cell found in the embryo before the formation of all the gem layers is complete returns gene expression data in the form of a row of boxes of varying shades of maroon, pink and lavender; The removal of sugar residues from a glycosylated protein returns a protein sequence SVLLRSGLGPLCAARAA….VVAGFELAWQ; A complex network of interacting proteins and enzymes is required for DNA replication returns a pathway illustrated by a collection of 11 circles connected to one or more of the other circles by lines.
BioTranslator functions as an intermediate layer between written text descriptions and biological data. The framework, which is based on large-scale pretrained language models that have been refined using biological ontologies from a variety of domains, translates user-generated text into a non-text biological data instance — for example, a protein sequence — and searches for similar instances across multiple biological datasets. Nature Communications

Xu and his colleagues developed BioTranslator using an unsupervised learning approach. Part of what makes BioTranslator unique is its ability to make predictions across multiple biological modalities without the benefit of paired data.

“We assessed BioTranslator’s performance on a selection of prediction tasks, spanning drug-target interaction, phenotype-gene association and phenotype-pathway association,” explained co-author and Allen School Ph.D. student Addie Woicik. “BioTranslator was able to predict the target gene for a drug using only the biological features of the drugs and phenotypes — no corresponding text descriptions — and without access to paired data between two of the non-text modalities. This sets it apart from supervised approaches like multiclass classification and logistic regression, which require paired data in training.”

BioTranslator outperformed both of those approaches in two out of the four tasks, and was better than the supervised approach that doesn’t use class features in the remaining two. In the team’s experiments, BioTranslator also successfully classified novel cell types and identified marker genes that were omitted from the training data. This indicates that BioTranslator can not only draw information from new or expanded datasets — no additional annotation or training required — but also contribute to the expansion of those datasets.

Portrait of Haifung Poon wearing a blue button-down shirt against a grey background side by side with a portrait of Russ Altman wearing a blue and white striped button-down shirt against a blue sky.
Hoifung Poon (left) and Dr. Russ Altman

“The number of potential text and biological data pairings is approaching one million and counting,” Wang said. “BioTranslator has the potential to enhance scientists’ ability to respond quickly to the next novel virus, pinpoint the genetic markers for diseases, and identify new drug candidates for treating those diseases.”

Other co-authors on the paper are Allen School alum Hoifung Poon (Ph.D., ‘11), general manager at Microsoft Health Futures, and Dr. Russ Altman, the Kenneth Fong Professor of Bioengineering, Genetics, Medicine and Biomedical Data Science, with a courtesy appointment in Computer Science, at Stanford University. Next steps for the team include expanding the model beyond expertly written descriptions to accommodate more plain language and noisy text.

Read the Nature Communications paper here, and access the BioTranslator code package here.