Meet the people who are redefining the field of data science! We capture the paradigm shifts that are driving the discoveries in cancer research of tomorrow, whether it is a brand new tool, model, or way to use data.
Svitlana Volksova, Ph.D. is the chief scientist of Pacific Northwest National Laboratory. She will describe how the “foundation model,” a term coined at the Stanford Institute for Human-Centered Artificial Intelligence (IHCIAI), is being used to give scientists and analysts a new way of unleashing artificial intelligence’s power.
According to Dr. Volkova, these multipurpose models represent a paradigm change in AI. Unsupervised, a foundation model is created using large data sets. The result is a model that can quickly be applied to a variety of tasks without having to be specifically asked, with minimal fine-tuning. This approach is especially relevant to researchers who want to apply a multi-omic and integrative data approach to cancer research.
What is the difference between AI foundation models and other AI models?
AI (or machine learning) is increasingly being used to enhance human intelligence. This technology is used to analyze large amounts of data to make predictions, find patterns, and forecast trends.
Our unique partnership between AI and humans hinges on the ability of our team to create models that improve our cognitive abilities, such as how we reason, learn and make decisions. The more accurate our models are, the better we can augment human tasks.
We used to train models on data until a few years ago and then use the model for specific tasks such as text translation or object recognition. We call this “narrow” AI.
We’re now moving from a narrow view to a broader one, called “foundation models”. This tool is all-encompassing and can be used for many tasks, with little adjustment. Text applications include summation and machine translation. It allows us to use a single model for a variety of tasks.
Many people may wonder, “Why is this model better?” Others might argue that developing a model of this type is detrimental because it hurts the environment, and it takes time and resources (often months) to train. We’re finding that the costs are offset by the benefits of accuracy and generalizability.
The accuracy of foundational models is superior to narrow AI. The foundation models are also highly generalizable. This is an important attribute that speaks to the problem of narrow AI. The model you create for a particular task is often not easily transferable to another data set, new task, or domain. This has been a persistent problem in AI, one that impacts our ability, especially, to share results, reproduce findings, and advance science.
Why use foundation models for cancer research?
In cancer research, there are many opportunities to use foundation models. These models are costly to develop and train and have been mainly used in industry. They are not common in biomedical research. Recent advances in hardware development and model creation have allowed us to train our models more quickly and at a lower cost.
OpenAI and Google DeepMind are two good examples. These open-ended algorithms were designed for sustainability. The model can solve new tasks and keep learning (seemingly forever) while only exceeding 1 billion parameters. This is similar to DeepMind’s multitasking Gato algorithm.
The foundation models perform exceptionally well when they are deployed in different modes. Models like BERT and GPT-2, for example, have revolutionized the field of natural language processing. Models such as AlexNet, have also revolutionized our ability to interpret visual data.
Now we’re exploring the possibility of applying these large-scale cancer models to domain-specific data, particularly multi-omics data.
Already, there have been some successes. In a recent article, researchers trained a large model (nearly one billion parameters) using 250 protein sequences. Researchers then demonstrated how this protein model can be adapted to perform a variety of tasks including narrow AI tasks such as finding biological structure and function. This research is cutting-edge.
Do foundation-model approaches have unique risks?
Risks are present, but not limited to foundational models. Even narrow models are subject to the same risks. Data is the basis of all models so that bias can be introduced. If the model’s development is not done responsibly and fairly, it’s harder to hold it accountable or trust it.
What does it mean to be responsible? It’s a reference to the experimental setup of the model. The community must be able to reproduce the model and understand how it is built. Therefore, each step of the application should be publicly available. I am a strong believer in reproducible research.
Fairness is also important, especially in the biomedical field. What if we only use data from one population, and ignore minorities that are underrepresented or environmental factors? We need to have a fairly distributed data set to achieve the best results. This is often difficult.
We need to test the model and make sure it is reliable. Not just meeting baselines is enough. We can’t just show that our model is similar to others in the field on a specific data set. We need to test robustness. We must test the robustness of our model. Data sets We need to use adversarial attacks to probe the model’s behavior. To test the model, we need to conduct adversarial attacks. We need to examine the results then critically.
There’s no universal model or hammer that can do everything. Reporting limitations from the beginning is important so that users of the model understand what it can and cannot accomplish. To build accountability, it’s important to set clear expectations.
If we are to progress in model development and deployment, then full transparency is required.
You mentioned that foundation models depend on large amounts of data. Is there a downside to using so much data?
In my career, I have worked with huge amounts of data including social media data generated by humans. The AI community has adopted many models that reflect the way humans learn. You can only learn a limited amount if you are exposed to one thing. Imagine that you are sitting in your home and only exposed to the living room. You will only have a limited view of the world. Imagine you are traveling to your nearest town, to a large city, and then overseas. Your horizons suddenly become much broader. The model and you will benefit from this increased exposure to data.
There are also downsides to this type of exposure. A foundation model, for example, training on internet data, will reflect all human biases that it encounters. This includes inaccuracies or distortions. Models will learn what they are exposed to unless we intervene. We’ve learned that we must put our models under control.
If we relate this to our example, “leashing our model” is similar to going into the city but only following a guidebook that leads you to the places that you want to visit.
We can control a model in many ways: either by guiding the learning process or by adjusting it afterward. Nobody wants to use a bias-learning model. We want to create technology that is better than ours.
In terms of building technology, what do you think the future holds for foundation models in the next five to ten years?
AI has reached a point where it can be used to augment the work of scientists and analysts. AI is a tool that helps humans to perform cognitive tasks like reasoning, decision-making, and gaining knowledge through large data sets. But AI is still handicapped. AI is still not at the next level of being autonomous and operational.
We should expect to see this level of autonomy increase in the coming years. Not full autonomy, but models that are semi-autonomous and could drive lab automation. This partnership has already shown promise. Humans will need to act as “guidebooks” and interact with the models, but they’ll still be needed. Alphafold, for example, is a tool that scientists can use to predict protein structure and function.
The discovery process will be streamlined by other foundation models. It’s almost impossible to keep up with all the scientific literature of today. Humans are not able to read and interpret these findings in an appropriate and timely manner. Future foundation models will be capable of generating recommendations and guiding scientists’ next steps by gleaning the most recent literature.
In 10 years I believe we will take this further and models will become a part of the discovery cycle. The model will be able to read existing knowledge and make recommendations for experiments and hypotheses. Scientists will be able to focus on implementing the model’s findings, with minimal intervention.
What led you to study the foundation models?
I was born and raised in Ukraine. I moved to the United States on a Fulbright scholarship 13 years ago. In the course of my master’s degree at Kansas State University, I heard the phrase “machine learning”. At that time I was extracting biomedical data from open-source sources, such as scientific literature, websites, reports, etc. Then, I was especially interested in machine learning and natural-language processing using a narrow AI model.
My doctorate was earned at Johns Hopkins University, where I worked with open-source data models and machine learning. This was before deep learning. As a member of JHU’s Human Language Technology Center of Excellence and Center for Language and Speech Processing, I started building models to study human behavior online. Although not directly related to biomedicine, this work was similar because we looked at the entire range of human experience – their interests, properties and actions, preferences, feelings, etc. This is similar in a sense to the data that we collect for cancer research.
Mega AI is the internal investment I am leading at PNNL. Mega AI is focused on building foundation models for scientific knowledge that will enhance our ability to reason and perceive at previously unimagined scales. We have created small-scale, but durable, 7 billion parameter climate and chemistry models from scientific literature. This work contrasts with some of the bigger models that are available today, like one created by China, which has 2 trillion parameters, or even Google’s, which isn’t far behind, at 1.7 trillion. We are committed at PNNL to creating smaller, more sustainable foundation models that focus on applications in science and security. These can be implemented much quicker.