Pushing computational biology into the open
The trend towards the democratization of AI for biology.
In an earlier blogpost, I mentioned that a trend we notice is the democratization of AI tools in the (computational) biology field. Research groups are not only increasingly developing AI-based tools and systems for biology, but they are deliberately making these tools accessible online for other researchers to use and build upon. This trend is important in the bigger picture of technology x biology, because the more people have access to AI, the faster this blended field can potentially progress and the faster we will see meaningful societal impact (but also: the faster we should think about potentially worrisome outcomes hinted by Harari’s Danger Formula).
In this blogpost, let’s dive into the democratization in tech x bio, look at some specific examples and ponder how it could further play out.
Democratization means accessible to anyone
In their book ‘Bold’, Peter Diamandis and Steven Kotler define democratization as:
“… the result of demonetization and dematerialization. It is what happens when physical objects are turned into bits and then hosted on a digital platform in such high volume that their price approaches zero.”
For AI tools specifically, the demonetization and dematerialization aspects are interesting to look at. The biggest cost in constructing large state-of-the-art AI models is in the training on huge amounts of data, using loads of specific computer hardware (GPUs or TPUs). In that regard, if a research group or company makes an already trained model publicly available, that is a form of demonetization.
The dematerialization aspect is interesting as well. An AI model itself is basically just lines of code in the digital realm (it is substrate independent), but again, running or training that model needs computer hardware. One of the biggest contributors in the ‘dematerialization‘ of AI tools is the availability of free compute resources in the cloud, most notably Google’s Colaboratory and Kaggle Notebooks. If a tool requires specialized hardware to run, it is not democratized. Both these platforms provide notebook environments with a CPU or GPU backend to run computations in the cloud, making it very easy to provide broad access to otherwise computationally demanding state-of-the-art AI tools.
Beyond analyzing those aspects, I haven’t found a definition of democratization specific to AI tools, but the OpenFold project provides a great set of principles that are generally desired when tools are democratized:
Open-source: the complete codebase of the tool or project should be available online for the community and companies to use and improve. Arguably equally necessary is good documentation supporting the codebase.
Parameter availability: the trained model underlying the AI tool, with all of its parameters (also called weights), should be available for immediate use.
Permissive license: tools are only fully democratized if they contain a license that permits both commercial and non-commercial use.
Training pipeline available: the code is available to train the AI model from scratch or repurpose components in other architectures. This goes beyond the first principle, and together with the second principle this ensures a full reproducibility of the published results related to the tool.
Optimized for performance: the tool is able to run on widely available computer hardware.
Looking at the bioinformatics and computational biology space, we could argue that this space has been largely democratized from the start! All of the foundational algorithms and tools are publicly available for researchers to use and build upon: the PAM and BLOSUM matrices, the Needleman-Wunsch and Smith-Waterman algorithms for sequence alignment, BLAST, Clustal Omega, the Pfam database and a lot more. These tools and resources have been the foundational building blocks of bioinformatics and have had a huge impact over the years.
Today, we see a second wave of democratization in biology, a wave that is about AI-based tools. Interestingly, we now see both well-funded research groups as well as large corporations enter the computational biology field and open-sourcing their research findings and tools for other researchers and groups to use and play with. This makes sense, as good funding is imperative to get to enough compute resources to train state-of-the-art models. But it is also somewhat worrisome, as this inherently limits the development (not the use) of groundbreaking new AI tools to these well-funded groups. That’s a reason I am a big fan of the model that StabilityAI is pursuing.
But leaving limited access aside, let’s look at some of the most interesting democratized tools in the computational biology space!
AlphaFold, ColabFold and OpenFold
Let’s start with the most high-profile release of code in the computational biology field in the last years: AlphaFold2. When Google DeepMind won the CASP14 contest in 2020, they took the field by a storm. Later, they decided to open-source their code, making protein folding available to the masses. Or at least, the masses that can figure out how to run the code on specific hardware. While it is technically possible to run AlphaFold on a basic laptop, it is not an effective solution. Therefore, the DeepMind team provided a Google Colab notebook for people to run the code easily with the benefits of having a free GPU at their disposal. And beyond that, DeepMind predicted the entire human proteome, and later set up a collaboration with EMBL to provide over 350,000 predicted protein structures for more than 20 model organisms. Since this summer, this number has further increased to beyond 200 million structures!
Still, various groups were thinking of further improving AlphaFold. Among those were researchers at SteineggerLab. They presented ColabFold, taking the democratized AlphaFold code a level higher by making it even more accessible to any researcher! ColabFold speeds up AlphaFold’s prediction pipeline by 20-30x, by using a much faster way to compute the necessary input (multiple sequence alignment via MMSeqs2), by avoiding recompilation and by adding early stop criteria in the pipeline.
But the team at OpenFold took the idea of democratizing AlphaFold the furthest. As mentioned in their principles above, OpenFold provides a completely trainable replication of AlphaFold (which wasn't possible before), together with the model weights for inference and a permissive license. The DeepMind team hasn’t provided code for training AlphaFold, which prevents training new variants of the model, or combining components of it with different models. This has been OpenFold’s mission, for which they teamed up with StabilityAI. Furthermore, they have made sure that their replica is highly optimized for performance, and can be ran on widely available GPUs.
ESM2 and ESMFold
A second important example is the significant progress we have seen in large language models for proteins. Large natural language models powered by transformer architectures have been all the rage for the past year, and we have seen amazing things done by them, such as generating art from text, playing strategy games and even conversing with you quite naturally.
In my previous blogpost, I explained that these language models are increasingly being trained on protein data, and the result is that these models learn (to some extent) the distribution of what proteins look like. The field is moving quickly, but today the state-of-the-art protein language models are Meta AI’s ESM models: ESM-2 is capable of representing protein sequences by dense numerical vectors while ESMFold takes this a step further and uses that information to predict the 3D structure of that protein (like AlphaFold does, but a lot faster).
Just like DeepMind, Meta AI has open-sourced their trained models for researchers around the world to use and has provided instructions to use these various models in the cloud or on local computers. Just a few years ago, when you wanted to build a machine learning model for protein sequences, you had to invest a significant amount of time into properly representing your proteins as a numerical vector for machine learning models to be able to learn from. These protein language models now enable that task to be performed automatically.
Default-to-open for AI-based tools in biology?
Looking ahead, I think there are two interesting aspects we should always take into account: advancing science versus pursuing economic opportunities.
Purely looking at the science side of things, we can only argument that democratizing every other tool or model out there is the best thing to do. When researchers can quickly build on top of one another’s findings, science advances more rapidly (the vaccine development against SARS-CoV was a nice example of that). But at the same time, we have to find a balance between open-sourcing and pursuing economic opportunities. Two examples are interesting to discuss here. The first is DeepMind, that open-sourced AlphaFold, but also started a sister company Isomorphic Laboratories that aims to further capitalize on the technology to improve drug discovery. The second is StabilityAI. They have provided complete open-source access to their StableDiffusion model, but they also provide products on top of that (as well as consulting services), which they monetize. The company argues that releasing products helps them bring broader access to state-of-the-art AI (which I think is a fair point).
Both companies demonstrate a very interesting model of and balance between democratization and commercialization that is likely to be copied in future endeavors. As Nathan Benaich argues, the real value of an AI-based tool is in a full-stack product, a product that delivers a complete solution for a specific problem that customers have. So my best guess is that we’ll see a lot more of this in the Tech x Bio space as well!