We ❤️ Open Source
A community education resource
You don’t need a biochemistry degree to analyze proteins
How pre-trained models democratize advanced research.
Protein research once took months or years of intensive lab work. Today, transformer-based models can predict protein structures and provide functional insights far more quickly. In this lightning talk at All Things AI, Tia Pope, a third-year PhD student at North Carolina A&T, shows how tools like ProtGPT2 and ESM are lowering the barrier to advanced protein analysis for anyone curious enough to try.
Tia frames protein research as a hidden war taking place inside our bodies. One side consists of proteins such as antibodies, enzymes, and hormones that build, repair, and protect us. The other side includes viral proteins that hijack cells, as well as misfolded or dysfunctional proteins that contribute to disease. COVID-19 made this dynamic more visible, as the virus relies on its spike protein to bind to human cells and begin infection.
Artificial intelligence has accelerated how we study this microscopic world. Tools like AlphaFold, including its newer versions, have dramatically improved the speed of protein structure prediction. Tasks that once required months or years of experimental work can now often be approximated computationally in hours or days. Experimental techniques such as X-ray crystallography are still necessary for confirmation, but AI has made structural insights far more accessible.
A major driver of this progress is the transformer architecture. Unlike older recurrent neural networks, which process sequences step by step, transformers analyze entire sequences at once using attention mechanisms. This approach has enabled powerful advances in protein modeling, including systems that learn from large datasets of amino acid sequences.
Read more: What if your AI agent could actually help?
Despite these advances, much of biology remains poorly understood. Many proteins in public databases have not been experimentally validated, and large portions of the so-called dark proteome have unknown or unclear functions. At the same time, evolving pathogens and increasing antibiotic resistance highlight the need for faster and more effective discovery methods.
In practice, tools such as ProtGPT2 can generate novel protein sequences using platforms like Hugging Face. These generated sequences can then be analyzed with models such as ESM, which provide predictions about structure and potential function. These predictions are probabilistic and require experimental validation before any real-world conclusions can be made.
Key takeaways
- Models such as AlphaFold and protein language models have significantly accelerated protein structure and sequence analysis. They complement rather than replace experimental methods.
- Generative models such as ProtGPT2 can propose new protein sequences, while models such as ESM can help evaluate their plausibility.
- Significant gaps remain in our understanding of protein function, especially within the dark proteome, which creates opportunities for further research.
- These tools are increasingly accessible, but meaningful impact still depends on careful interpretation, validation, and domain expertise.
Tia’s research builds on these models and she provides notebooks for hands-on exploration. The barrier to entry is low and the impact is high. Curious minds are essential to test these models, separate signal from noise, and refine them so they can safely contribute to real-world treatments.
More from We Love Open Source
- Deep dive into the Model Context Protocol
- Discover Goose: Automate your developer setup with this AI agent
- Why AI agents are the future of web navigation
- What if your AI agent could actually help?
- AI agents are here—are your skills ready?
The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.