Mar 31, 2025 4 min read

PLUMB blab #1: We need better benchmarks for computer-aided drug design

Welcome to a new series on the OMSF blog: PLUMB blab! This series will take you behind the scenes of PLUMB (Protein-Ligand Unified Metrics Benchmark), an open-source effort to build better benchmarks for computer-aided drug design (CADD). At OMSF, we believe that rigorous, transparent, and community-driven benchmarking is essential for improving predictive models in molecular science. With PLUMB, Ariana Brenner-Clark working to create a high-quality, reproducible benchmark that integrates structural and affinity data across diverse protein-ligand systems. Throughout this series, Ariana will share key decisions, challenges, and insights as we develop PLUMB into a valuable resource for researchers, force field developers, and machine learning scientists alike.

Welcome to PLUMB blab!

This blog series will take you behind the scenes as we build PLUMB (Protein-Ligand Unified Metrics Benchmark)—an effort to build better benchmarks for computer-aided drug design (CADD). Along the way, I will share key decisions, challenges, and insights.

This effort is conducted in conjunction with the Living Journal of Computational Molecular Sciences best practices paper, “Best Practices for Constructing, Preparing, and Evaluating Protein-Ligand Binding Affinity Benchmarks [Article v1.0],” which aims to create a fully self-consistent, open community resource and set of best practices for building protein-ligand binding affinity benchmarks [1].

In this first post, I want to start with a fundamental question: why do we need better benchmarks for protein-ligand affinity prediction?

Why Benchmarking Matters for Drug Discovery

Proteins control countless cellular processes, and some play a role in disease. These disease-associated proteins, or target proteins, can often be modulated by therapeutic drugs—small molecules that interact with the protein to produce a disease-mitigating effect. Algorithms that predict how well potential drugs bind to target proteins are a cornerstone of computer-aided drug design (CADD), helping researchers identify promising drug candidates faster and more efficiently.

The accuracy of these predictions depends heavily on the models we use to simulate molecular interactions. Force fields—mathematical models that rely on physical principles to define the potential energy of a system from its structure [2]—are fundamental to CADD. Scientists apply these force fields in molecular dynamics (MD) simulations, which model the motion of protein-ligand systems over time, enabling researchers to estimate thermodynamic properties such as binding free energy. Binding free energy calculations have become a highly promising approach in CADD [1], [3]. But how do we assess the accuracy of force fields and the free energy calculations that depend on them?

To determine whether these computed thermodynamic properties are accurate, we must compare against experimental data. This requires benchmark datasets—collections of protein-ligand systems with binding affinities that allow us to evaluate and refine force fields and free energy calculation methods.

A useful benchmark dataset should be:

✅ Diverse, covering a broad range of protein-ligand complexes to ensure general applicability.

✅ High-quality, with reliable ground truth affinity measurements.

✅ Open and transparent, so the research community can use and improve it.

Unfortunately, current benchmarks fall short on at least one of these criteria.

Limitations of Existing Benchmarks

While structural and affinity data are available across multiple databases, they can be difficult to directly use for benchmarking. Some common challenges include:

🚨 Data quality issues – Some protein structures may have low quality or resolution. This makes it difficult or impossible to correctly model bound ligand poses or reliably prepare the protein-ligand system for MD simulation.

🚨 Scalability vs. accuracy trade-offs – Manually-curated datasets may be high quality, but they are time-intensive, whereas automated docking methods can introduce errors such as unrealistic ligand poses.

🚨 Lack of reproducibility – Some benchmark sets have been manually curated or do not share the assumptions, protocols, or code used to build them, and are therefore not reproducible.

🚨 Lack of quality control and annotations – Many benchmark sets lack a uniformly-applied quality control process to ensure that all ligand poses are stable and that various challenging aspects of the systems (such as ligands of differing charge, the presence of cofactors, or ordered waters present for some ligands) are appropriately annotated.

As a result, we do not have a comprehensive, rigorously curated benchmark dataset that is both scalable and reproducible.

To address this challenge, we are developing PLUMB, an open-source benchmark designed to integrate structural and affinity data for a diverse, and continually growing, range of protein-ligand systems.

PLUMB is intended for multiple audiences who benefit from having a high-quality, automatically-curated collection of protein-ligand structures and affinity data. These audiences include:

Free energy methods developers assessing their methods on different target classes or in the presence of various challenges (e.g. ligand charge transformations, ring opening/closing, etc.)
Force field developers assessing force field accuracy on easy-to-sample protein-ligand systems
Structure-based machine learning researchers training or assessing their models on large protein-ligand benchmark sets

For more details, check out our PLUMB codebase here: https://github.com/omsf/plumb

What’s Next?

This blog, PLUMB blab, will keep you updated on PLUMB’s progress—discussing key decisions, challenges, and insights.

In next month’s post, I will tackle our first major decision: which existing datasets should be included in PLUMB?

We will analyze available structural and affinity datasets, evaluating their strengths and limitations in the context of force field and free energy method benchmarking. Stay tuned!

Sources

[1] D. F. Hahn et al., “Best Practices for Constructing, Preparing, and Evaluating Protein-Ligand Binding Affinity Benchmarks [Article v1.0],” Living J. Comput. Mol. Sci., vol. 4, no. 1, 2022, doi: 10.33011/livecoms.4.1.1497.

[2] S. Barnett and J. D. Chodera, “Neural Network Potentials for Enabling Advanced Small-Molecule Drug Discovery and Generative Design,” GEN Biotechnol., vol. 3, no. 3, pp. 119–129, Jun. 2024, doi: 10.1089/genbio.2024.0011.

[3] Z. Cournia, B. Allen, and W. Sherman, “Relative Binding Free Energy Calculations in Drug Discovery: Recent Advances and Practical Considerations,” J. Chem. Inf. Model., vol. 57, no. 12, pp. 2911–2937, Dec. 2017, doi: 10.1021/acs.jcim.7b00564.