We won a hackathon by letting an LLM agent decide what a property-prediction model should learn from. The interesting part was finding where it stopped helping.
A quick framing before the results. Most “LLM agents for science” posts tell you where an agent worked
In drug discovery you measure ADMET properties — absorption, distribution, metabolism, excretion, toxicity — one assay at a time. Every new assay starts cold: a handful of measured molecules, often 10–50, because each label is a real experiment that costs real time and money. Train a model on that handful and you get nothing. On our benchmark, few-shot prediction from scratch scored a mean test Spearman of 0.08 — statistically indistinguishable from guessing.
The standard fix is transfer learning: pretrain on past assays you already have labels for, then fine-tune on the few new ones. Which raises the question that is usually settled by a senior scientist’s intuition — which past assays do you transfer from? (Classical transferability scores try to answer this without an LLM
So we made that decision the object of study. We held the predictor fixed — a plain shared-trunk MLP — so that any improvement is attributable to selection, not to a fancier model. Then we asked: how far can an LLM agent push that one decision, and where does it stop mattering?
The first agent is small and boring by design. It sees candidate source assays as a list with summary statistics and metadata (what each assay measures, how it correlates with the target, how much the molecules overlap). It has two actions: evaluate a proposed subset against a held-out split, and finalize a choice with a short, chemist-readable rationale. It loops a few times, then commits.
Against a brute-force powerset oracle — which literally tries every combination of source assays and keeps the best — the agent reaches the oracle’s performance. It beats naive “pretrain on everything” pooling decisively ($p < 10^{-4}$) and matches a strong correlation heuristic, with its margin concentrated at the coldest start ($n=5$), exactly where the decision is hardest and the data gives you the least to go on.
Scaling it up to the Therapeutics Data Commons benchmark evaluate signal sharpens with data while a static correlation statistic does not. On regression tasks it’s the best method at every sample size.
The detail I keep coming back to: nobody told the agent any chemistry. Yet its selections are mechanistically sensible — it routes a half-life task to clearance assays, a protein-binding task to lipophilicity. It rediscovered the relationships a pharmacologist would draw, from metadata and a few dozen labels. That’s the moment that makes you think the reasoning is doing something real.
Picking sources is one decision. A real pipeline has several: how many labels to use, how to spend the labeling budget, which assays to transfer. So we handed all of it to a single ReAct agent
On the three sparsest endpoints — the genuinely label-starved ones, 222 to 1300 labels — the autonomous agent reached 96% of the brute-force oracle and +56% over no transfer, beating the no-transfer floor on all nine runs.
Here’s the honest framing, because it matters: it does not beat a carefully-tuned expert greedy pipeline — on average it ties it. The claim is automation, not dominance: a single agent that also tunes the knobs a human would tune, matching the expert and nearly reaching the oracle, while writing down its reasoning. And on the hardest cold start — the endpoint with the lowest floor — it did edge past greedy. The reasoning earns the most where transfer is hardest. That’s the pattern you’d hope for.
This is the test that makes the post worth writing.
Same setup, a neighbouring question — arguably the more valuable one to a chemist: with zero labels on a new target, which molecules should you measure first? Spend your first five experiments well and you bootstrap the whole assay. We built the agent a per-molecule recommender: it reads the other assays’ values plus RDKit descriptors and names specific molecules to measure, with a reason for each (“unique scaffold and low LogD”, “high Caco-2 efflux”, “similar to m12, skip it”).
It lost. Not to another agent — to k-medoids, a label-free clustering heuristic that just picks geometrically representative molecules in the embedding space. At the real cold-start regime ($n=5$), k-medoids was the best method on 8 of 9 targets. The agent beat random sampling but couldn’t catch the heuristic.
My first instinct was that we’d handicapped the agent — it saw a lossy summary, k-medoids saw the full embedding. So we ran the control: we gave the agent k-medoids’ exact input, the full pool embedding, plus a Python REPL with k-medoids, farthest-first, and V-optimal design as callable tools. If the gap were an information problem, it would close. It didn’t. The agent inspected the data, judged the k-medoids picks good, and finalized them — reproducing the heuristic’s selection exactly on 81% of cases. It had nothing to add because there was nothing to add: representative geometric sampling is simply the right objective for “cover the space”, and an LLM can’t out-reason a problem that’s already solved by an algorithm. A clean null result.
(The per-molecule agent still earns its place in the demo — it’s the interpretable layer, naming molecules with a chemical reason a human can audit. But sell that as transparency, not accuracy. It ties the heuristic; it doesn’t beat it.)
Put the three tests side by side and the line is sharp:
The practical takeaway, and the thing I’d tell anyone building these systems: use the LLM for the judgment calls, not the math that already has a good algorithm. The wins come from reasoning over messy, semantic, discrete choices — not from asking a language model to do numerical optimization that scipy does better. Knowing where the agent doesn’t help is what lets you build something that does.
This was a hackathon: two benchmarks (OpenADMET ExpansionRx
Built with [teammates] at the Exponential Discovery in Life Sciences hackathon.