Meet the Humans Building AI Scientists

Words by

Bryan Duoto

The Column Issue 06

Meet the Humans Building AI Scientists

A look inside FutureHouse, a nonprofit research institute in San Francisco.

Example H2

Example H3

The exterior of FutureHouse HQ in San Francisco.

From inside an industrial facade in San Francisco’s Dogpatch neighborhood, a crow takes flight. This tool-using corvid is the chosen mascot of FutureHouse, a nonprofit determined to automate scientific discovery using AI agents that can generate hypotheses, connect existing findings, and even suggest experiments.

Since its launch two years ago, FutureHouse has steadily rolled out a family of “crow”-themed tools for researchers. ChemCrow helps design and execute chemical reactions. WikiCrow compiles encyclopedia-style summaries of human proteins — including their structure and known functions — by drawing on thousands of papers. ContraCrow sifts through the literature to find contradictory claims. PaperQA and its successor, PaperQA2, enable users to query PDFs and glean reliable answers without “hallucinated” misinformation. LAB-Bench, their benchmarking suite, measures how well these agents handle real-world biological tasks. And Aviary — software explicitly designed “to give language models access to the same tools as human researchers” — has enabled open-source LLMs “to exceed human-level performance on two more of the lab-bench tasks: doing scientific literature research and reasoning about DNA constructs” with only “modest compute budgets.”

Despite their varied uses, each tool revolves around a common principle: letting an AI system read and reason about biological data to accelerate discoveries.

FutureHouse’s focus on the scientific literature is no accident, either. CEO Sam Rodriques has long lamented the state of publishing, writing that “the biomedical literature is vast and suffers from three problems: it does not lend itself to summarization in textbooks; it is unreliable by commission; and it is unreliable by omission.” Many other scientists share his view.

The Allen Institute for AI introduced Semantic Scholar way back in 2015; it was among the earliest platforms to rank and predict research relevance with machine learning rather than raw citation counts. Elicit, launched in the fall of 2023, gained two hundred thousand users by word of mouth; it promised a “one-click literature review” that, in controlled tests, cut time in half for researchers sifting through papers. Meanwhile, OpenAI’s “Deep Research” is now offering automated assistance for tasks ranging from summarizing journal articles to generating experiment designs.

While these tools move us closer to the ideal of frictionless access to biological knowledge, FutureHouse is aiming higher. The team wants not only to streamline access to the scientific literature but also to mine it for untapped research directions — “unknown unknowns” that could lead to breakthroughs. Their ten-year mission is to build semi-autonomous AIs for science, from predictive models that explore genetic variants to humanoid robots that could one day run entire experiments on their own.

To learn more about FutureHouse’s ambitions, we sat down with co-founders Sam Rodriques and Andrew White.

Andrew White and Sam Rodriques, co-founders of FutureHouse.

The Interview

A lot of your tools reference crows. What’s up with that?

White: When I got started in this space around October 2022, I was red-teaming with GPT4. Around the same time, a paper called “Language Models are Stochastic Parrots” was circulating, and people were debating whether these models were just regurgitating their training data or truly reasoning. The analogy is appealing, and parrots are definitely known for mimicking speech. But what we saw was that pairing these language models with external tools made them much more accurate — a bit like crows, which can use tools to solve puzzles.

In the work that led to ChemCrow,¹ for instance, we found that giving the large language model access to calculators or chemistry software made its answers much better. So we kind of retconned a little bit to make “Crows” be agents that can interact with tools using natural language.

FutureHouse launched a bit more than two years ago. When you first set out on this quest to build an AI scientist, what did you assume would be simple? And which problems turned out to be surprisingly difficult?

Rodriques: The first thing I did when thinking of making an AI scientist — which was a little bit before ChatGPT came out in September 2022 — was to figure out what is easy for humans and which tasks are easy for AI models. A great example is flipping burgers; it’s relatively easy for humans, but pretty difficult for robots. Solving mathematical proofs, on the other hand, seems to be easier for AI models and more difficult for humans.

Practically speaking, one thing we found difficult was creating the infrastructure for these agents and getting them access to data and various web sources. We’ve also been surprised, like many people, by how easy the cognitive work is for these models; they’re exceptionally good at both hypothesis generation and drawing conclusions.

White: I thought that most things were going to be hard, actually. But it turns out that some of the hardest things have nothing to do with AI. Engineering and production work were unexpectedly difficult. Going from a demo in a Jupyter notebook (used to write Python code) to getting something that can run at scale is a lot of work.

In January 2023, PaperQA was working quite well, but the scores against humans weren’t great; the model was only about half as good as humans. When we added better parsing, though, and built tools to better find open-access papers — so not even any breakthroughs on the algorithm itself — PaperQA almost doubled in performance. That took about one year of work.

Making WikiCrow, where we were writing Wikipedia articles for every single protein-coding gene in the Human Genome, was also a ton of engineering work. We were writing 20,000 articles and each article consisted of five PaperQA calls. It was 100,000 calls in total. So one of the most challenging things was getting that to run in a reasonable amount of time.

What kinds of data are still needed to build your AI agents?

Rodriques: Data is certainly a limiting factor right now. We need both better and more data on how humans do science, including recordings of how people actually talk about it. We have almost none of this kind of data, but it is crucial to build a human-level AI scientist.

How can we trust that the agents you’re building are giving reliable information? A person who speaks with enough authority, after all, is capable of convincing an expert despite gaps in their knowledge.

Rodriques: That’s a great question. Similar to when you are talking to a human, they tell you what to do and they sound very authoritative — you as a scientist need to go and think about it and see if it's right. People who are trusting the models without thinking may also be the same people who are trusting other people without questioning them. People need to be critical and skeptical when it makes sense to.

I’m optimistic that an AI scientist will help with reproducibility overall. Did you do the experiment that you said you did? Did you record all the variables in a way that you can report it in the way you did it? Obviously, if someone is making something up, it is going to be as hard for a model to detect as it would be for a human.

Another kind of reproducibility, which I expect to be even more common, is where the data is real, but where you’ve analyzed it in some way that invalidates your hypothesis. Here’s an example:

Say you conduct twenty analyses on your data until you find one with a p-value of less than 0.05. This is something where the agents we are building will be very helpful. You can say to the agent, “Hey! Here are some papers with an analysis, please reproduce it on this data.” It should be able to access the data and run the analysis to tell you if it reproduces. And it should be able to run twenty other analyses. If you can systematically run data analysis at scale, p-hacking is no longer an issue. Instead of reporting one p-value for 100 tests, you run 100 analyses and create a distribution of p-values. The distribution of p-values tells you much more about the data, too, especially if you know how correlated those statistical tests are compared to a single test.

How are your autonomous science agents evaluated on tasks pertaining to the scientific literature? And how do they stack up against human scientists?

White: We made LitQA because we needed something automatic and fast so we can iterate quickly. It’s like 250 questions that are very hard. Humans score about 67 percent and our latest models are at 90 percent. So we are well above humans — which are PhD-level professional biologists who are paid to answer and are incentivized to do well. Of course, that doesn’t represent real science, it’s more like trivia questions. We had WikiCrow write Wikipedia articles before pairing them up and blind-evaluating them. This is a good example of how we trained on trivia questions but ended up outperforming humans on general knowledge.

As an organization, however, we want to measure our performance on novel scientific discoveries. That’s like our five-year KPI. We will know PaperQA is working well when it is integrated and contributing to discoveries. At the end of the day, as these models get closer and closer to humans, we will evaluate them like humans. What makes a good PhD student, for example? They come up with good ideas, scope them correctly, drive progress forward, and write a paper on them. That is eventually how we will evaluate these models. But just like how it is hard to evaluate a PhD student based on grades or a first-year exam, we won’t know how good these models are until we put them in the lab and see what they can come up with.

FutureHouse is a non-profit research institute, but you are not a Focused Research Organization (FRO). What is the distinction, and do you foresee a for-profit offshoot in the future?

Rodriques: I imagine that there will be a point where what we’ve made has so much commercial demand that you have to spin out a for-profit. It’s very common for non-profits to spin out for-profits. Universities do it all the time.

FROs are non-profits that tackle projects too large for academia, but that cannot be done as a for-profit. In that sense, we are very much an FRO. The difference, though, is that when we first wrote down the FRO model, we specified a few things about how FROs work that did not apply to FutureHouse to make it palatable to a certain set of funders. FROs are funded for five years and we ask that they be funded at a particular scale. And they were really supposed to be milestone-driven. When we got started, the idea of building an AI scientist was a new and nebulous idea. We have a much better idea of what that means now.

But at the time, it was difficult for us to write out clear milestones or objectives for FutureHouse, because we didn’t know what was going to happen. So now, our funding isn’t limited to five years and we are funded in a different way compared to FROs. This structure allows us to stay nimble.

What are some misleading assumptions that people make about your work?

Rodriques: Many people assume we’re focused on wet lab automation. There are certainly opportunities there and we are exploring them, but the biggest opportunities are actually on the cognitive side.

We also contend with a lot of biosecurity assumptions. There is a community of people who are very concerned about biosecurity and some who assume that what we are building will be dangerous. I like to emphasize that, fundamentally in biology, you have to bring things into the world. Biosecurity is always a challenging question because our goal is to manipulate human biology to cure disease. If you can manipulate biology, you can also create things that are very dangerous. We think about it a lot.

In terms of bringing things into the world: How are your wet lab automation efforts going?

Rodriques: It’s not the main focus of our work, as I said. But AI models are going to be way better at it than we are, especially in cases where you have high-throughput wet lab automation. As a human scientist doing wet lab work, the hardest thing is often just remembering the dozens of different experimental conditions that are being tested at once. — but that’s what these AI systems are designed for.

What we are interested in building is the cognition layer that goes on top of the actual experiments. Once you have an experiment you want to test, there are established methods — like design-of-experiments — that define the parameter space and help decide what to test.

Put another way, what is really special about AI today is the fact that — with the language models — we are going to be able to apply AI to poorly structured spaces. When you’re in a well-structured space like the structure of protein sequences, DNA sequences, or even defined experimental spaces like chemical concentrations, there are lots of classic AI methods you can use to train a foundation model or Bayesian optimization. But when you are in a poorly defined space, like in natural language, where hypotheses can be explored for anything conceivable, traditional methods don’t work very well. This is why the revolution is really going to come from being able to apply AI to those poorly structured spaces.

In a recent tweet, you showed off a humanoid robot sitting on a couch in your HQ. Why are you using humanoid robots instead of more conventional robots specifically designed for biology?

Rodriques: There is a key distinction, in terms of automating biology, between one-off experiments and running experiments at scale. The tools required in each case are very different.

Imagine you are building a car. There are two different regimes: Either you can be in the “I’ve never built a car before and I want to build one” regime or the “I’ve built a car before, but I want to build 100,000 cars” regime. If you want to produce 100,000 cars you build an assembly line with a bunch of specially-designed robots that each do one thing. If you want to change the size of the wheels you’re going to have to throw out some of the robots. God forbid you suddenly want to build a helicopter. There would be no chance, right?

If you want to do the “I’ve never built a car before” you don’t do any of that. You don’t buy a bunch of robots. You buy a machine shop and you build the car as a one-off. When you are talking about basic discovery research, it’s just a bunch of one-off experiments — imagine a graduate student doing every experiment once and for the first time comes up with an amazing discovery. That’s the kind of science we want to automate and it isn’t compatible with huge robotic systems. Most of what we do right now is humans designing and performing experiments, so the next step is AI assisting these humans, and the future is general-purpose robotics guided by AI. We are interested in humanoid robots because they more closely map the kind of experimentation we are looking to emulate.

In one of your essays, Sam, you wrote about how lab automation is difficult, in part, because robots can’t adapt to surprises. You discovered, in your laboratory, that a broken gasket was leaching a chemical into your cell cultures —unbeknownst to your team — that interfered with results. Will robots ever be able to diagnose and resolve problems like this?

Rodriques: Using robots to run wet lab experiments is ultimately a sensing problem. The ability of humans to perceive things is very good. So much of biology is having a tube that you tilt up to the light in just the right way that you can see its contents. Just try and get a robot with a camera to do that. It's tough. And if you aren’t able to sense like that, something like a chemical leeching into an experiment is incredibly difficult to detect.

Human sensory motor function is way more evolved than human cognition. Sensor motor function has been evolving since the Cambrian. That was 500 million years ago. Cognition, in the human sense, has been evolving for a few million years by comparison. It’s not surprising that AI models will match the levels of human cognition before we have robots and sensory systems that are as good as humans.

Okay, last question. How are you getting your tools into the hands of scientists? How do you grow your community of users?

Rodriques: We’re still trying to figure that out. The first thing to appreciate is that our mission here is to automate scientific research and scale scientific research. That’s the core goal. The idea is not to create productivity tools, but we want to make sure we keep building and don’t end up in a commercialization cycle. We don’t want the quality of the technology we are building to be affected by commercialization and that’s why we operate as a non-profit.

The plan is to eventually launch a platform for people to use our tools. For the foreseeable future, though, the North Star is focused on building more capable agents because that’s where the value ultimately comes from.

White: The end goal is to have a platform that runs scientific intelligence at a scale that is good enough to work on all genes, all proteins, all diseases.

Interviews by Bryan Duoto. Photography by Xiaofan Fang. This article was edited for brevity and clarity.

Cite: Duoto, B. "Meet the Humans Building AI Scientists." Asimov Press (2025). DOI: 10.62211/42py-87gh

This article was published on 19 March 2025.

Lead image by Ella Watkins-Dulaney.

Footnote

A large language model that plans and executes chemical synthesis steps. It was pre-printed in April 2023 and published in May 2024.

Learn More

TOC

Example H2