Skip to main content

Do AI Scientists perform abductive reasoning, and what does that mean for how we should build AI Scientists?

Blog post about a white paper on whether AI Scientists perform abductive reasoning that I conducted as part of CMU's 80-329 Philosophy And Causation course.


This blog post contains the white paper that I wrote as part of CMU's 80-329 Philosophy And Causation course which I took in Fall 2025, taught by Prof. Peter Spirtes. This course was my first Philosophy class. In Spring 2025, when I saw the course being offered for the first time, I was drawn to its description, so I contacted Prof. Spirtes and requested permission to enroll. The class size was small (6 students, with the only 2 undergraduates being me and a 4th-year), which gave ample room for discussion (I was really surprised at how much philosophers love to debate and engage in dialectic thinking. Once someone starts talking they can go on for 20 minutes!) I am thankful for the learning opportunity as it exposed me to the world of Causation, which I feel holds some gems that are highly relevant to current AI progress. If you are interested in the question of how to untangle causation from correlation, then you should look at CMU Philosophy department's work. They are SOTA for Causation!!

This is my TLDR of the course content.

  1. Philosophers have been trying for ages to come up with a formal definition for Causation. 4 main schools of thought are (I'm glossing heavily over the actual phrasing):
    1. Regularity Theory of Causation: causation is all about regularities (i.e., when we see XX happening, we tend to see YY happening after XX).
    2. Counterfactual Theory of Causation: suppose we wanted to say XX causes YY. If it's true that the absence of XX brings about the absence of YY then XX causes YY.
    3. Process Theory Of Causation: Imagine a ball hitting another ball. Both balls are travelling through space-time. So they both have world lines. If the world line of the first ball intersects the world line of the second ball (in this case yes, because the first ball hit the second ball, so the first ball's position in space overlaps that of the second ball), and there's an exchange of a conserved quantity (in this case, momentum), then the first ball causes the second ball to move.
    4. Probabilistic Theory of Causation: XX causes YY if XX happening raises the probability of YY happening.
  2. Every one of the 4 schools of thought has problems. As people tried to patch the counter examples, the formal definition just kept on getting more and more complicated.
  3. Instead of arguing about how to define Causation, which is still unsolved, another approach is to lay down some axioms, and then build upon those axioms to make useful tools.
  4. Prof. Spirtes did point 3, and together with other researchers like Prof. Clark Glymour, created causal discovery algorithms that extract the causal structure (or part of it) from observational data, which allows us to untangle causation from correlation. E.g. the "PC" in "PC algorithm" comes from their first names!
  5. Point 4 is extremely useful for certain fields like genomics or climate studies where it is very costly or straight out impossible for scientists to intervene in systems of interest. When we do science, we can either intervene in a system and collect data about what happens after the intervention, or we can study the system as an external observer. Many times we have to stay as an observer because of practical or ethical constraints. These causal discovery algorithms then help researchers learn about what variables drive behavior in others without the need to intervene.

Now on to my project. Casual reasoning is central to so much of Science. The paper is about one type of causal reasoning: abductive reasoning. This is the kind of reasoning that allows us to think of hypotheses that explain why other phenomena occur. (E.g. Newton saw an apple fall + looked at planetary motion and thinks of gravity. August Keklué imagined a snake biting its tail and thought of benzene's circle structure to explain its properties.) This ability is mandatory for effective AI Scientists, hence my project. Note that the kind of analysis I did in Sections 4 can technically apply to the question of "can LLMs reason?" and "can LLMs understand things?". So it provides (the beginnings of) a framework to think about how to measure these capabilities and devise methods that increase these capabilities. Once again, I'm available via email (see the left sidebar) to learn from any criticism.

Table of Contents

1. Introduction

Much of human progress has been driven by the systematic pursuit of understanding the world through science. From early inventions such as controlled fire and agriculture to modern revolutions like electricity and antibiotics, scientific inquiry has repeatedly transformed the quality of human life. Our ability to do science, enabled by our intelligence, is arguably the key reason for our species' dominance and survival. Yet, despite this intellectual prowess, human existence remains fragile relative to the scale and unpredictability of nature. To deepen our understanding of the world and to secure a more sustainable future, society has long sought to accelerate the process of scientific discovery itself.

As scientific problems have grown more complex, researchers have increasingly turned to computational methods to assist the search for solutions. Early efforts relied on hand-designed software to automate data analysis, and specialized hardware to run experiments more quickly and reliably. Tools such as SPSS [1] and MATLAB [2] streamlined statistical analysis, while Applied Biosystems' PRISM 3700 DNA Analyzer sped up the Human Genome Project [3]. With the rise of artificial intelligence (AI), these efforts evolved into training models to accelerate key steps in scientific discovery. DeepMind's AlphaFold2, which solved the decades-long challenge of predicting protein structures [4], is a characteristic example.

Today, the growing generality and intelligence of Large Language Models (LLMs) have inspired many efforts to build "AI Scientists" - systems capable of automating large portions of the scientific method [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. Without a doubt, these advances are remarkable, but they can still fail in serious ways that undermine the utility of their research outputs [17].

To address these limitations, we must examine what abilities human scientists have that frontier AI Scientists currently do not. Understanding these gaps clarifies the capabilities that future AI Scientists need to possess in order to do trustworthy science. A number of studies have done this analysis [17], [18], [19], [20], but this paper argues that they focus mainly on detecting issues with the observable behaviors, such as the trace of actions taken by the AI Scientist agent, or final artifacts of these systems, such as the generated research papers. This paper claims that such an evaluation strategy, while highly practical, may not be sufficient, as the observable failures stem from deeper deficiencies in the underlying cognitive skills required for an intelligent being to conduct science.

Therefore, this paper

  1. briefly reviews how AI Scientist systems are currently evaluated and highlights the predominance of evaluation at the observable layer;
  2. argues that abductive reasoning is a core cognitive prerequisite for scientific inquiry, therefore an AI Scientist system must possess robust abductive capabilities to be genuinely useful;
  3. examines different perspectives on what it means for an intelligent system to be "performing abduction", contrasting functional and mechanistic criteria and defending a predominantly functional view;
  4. analyzes one case study using the criteria developed in (3) to show that state-of-the-art AI Scientist systems can do a non-trivial form of abductive reasoning, but there are rough edges that need fixing;
  5. discusses how we might evaluate abductive reasoning in practice using the criteria developed in (3), and
  6. sketches possible approaches for improving the abductive reasoning capabilities of the LLMs that power these systems, including the potential role of more open-ended, world modelling-oriented training regimes.

2. Brief review of AI Scientist Evaluation as of Dec 2025

Research on AI Scientists has grown rapidly in the past few years, and with it a body of work that attempts to evaluate these systems. At a high level, existing approaches to evaluation can be grouped into four types. This section analyzes each type and argues that existing evaluations predominantly focus on the observable layer of behaviors and artifacts.

(1) Landscape surveys. The first category consists of survey papers that classify existing AI Scientists, summarize current evaluation protocols, and identify broad areas for improvement. Tie et al. categorized AI Scientist systems and benchmarks from mid-2023 to late-2025 according to the stages of the scientific method, and highlighted frontiers such as reproducibility, epistemic humility, cross-domain generalization, and synergy with human researchers [20]. Gridach et al. similarly survey "agentic AI for scientific discovery", describing systems across domains such as chemistry, biology, and materials science, and reviewing commonly used datasets (e.g. LAB-Bench, MoleculeNet) and metrics (e.g. task accuracy, prediction error, human evaluation of generated reports) [18]. A common theme among these surveys is the emphasis on ensuring that the outputs of the system are sound and can be checked.

(2) System papers that grade final artifacts. A second type of evaluation appears within the papers that introduce specific AI Scientist frameworks. These works typically propose an architecture (often an LLM-centric, multi-agent pipeline) and then demonstrate its effectiveness through a suite of tests. For instance, systems such as Sakana.AI's AI Scientist v2, AMD's Agent Laboratory, Hong Kong University's AI-Researcher, and others report metrics like success rates on end-to-end research workflows, reproduction of known results, discovery of new variants that match or exceed baseline performance, or human ratings of the quality, novelty, and soundness of generated papers [6], [8], [9], [20]. A more comprehensive list of AI Scientist systems can be found in Section 5. Some papers also adopt conference-style review rubrics (e.g. NeurIPS-like scores for quality, significance, and clarity) to evaluate AI-generated manuscripts [18]. While these evaluations differ in detail, they all primarily judge the final research artifacts (papers, experimental results) rather than the internal reasoning processes that produced them.

(3) Process-level audits that look at traces and code. More recently, researchers have moved beyond grading final papers to auditing the integrity of the automated research process itself. Luo et al. investigated whether AI Scientists adhere to scientific norms by focusing on four specific pitfalls: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias [17]. Their methodology proceeded in three distinct steps. First, to prevent the AI Scientist from recalling memorized solutions from the internet, they created a fully synthetic task called "Symbolic Pattern Reasoning" (SPR), where the AI Scientist must identify hidden logical rules governing sequences of abstract shapes and colors. Second, they ran two prominent systems (Sakana.AI's AI Scientist v2 [6], AMD's Agent Laboratory [8]) on this task to generate a corpus of research projects. They found that these systems naturally committed pitfalls, such as picking easier benchmarks or selecting models based on test-set performance. To create a balanced dataset for testing detection, they supplemented these naturally occurring failures by manually injecting errors (such as explicit data leakage) into some of the clean projects. Third, they developed an LLM-based auditor and tested its ability to flag these flawed projects. The results showed that the auditor could rarely detect the flaws when reviewing the final paper alone (51.4% accuracy), but performance improved significantly when it was given access to the execution logs and generated code (78.3% accuracy). This confirms that critical process failures often remain invisible in the final manuscript and require transparent access to the system's trace to be detected. While a key step in verifying the authenticity of AI Scientist outputs, this study also focuses on the observable layer.

(4) Skill-specific benchmarks. The final line of work consists of benchmarks that target specific skills in the scientific method. ScienceAgentBench, for example, extracts 102 code- and data-analysis tasks from real papers and scores agents by whether their generated Python programs run and recover the reported results [21]. DiscoveryWorld provides a virtual environment in which agents must complete full cycles of scientific discovery, but performance is still measured by task completion and discovered knowledge rather than by process-level reasoning quality [22]. These benchmarks thus refine how behavior is tested, but remain focused on observable success at specific stages of the pipeline.

Separate from these system-oriented benchmarks, there is a rapidly growing ecosystem of evaluations that probe cognitive abilities closely related to scientific work. HypoBench evaluates hypothesis generation along dimensions such as practical utility and discovery rate [23]; LLM-SRBench and NewtonBench test equation and law discovery beyond memorization in synthetic scientific settings [24], [25]; and CURIE measures long-context scientific understanding and reasoning across multiple disciplines [26]. These benchmarks are much closer in spirit to the "evaluation below the observable layer" that this paper advocates, thus they are a valuable start. However, current practice still positions such cognitive-skill benchmarks as secondary checks within broader tests of AI Scientist workflow success, rather than as the core of the evaluation.

3. Abductive Reasoning as a cognitive prerequisite for science

The previous section argued that existing evaluations of AI Scientists focus mostly on the observable layer. This focus is natural and practically useful, but may not be sufficient to determine whether a system is cognitively able to do trustworthy science in the first place. LLMs are well known to exhibit "savant-like" profiles: they can have superhuman performance on hard tasks while failing in unexpected and sometimes pathological ways, including hallucinations, spurious reasoning, and brittle generalization [27]. These failure modes suggest that something is amiss with their underlying cognitive competencies.

This paper's hypothesis is that the observable failures of AI Scientist systems stem from deficiencies in the cognitive prerequisites for science. For example, if an AI Scientist systematically struggles to notice when data contradicts its assumptions, these shortcomings will surface as misleading conclusions. Thus, if we wish to construct trustworthy AI Scientists, we need to look below the observable layer and investigate whether the system under test possesses the cognitive prerequisites for doing science. This paper focuses on one such prerequisite: abductive reasoning.

Abduction is often characterized as inference to the best explanation. Given a surprising or puzzling observation, an agent generates a hypothesis such that, if the hypothesis were true, the observation would be the least surprising. For instance, after noticing an apple fall on his head and observing patterns in planetary motion, Newton thought of gravity as a fundamental force in the universe. The concept of abduction dates back to Aristotle's apagoge, but Charles Sanders Peirce was the first major logician to systematically outline that there are 3 basic forms of inference: abduction, deduction, and induction, and that scientific inquiry cycles through them in that order [28]. Abduction generates candidate explanatory hypotheses, deduction derives testable predictions from those hypotheses, while induction tests those predictions against data and updates the agent's belief in light of the results.

A simple medical example helps to illustrate the difference between Peirce's 3 types of reasoning:

AbductionDeductionInduction
I see a pattern of surprising cases. I think of a hypothesis such that, if it were true, the observation would be the least surprising.Given the rule and the case, what follows?I see many cases. I generalize a rule or probability that is probably true beyond those cases.
I notice Alice has a fever. Maybe she has the flu (caused by a virus); if that were true, the fever would make sense as the body raises its temperature to fight pathogens.Assume that people with the flu develop a fever (rule). A test shows Alice has the flu (case). So Alice will have a fever (result).Observe many patients. When tests show they have the flu, they almost always have a fever. So it is probably true that people with the flu tend to develop fevers.

Seen from this perspective, it becomes apparent that abduction underpins several key scientific activities:

  • Formulating research questions and hypotheses. Scientists routinely move from patterns in data, anomalies in existing theories, or surprising empirical findings to conjectures about underlying causal structure. This is abductive in character.
  • Interpreting results. After an experiment or analysis, researchers must ask what could explain the observed outcomes: whether they confirm, undermine, or refine existing hypotheses, and what alternative explanations remain viable. This again requires inference to the best explanation given the data and background knowledge.
  • Unifying and extending theories. Many important advances in science involve finding a single explanatory framework that accounts for previously disparate phenomena. Such moves are cases of abduction.

Without some capacity for abduction, a system might still be able to apply fixed rules or check the internal consistency of a given argument. But it would lack the generative side of science, and this would eliminate most exploration. If the system applies "faulty" abduction, it will waste time and compute as its hypotheses do not bear fruit, or yield marginal gains that overfit to benchmarks without solving structural issues. It may also provide weak explanations for experimental results, which dilutes the signal provided to future iterations of its exploration loop. For these reasons, this paper argues that abduction is one of the abilities that make scientific inquiry possible at all.

If we accept that abductive reasoning is a cognitive prerequisite for doing science, then a natural question that peers below the observable layer would be: "Do frontier AI Scientist systems perform abductive reasoning?" Note that in order to answer this question, there is a need to test for abductive reasoning. However, before doing any tests, we must be clear of the behaviors we are trying to test for. This warrants a philosophical discussion of what it even means for an intelligent system to "perform abduction", or in other words, what criteria should be used to decide whether an agent has successfully executed abductive reasoning. The next section tackles this conceptual issue.

4. What does it mean for an intelligent system to "perform abduction"?

Clarifying whether AI Scientists perform abductive reasoning requires an analysis of what would count as an instance of abduction in an intelligent system. This section distinguishes two broad families of criteria (functional vs. mechanistic) and motivates a predominantly functional approach that will guide the rest of the paper.

4.1. A functional criterion: abduction is a mapping P(observations)P(explanations)\mathcal{P}(\text{observations}) \mapsto \mathcal{P}(\text{explanations})

This view characterizes abductive reasoning as a (possibly stocahstic) mathematical function that maps a set of observations to a set of explanatory hypotheses. The internal workings of the system are treated as a black box; what matters is whether, given certain inputs, the system tends to produce appropriate outputs.

The domain is the power set of individual observations - e.g., experimental results, anomalous measurements, or surprising regularities in a dataset. The codomain is the power set of candidate hypotheses about unobserved structure, causal mechanisms, or governing laws. Note that the function outputs a set of hypotheses, which covers cases of overdetermination. Each hypothesis is supposed to explain the set of input observations, and should come with a score that estimates how true the hypothesis is. Then, a system can be considered as "performing abduction" if, for a wide range of observation sets, it produces hypothesis sets that satisfy these two properties:

  1. Explanations are correct (high recall). At least one hypothesis is present in the set such that if it were true, the set of observations would be rendered unsurprising. Therefore, if we were to act on this hypothesis, it would lead to correct insights, fixes and improvements.
  2. Explanations are calibrated (high precision). Hypotheses in the set that are closer to the truth/better account for the observations given background knowledge, tend to receive higher scores. So, the ranking induced by the scores (which can have multiple modes) meaningfully tracks the actual hypothesis quality.

A common analogy makes the point vivid. Airplanes do not fly by flapping their wings like birds, but they still fly. What justifies calling both activities "flying" is not shared mechanism, but shared functional role: supporting sustained motion through the air under gravity. So, on a functional reading, if a system behaves in ways that satisfy the specification above for the range of cases we care about, then it might be perfectly reasonable to say it is performing abduction. This echoes how abduction was originally introduced in philosophy; Peirce's distinction between abduction, deduction, and induction is drawn at the level of inferences, not neural implementation.

This view naturally leads to evaluations that test for task performance. To assess whether a system instantiates the abductive function, one designs tasks where explanations are needed and then judges its outputs: Does the agent, given sets of observations, propose hypotheses about unobserved structure? Does it treat those hypotheses as explanations with varying degrees of plausibility? And when those hypotheses are used to guide further interventions, do they lead to successful control of the system? Because it is infeasible to check the function specification for the entire domain, practical evaluation seeks broad coverage of the contexts that matter.

4.2. Mechanistic perspectives: abduction is tied to its implementation

Mechanistic perspectives, by contrast, are concerned not only with what comes in and goes out, but with how the inference is carried out. On this view, it is not sufficient that a system's outputs look abductive from the outside; one must also inspect aspects of the internal process. There are at least two subcamps here.

Mechanistic interpretability concerns. This subcamp emphasizes the study of uncovering system behavior to guide further development. Researchers in this camp worry that LLMs may arrive at apparently explanatory outputs by trivial means: for example, by retrieving memorized patterns from pretraining data [29] or by exploiting superficial statistical regularities [30], rather than by constructing and comparing hypotheses in a way that deserves to be called inference to the best explanation. Recent benchmarks suggest that even when LLMs sometimes return plausible answers to observational questions, their estimates of real-world distributions and their performance on fresh causal scenarios are systematically misaligned with ground truth, so there is little evidence that they can internalize a robust model of the underlying causal processes [31], [32]. If they cannot internalize causal processes, then this calls into question their ability to generate such processes, as required in abduction. From this perspective, merely checking task performance is not enough; one must also probe the intermediate representations and reasoning traces to see whether the model is actually integrating evidence, combining prior knowledge in non-trivial ways, and updating its commitments when faced with new evidence [33], [34]. If the internal process looks wrong, the system might be "bluffing" abduction, even if its outputs align with what an abductive reasoner would say.

Concerns about similarity to human reasoning. A second, more demanding subcamp goes further and takes humans as an implicit gold standard. Here the suggestion is that an AI system should only count as performing abduction if it does so in the same way that humans do. On this view, it is not enough for the system to generate good explanations; to qualify as genuinely abductive, the underlying reasoning process should mirror human scientific cognition in some believable way. When critics say that current LLMs are not "really reasoning" or "thinking like us", they often gesture at this stronger mechanistic requirement.

These two subcamps are not mutually exclusive. Interpretability work can morph into concerns about similarity to humans when researchers implicitly take human-like chains of thought as the target pattern. Note that the functional-mechanistic distinction can also apply to reasoning in more general terms: replacing "abduction" with "reasoning" or "understanding" yields essentially the same debate. The dialectic developed in this paper can therefore also be used to address whether LLMs genuinely "reason" or "understand" anything.

4.3. The limits of using human reasoning as the gold standard

The human-mimicry subcamp faces a serious epistemic obstacle. We currently lack a detailed account of how human brains implement abduction. Cognitive science and neuroscience provide valuable models, but there is no settled story about the precise neural mechanisms that underpin this capability. Worse, our intuitive attempts at reasoning are themselves unreliable. Humans systematically fail at probabilistic reasoning and are prone to fallacies and biases; we often need to override intuition by explicitly invoking mathematical machinery. This suggests that there may be multiple, interacting "systems" of reasoning in the human mind, some more normatively aligned than others, and that our introspective access to them is limited.

Given this, using human abduction as the gold standard is difficult to make precise. At best, human cognition can provide a rich source of inspiration and a set of informal benchmarks. It is entirely reasonable for neuroscience to ask how brains implement abductive inference (or reasoning in more general terms) and for AI researchers to explore biologically inspired architectures. But at this juncture, if one were to insist that AI systems only count as performing abduction if they do so "the way we do", and refuse to deploy them otherwise, significant productivity gains might be forgone as the empirical question may remain unresolved for quite some time.

Moreover, while evolution has had hundreds of thousands of years to discover the current mechanism implemented in our brains, there is no guarantee that this solution is unique or even optimal in all respects. Just as systems like AlphaGo have discovered strategies in games that are alien to human intuition yet undeniably effective [35], it is conceivable that future AI systems could implement abductive reasoning in ways that diverge from human mechanisms while still satisfying the functional specification.

For these reasons, this paper treats human-mimicry as an interesting research direction but not as a plausible criterion (as of now) for deciding whether a system performs abduction.

4.4. Interpretability as an engineering tool rather than part of the definition

The interpretability-oriented subcamp raises a different, and more tractable, concern. Here the worry is not that AI systems fail to match human mechanisms, but that without some insight into their internal processes, we cannot safely trust or improve them. Interpretability is crucial for debugging models, identifying spurious shortcuts, and designing safer training procedures. In the context of AI Scientists, being able to inspect how hypotheses are generated, how evidence is weighed, and why particular experimental choices are made is essential for safety, accountability, and efficient iteration.

This paper takes this concern very seriously. Interpretability should be treated as a first-class citizen in the engineering and governance of AI Scientists. However, there is an important distinction between using interpretability to study and improve abductive reasoning versus building interpretability into the definition of abduction itself. If we say that a system is only performing abduction when its internal processes meet some specific mechanistic norm, we face the same problem as the human-mimicry subcamp: we must first specify a gold standard, and at present we do not have a uniquely privileged one. Different architectures may implement the abductive role in different ways, and requiring a particular internal pattern risks excluding alternative, potentially superior solutions.

Furthermore, the very fact that interpretability work can surface hallucinations or detect faulty outputs when inputs are perturbed, shows that current systems do not satisfy the full abductive specification yet. Efforts to make these systems robust to such failures can then be viewed as attempts to realize that specification, bringing the debate back to functional criteria. Thus, this paper argues that a more sustainable position is to separate two questions:

  1. Definition: What does it mean, in functional terms, for a system to perform abduction?
  2. Engineering and safety: How can we use interpretability and related tools to ensure that systems actually satisfy this functional specification in a safe manner?

On this view, interpretability is indispensable as an engineering tool and safety guard, but it is not itself part of the conceptual criterion for abduction.

4.5. A predominantly functional stance

The stance adopted in this paper is therefore mostly functional. Abductive reasoning is defined in terms of its role: taking a set of observations as input and generating a weighted set of hypotheses as output. An AI system counts as "performing abduction" if, across the relevant contexts, its behavior meets this specification. Interpretability plays a crucial supporting role: it helps us reveal hidden failure modes and be more efficient when searching for improvements to existing systems, but this search may well uncover novel mechanisms that do not resemble human cognition. In the next section, this functional perspective is applied to analyze whether abductive reasoning occurs within a state-of-the-art AI Scientist.

5. A Case Study

The preceding section developed a functional criterion for what it would mean for an intelligent system to "perform abduction". This criterion will now be applied to a concrete system.

Table 1: AI Scientist systems released from mid-2024 to late-2025 and the stages of the scientific method they automate (adapted from [17]).

Release DateSystem NameCreatorHyp. Gen.Exp. Exec.Paper WritingPeer ReviewFieldOpen Source?
Aug 2024The AI Scientist v1 [5]Sakana.AIYesYesYesYesComputer ScienceYes
Jan 2025Agent Laboratory [8]AMDYesYesYesNoComputer ScienceYes
Feb 2025AI Co ScientistGoogle DeepMindPartialNoPartialPartialGeneric, BioMedNo
Mar 2025Carl [11]AutosciencePartialPartialPartialNoComputer ScienceNo
Mar 2025Zochi [12]Intology.AIPartialPartialPartialNoComputer ScienceNo
Apr 2025The AI Scientist v2 [6]Sakana.AIYesYesYesNoComputer ScienceYes
May 2025Robin [15]Future HouseYesPartialNoNoBioMedYes
May 2025AI-Researcher [9]Hong Kong UniversityYesYesYesYesComputer ScienceYes
Jul 2025InternAgent [10]Shanghai AI LabPartialYesNoNoMultipleNo
Oct 2025Deep Scientist [14]Westlake UniversityYesYesYesNoComputer ScienceYes
Nov 2025Kosmos [16]Future HouseYesYesPartialNoGeneric, BioMedNo

Based on Table 1, this study focuses on DeepScientist [14] as it is (i) one of the most recent systems, (ii) open-source, (iii) automates most stages of the scientific workflow, and (iv) claims to have advanced the state-of-the-art in 3 computer-science domains.

5.1. How DeepScientist works

DeepScientist views scientific discovery as an iterative search over research ideas, guided by a global memory of past results, a hypothesis generator powered by Gemini 2.5 Pro, and a Claude Code agent to implement experiments. Each run of the system follows the same conceptual loop, regardless of the task.

Global Findings Memory. The system maintains a Global Findings Memory that stores structured entries of three types: Idea Findings (hypotheses that have been proposed but not yet shown to improve the state of the art), Implementation Findings (ideas that have been implemented and tested but did not beat the baseline), and Progress Findings (ideas whose implementations achieved measurable improvements over a human baseline). Each record contains the hypothesis description, its reviewer scores, associated code, logs, and experimental metrics.

(1) Task specification and retrieval of past findings. A human first specifies a concrete target, such as "improve LLM inference throughput on benchmark BB" or "design a better AI-text detector on dataset DD". Given this task description, DeepScientist uses an embedding model to retrieve the top-KK (K=15K=15 in the reported experiments) most relevant past findings from the Global Findings Memory. These retrieved findings constitute the observational context for the next round of idea generation. In the published implementation, the human authors seeded the Global Findings Memory with data from papers published at ICLR 2025.

(2) Hypothesis generation and valuation. The retrieved findings are passed to Gemini 2.5 Pro, which is prompted to identify current limitations of the baseline and to propose a batch of new hypotheses H1,,HNH_1, \dots, H_N. For each hypothesis HiH_i, the same LLM is then asked to assign three scalar scores in [0,100][0,100]: utility (expected performance gain), value/quality (coherence and technical plausibility), and exploration (novelty or uncertainty). Each hypothesis together with its scores is written back into the Global Findings Memory as an Idea Finding.

(3) Selecting the next idea to implement. To decide which hypothesis to test next, DeepScientist applies an acquisition rule inspired by upper-confidence-bound Bayesian optimization. For each HiH_i, it computes

score(Hi)=utilityi+valuei+explorationi,\text{score}(H_i) = \text{utility}_i + \text{value}_i + \text{exploration}_i,

and selects the hypothesis HbestH_{\text{best}} with the highest score (in the original system, the three components can be weighted differently; the experiments described in the DeepScientist paper use equal weights). This balances exploitation (choosing ideas that look immediately promising) with exploration (testing ideas that push into less tried directions).

(4) Implementation and verification. Once HbestH_{\text{best}} is chosen, DeepScientist spawns a Claude Code agent inside a sandboxed copy of the relevant SOTA repository. The coding agent is instructed to (i) understand the baseline implementation, (ii) modify the code to realize HbestH_{\text{best}}, and (iii) run the corresponding experiments. To reduce false positives, the system re-runs the main experiment pipeline independently after the coding agent reports success. The resulting code, logs, and metrics are recorded. If the new method fails to improve on the baseline, the corresponding record is stored as an Implementation Finding.

(5) Analysis, paper drafting, and promotion to Progress Findings. If the implementation of HbestH_{\text{best}} does improve the baseline, DeepScientist spawns more specialized analysis agents to run follow-up experiments such as ablations, robustness checks, and evaluation on additional datasets. Using the accumulated findings (plans, results, and plots), the system then prompts Gemini 2.5 Pro to generate a paper describing the new method and its empirical performance. The underlying idea, together with its validated results and paper draft, is stored as a Progress Finding in the Global Findings Memory so that future runs can treat it as a source of inspiration.

(6) Advancement of SOTA in 3 computer science domains. DeepScientist was pointed at three different computer-science tasks: agent failure attribution (where it discovered A2P, a counterfactual-based debugger that figures out where failures occurred in the logs of multi-agent LLM systems), LLM inference acceleration (where it proposed ACRA, a speculative decoding scheme that modestly but meaningfully increased throughput), and AI text detection (where it developed T-Detect, TDT, and PA-Detect, a sequence of detectors that improved robustness and latency over the SOTA). In total, DeepScientist generated roughly 5000 ideas, implemented about 1100 of them, and obtained 21 "Progress Findings" that surpassed the human baselines.

5.2. Where abduction occurs in the pipeline, and why it has rough edges

Locating the abductive step. At first glance it is tempting to ask whether the DeepScientist system performs abduction. But DeepScientist is not a monolithic agent: it is a scaffold that orchestrates several LLMs (Gemini 2.5 Pro for strategy, Claude 4 Opus for implementation, smaller embedding models for retrieval). To apply the functional criterion from Section 4, we need to zoom in on the following parts of the pipeline:

  1. the step where DeepScientist retrieves K=15K=15 records from the Global Findings Memory and prompts Gemini 2.5 Pro to propose new research hypotheses, and
  2. the step where each hypothesis is assigned a triple of scores for utility, quality, and exploration value.

The more precise question then becomes:

Does Gemini 2.5 Pro, when prompted with information that is present within the DeepScientist framework, perform abduction according to the functional criterion?

The inputs and outputs type-check. The inputs to this Gemini subroutine are sets of observations because they are the set of 15 findings from the shared memory. The outputs are sets of candidate explanations H1,,HNH_1 , \dots, H_N where NN is decided by the model. This criterion can be satisfied rather trivially.

Hypotheses are graded, leading to partial satisfaction of the precision requirement. Recall that our specification states that each hypothesis should come with some score that meaningfully tracks how correct the explanation is. DeepScientist explicitly attempts this: Gemini produces utility, value and exploration scores for each idea and the system chooses the hypothesis that maximizes their sum. However, the way these scores are assigned is largely heuristic: the paper does not provide a theory tying the numbers to any calibrated probability of success, and there is no ground-truth signal supervising Gemini 2.5 Pro beyond its own informal judgments in its chain of thought.

Empirically, the idea funnel is extremely lossy. Roughly 1-3% of implemented hypotheses led to measurable improvements, and essentially none of 100 randomly selected ideas per task did. This suggests that the internal scoring has a discriminative power that is much better than random, but it is still far from a high-precision estimator of "true" hypothesis quality.

Some hypotheses were meaningful, suggesting non-trivial recall. DeepScientist did provide positive evidence that acting on some of its top-ranked hypotheses led to real improvements. On three frontier tasks, some Gemini-generated hypotheses, once implemented and refined, led to SOTA-surpassing methods:

  • AI text detection. For this task, each text receives a scalar score denoting the probability that it was AI-generated. The baseline assumes that, for human texts, these scores follow a Gaussian distribution and uses that model to set thresholds and probabilities. Under adversarial editing, however, many more texts end up with very extreme scores than a Gaussian would predict, so this assumption breaks and the detector becomes badly miscalibrated. Gemini noticed this limitation and proposed replacing the Gaussian with a heavier-tailed Student-tt model. Later, it switched to a wavelet-based method that looks at the stream of probabilities for each token and checks for short, localized spikes rather than relying on global Gaussian statistics. The final method, PA-Detect, achieved a 7.97.9 point AUROC gain over the prior SOTA while roughly halving latency, and remained robust across diverse adversarial attacks.
  • Agent failure attribution. In the Who&When benchmark, the goal is to read the logs of a multi-agent LLM system and identify which agent at which step caused a failure. Gemini inferred that the existing "All at Once" baseline simply feeds the whole trajectory to an LLM and asks it to guess, without explicitly reasoning about what would happen if a particular step were different. It proposed the Abduction--Action--Prediction (A2P) framework: first infer the most likely failure step (Abduction), then suggest a concrete fix (Action), and finally predict whether that single change would have led to success (Prediction). Implementing A2P raised accuracy from 12.07%12.07\% to 29.31%29.31\% in the handcrafted setting and from 16.67%16.67\% to 47.46%47.46\% in the algorithm-generated setting, and the same method worked well across both types of trajectories.
  • LLM inference acceleration. On MBPP, a program-synthesis benchmark used to measure decoding throughput of an LLM, Gemini's hypotheses explored ways to give speculative decoding a longer-term memory. The final ACRA method assumes that, over several decoding steps, part of the output sequence often stabilizes; it identifies the longest stable suffix and reuses it as a smarter draft for the next step, while still verifying every token exactly. This yields a modest but non-trivial 1.9%1.9\% throughput gain over the heavily tuned Token Recycling baseline, without sacrificing correctness.

Across these cases, the hypotheses are clearly not random perturbations: they diagnose structural limitations in existing methods (Gaussian tails, lack of counterfactuals, short-context decoders), propose mechanisms that would resolve those limitations, and, once implemented, do in fact produce better behavior that often generalizes across settings (e.g., A2P working on both handcrafted and algorithm-generated logs, PA-Detect remaining robust to many attack types). This supports the claim that DeepScientist's abductive steps provide a non-trivial approximation to the abductive function.

Hallucination and reward hacking. However, there are also rough edges where the abductive mapping fails. DeepScientist did not operate fully autonomously as 3 experts had to monitor its runs and veto hallucinations. Moreover, the second verification step (in which DeepScientist reruns the main script after Claude Code reports success) was added because roughly half of initial implementation attempts silently failed due to bugs, or Claude Code simply claiming success when it did not achieve it. A post-hoc audit of 300 failed trials also found that about 60%60\% of terminations were caused by bugs rather than flawed ideas. These facts show that the abductive module can sometimes propose bogus hypotheses, but even when its hypotheses are coherent, the different parts in its system do not necessarily "understand each other", thus unreliable execution can easily break the end-to-end loop.

Human reviews concluded that DeepScientist's ideation lacked empirical soundness. Human reviewers checked the 3 research papers for AI text detection, agent failure attribution and LLM inference written by DeepScientist. They agreed that DeepScientist "consistently excels at ideation" and has novel hypotheses, but there is a recurring lack of empirical soundness. The proposed ideas frequently lacked comprehensive evaluation on standard benchmarks which a human familiar with the field would immediately think of, and provided shallow ablations. In our terms, this means Gemini 2.5 Pro can produce a plausible main explanation, but does not systematically test alternative explanations, thus its hypothesis recall may not be very high.

Overall Assessment. Functionally, Gemini 2.5 Pro inside DeepScientist does achieve a non-trivial approximation to the abductive mapping. It turns observations into graded hypotheses, some of which robustly fix real shortcomings and even generalize across settings. Yet the need for heavy human supervision, the very noisy idea funnel, and the issues with empirical soundness indicate that the specification from Section 4 is only partially met.

6. How might we test for abductive reasoning in AI Scientists?

The functional criterion developed in Section 4 treats abduction as a mapping from sets of observations to sets of weighted hypotheses, where the hypothesis set should ideally have high "recall" and "precision". This section sketches how these desiderata can guide benchmark design and then discusses a very recent framework, GEAR [36], which seems to be the closest instantiation of this criterion thus far.

6.1. Design principles from the functional criterion

The functional view suggests several general principles for evaluating abductive reasoning:

  • Evaluation via task performance, with interpretability guiding design of robust tasks. Benchmarks should still judge models by task performance, but on tasks where the system receives sets of observations and may return sets of hypotheses, with the whole set scored rather than only the top item (to respect underdetermination). Interpretability tools can then be used to stress-test these tasks, for example, by adding logically irrelevant distractors or adversarial perturbations and checking whether the system's hypotheses remain stable rather than "hacking" spurious cues.
  • Recall via ground-truth or interventional success. For each task, there should be a notion of what counts as a "good" explanation: either one or more ground-truth causal mechanisms (in simulations and toy domains), or an intervention that measurably improves the system under study (in realistic domains). Recall is then the probability that at least one hypothesis in the set meets this criterion.
  • Calibration via ranking. To test precision, the evaluation should expect that the system rank its hypotheses in some way, then measure how strongly this ordering correlates with downstream success.
  • Module-level testing inside AI Scientists. For agentic AI Scientist systems, tests should target the abductive module directly: fix the surrounding orchestration, feed carefully controlled observation sets into the hypothesis-generation step, and evaluate the resulting hypothesis sets as above. This avoids conflating abductive failures with unrelated engineering errors elsewhere in the pipeline.

6.2. GEAR: a general recipe for scoring sets of hypotheses

Most existing abductive benchmarks still rely on a single "gold" hypothesis and treat any alternative explanation as wrong. The GEAR framework takes a different approach that aligns closely with our functional criterion. It was explicitly designed as a benchmark for abductive reasoning in toy, programmable domains, where each hypothesis can be represented as an executable program. GEAR measured abduction on the following datasets:

  1. MINI-ARC [37] / ARC-2025 [38]: 2D visual "IQ" puzzles derived from François Chollet's ARC benchmark, where each observation is a pair of colored grids (input and transformed output), and a hypothesis is a program that maps any input grid to an output grid.
  2. ACRE [39]: lists of simple objects (e.g. [red cube, blue sphere]) paired with binary labels, where hypotheses are programs that implement abstract rules such as "return 1 iff there is at least one red object".
  3. LIST FUNCTIONS [40]: integer lists paired with outputs (e.g. [3,5,3] \mapsto 2), where hypotheses are list-processing functions like "count how many times the first element appears". In all cases, an observation is an input-output pair oi=(ini,outi)o_i = (in_i, out_i), and a hypothesis is an executable program ff that takes any admissible input and returns an output. For each dataset, the authors construct a finite but rich sample space SS of possible inputs (e.g. many candidate grids or lists) and study each hypothesis through its prediction set
Pf={(in,f(in)):inS}.P_f = \{(in, f(in)) : in \in S\}.

Given a set FF of hypotheses produced by an LLM, GEAR scores it along three dimensions: 4. Consistency. A hypothesis is consistent if it matches all observed examples in OO (i.e. f(ini)=outif(in_i) = out_i for every (ini,outi)O(in_i, out_i) \in O). This captures the minimal requirement that an explanation fit the known data. 5. Generalizability. For each consistent hypothesis, GEAR measures how much of SS it covers with well-defined predictions (no crashes, no undefined behaviour). Intuitively, higher coverage means the hypothesis makes more testable claims about unseen cases rather than only handling a narrow corner of the input space. 6. Diversity. Two diversity metrics (average per-input disagreement and Jaccard dissimilarity between prediction sets) quantify how much the hypotheses in FF genuinely differ in their predictions on SS, rather than being trivial variants that behave almost identically. These metrics are used twice: first, to analyze the abductive behaviour of nine LLMs across four benchmarks, showing that many consistent but diverse explanations can coexist even after additional test cases are revealed; and second, as a training signal. By turning GEAR scores into preference pairs and applying Direct Preference Optimization with a simple curriculum, the authors fine-tune open-source models to generate hypothesis sets that are more consistent, more general, and more diverse, and they show that these gains transfer to higher pass-rates and top-kk accuracy on held-out problems.

6.3. How well does GEAR realize the functional criterion?

Viewed through the lens of Section 4, GEAR is a strong proof-of-concept for functional evaluation of abduction, but it does not fully realize the specification.

On the positive side, the framework is explicitly set-based: it evaluates sets of hypotheses rather than single outputs, and it embraces underdetermination by rewarding diversity. Consistency and generalizability, combined with stress-testing on hidden examples, act as a practical proxy for the recall requirement: in many problems, hypothesis sets produced by strong LLMs contain at least one program that continues to predict unseen observations correctly. Moreover, Simulation Study 2 shows that hypotheses with higher combined GEAR scores are systematically more likely to pass hidden test cases than lower-scoring alternatives, indicating that the metric itself correlates with future explanatory success.

However, two gaps remain relative to the full functional criterion. First, GEAR operates entirely at the level of prediction, not intervention: hypotheses are judged by how well they match additional data, not by whether acting on them would fix a faulty system or improve a scientific method. Second, the calibration requirement in Section 4 concerns the system's own scoring of hypotheses. In GEAR, the scoring function is external: GEAR defines how hypotheses should be ranked, and training nudges models toward generating higher-scoring hypotheses, but models are not required to output explicit confidences or implicit rankings that can themselves be tested for calibration.

In sum, GEAR offers a practical way to test whether a model, given a set of observations, can propose a diverse set of good hypotheses without gold labels. It provides a valuable template for abductive evaluation and aligns well with the functional stance of this paper. However, it operates in toy domains for which we can represent the hypothesis as a program. Extending the framework to cover the gaps mentioned above, and applying it directly to AI Scientist settings, where the data is messy, remains an important direction for future work.

7. How might we create AI Scientists with strong abductive capabilities?

The preceding sections treated abduction functionally and showed how to test for it. A natural next step is to ask how one might train AI Scientists so that they better realize the mapping P(observations)P(explanations)\mathcal{P}(\text{observations}) \mapsto \mathcal{P}(\text{explanations}). Current work already hints at three broad strategies.

Camp 1: Directly optimize abductive behaviour over many relevant contexts. Systems in this camp try to train LLMs to score better on abductive benchmarks themselves. As mentioned in Section 6, GEAR used its consistency, generalizability and diversity scores as preferences to fine-tune LLMs [36] and improved pass-rates on the same toy domains. This is the most straightforward way of optimizing the abductive function, but so far it has only been demonstrated on programmable benchmarks where hypotheses can be executed as code.

Camp 2: Ground hypothesis generation in formal structure. A second line of work focuses on constraining what counts as a permissible hypothesis. DeepScientist, for instance, explicitly flags the need for "derivable models" that embed scientific axioms as hard constraints, so that the ideas that are generated are rooted in theory [14]. On this view, hallucinations and weak empirical discipline are symptoms of an underconstrained hypothesis space. The cure is to add some sort of provenance to each claim and prevent the LLM from speculating in a free format. This has a nice side effect of increasing interpretability as well.

Camp 3: Combine LLMs with causal search algorithms. A third strategy lets LLMs share the abductive workload with symbolic causal search tools. CARE [41] fine-tunes an LLM to read the outputs of classical causal discovery algorithms (e.g., PC, GES, LiNGAM) together with observational data, and to output refined causal graphs that outperform both the raw algorithms and larger LLMs that did not undergo finetuning. CARE treats the causla discovery algorithm outputs as compressed "evidence" and trains the LLM to integrate that evidence with its world knowledge, using data augmentations (e.g., permuted names, omitted variables) to discourage superficial semantic shortcuts. In functional terms, this improves the precision of the abductive mapping by biasing it toward more robust causal structures.

7.1. A speculative route: aim for world-modelling, and abduction emerges as a by-product

All three camps optimize abduction more or less directly: they either reward explanations that behave well (Camp 1), restrict the space of allowable explanations (Camp 2), or bolt the LLM onto an explicit causal search procedure (Camp 3). A complementary hypothesis is that human-like abductive skill arises from a deeper cognitive capacity: the ability to build rich, manipulable world models. Humans seem to possess strong causal intuitions. We are good at asking "what could have produced this?" and at mentally simulating counterfactual interventions. This paper speculates that these intuitions emerge from internal generative models that represent objects, dynamics, and abstract structures and can be updated under interventions.

If this is right, then the key step is not abduction itself but modelling. Once an agent has an internal model of how a system behaves, it can imagine interventions on that model, which naturally gives rise to causal intuitions. If the agent has learned that XX can produce YY, then upon observing YY it becomes meaningful to search over candidate causes XX that could have generated it. This is precisely the abductive move. The same world model also supports other cognitive prerequisites for science, such as predicting the effects of interventions, checking consistency across domains, and re-using causal knowledge in new contexts.

On this picture, if we want AI Scientists with strong abductive abilities, we might aim at the higher target of training systems to acquire powerful world models. Abduction would then emerge as a consequence of having good models and good search, rather than as a hand-engineered capability. The space of possible world-modelling architectures and training regimes is enormous, so the search over this space should be done by our methods, not by us. This philosophy echoes Rich Sutton's "Bitter Lesson" about leveraging computation and search over hand-crafted structure.

The three camps discussed above can then be reinterpreted as different attempts to navigate the space of solutions that implement the abductive function. As we tinker with ideas in those camps, our methods will slowly improve, and the hope is that soon they will improve to the point where they can lead their own search. The training objective may then be switched to a highly abstract goal, like one that is oriented around world-modelling. Abduction, along with other cognitive prerequisites of science, would then ideally arise as emergent properties.

8. Conclusion

Humans have long sought to understand the world in order to preserve our miraculous existence, and we are increasingly turning to AI to help formulate and solve scientific problems. Recent AI Scientist systems are a promising step in this direction, but can still fail in serious ways. Evaluation of these systems also focuses mainly on what is directly observable: traces of agent actions, benchmark scores, and the quality of generated papers. If we want AI Scientists that we can genuinely trust, we must look beneath these surface artifacts and uncover the degree to which the underlying cognitive prerequisites for science are present.

Focusing on abduction as one such prerequisite, this paper argued that the right level of analysis is not the AI Scientist scaffold as a whole, but the underlying LLM(s) that execute the abductive steps. What we are really asking then becomes: does an LLM, when used in a specific AI Scientist context, perform abduction? To make this question concrete, this paper developed a functional criterion for abduction and separated it from stricter mechanistic demands, which are better treated as engineering and safety concerns.

Using this criterion, this paper analyzed DeepScientist [14] and found that its Gemini-based hypothesis module does approximate abductive behaviour in a non-trivial way. At the same time, hallucinations, noisy scoring, weak empirical soundness, and implementation failures show that the functional specification is only partially met.

Finally, the paper sketched how to move forward. On the evaluation side, the functional criterion suggests treating abduction as a mapping from sets of observations to sets of weighted hypotheses, and using this to design benchmarks that score whole sets of explanations on recall and calibration rather than just checking a single "gold" answer. On the training side, the paper highlighted three emerging paradigms, plus a more speculative route that targets a more ambitious goal of rich world-modelling and treats abduction as an emergent by-product.

If this picture is right, then the path to AI Scientists runs through (1) explicit functional targets for the abstract cognitive capacities we care about, but cannot yet introspect well; (2) interpretability to design robust tasks that genuinely test those specifications; and (3) iteration until our methods can explore the solution space themselves - at which point they may discover their own powerful ways of doing science that transform human productivity.

[1]
“SPSS - About SPSS Inc.” Accessed: Nov. 09, 2025. [Online]. Available: https://www.spss-asp.com/corpinfo/history/
[2]
“A Brief History of MATLAB.” Accessed: Nov. 09, 2025. [Online]. Available: https://www.mathworks.com/company/technical-articles/a-brief-history-of-matlab.html
[3]
C. A. Hutchison III, “DNA Sequencing: Bench to Bedside and beyond †,” Nucleic Acids Res, vol. 35, no. 18, pp. 6227–6237, Sep. 2007, doi: 10.1093/nar/gkm688.
[4]
J. Jumper et al., “Highly Accurate Protein Structure Prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, Aug. 2021, doi: 10.1038/s41586-021-03819-2.
[5]
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2408.06292
[6]
Y. Yamada et al., “The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2504.08066
[7]
“Accelerating Scientific Breakthroughs with an AI Co-Scientist.” Accessed: Nov. 09, 2025. [Online]. Available: https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/
[8]
S. Schmidgall et al., “Agent Laboratory: Using LLM Agents as Research Assistants.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2501.04227
[9]
J. Tang, L. Xia, Z. Li, and C. Huang, “AI-Researcher: Autonomous Scientific Innovation.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2505.18705
[10]
I. Team et al., “InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2505.16938
[11]
“Meet Carl: The First AI System To Produce Academically Peer-Reviewed Research | Autoscience.” Accessed: Nov. 09, 2025. [Online]. Available: https://autoscience.ai
[12]
“Zochi Technical Report.” Accessed: Nov. 09, 2025. [Online]. Available: https://www.intology.ai/blog/zochi-tech-report
[13]
“Periodic Labs.” Accessed: Nov. 09, 2025. [Online]. Available: https://periodic.com/
[14]
Y. Weng et al., “DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively.” Accessed: Dec. 11, 2025. [Online]. Available: http://arxiv.org/abs/2509.26603
[15]
A. E. Ghareeb et al., “Robin: A Multi-Agent System for Automating Scientific Discovery.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2505.13400
[16]
L. Mitchener et al., “Kosmos: An AI Scientist for Autonomous Discovery.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2511.02824
[17]
Z. Luo, A. Kasirzadeh, and N. B. Shah, “The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2509.08713
[18]
M. Gridach, J. Nanavati, K. Z. E. Abidine, L. Mendes, and C. Mack, “Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2503.08979
[19]
R. Pool, Ed., AI for Scientific Discovery: Proceedings of a Workshop. Washington, D.C.: National Academies Press, 2024. doi: 10.17226/27457.
[20]
G. Tie, P. Zhou, and L. Sun, “A Survey of AI Scientists.” Accessed: Nov. 09, 2025. [Online]. Available: http://arxiv.org/abs/2510.23045
[21]
Z. Chen et al., “ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2410.05080
[22]
P. Jansen et al., “DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2406.06769
[23]
H. Liu, S. Huang, J. Hu, Y. Zhou, and C. Tan, “HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2504.11524
[24]
P. Shojaee, N.-H. Nguyen, K. Meidani, A. B. Farimani, K. D. Doan, and C. K. Reddy, “LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2504.10415
[25]
T. Zheng et al., “NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2510.07172
[26]
H. Cui et al., “CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2503.13517
[27]
“Andrej Karpathy: Software Is Changing (Again) : YC Startup Library | Y Combinator.” Accessed: Dec. 13, 2025. [Online]. Available: https://www.ycombinator.com/library/MW-andrej-karpathy-software-is-changing-again
[28]
D. R. Anderson, “The Evolution of Peirce’s Concept of Abduction,” Transactions of the Charles S. Peirce Society, vol. 22, no. 2, pp. 145–164, 1986, Accessed: Dec. 02, 2025. [Online]. Available: https://www.jstor.org/stable/40320131
[29]
M. Nasr et al., “Scalable Extraction of Training Data from Aligned, Production Language Models,” Oct. 2024. Accessed: Dec. 12, 2025. [Online]. Available: https://openreview.net/forum?id=vjel3nWP2a&utm_source=chatgpt.com
[30]
G. Bao, H. Zhang, C. Wang, L. Yang, and Y. Zhang, “How Likely Do LLMs with CoT Mimic Human Reasoning?,” in Proceedings of the 31st International Conference on Computational Linguistics, O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, Eds., Abu Dhabi, UAE: Association for Computational Linguistics, Jan. 2025, pp. 7831–7850. Accessed: Mar. 31, 2025. [Online]. Available: https://aclanthology.org/2025.coling-main.524/
[31]
H. Chi et al., “Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2506.21215
[32]
D. Plecko, P. Okanovic, S. Havaldar, T. Hoefler, and E. Bareinboim, “Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2511.03070
[33]
C. Agarwal, S. H. Tanneru, and H. Lakkaraju, “Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2402.04614
[34]
Y. Chen et al., “Reasoning Models Don’t Always Say What They Think.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2505.05410
[35]
D. Silver et al., “Mastering the Game of Go without Human Knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017, doi: 10.1038/nature24270.
[36]
K. He et al., “GEAR: A General Evaluation Framework for Abductive Reasoning.” Accessed: Dec. 12, 2025. [Online]. Available: http://arxiv.org/abs/2509.24096
[37]
S. Kim, P. Phunyaphibarn, D. Ahn, and S. Kim, “Playgrounds for Abstraction and Reasoning,” Oct. 2022. Accessed: Dec. 13, 2025. [Online]. Available: https://openreview.net/forum?id=F4RNpByoqP
[38]
F. Chollet, “On the Measure of Intelligence.” Accessed: Dec. 13, 2025. [Online]. Available: http://arxiv.org/abs/1911.01547
[39]
C. Zhang, B. Jia, M. Edmonds, S.-C. Zhu, and Y. Zhu, “ACRE: Abstract Causal REasoning Beyond Covariation.” Accessed: Dec. 13, 2025. [Online]. Available: http://arxiv.org/abs/2103.14232
[40]
J. S. (Joshua S. Rule, “The Child as Hacker : Building More Human-like Models of Learning,” Thesis, Massachusetts Institute of Technology, 2020. Accessed: Dec. 13, 2025. [Online]. Available: https://dspace.mit.edu/handle/1721.1/129232
[41]
J. Dong, Y. Liu, A. Aloui, V. Tarokh, and D. Carlson, “CARE: Turning LLMs Into Causal Reasoning Expert.” Accessed: Dec. 13, 2025. [Online]. Available: http://arxiv.org/abs/2511.16016