From Prompt to Protocol: What Autonomous AI Agents Actually Do

In December 2023, a chemistry laboratory at Carnegie Mellon University ran an experiment with a notable absence: no human hand adjusted the pipettes, no researcher monitored the reaction vessels, and no graduate student recorded the observations. The entire procedure – designing the chemical synthesis, calculating reagent quantities, programming the liquid-handling robot, and interpreting the results – was conducted by an artificial intelligence system called Coscientist. Operating solely on a plain-language instruction to perform Suzuki–Miyaura coupling reactions, the system browsed the internet for reaction conditions, wrote Python code to control laboratory hardware, executed the protocol, and verified the formation of the target compounds through mass spectrometry. The results appeared the following day in Nature (Boiko et al., 2023).

Coscientist represents something qualitatively different from the generative AI systems that have dominated public attention. Large language models such as GPT-4 produce text when prompted; they respond to queries, draft documents, and generate code upon request. Coscientist, by contrast, operates across an extended temporal horizon. It plans multi-step procedures, executes physical actions through robotic interfaces, evaluates outcomes against goals, and iterates accordingly. The shift is from tool to actor – from systems that await instruction to systems that pursue objectives.

This distinction, while subtle, carries substantial implications for how scientific research is conducted, who can conduct it, and what regulatory frameworks might apply.

The Architecture of Autonomy

Understanding this shift requires examining how these systems are constructed. Coscientist employs a modular architecture in which a central "Planner" module, powered by GPT-4, coordinates specialised subsystems for web search, documentation retrieval, code execution, and robotic control (Boiko et al., 2023). When instructed to perform a reaction, the Planner does not simply retrieve a synthesis procedure from its training data – an approach that would risk the confident fabrication of chemical protocols. Instead, it actively searches the internet for current literature, consults hardware documentation to learn robotic control APIs, performs calculations in an isolated environment, and generates executable code to conduct the experiment.

This capacity for tool use and iterative refinement marks a departure from earlier AI systems. Where a generative model provides a static response to a prompt, autonomous agents engage in extended reasoning chains. They recognise when they lack information, seek it out, adjust their plans based on feedback, and persist toward objectives across multiple operational cycles.

The approach has proven extensible beyond chemistry. At Princeton University, researchers developed SWE-agent, a system that addresses real software engineering tasks drawn from actual GitHub repositories (Yang et al., 2024). Given a bug report for an open-source Python library, SWE-agent retrieves the relevant codebase, locates the problematic function, proposes a fix, implements the modification, and verifies that the solution passes the existing test suite. On the SWE-bench benchmark – a collection of 2,294 real-world issues from 12 popular Python repositories – the system resolved 12.47% of tasks autonomously. This represents a qualitative leap from the sub-2% performance of retrieval-augmented generation baselines established months earlier (Jimenez et al., 2023).

In February 2025, Google Research introduced AI co-scientist, a multi-agent system built upon Gemini 2.0 designed to function as a collaborative tool for biomedical research (Gottweis & Natarajan, 2025). Unlike systems that merely summarise existing literature, AI co-scientist generates novel, testable hypotheses through an iterative process involving multiple specialised agents – Generation, Reflection, Ranking, Evolution, Proximity, and Meta-review – that collectively mirror the scientific method. In one validation study, the system proposed novel drug repurposing candidates for acute myeloid leukaemia; subsequent laboratory experiments confirmed that the suggested compounds inhibited tumour viability at clinically relevant concentrations. In another test, researchers at Imperial College London challenged the system to explain the mechanism by which certain bacterial genetic elements spread across species – a phenomenon the researchers had themselves recently discovered but had not yet published. The AI co-scientist independently proposed the correct mechanism, demonstrating its capacity to synthesise novel inferences from existing literature (Gottweis & Natarajan, 2025).

The biological sciences have seen comparable developments. AlphaFold 3, described in Nature in May 2024, employs a diffusion-based architecture to predict the joint structure of molecular complexes encompassing proteins, nucleic acids, small molecules, ions, and modified residues (Abramson et al., 2024). The system predicts structures that would previously have required months or years of laboratory work – crystallography experiments, nuclear magnetic resonance spectroscopy, or cryo-electron microscopy – delivering results in hours.

The Geography of Capability

A conspicuous feature unites these examples: their institutional provenance. Coscientist emerged from a well-resourced laboratory at Carnegie Mellon, supported by the US National Science Foundation. SWE-agent was developed at Princeton with access to substantial computational infrastructure. AI co-scientist represents the research output of Google. AlphaFold 3 was developed by Isomorphic Labs and Google DeepMind.

These systems require resources concentrated in a small number of wealthy institutions and corporations, predominantly in the United States and Europe. Access to the underlying large language models typically requires paid API access or institutional partnerships. The computational costs of running extensive agentic reasoning cycles are non-trivial. The robotic laboratory infrastructure that enables Coscientist to operate in the physical world costs millions of dollars to establish.

This concentration raises questions about distribution. The same capabilities that enable a Carnegie Mellon chemist to automate reaction optimisation could, in principle, assist a researcher at the University of Nairobi in developing novel catalysts for water purification. The hypothesis-generation capacities demonstrated by AI co-scientist could support public health workers in rural Brazil. The software engineering capabilities of SWE-agent could enable a developer in Tamil Nadu to maintain critical infrastructure codebases.

But the infrastructure to support such applications is absent. Cloud laboratories remain geographically concentrated in North America and Europe. High-quality internet connectivity remains unevenly distributed. The technical expertise required to implement and maintain autonomous AI systems remains scarce in many regions.

The International Monetary Fund has warned that AI could exacerbate cross-country income inequality, with growth impacts in advanced economies potentially more than double those in low-income countries (IMF, 2024). The Centre for Global Development has noted that while AI may fuel within-country inequality, it could also slow or reverse gains made in reducing between-country inequality (Center for Global Development, 2024). These assessments focused primarily on generative AI; autonomous agents, with their capacity to execute complex workflows independently, may widen these disparities further.

Regulatory Frameworks Designed for a Previous Generation

Current AI governance frameworks were constructed with generative systems in mind. The European Union's AI Act, which entered into force in 2024, establishes risk-based categories for AI applications, with heightened requirements for "high-risk" systems affecting safety, fundamental rights, or critical infrastructure (Regulation (EU) 2024/1689). The Act's provisions address transparency, data governance, and human oversight – concepts well-suited to systems that produce content for human review.

Autonomous agents present distinct challenges. A system that designs and executes chemical experiments operates with a degree of independence that complicates traditional accountability structures. If an autonomous agent makes an error, responsibility may lie with the system developer, the laboratory operator, the organisation deploying the system, or some combination thereof. Current liability frameworks, developed for tools operated by human agents, do not map cleanly onto systems that pursue objectives across extended time horizons with minimal human intervention.

The capacity for autonomous agents to take actions in the physical world introduces safety considerations that differ from those associated with text generation. The developers of Coscientist explicitly investigated whether the system could be coerced into planning the synthesis of hazardous chemicals or controlled substances, finding that while safeguards could be implemented, the risk of misuse required ongoing attention (Boiko et al., 2023).

Policy responses remain nascent. The OECD, African Union, and United Nations have released frameworks emphasising transparency and trustworthiness, but these principles remain general. The EU AI Act's requirements for "human oversight" assume a human in the loop; autonomous agents are designed to operate with humans on the loop, or in some configurations, out of the loop entirely.

A Question of Agency

The emergence of autonomous AI agents does not represent a discontinuity in technological development, but rather an extension of trends visible for decades: the automation of cognitive labour, the delegation of decision-making to algorithms, the increasing capacity of machines to operate independently in complex environments. What distinguishes the current moment is the breadth of domains – chemistry, software engineering, biomedical research – where autonomous operation has become feasible.

Whether this development proves broadly beneficial or further concentrates advantage among the already-advantaged depends not on the technology itself, but on choices about access, governance, and distribution. The Carnegie Mellon researchers who developed Coscientist noted that their system could democratise access to advanced laboratory infrastructure. But democratisation requires more than technical capability; it requires investment in connectivity, infrastructure, training, and political will.

If autonomous AI agents can plan, execute, and refine scientific experiments, software systems, and research programmes, the question becomes not what these systems can do, but who will be able to use them, and to what ends.

---

References

Abramson, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493–500. https://doi.org/10.1038/s41586-024-07487-w

Boiko, D. A., MacKnight, R., Kline, B., & Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624(7992), 570–578. https://doi.org/10.1038/s41586-023-06792-0

Center for Global Development. (2024). Three reasons why AI may widen global inequality. https://www.cgdev.org/blog/three-reasons-why-ai-may-widen-global-inequality

Gottweis, J., & Natarajan, V. (2025). Accelerating scientific breakthroughs with an AI co-scientist. Google Research Blog. https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/

International Monetary Fund. (2024). Gen-AI: Artificial intelligence and the future of work. IMF Staff Discussion Note. https://www.imf.org/en/Publications/Staff-Discussion-Notes/Issues/2024/01/14/Gen-AI-Artificial-Intelligence-and-the-Future-of-Work-542379

Jimenez, C. E., et al. (2023). SWE-bench: Can language models resolve real-world GitHub issues? arXiv preprint. https://arxiv.org/abs/2310.06770

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L, 2024/1689.

Stanford HAI. (2025). 2025 AI Index Report. Stanford University. https://hai.stanford.edu/ai-index/2025-ai-index-report

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O. (2024). SWE-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37. https://arxiv.org/abs/2405.15793