Grounded rewards in the era of experience: A commentary on Silver and Sutton, “Welcome to the era of experience” (2025)

Noumenal Labs
Apr 28
11 min read

The tl;dr

This post is a commentary on a new paper by Silver and Sutton, entitled “Welcome to the era of experience” (2025).

Silver and Sutton (2025) provide a thought-provoking discussion of the last decade of research and development in the field of artificial intelligence (AI), and where the field is heading. The core idea is that we have reached a performance ceiling for AI agents trained via supervised learning from human data — and that we have entered a new epoch in the development of AI, which the authors call the “era of experience.”

The era of experience, as the authors describe it, is a forthcoming phase in the development of AI that will be characterized by “grounding” in the real world, online action-perception loops, physical embodiment, environment-sourced reward signals, and online real-time experiential learning.

In particular, Silver and Sutton argue that the era of experience heralds a shift from hand-crafted, user-specified reward functions and the heavy use of human expert feedback and supervision, towards “grounded rewards,” which are measured and evaluated by AI agents themselves by continually assessing the sensory consequences of their actions in real time.

Here, we review and evaluate their argument. We enthusiastically embrace several aspects of their discussion and offer some constructive feedback pertaining to the learning of grounded reward functions.

Reinforcement learning and the era of simulation

Silver and Sutton (2025) propose that we have just entered (or are about to enter) a new era in the research and development of AI systems, which the authors call the “era of experience.” As they recount it, this new era follows from two previous ones, which together span the last decade or so. They dub the period between 2015–2020 the “era of simulation,” followed by the “era of human data” or “human-centric AI” (2020–present). We briefly rehearse their proposed recent history of the field.

The era of simulation corresponds to the heyday of classical reinforcement learning (RL). During this era, the AI community witnessed impressive, rapid, headline-grabbing progress in the performance of AI systems, albeit in relatively narrow domains, usually simple games. The success of AI systems like as AlphaZero and AlphaGo — their ability to achieve superhuman performance in games like chess, Go, and StarCraft II, and to discover surprising strategies that had not been considered previously by humans — demonstrated for the first time, in a compelling way, the capacity for AI systems to engage in autonomous learning and discovery.

Then as now, (model based) RL agents only have a few basic components: an inference/prediction engine and a reward function, and are typically formulated as a partially observable Markov decision process (POMDPs). A POMDP is defined as follows.

First, we define the machinery of an inference/prediction engine, beginning with a world model:

A state space S, which represents (fictive) states of the environment
A likelihood model P(o ❘ s), which represents the conditional probability of observing a given outcome o, given that the system is in state s
Actions A taken by the agent in its environment, which select between possible state transitions in S
A dynamics model P(s’ ❘ s, a), which represents the effect of a given action on the state of the world

The world model is coupled to data and an associated prediction engine that together endow it with the ability to predict the consequences of a given series of actions, ultimately enabling rational planning.

In order to achieve goal-directed behavior, this world model is then augmented by a reward function that specifies the relative value of outcomes and costs associated with actions and state transitions

R(o, s’, s, a) associating a scalar value to a function of observations, actions, states, and state transitions from s to s’.

RL agents learn the optimal policy, p(a | s), defined as the one that maximizes expected reward. In other words, given a prespecified function describing their inventive structure, agents figure out what they can do that maximizes expected reward.

The quality of a world model is determined by the world. It is necessarily grounded in data that consists of the observations and actions that an agent performs. In a very real way, world models have the capacity of being normative, in the sense that the world itself provides an objective metric by which we can measure the quality of a world model: it generates more accurate predictions of future world states.

On the contrary, there is no procedure or objective metric by which to evaluate the goodness of a reward function. Reward functions merely motivate behavior — and while we can judge a given behavior as good or bad, that evaluation is based on our values and goals and is therefore also highly subjective. Indeed, the problem of reward function selection is the POMDP version of the fundamental problem of moral philosophy: What makes for a good life? This is a hard question. And any answer to it presupposes a set of norms against which to evaluate the answer — reward function selection seems to require a sort of meta-reward function.

The era of human data

According to Silver and Sutton, the main limitation of AI systems from the era of simulation was their relative simplicity and lack of generalizability or brittleness. Simulation environments of the simulation era all feature well defined, closed problems characterized by trivial reward functions like win the game or be the fastest. While effective in toy cases, these techniques failed to generalize to the real world, which is open ended, filled with people that can have multiple conflicting goals and aims, and can sometimes afford multiple equivalently good solutions to a problem. One might think that an exception to this rule might be games with clear win conditions, like chess or Go. But even this is not necessarily the case in the real world. Consider the case of a chess tutor teaching a new player how to play the game. In that case, the tutor’s goal is definitely not to win, but rather to educate.

Silver and Sutton argue that the era of simulation eventually came to an end because of the inability of AI systems to close the real-to-sim gap — i.e., to resolve the problem of redeploying an AI system trained in well controlled simulation environments in the real world — and the ensuing poverty of task generality of agents designed using RL approaches in simulated environments or simple games. Ultimately, this is a criticism of the quality of the world models and inference engines — and in our view, only acknowledges part of the problem.

What followed was the “era of human data” or “era of human-centric AI,” from 2020 onward. Around the turn of the 2020s, the availability of enormous, highly curated and annotated, human-generated datasets, combined with new techniques to make good use of this massive amount of data, like reinforcement learning from human feedback (RLHF) and expert trajectory learning, offered an attractive solution to the brittleness of systems trained in simple simulated environments and on synthetic data. In this setting, we replaced the agent’s reward function with the normative goal of replicating human behavior in narrow task settings. In this way, human raters have become the reward function evaluators.

According to Silver and Sutton, the upshot was vastly improved generalizability — the main bottleneck being the availability and quality of massive datasets that exhaustively cover all relevant use cases. As everyone knows by now, this new wave of human-centric AI has led to impressive advancements in the capabilities of AI systems. Indeed, this is the technology that underwrites the current cohort of highly impressive, high-powered, multimodal large language models (LLMs).

But, as Silver and Sutton argue, something key was lost in the transition to human-centric AI. In particular, they argue that some of the most productive core concepts of RL have been abandoned, leading to the impoverishment of the field as a whole. Indeed, while RL is still core to the era of human data, there are key differences between how it was implemented then and now.

The key difference between RL agents in the era of simulation and the era of human data is related to the ability for AI systems to learn autonomously, without heavy handed human supervision or expert mimicry. AI systems of the human-centric era are no longer designed to learn autonomously from their own experience, that is, from the observed consequences of their actions in the world. Rather, as just discussed, in state of the art human-centric AI systems, a human rater is tasked with evaluating model output — and tells the agent what counts as a good versus a bad outcome. While this is still RL of a sort, it moves away from core concepts that had attracted interest in RL to begin with. As Silver and Sutton note, techniques like RLHF “side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction” (2025, p. 7).

And this, on Silver and Sutton’s account, has come at a cost, namely, agents’ inability to autonomously discover new knowledge. We completely agree. It is underappreciated today that early RL systems were capable of autonomous learning, in a way that current state of the art AI systems are not. That is, rather than provide the agent with an incentive structure (i.e., reward function) and let it determine how best to achieve its goals, we tell the agent what to do directly.

Human-centric AI also displays an unexpected, but related, performance ceiling. In the era of human data, all the knowledge to which an AI system has access is ultimately derived from the feedback (and therefore the expertise) of human users. This has allowed us to achieve expert human-level performance on several key tasks, such as the impressive language generation of LLMs. But Silver and Sutton argue that this approach cannot lead to super-human performance, because human feedback and the associated level of expertise imposes a performance ceiling — that of the human expert. This is a ceiling which current AIs cannot exceed because they cannot generate data, nor learn autonomously.

Where to go from here? According to Silver and Sutton, these problems are leading to another era change. We have now entered what they call the “era of experience.”

The end of an era, the beginning of a new one

Silver and Sutton note that we have reached a performance ceiling for AI agents designed using supervised learning from human data — and propose that we have entered a new era or phase in the development of AI, which the authors call the “era of experience.” They argue that we will soon live in a world where the majority of the data that is generated is generated by autonomous agents themselves. On their account, this new era will be characterized by what they call “grounding” in the real world, which entails online action-perception loops, physical embodiment, and experiential learning. Crucially, Silver and Sutton argue that in the era of experience, core concepts of RL, which were deemphasized in the human-centric era, will make a triumphant return.

As evidence of their view, they point to the recent success of AlphaProof. AlphaProof is based on a RL algorithm trained on 100,000 formal proofs, and was allowed to interact with a formal proving system, generating new data autonomously by trying out new solutions to novel problems. Once trained, AlphaProof was able to generate 100,000,000 additional proofs through continual interaction with the proving system.

Silver and Sutton propose that the next generation of RL should be designed to pursue “grounded rewards.” They define grounded rewards as signals that are generated directly by the environment as a consequence of the physical actions undertaken by an agent in that environment. Grounded rewards are grounded, in their parlance, because they originate directly from the agent’s experience of the effects of its actions in the world. A grounded reward function, according to this line of reasoning, would be one in which the arguments of the reward function (i.e., the elements of its domain) are signals that originate from the environment.

Silver and Sutton argue that grounded rewards of this sort would obviate the need for heavy reliance on costly human supervision and expert feedback. And they argue that the human environment is extremely rich in such “grounded signals,” which the agent itself can evaluate in real time by continually assessing the feedback provided by the sensory consequences of their actions in real time. These are quantities that the agent itself can measure and evaluate, like “cost, error rates, hunger, productivity, health metrics, climate metrics, profit, sales, exam results, success, visits, yields, stocks, likes, income, pleasure/pain, economic indicators, accuracy, power, distance, speed, efficiency, or energy consumption” (2025, p. 4).

Of course, such grounded reward signals can also be computed from human feedback — but the key difference is that rewards are not directly prespecified by human raters. For instance, a grounded reward could be a human’s reporting on whether or not they are satisfied with the performance of a robotic assistant on a task. But the key to making this kind of feedback into a grounded reward is to provide such feedback to the agent as a consequence of its own decisions and actions — consequences which the agent itself can measure and evaluate. In other words, the key question is: Does a human user provide the agent with (the outputs of) an entire reward function, or only with sparse feedback, leaving it to reason about a course of action based on its incentives and understanding?

We embrace the new experimental era described by Silver and Sutton. We agree with most of the points raised by Silver and Sutton in their excellent article, and their conclusion that “once agents become connected to the world through rich action and observation spaces, there will be no shortage of grounded signals to provide a basis for reward.” (2025, p. 4). Unfortunately, this does not address the central problem: Where does the reward function come from? Indeed, the central prescription merely specifies the proper domain in which reward functions should live, but says nothing with regards to the precise means by which one should map actions and observations onto a real valued reward signal.

Experiential learning and grounded reward functions

Thus, our main criticism of Silver and Sutton is that grounding rewards in actions and observations provides little insight into how to solve the hard problem of reward function selection. In their terminology, a reward is grounded if it is computed from a signal provided directly to an agent, as sensory feedback to actions taken in the environment. We agree that a reward only counts as grounded in the appropriate sense if it is evaluated in terms of grounded, environmental variables, specifically, the actions and observations of the agent. But the specification of the domain of a function provides little insight into what the function itself should actually be.

This is the “hard problem” of RL and is the central focus of the inverse reinforcement learning (IRL) literature. In IRL, the aim is to develop techniques to translate behavior (i.e., a policy or an observed expert trajectory) into a reward function, given a model of the belief formation process implemented by an agent. IRL thus requires observations of expert trajectories, which must be selected by a human user or rater. This means that IRL is based upon expert demonstrations, i.e., on extensive supervision from costly, and possibly biased or incomplete human feedback — and its use pulls us back into the era of human data.

If the era of experience is meant to transcend the era of human data, IRL in its current form is inadequate. While it’s tempting to hope that one could simply apply IRL to model human behavior, obtain a grounded instantiation of the human’s reward function, and then inject that reward function into the artificial agent, this too is problematic. This is because the use of IRL methods requires knowledge of the belief formation process of the agent to which it is applied. Outside of simple laboratory environments, we simply don’t know much about the belief formation process of humans. See our recent blog post on this topic for more information.

So if IRL doesn’t quite fit the bill, what could solve the problem of reward function learning? In machine learning, the word “learning” always refers to the resolution of some optimization problem. In RL, the reward function specifies the quantity that must be optimized. If reward functions are to be learned, we must identify a distinct objective function and optimize it in order to identify the reward function. But note that, in this setting, the reward function is no longer the thing that we are optimizing. As such, we agree that something like a grounded reward function could, in principle, be learned through experience by agents. But technically, this would only be via the instantiation of some kind of meta-reward objective. Moreover, if that meta-reward function is “reproducing the observed behavior of other agents or humans,” then we are still in the age of human data.

So where does this leave us? Transcending this era requires identifying a meta-objective or meta-reward function, like knowledge acquisition (information seeking) or maintaining an observed equilibrium or steady state of the environment, the two objectives used in the active inference formulation of behavior. But it suffices to say that simply insisting that reward functions operate on actions and outcomes provides little guidance when it comes to solving the hard problem of reward function selection.