Toward Agents That Reason About Their Computation

One of the takeaways for me about the Bitter Lesson is that “we want AI agents that can discover like we can, not which contain what we have discovered.” It is natural for researchers to identify a problem in our agents and then build a solution into them. But if the solution matters to improving the agent’s success on the task, why did the designer need to be the one to create that change in the agent’s design? This question led me to the first paper published in my PhD: why don’t we let agents change how they are designed to use compute so that they can improve their performance?

In a normal Arcade Learning Environment training run, DQN makes 40 million decisions, which corresponds to 200 million emulator frames, or over 925 hours of gameplay at 60 frames per second. After that much experience, the agent has improved at the game but still uses compute in the way the designer specified before training.

When people identify a problem I think it is important that we understand why the agent wasn’t capable of improving on the problem for itself. Many of the problems researchers identify are issues in how the agent is designed, and my paper is calling for researchers to find general-purpose ways that agents could modify their design at run time in order to improve performance. In order to create a persuasive argument, I needed to present a tangible improvement in both the agent’s behavior and performance, and so, I wanted to find a common design choice to automate so that we could present empirical results in the paper.

The paper uses the rate at which an agent senses the world and acts as the design choice that the agent can change. At first glance, the agent’s objective does not motivate the agent to reduce its decision rate, as its only objective is to improve performance. And so, we include a small penalty in the agent’s reward when the agent senses and acts as this uses compute. When there is an animation sequence where the agent’s actions simply advance the frames in the emulator, we want the agent to learn to skip through these sequences faster. But when precise control could improve performance, we want the agent to ramp up its compute.

I see this as a frustratingly small step in the direction the Bitter Lesson calls for, as instead of a human selecting a fixed rate, a human proposes various rates that the agent can select at run time. I don’t think, though, that this sufficiently captured the Bitter Lesson, so I think it is important I paint that picture for the reader better than my paper did. Let me create a spectrum of choices I could have made so that you, the reader, can understand the degree to which each version of the paper could have adhered to the lesson:

Choose a fixed decision rate for the agent. This is the antithesis of the lesson.
Let the agent choose a fixed decision rate. This is slightly better, kinda like learning a hyperparameter.
Let the agent choose its decision rate at decision time. This is what most readers interpret our work as being, but I would put many of the related works as fitting best in this category.
Expose more controls over the sensor and actuator in the agent’s interface. Move the interface so that it encompasses more of the design choices that people need to make.

My paper lives in the third and fourth category, the Compute DQN method we propose is solidly an example of the third category, but the paper is calling for the fourth. The agent’s interface is expanded so that the agent can choose both what action it performs in the game and when it will next sense the world and act. Compute DQN chooses an Atari action together with a duration. During that duration the action is repeated, and the agent does not process another observation. A short duration lets the agent respond sooner and spends more compute. A long duration delays the next observation and action selection, and therefore spends less compute.

The important result to me is that compute use was learned, game-specific, and moment-specific. Below is a video from Asterix where the agent ramped up compute during waves with many collectibles and lowered its compute cost in the lulls between waves.

Asterix Video: Compute DQN in Asterix changes its decision rate during play, using more compute during dense waves of objects and less compute between them. The agent's attention is visualised in the heat-map, showing where in the image the agent is considering when selecting its next action.

The result of this research paper is that we have one example of the data-rate clearly not being a hinderence to performance, and so, we should seek more ways that the design problem could be handed over to the agent. We should give agents actions over their computational processes and give them feedback about the consequences of the real problem the designer faces, and let the agent wield its compute resources toward attaining higher performance.