This blog post is about running more ROMs in the Arcade Learning Environment by delegating execution to C++.

Background

ALE-Py has gotten a lot of new features but not enough people know about the native C++ vector implementation (PR 599). ALE-Py has had the AtariVectorEnv interface since April 2025, which adds support for standard preprocessing, asynchronous send/recv between the agent and the environment, and multiple instances of the same ROM; the ALE-Py vector environment docs are here for more info. Here’s how you create the Python interface:

from ale_py.vector_env import AtariVectorEnv

# Create a vector environment with 4 parallel instances of Breakout
envs = AtariVectorEnv(
    game="breakout",  # The ROM id, not the Gymnasium environment id
    num_envs=4,
)

Support for multiple ROMs (PR)

My PR extends the AtariVectorEnv to accept a list of ROMs, adding the games parameter in the constructor, num_envs can be used to duplicate the list. Each ROM runs independantly: each ROM has its own episode state, autoresets independently, and terminates and truncations independantly. Here is how you spawn multiple ROMs:

# with num_envs
envs = AtariVectorEnv(games=["pong", "breakout", "space_invaders"], num_envs=2)
# ["pong", "pong", "breakout", "breakout", "space_invaders", "space_invaders"]

# manually
envs = AtariVectorEnv(games=["pong", "pong", "breakout", "space_invaders", "space_invaders", "space_invaders"])
# ["pong", "pong", "breakout", "space_invaders", "space_invaders", "space_invaders"]

As full_action_space is False by default, each ROM keeps its minimal action set and so the number of valid actions per ROM can be different. When ROMs have different action counts, single_action_space is None (there is no shared single space) and action_space is a MultiDiscrete with one count per ROM. Here’s an example with four ROMs where the last three support 6 actions:

import gymnasium as gym

assert envs.single_action_space is None
assert isinstance(envs.action_space, gym.spaces.MultiDiscrete)
print(envs.num_actions) # [4, 6, 6, 6] - preferable to use num_actions 

The chart below shows throughput versus latency using the multi-ROM feature. I use unique ROMs until 100 different ROMs are used, and then duplicate ROMs are added using round-robin assignment until 256 ROMs are used. The line below is the mean and the shaded band is the 95% confidence interval with 10 independant runs.

Latency vs throughput as we increase the number of ROMs (All ROMs)

ALE-Py AtariVectorEnv Gymnasium AsyncVectorEnv
Ryzen 9 7950X (16 Core) 2x Xeon Gold 6448Y (64 Core)

Packing all runs into a single job reduces the sequential speed of any single ROM but can massively increase the overall rate of completing an experiment up to the thread-count on the CPU. I gather results on an AMD Ryzen 9 16-Core CPU, and a machine with two Intel Xeon Gold 6448Y totaling 64 combined cores. The latency and throughput per hardware device improves before the CPU is saturated (32 cores or 64 cores). Increasing the number of ROMs past the thread count results in a speedup due to each ROM taking a different amount of time to execute.
We increase speedup past the thread count for each Processor by using a work-stealing threadpool.

If you’ve been using EnvPool for its vectorised implementation of environments I’d recommend giving ALE-Py’s vector interface a shot as it is actively maintained.