VLM-RB is a plug-and-play framework that uses frozen Vision-Language Models to prioritize semantically meaningful experiences in reinforcement learning replay buffers. Unlike traditional methods that rely on TD-error or other statistical proxies, VLM-RB leverages the semantic reasoning capabilities of pre-trained VLMs to identify which transitions represent genuine task progress.
Experience replay is fundamental to off-policy RL, but which experiences should be prioritized? Traditional methods like Prioritized Experience Replay (PER) use TD-error as a proxy for importance. However, TD-error lacks semantic awareness—it cannot distinguish transitions representing genuine task progress from those that do not.
This limitation is acute in sparse-reward, long-horizon tasks. Critical transitions (e.g., grasping an object, unlocking a door) may have low TD-error early in training due to delayed credit assignment. Meanwhile, task-irrelevant motions can yield large TD-errors despite contributing nothing to task completion.
Key insight: Pre-trained VLMs already encode rich semantic priors about what constitutes meaningful behavior. We can use these priors to identify important experiences from the very first update, without waiting for the critic to converge.
VLM-RB scores sub-trajectories (short video clips) rather than individual frames. A single frame can be ambiguous—a gripper above an object could be the start of a successful grasp or the aftermath of a failed attempt. Temporal context resolves this ambiguity.
Let $\tau^O_i=(o_{i}, o_{i+1}, \ldots, o_{i+L-1})$ denote a visual clip of $L$ rendered frames. The VLM maps this clip and a text prompt $\mathsf{P}$ to a scalar score:
Rather than sampling exclusively from VLM-prioritized transitions, we use a mixture strategy that interpolates between VLM-guided prioritization and uniform replay:
We use a linear warm-up schedule, starting with $\lambda_0=0$ (pure uniform) and gradually increasing to $\lambda_{\max}=0.5$. This ensures broad coverage early in training while biasing toward high-utility experiences later.
For continuous control tasks requiring fine-grained motion, we combine VLM scoring with TD-error:
The VLM score acts as a semantic filter, masking "irrelevant" transitions, while the TD-error $\delta_i$ refines prioritization within the filtered set.
Discrete navigation requiring sequential reasoning: pick up key, unlock door, reach goal. Sparse rewards only at task completion.
Continuous manipulation with UR5e arm. Long-horizon compositional tasks: unlocking, placing, coordinating objects.
VLM-RB consistently outperforms baselines across all settings. The largest gains appear in challenging sparse-reward scenarios.
| Algorithm | Environment | Baseline | Performance Gain | Sample Efficiency |
|---|---|---|---|---|
| DQN + IQN | DoorKey 8x8 | UER / PER | +0.0% / +0.0% | +19.1% / +23.0% |
| DoorKey 12x12 | UER / PER | +61.3% / +22.0% | +52.8% / +32.1% | |
| DoorKey 16x16 | UER / PER | +241.7% / +70.8% | +37.6% / +24.1% | |
| SAC + TD3 | Scene Task 3 | UER / PER | +0.0% / +0.0% | +40.7% / +21.1% |
| Scene Task 4 | UER / PER | +22.0% / +2.0% | +44.6% / +19.7% | |
| Scene Task 5 | UER / PER | +119.4% / +49.1% | +46.3% / +17.9% |
We compare VLM-RB against alternative prioritization schemes: Attentive Experience Replay (AER), Experience Replay Optimization (ERO), and Reducible Loss Prioritization (ReLo).
To verify that VLM-RB relies on semantic understanding, we use the "Modified Game" paradigm: we modify only the rendered frames used for VLM scoring while keeping the underlying MDP identical.
(a) Standard
(b) Misleading
(c) Abstract
@article{sharony2025vlmrb,
title = {VLM-Guided Experience Replay},
author = {Elad Sharony and Tom Jurgenson and Orr Krupnik and Dotan Di Castro and Shie Mannor},
journal = {arXiv preprint},
year = {2025}
}