VLM-RB is a plug-and-play framework that uses frozen Vision-Language Models to prioritize semantically meaningful experiences in reinforcement learning replay buffers. Unlike traditional methods that rely on TD-error or other statistical proxies, VLM-RB leverages the semantic reasoning capabilities of pre-trained VLMs to identify which transitions represent genuine task progress.
Experience replay is fundamental to off-policy RL, but which experiences should be prioritized? Traditional methods like Prioritized Experience Replay (PER) use TD-error as a proxy for importance. However, TD-error lacks semantic awareness—it cannot distinguish transitions representing genuine task progress from those that do not.
This limitation is acute in sparse-reward, long-horizon tasks. Critical transitions (e.g., grasping an object, unlocking a door) may have low TD-error early in training due to delayed credit assignment. Meanwhile, task-irrelevant motions can yield large TD-errors despite contributing nothing to task completion.
Key insight: Pre-trained VLMs already encode rich semantic priors about what constitutes meaningful behavior. We can use these priors to identify important experiences from the very first update, without waiting for the critic to converge.
VLM-RB scores sub-trajectories (short video clips) rather than individual frames. A single frame can be ambiguous—a gripper above an object could be the start of a successful grasp or the aftermath of a failed attempt. Temporal context resolves this ambiguity.
Let $\tau^O_i=(o_{i}, o_{i+1}, \ldots, o_{i+L-1})$ denote a visual clip of $L$ rendered frames. The VLM maps this clip and a text prompt $\mathsf{P}$ to a scalar score:
Rather than sampling exclusively from VLM-prioritized transitions, we use a mixture strategy that interpolates between VLM-guided prioritization and uniform replay:
We use a linear warm-up schedule, starting with $\lambda_0=0$ (pure uniform) and gradually increasing to $\lambda_{\max}=0.5$. This ensures broad coverage early in training while biasing toward high-utility experiences later.
For continuous control tasks requiring fine-grained motion, we combine VLM scoring with TD-error:
The VLM score acts as a semantic filter, masking "irrelevant" transitions, while the TD-error $\delta_i$ refines prioritization within the filtered set.
Discrete navigation requiring sequential reasoning: pick up key, unlock door, reach goal. Sparse rewards only at task completion.
Prompt: "Does this clip contain a clear instance of goal satisfaction anywhere in it? If no visible success occurs, answer No. Do not guess. Output exactly Answer: Yes or Answer: No."
Continuous manipulation with UR5e arm. Long-horizon compositional tasks: unlocking, placing, coordinating objects.
Prompt: "Is there at least one clear instance of goal satisfaction in these frames? Look for contact + displacement consistent with the goal (lift off surface, place into receptacle, open/close articulation, move to target zone). Do not guess. If not visible, answer No. Output exactly Answer: Yes or Answer: No."
VLM-RB consistently outperforms baselines across all settings. The largest gains appear in challenging sparse-reward scenarios.
| Algorithm | Environment | Baseline | Performance Gain | Sample Efficiency |
|---|---|---|---|---|
| DQN + IQN | DoorKey 8x8 | UER / PER | +0.0% / +0.0% | +19.1% / +23.0% |
| DoorKey 12x12 | UER / PER | +61.3% / +22.0% | +52.8% / +32.1% | |
| DoorKey 16x16 | UER / PER | +241.7% / +70.8% | +37.6% / +24.1% | |
| SAC + TD3 | Scene Task 3 | UER / PER | +0.0% / +0.0% | +40.7% / +21.1% |
| Scene Task 4 | UER / PER | +22.0% / +2.0% | +44.6% / +19.7% | |
| Scene Task 5 | UER / PER | +119.4% / +49.1% | +46.3% / +17.9% |
We compare VLM-RB against alternative prioritization schemes: Attentive Experience Replay (AER), Experience Replay Optimization (ERO), and Reducible Loss Prioritization (ReLo).
To verify that VLM-RB relies on semantic understanding, we use the "Modified Game" paradigm: we modify only the rendered frames used for VLM scoring while keeping the underlying MDP identical.
(a) Standard
(b) Misleading
(c) Abstract
@article{sharony2026vlm,
title={VLM-Guided Experience Replay},
author={Sharony, Elad and Jurgenson, Tom and Krupnik, Orr and Di Castro, Dotan and Mannor, Shie},
journal={arXiv preprint arXiv:2602.01915},
year={2026}
}