VLM-Guided Experience Replay

Elad Sharony1, Tom Jurgenson1, Orr Krupnik1, Dotan Di Castro2, Shie Mannor1,3
1Technion    2ForSight Robotics    3Nvidia Research

Overview

VLM-RB is a plug-and-play framework that uses frozen Vision-Language Models to prioritize semantically meaningful experiences in reinforcement learning replay buffers. Unlike traditional methods that rely on TD-error or other statistical proxies, VLM-RB leverages the semantic reasoning capabilities of pre-trained VLMs to identify which transitions represent genuine task progress.

Experience replay is fundamental to off-policy RL, but which experiences should be prioritized? Traditional methods like Prioritized Experience Replay (PER) use TD-error as a proxy for importance. However, TD-error lacks semantic awareness—it cannot distinguish transitions representing genuine task progress from those that do not.

This limitation is acute in sparse-reward, long-horizon tasks. Critical transitions (e.g., grasping an object, unlocking a door) may have low TD-error early in training due to delayed credit assignment. Meanwhile, task-irrelevant motions can yield large TD-errors despite contributing nothing to task completion.

  • 11–52% higher success rates compared to uniform and prioritized experience replay
  • 19–45% improved sample efficiency across discrete and continuous domains
  • No fine-tuning required — uses frozen, off-the-shelf VLMs
  • ~12% throughput overhead via asynchronous VLM scoring
VLM-RB Method Overview
Figure 1: System diagram. (1) Data is collected with the current policy $\pi_k$. (2) Transitions $(s,a,r,s')$ are stored in a prioritized replay buffer with default priority $\bar{p}$. (3) Asynchronously, a VLM worker scores rendered clips $\tau^O$ under prompt $\mathsf{P}$ and writes priorities $p^{\text{VLM}}$ back to the buffer. (4) The learner samples using a mixture distribution $q_t$ interpolating between VLM-prioritized and uniform replay. (5) The policy is updated to $\pi_{k+1}$.

Method

Key insight: Pre-trained VLMs already encode rich semantic priors about what constitutes meaningful behavior. We can use these priors to identify important experiences from the very first update, without waiting for the critic to converge.

VLM Score vs Q-value
Figure 2: Frozen VLM scoring anticipates learned value. The rose curve shows the frozen VLM score for $L=32$-frame clips. The gray curves show the temporal value difference from critic checkpoints at increasing training steps. The VLM identifies semantically relevant events immediately, while the critic only gradually learns to assign value to these same transitions.

Scoring Sub-Trajectories

VLM-RB scores sub-trajectories (short video clips) rather than individual frames. A single frame can be ambiguous—a gripper above an object could be the start of a successful grasp or the aftermath of a failed attempt. Temporal context resolves this ambiguity.

Let $\tau^O_i=(o_{i}, o_{i+1}, \ldots, o_{i+L-1})$ denote a visual clip of $L$ rendered frames. The VLM maps this clip and a text prompt $\mathsf{P}$ to a scalar score:

$$p^{\text{VLM}} = f_{\text{VLM}}(\tau^O, \mathsf{P}) \in \mathbb{R}$$
Temporal context resolves ambiguity
Figure 3: From the same initial observation $o_i$ (left), multiple futures are possible. The top trajectory shows a successful grasp; the bottom shows stagnation. By scoring sub-trajectories, the VLM has sufficient context to distinguish meaningful progress from failure.

Mixture Sampling

Rather than sampling exclusively from VLM-prioritized transitions, we use a mixture strategy that interpolates between VLM-guided prioritization and uniform replay:

$$q_t(i) = \lambda_t \cdot q^P(i) + (1 - \lambda_t) \cdot q^U(i)$$

We use a linear warm-up schedule, starting with $\lambda_0=0$ (pure uniform) and gradually increasing to $\lambda_{\max}=0.5$. This ensures broad coverage early in training while biasing toward high-utility experiences later.

TD-Error Boosting (Continuous Control)

For continuous control tasks requiring fine-grained motion, we combine VLM scoring with TD-error:

$$q^P(i) \propto p^{\text{VLM}}_{i} \cdot |\delta_i|$$

The VLM score acts as a semantic filter, masking "irrelevant" transitions, while the TD-error $\delta_i$ refines prioritization within the filtered set.

Experiments

Environments

MiniGrid / DoorKey
8x8, 12x12, 16x16

Discrete navigation requiring sequential reasoning: pick up key, unlock door, reach goal. Sparse rewards only at task completion.

Prompt: "Does this clip contain a clear instance of goal satisfaction anywhere in it? If no visible success occurs, answer No. Do not guess. Output exactly Answer: Yes or Answer: No."

OGBench / Scene
Task 3, Task 4, Task 5

Continuous manipulation with UR5e arm. Long-horizon compositional tasks: unlocking, placing, coordinating objects.

Prompt: "Is there at least one clear instance of goal satisfaction in these frames? Look for contact + displacement consistent with the goal (lift off surface, place into receptacle, open/close articulation, move to target zone). Do not guess. If not visible, answer No. Output exactly Answer: Yes or Answer: No."

Main Results

VLM-RB consistently outperforms baselines across all settings. The largest gains appear in challenging sparse-reward scenarios.

Figure 4: Training progression on Scene/Task-4 (TD3). VLM-RB showcasing improved sample efficiency and higher success rates.
Aggregated training curves
Figure 5: VLM-RB consistently outperforms baselines across continuous and discrete tasks. Aggregated success rates for four algorithms (DQN, IQN, SAC, TD3) on MiniGrid and OGBench domains.
Algorithm Environment Baseline Performance Gain Sample Efficiency
DQN + IQN DoorKey 8x8 UER / PER +0.0% / +0.0% +19.1% / +23.0%
DoorKey 12x12 UER / PER +61.3% / +22.0% +52.8% / +32.1%
DoorKey 16x16 UER / PER +241.7% / +70.8% +37.6% / +24.1%
SAC + TD3 Scene Task 3 UER / PER +0.0% / +0.0% +40.7% / +21.1%
Scene Task 4 UER / PER +22.0% / +2.0% +44.6% / +19.7%
Scene Task 5 UER / PER +119.4% / +49.1% +46.3% / +17.9%

Comparison to Alternative Prioritization Methods

We compare VLM-RB against alternative prioritization schemes: Attentive Experience Replay (AER), Experience Replay Optimization (ERO), and Reducible Loss Prioritization (ReLo).

Baseline comparison
Figure 5: VLM-RB is the only method to consistently solve sparse-reward tasks. Alternatives fail because they depend on dense feedback or local similarity metrics.

Semantic Alignment Ablation

To verify that VLM-RB relies on semantic understanding, we use the "Modified Game" paradigm: we modify only the rendered frames used for VLM scoring while keeping the underlying MDP identical.

Standard

(a) Standard

Misleading

(b) Misleading

Abstract

(c) Abstract

VLM Prior Ablation
Figure 6: Success depends on alignment with semantic priors. With misleading or abstract visuals, performance degrades to baseline levels—confirming that VLM-RB's prioritization depends on recognizing meaningful visual semantics.

Citation

@article{sharony2026vlm,
  title={VLM-Guided Experience Replay},
  author={Sharony, Elad and Jurgenson, Tom and Krupnik, Orr and Di Castro, Dotan and Mannor, Shie},
  journal={arXiv preprint arXiv:2602.01915},
  year={2026}
}

References

  1. Schaul, T., et al. (2015). Prioritized Experience Replay. arXiv:1511.05952.
  2. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
  3. Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
  4. Cho, J.H., et al. (2025). PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding. arXiv.
  5. Chevalier-Boisvert, M., et al. (2023). Minigrid & Miniworld: Modular & Customizable RL Environments. NeurIPS.
  6. Park, S., et al. (2024). OGBench: Benchmarking Offline Goal-Conditioned RL. arXiv.