VLM-Guided Experience Replay

Elad Sharony¹, Tom Jurgenson¹, Orr Krupnik¹, Dotan Di Castro², Shie Mannor^1,3

¹Technion ²ForSight Robotics ³Nvidia Research

arXiv Code (coming soon)

Overview

VLM-RB is a plug-and-play framework that uses frozen Vision-Language Models to prioritize semantically meaningful experiences in reinforcement learning replay buffers. Unlike traditional methods that rely on TD-error or other statistical proxies, VLM-RB leverages the semantic reasoning capabilities of pre-trained VLMs to identify which transitions represent genuine task progress.

Experience replay is fundamental to off-policy RL, but which experiences should be prioritized? Traditional methods like Prioritized Experience Replay (PER) use TD-error as a proxy for importance. However, TD-error lacks semantic awareness—it cannot distinguish transitions representing genuine task progress from those that do not.

This limitation is acute in sparse-reward, long-horizon tasks. Critical transitions (e.g., grasping an object, unlocking a door) may have low TD-error early in training due to delayed credit assignment. Meanwhile, task-irrelevant motions can yield large TD-errors despite contributing nothing to task completion.

11–52% higher success rates compared to uniform and prioritized experience replay
19–45% improved sample efficiency across discrete and continuous domains
No fine-tuning required — uses frozen, off-the-shelf VLMs
~12% throughput overhead via asynchronous VLM scoring

Method

Key insight: Pre-trained VLMs already encode rich semantic priors about what constitutes meaningful behavior. We can use these priors to identify important experiences from the very first update, without waiting for the critic to converge.

VLM Score vs Q-value — **Figure 2:** Frozen VLM scoring anticipates learned value. The rose curve shows the frozen VLM score for $L=32$-frame clips. The gray curves show the temporal value difference from critic checkpoints at increasing training steps. The VLM identifies semantically relevant events immediately, while the critic only gradually learns to assign value to these same transitions.

Scoring Sub-Trajectories

VLM-RB scores sub-trajectories (short video clips) rather than individual frames. A single frame can be ambiguous—a gripper above an object could be the start of a successful grasp or the aftermath of a failed attempt. Temporal context resolves this ambiguity.

Let $\tau^O_i=(o_{i}, o_{i+1}, \ldots, o_{i+L-1})$ denote a visual clip of $L$ rendered frames. The VLM maps this clip and a text prompt $\mathsf{P}$ to a scalar score:

p^{\text{VLM}} = f_{\text{VLM}}(\tau^O, \mathsf{P}) \in \mathbb{R}

Temporal context resolves ambiguity — **Figure 3:** From the same initial observation $o_i$ (left), multiple futures are possible. The top trajectory shows a successful grasp; the bottom shows stagnation. By scoring sub-trajectories, the VLM has sufficient context to distinguish meaningful progress from failure.

Mixture Sampling

Rather than sampling exclusively from VLM-prioritized transitions, we use a mixture strategy that interpolates between VLM-guided prioritization and uniform replay:

q_t(i) = \lambda_t \cdot q^P(i) + (1 - \lambda_t) \cdot q^U(i)

We use a linear warm-up schedule, starting with $\lambda_0=0$ (pure uniform) and gradually increasing to $\lambda_{\max}=0.5$. This ensures broad coverage early in training while biasing toward high-utility experiences later.

TD-Error Boosting (Continuous Control)

For continuous control tasks requiring fine-grained motion, we combine VLM scoring with TD-error:

q^P(i) \propto p^{\text{VLM}}_{i} \cdot |\delta_i|

The VLM score acts as a semantic filter, masking "irrelevant" transitions, while the TD-error $\delta_i$ refines prioritization within the filtered set.

Experiments

Environments

MiniGrid / DoorKey
8x8, 12x12, 16x16

Discrete navigation requiring sequential reasoning: pick up key, unlock door, reach goal. Sparse rewards only at task completion.

Prompt: "Does this clip contain a clear instance of goal satisfaction anywhere in it? If no visible success occurs, answer No. Do not guess. Output exactly Answer: Yes or Answer: No."

OGBench / Scene
Task 3, Task 4, Task 5

Continuous manipulation with UR5e arm. Long-horizon compositional tasks: unlocking, placing, coordinating objects.

Prompt: "Is there at least one clear instance of goal satisfaction in these frames? Look for contact + displacement consistent with the goal (lift off surface, place into receptacle, open/close articulation, move to target zone). Do not guess. If not visible, answer No. Output exactly Answer: Yes or Answer: No."

Main Results

VLM-RB consistently outperforms baselines across all settings. The largest gains appear in challenging sparse-reward scenarios.

Figure 4: Training progression on Scene/Task-4 (TD3). VLM-RB showcasing improved sample efficiency and higher success rates.

Aggregated training curves — **Figure 5:** VLM-RB consistently outperforms baselines across continuous and discrete tasks. Aggregated success rates for four algorithms (DQN, IQN, SAC, TD3) on MiniGrid and OGBench domains.

Algorithm	Environment	Baseline	Performance Gain	Sample Efficiency
DQN + IQN	DoorKey 8x8	UER / PER	+0.0% / +0.0%	+19.1% / +23.0%
	DoorKey 12x12	UER / PER	+61.3% / +22.0%	+52.8% / +32.1%
	DoorKey 16x16	UER / PER	+241.7% / +70.8%	+37.6% / +24.1%
SAC + TD3	Scene Task 3	UER / PER	+0.0% / +0.0%	+40.7% / +21.1%
	Scene Task 4	UER / PER	+22.0% / +2.0%	+44.6% / +19.7%
	Scene Task 5	UER / PER	+119.4% / +49.1%	+46.3% / +17.9%

Comparison to Alternative Prioritization Methods

We compare VLM-RB against alternative prioritization schemes: Attentive Experience Replay (AER), Experience Replay Optimization (ERO), and Reducible Loss Prioritization (ReLo).

Baseline comparison — **Figure 5:** VLM-RB is the only method to consistently solve sparse-reward tasks. Alternatives fail because they depend on dense feedback or local similarity metrics.

Semantic Alignment Ablation

To verify that VLM-RB relies on semantic understanding, we use the "Modified Game" paradigm: we modify only the rendered frames used for VLM scoring while keeping the underlying MDP identical.

(a) Standard

(b) Misleading

(c) Abstract

VLM Prior Ablation — **Figure 6:** Success depends on alignment with semantic priors. With misleading or abstract visuals, performance degrades to baseline levels—confirming that VLM-RB's prioritization depends on recognizing meaningful visual semantics.

Citation

@article{sharony2026vlm,
  title={VLM-Guided Experience Replay},
  author={Sharony, Elad and Jurgenson, Tom and Krupnik, Orr and Di Castro, Dotan and Mannor, Shie},
  journal={arXiv preprint arXiv:2602.01915},
  year={2026}
}

References

Schaul, T., et al. (2015). Prioritized Experience Replay. arXiv:1511.05952.
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
Cho, J.H., et al. (2025). PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding. arXiv.
Chevalier-Boisvert, M., et al. (2023). Minigrid & Miniworld: Modular & Customizable RL Environments. NeurIPS.
Park, S., et al. (2024). OGBench: Benchmarking Offline Goal-Conditioned RL. arXiv.