VLM-Guided Experience Replay Buffer

Elad Sharony1, Tom Jurgenson1,2, Orr Krupnik2, Dotan Di Castro2, Shie Mannor1
1Technion    2Bosch AI (BCAI)

Overview

VLM-RB is a plug-and-play framework that uses frozen Vision-Language Models to prioritize semantically meaningful experiences in reinforcement learning replay buffers. Unlike traditional methods that rely on TD-error or other statistical proxies, VLM-RB leverages the semantic reasoning capabilities of pre-trained VLMs to identify which transitions represent genuine task progress.

  • 11–52% higher success rates compared to uniform and prioritized experience replay
  • 19–45% improved sample efficiency across discrete and continuous domains
  • No fine-tuning required — uses frozen, off-the-shelf VLMs
  • ~12% throughput overhead via asynchronous VLM scoring
VLM-RB Method Overview
Figure 1: System diagram. (1) Data is collected with the current policy $\pi_k$. (2) Transitions $(s,a,r,s')$ are stored in a prioritized replay buffer with default priority $\bar{p}$. (3) Asynchronously, a VLM worker scores rendered clips $\tau^O$ under prompt $\mathsf{P}$ and writes priorities $p^{\text{VLM}}$ back to the buffer. (4) The learner samples using a mixture distribution $q_t$ interpolating between VLM-prioritized and uniform replay. (5) The policy is updated to $\pi_{k+1}$.

Motivation

Experience replay is fundamental to off-policy RL, but which experiences should be prioritized? Traditional methods like Prioritized Experience Replay (PER) use TD-error as a proxy for importance. However, TD-error lacks semantic awareness—it cannot distinguish transitions representing genuine task progress from those that do not.

This limitation is acute in sparse-reward, long-horizon tasks. Critical transitions (e.g., grasping an object, unlocking a door) may have low TD-error early in training due to delayed credit assignment. Meanwhile, task-irrelevant motions can yield large TD-errors despite contributing nothing to task completion.

Key insight: Pre-trained VLMs already encode rich semantic priors about what constitutes meaningful behavior. We can use these priors to identify important experiences from the very first update, without waiting for the critic to converge.

VLM Score vs Q-value
Figure 2: Frozen VLM scoring anticipates learned value. The rose curve shows the frozen VLM score for $L=32$-frame clips. The gray curves show the temporal value difference from critic checkpoints at increasing training steps. The VLM identifies semantically relevant events immediately, while the critic only gradually learns to assign value to these same transitions.

Method

Scoring Sub-Trajectories

VLM-RB scores sub-trajectories (short video clips) rather than individual frames. A single frame can be ambiguous—a gripper above an object could be the start of a successful grasp or the aftermath of a failed attempt. Temporal context resolves this ambiguity.

Let $\tau^O_i=(o_{i}, o_{i+1}, \ldots, o_{i+L-1})$ denote a visual clip of $L$ rendered frames. The VLM maps this clip and a text prompt $\mathsf{P}$ to a scalar score:

$$p^{\text{VLM}} = f_{\text{VLM}}(\tau^O, \mathsf{P}) \in \mathbb{R}$$
Temporal context resolves ambiguity
Figure 3: From the same initial observation $o_i$ (left), multiple futures are possible. The top trajectory shows a successful grasp; the bottom shows stagnation. By scoring sub-trajectories, the VLM has sufficient context to distinguish meaningful progress from failure.

Mixture Sampling

Rather than sampling exclusively from VLM-prioritized transitions, we use a mixture strategy that interpolates between VLM-guided prioritization and uniform replay:

$$q_t(i) = \lambda_t \cdot q^P(i) + (1 - \lambda_t) \cdot q^U(i)$$

We use a linear warm-up schedule, starting with $\lambda_0=0$ (pure uniform) and gradually increasing to $\lambda_{\max}=0.5$. This ensures broad coverage early in training while biasing toward high-utility experiences later.

TD-Error Boosting (Continuous Control)

For continuous control tasks requiring fine-grained motion, we combine VLM scoring with TD-error:

$$q^P(i) \propto p^{\text{VLM}}_{i} \cdot |\delta_i|$$

The VLM score acts as a semantic filter, masking "irrelevant" transitions, while the TD-error $\delta_i$ refines prioritization within the filtered set.

Experiments

Environments

MiniGrid / DoorKey

Discrete navigation requiring sequential reasoning: pick up key, unlock door, reach goal. Sparse rewards only at task completion.

8x8 12x12 16x16

OGBench / Scene

Continuous manipulation with UR5e arm. Long-horizon compositional tasks: unlocking, placing, coordinating objects.

Task 3 Task 4 Task 5

Main Results

VLM-RB consistently outperforms baselines across all settings. The largest gains appear in challenging sparse-reward scenarios.

Aggregated training curves
Figure 4: VLM-RB consistently outperforms baselines across continuous and discrete tasks. Aggregated success rates for four algorithms (DQN, IQN, SAC, TD3) on MiniGrid and OGBench domains.
Algorithm Environment Baseline Performance Gain Sample Efficiency
DQN + IQN DoorKey 8x8 UER / PER +0.0% / +0.0% +19.1% / +23.0%
DoorKey 12x12 UER / PER +61.3% / +22.0% +52.8% / +32.1%
DoorKey 16x16 UER / PER +241.7% / +70.8% +37.6% / +24.1%
SAC + TD3 Scene Task 3 UER / PER +0.0% / +0.0% +40.7% / +21.1%
Scene Task 4 UER / PER +22.0% / +2.0% +44.6% / +19.7%
Scene Task 5 UER / PER +119.4% / +49.1% +46.3% / +17.9%

Comparison to Alternative Prioritization Methods

We compare VLM-RB against alternative prioritization schemes: Attentive Experience Replay (AER), Experience Replay Optimization (ERO), and Reducible Loss Prioritization (ReLo).

Baseline comparison
Figure 5: VLM-RB is the only method to consistently solve sparse-reward tasks. Alternatives fail because they depend on dense feedback or local similarity metrics.

Semantic Alignment Ablation

To verify that VLM-RB relies on semantic understanding, we use the "Modified Game" paradigm: we modify only the rendered frames used for VLM scoring while keeping the underlying MDP identical.

Standard

(a) Standard

Misleading

(b) Misleading

Abstract

(c) Abstract

VLM Prior Ablation
Figure 6: Success depends on alignment with semantic priors. With misleading or abstract visuals, performance degrades to baseline levels—confirming that VLM-RB's prioritization depends on recognizing meaningful visual semantics.

Citation

@article{sharony2025vlmrb,
  title   = {VLM-Guided Experience Replay},
  author  = {Elad Sharony and Tom Jurgenson and Orr Krupnik and Dotan Di Castro and Shie Mannor},
  journal = {arXiv preprint},
  year    = {2025}
}

References

  1. Schaul, T., et al. (2015). Prioritized Experience Replay. arXiv:1511.05952.
  2. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
  3. Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
  4. Cho, J.H., et al. (2025). PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding. arXiv.
  5. Chevalier-Boisvert, M., et al. (2023). Minigrid & Miniworld: Modular & Customizable RL Environments. NeurIPS.
  6. Park, S., et al. (2024). OGBench: Benchmarking Offline Goal-Conditioned RL. arXiv.