Overview
Reinforcement learning (RL) aims to solve long-horizon tasks even when training data contains only short fragments of behavior — a capability often called experience stitching. This ability is commonly attributed to bootstrapping via temporal-difference (TD) updates, i.e., updating value estimates using predictions about successor states instead of relying on full rollouts. However, stitching remains far from solved: we show that the common belief — that TD methods enable stitching automatically — fails even in relatively simple environments as the state space grows. We further show that Monte-Carlo (MC) methods can achieve stitching in some settings, and that model scale is an important lever that can enable stitching.
The problem with stitching
Outside of tabular or highly constrained settings, TD methods cannot literally stitch trajectories together, as trajectories rarely self-intersect in real-world scenarios. For example, compare an ant crawling on a sheet of paper (2D) with a fly flying in an empty room (3D) in the figure below: the ant's 2D path will self-cross far more often than the fly's 3D path. Following this example, we observe that stitching has a dual relationship with generalization. On one hand, stitching requires generalization: the value function must assign similar values to similar states, enabling values to propagate across disconnected trajectories. On the other hand, stitching itself provides generalization: it allows a policy to traverse between states that were never observed as connected during training.
What does "stitching" mean for TD updates?
Temporal-Difference learning “stitches” information by updating the value of the current state using the TD target $R_{t+1} + \gamma\,V(S_{t+1})$, which combines the observed reward and the estimated value of the next state. During the update, $V(S_t) \leftarrow V(S_t) + \alpha\,[R_{t+1} + \gamma\,V(S_{t+1}) - V(S_t)]$, the TD error term $[R_{t+1} + \gamma\,V(S_{t+1}) - V(S_t)]$ propagates information about future outcomes backward to preceding states. In principle, this mechanism should allow value estimates to integrate information gathered across different trajectories.
Stitching regimes
We describe stitching regimes purely by what is present in the replay buffer $\mathcal D$ at train time and what is queried at test time.
-
No stitching (end-to-end only).
Train: $\mathcal{D}$ contains end-to-end trajectories $(s' \to g')$.
Test: evaluate a held-out end-to-end pair $(s \to g)$ (same generator, disjoint pairs). -
Exact stitching (shared waypoint).
Train: $\mathcal{D}$ contains trajectories $(s \to w')$ and $(w' \to g)$ for the same waypoint $w'$; no $(s \to g)$.
Test: evaluate the end-to-end query $(s \to g)$.
This setting aligns with classic dynamic programming and temporal-difference propagation across a shared waypoint and recent discussions of “stitching”. -
Generalized stitching (waypoint mismatch).
Train: $\mathcal{D}$ contains $(s \to w')$ and $(w'' \to g)$ with $w' \neq w''$; there is no waypoint $\tilde{w}$ for which both trajectories $(s \to \tilde{w})$ and $(\tilde{w} \to g)$ are present.
Test: evaluate $(s \to g)$.
Success requires a representation that bridges mismatched trajectories (e.g., successor features with generalized policy improvement or temporal distance/value models).
Experimental setup
The agent (red ball) needs to push the boxes into the target locations (yellow transparent areas). If the box is placed in the correct quarter, the target location lights up green.
Exact Stitching (Quarters setting)
During training, boxes are placed in one quarter and must be moved to an adjacent quarter (gray arrow indicates the required direction of transfer). During testing, boxes must be moved to the diagonal quarter. The gray arrows illustrate one of the valid two-step routes via adjacent quarters (adjacent $\rightarrow$ adjacent), which were seen separately during training but never as an end-to-end diagonal move.
Generalized Stitching (Few-to-Many setting)
During training, one box is already on a target, and the agent must place the remaining two. During testing, no boxes start on targets. Although both start and goal configurations are individually familiar, training never includes segments that involve moving three boxes.
- We label a setup open if test solutions are likely to leave the training support.
- We label a setup closed if all test solutions can be realized on training support.
Experimental Highlights
In the Quarters setting (6×6 grid) — which tests exact stitching — increasing the number of boxes widens the generalization gap for both TD and MC methods.
The failure mode of TD methods
A subtle failure of stitching. An agent trained on the quarters task should first move all boxes to an adjacent corner and then to the goal quarter. However, if the agent prematurely moves a box along the diagonal, it will end up in a state that has never been seen before during training, as a result of navigating in the open setup.
In the Few-to-many setting, we probe methods' generalized stitching capabilities. The more difficult the training tasks (fewer boxes starting on target), the smaller the generalization gap for both MC and TD methods.
Scaling Narrows the Generalization Gap
Previous works (Nauman et al., 2024; Lee et al., 2025; Wang et al., 2025) have shown that proper scaling of critics’ and actors’ neural net- works can provide enormous benefits in online RL. Strikingly, the generalization gap might be reduced by simply increasing the scale of the critic for both TD and MC methods.
Empirical Takeaways
- TD updates excel at exact stitching for a couple of boxes, yet performance collapses as the space of possible states expands.
- Monte Carlo variants such as CRL close most of the generalization gap in closed settings, demonstrating implicit stitching despite the absence of bootstrapping.
- Increasing critic capacity consistently narrows the train-test gap for both paradigms, positioning scale as a stronger lever for stitching properties.
Call to action
Our findings challenge the conventional wisdom that TD learning is the sole effective method for experience stitching. As shown by our experiments, even a couple of boxes in the Quarters task can cause TD methods to fail. We encourage readers to explore the code linked above to experiment with the benchmark and propose new methods that can mitigate this problem! Also, feel free to get in touch with us for collaboration or questions!
BibTeX
@article{bortkiewicz2025temporal,
title = {Is Temporal Difference Learning the Gold Standard for Stitching in RL?},
author = {Michał Bortkiewicz and Władysław Pałucki and Mateusz Ostaszewski and Benjamin Eysenbach},
year = {2025},
journal = {arXiv preprint arXiv: 2510.21995}
}