Overview
Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of the behaviors. This quality is called stitching, and is a crucial prerequisite for more general, foundational RL models. Conventional wisdom dictates, that only temporal difference (TD) methods are able stitch fragments of experiences gathered during the training and use them to solve more complex tasks. We show that, while on simple, low-dimensional settings TD methods can indeed stitch experiences, this does not transfer to more complex, high-dimensional tasks. Additionally we show that Monte Carlo (MC) methods, while they still fall behind TD methods, are able to exhibit some stitching behavior as well. Furthermore we determine that scaling the network sizes plays more of a critical role in closing the generalization gap than previously thought, and is a promising avenue of research, especially in the age of larger models in RL.
The problem with stitching
Outside of tabular or highly constrained settings, TD methods cannot literally stitch trajectories together, as trajectories rarely self-intersect in real-world scenarios. For example, compare an ant crawling on a sheet of paper (2D) with a fly flying in an empty room (3D) in the figure below: the ant's 2D path will self-cross far more often than the fly's 3D path. Following this example, we observe that stitching has a dual relationship with generalization. On one hand, stitching requires generalization: the value function must assign similar values to similar states, enabling values to propagate across disconnected trajectories. On the other hand, stitching itself provides generalization: it allows a policy to traverse between states that were never observed as connected during training.
Stitching regimes
-
No stitching (end-to-end only).
Train: $\mathcal{D}$ contains end-to-end trajectories $(s' \to g')$.
Test: evaluate a held-out end-to-end pair $(s \to g)$ (same generator, disjoint pairs). -
Exact stitching (shared waypoint).
Train: $\mathcal{D}$ contains trajectories $(s \to w')$ and $(w' \to g)$ for the same waypoint $w'$; no $(s \to g)$.
Test: evaluate the end-to-end query $(s \to g)$.
This setting aligns with classic dynamic programming and temporal-difference propagation across a shared waypoint and recent discussions of “stitching”. -
Generalized stitching (waypoint mismatch).
Train: $\mathcal{D}$ contains $(s \to w')$ and $(w'' \to g)$ with $w' \neq w''$; there is no waypoint $\tilde{w}$ for which both trajectories $(s \to \tilde{w})$ and $(\tilde{w} \to g)$ are present.
Test: evaluate $(s \to g)$.
Success requires a representation that bridges mismatched trajectories (e.g., successor features with generalized policy improvement or temporal distance/value models).
Experimental setup
The agent (red ball) needs to push the boxes into the target locations (yellow transparent areas). If the box is placed in the correct quarter, the target location lights up green.
Exact Stitching (Quarters setting)
During training, boxes are placed in one quarter and must be moved to an adjacent quarter (gray arrow indicates the required direction of transfer). During testing, boxes must be moved to the diagonal quarter. The gray arrows illustrate one of the valid two-step routes via adjacent quarters (adjacent $\rightarrow$ adjacent), which were seen separately during training but never as an end-to-end diagonal move.
Generalized Stitching (Few-to-Many setting)
During training, one box is already on a target, and the agent must place the remaining two. During testing, no boxes start on targets. Although both start and goal configurations are individually familiar, training never includes segments that involve moving three boxes.
- We label a setup open if test solutions are likely to leave the training support.
- We label a setup closed if all test solutions can be realized on training support.
Experimental Highlights
In the Quarters setting (6×6 grid) — which tests exact stitching — increasing the number of boxes widens the generalization gap for both TD and MC methods.
Scaling Narrows the Generalization Gap
A subtle failure of stitching. An agent trained on the quarters task should first move all boxes to an adjacent corner and then to the goal quarter. However, if the agent prematurely moves a box along the diagonal, it will end up in a state that has never been seen before during training, as a result of open setup.
In the Few-to-many setting, we probe methods' generalized stitching capabilities. The more difficult the training tasks (fewer boxes starting on target), the smaller the generalization gap for both MC and TD methods.
Scaling Narrows the Generalization Gap
Previous works (Nauman et al., 2024; Lee et al., 2025; Wang et al., 2025) have shown that proper scaling of critics’ and actors’ neural net- works can provide enormous benefits in online RL. Strikingly, the generalization gap might be reduced by simply increasing the scale of the critic for both TD and MC methods.
Empirical Takeaways
- TD updates excel at exact stitching for a couple of boxes, yet performance collapses as the space of possible states expands.
- Monte Carlo variants such as CRL close most of the generalization gap in closed settings, demonstrating implicit stitching despite the absence of bootstrapping.
- Increasing critic capacity consistently narrows the train-test gap for both paradigms, positioning scale as a stronger lever for stitching properties.
Call to action
Our findings challenge the conventional wisdom that TD learning is the sole effective method for experience stitching. As shown by our experiments, even a couple of boxes in the Quarters task can cause TD methods to fail. We encourage readers to explore the code linked above to experiment with the benchmark and propose new methods that can mitigate this problem! Also, feel free to get in touch with the authors for collaboration or questions.
BibTeX
@inproceedings{anonymous2026temporal,
title={Is Temporal Difference Learning the Gold Standard for Stitching in RL?},
author={Michał Bortkiewicz and Władysław Pałucki and Benjamin Eysenbach and Mateusz Ostaszewski},
year={2026},
url={https://michalbortkiewicz.github.io/golden-standard/}
}