q-convergence
We present the proof of model-free $q$-convergence. The generator-based rate $q_V$ is replaced by a finite-horizon estimate $q_V^u$, which introduces both approximation bias and stronger sensitivity to the value function. We show how smoothness plus Richardson extrapolation sharpen those errors enough to recover convergence of both $V_k$ and the learned instantaneous rate.
Proof roadmap
The argument has four layers: start from the baseline finite-horizon bias, strengthen it using a second-order semigroup expansion, cancel the leading $u$-bias by Richardson extrapolation, and then transfer the resulting Hamiltonian mismatch into value and $q$-error recursions.
What changes from the value proof
The ideal Hamiltonian $H^{(\alpha)}(V)$ is no longer used directly. Instead, we study finite-horizon Hamiltonians built from $q_V^u$ and the Richardson-corrected $\tilde q_V^u$, so there is now both a local semigroup discretization error and a modelling bias in $u$.
Main technical difficulty
The map $V \mapsto q_V^u$ is only Lipschitz with constant of order $1/u$. This is why the final $q$-bound contains a factor $1/u$, and why the theorem requires the limit $\tau/u \to 0$ when sending both discretization scales to zero.
Main theorem on this page
The first estimate controls the value flow with a finite-horizon Hamiltonian. The second estimate transfers value convergence into convergence of the learned instantaneous advantage-rate.
1. Baseline finite-horizon bias
Before imposing extra smoothness, we record the natural bias scale that already follows from Dynkin's formula and the small-time moment bounds. Under the standard diffusion assumptions one has:
where $q_V^u$ is the short-rollout estimator and $q_V$ is the generator-based infinitesimal rate. This already shows that finite-horizon estimates are asymptotically correct, but the rate is too weak for the sharper theorem in the main paper.
We additionally note that the associated Hamiltonian inherits the same baseline scale:
2. Smoothness assumptions and second-order expansion
To sharpen the bias, the appendix strengthens the assumptions: the controlled drift and diffusion are $C^2$ in the state variable with bounded derivatives, the reward is $C^2$, and the value class is upgraded to $\mathcal D \subset C_b^4(\mathbb R^d)$. Under these assumptions, the fixed-action semigroup admits a second-order expansion:
Substituting this expansion into the finite-horizon definition
improves the modelling bias to first order:
3. Richardson cancellation
The corrected estimator is defined by
Because the finite-horizon expansion has the form
the linear term cancels, leaving the sharper bound
4. Hamiltonian bias and Lipschitz transfer
The next step is to push the $q$-bias through the soft Hamiltonian. We use the log-sum-exp map's Lipschitz property: for bounded functions $z$ and $\bar z$,
Applying this pointwise with $z=q_V^u(x,\cdot)$ or $z=\tilde q_V^u(x,\cdot)$ yields:
We also have the stability estimates:
5. Perturbed value recursion
Under stronger assumptions, the short-horizon semigroup now has a better local expansion than in the value-only theorem:
Combining that with the Hamiltonian bias above gives the one-step mismatch for the Richardson flow
Finally, the same contraction comparison as in the value-convergence proof yields:
So the value iterate is still controlled by a geometric contraction term plus an accumulated local mismatch term, except the mismatch is now sharper and explicitly depends on $u$.
6. Transfer from $V$ to $q$
The final step is a triangle inequality that separates approximation and stability:
The first term is controlled by the Lipschitz-in-$V$ bound, and the second term is the Richardson bias at the fixed point. Together they give:
7. Final limit when $\tau/u\to 0$
The Richardson value bound contains a term of order $\tau+u^2$ once the denominator $1-e^{-\rho\tau}$ is expanded, while the $q$-bound multiplies the value error by $1/u$. So the combined error behaves like
To ensure this goes to zero, it is enough to take $u\to 0$ and $\tau/u\to 0$. In this setting, the learned finite-horizon rate converges to the true infinitesimal advantage-rate.