Continuous-Time RL
Paper Codebase

q-convergence

We present the proof of model-free $q$-convergence. The generator-based rate $q_V$ is replaced by a finite-horizon estimate $q_V^u$, which introduces both approximation bias and stronger sensitivity to the value function. We show how smoothness plus Richardson extrapolation sharpen those errors enough to recover convergence of both $V_k$ and the learned instantaneous rate.

Proof roadmap

The argument has four layers: start from the baseline finite-horizon bias, strengthen it using a second-order semigroup expansion, cancel the leading $u$-bias by Richardson extrapolation, and then transfer the resulting Hamiltonian mismatch into value and $q$-error recursions.

Finite-horizon estimator: $q_V^u$ Richardson correction: $\tilde q_V^u=2q_V^{u/2}-q_V^u$ Mismatch: $\tau^2+\tau u^2$
Baseline bias standard assumptions give $\sqrt{u}$ scale Smooth expansion second-order semigroup improves bias to $u$ Richardson step cancel leading term leave $u^2$ bias Value recursion contraction plus $\tau^2+\tau u^2$ mismatch Transfer to $q$ Lipschitz in $V$ with factor $1/u$

What changes from the value proof

The ideal Hamiltonian $H^{(\alpha)}(V)$ is no longer used directly. Instead, we study finite-horizon Hamiltonians built from $q_V^u$ and the Richardson-corrected $\tilde q_V^u$, so there is now both a local semigroup discretization error and a modelling bias in $u$.

Main technical difficulty

The map $V \mapsto q_V^u$ is only Lipschitz with constant of order $1/u$. This is why the final $q$-bound contains a factor $1/u$, and why the theorem requires the limit $\tau/u \to 0$ when sending both discretization scales to zero.

Main theorem on this page

$$ \|V_k - V^{(\alpha)}\|_\infty \le e^{-\rho \tau k}\,\|V_0 - V^{(\alpha)}\|_\infty + \frac{C_1(\tau^2 + \tau u^2)}{1-e^{-\rho \tau}}, $$
$$ \|\tilde q_{V_k}^u - q_{V^{(\alpha)}}\|_\infty \le \frac{L}{u}\,\|V_k - V^{(\alpha)}\|_\infty + C_2 u^2. $$

The first estimate controls the value flow with a finite-horizon Hamiltonian. The second estimate transfers value convergence into convergence of the learned instantaneous advantage-rate.

1. Baseline finite-horizon bias

Before imposing extra smoothness, we record the natural bias scale that already follows from Dynkin's formula and the small-time moment bounds. Under the standard diffusion assumptions one has:

$$ \|q_V^u-q_V\|_\infty \le C\sqrt{u}, $$

where $q_V^u$ is the short-rollout estimator and $q_V$ is the generator-based infinitesimal rate. This already shows that finite-horizon estimates are asymptotically correct, but the rate is too weak for the sharper theorem in the main paper.

We additionally note that the associated Hamiltonian inherits the same baseline scale:

$$ \|H_u^{(\alpha)}(V)-H^{(\alpha)}(V)\|_\infty \le C\sqrt{u}. $$

2. Smoothness assumptions and second-order expansion

To sharpen the bias, the appendix strengthens the assumptions: the controlled drift and diffusion are $C^2$ in the state variable with bounded derivatives, the reward is $C^2$, and the value class is upgraded to $\mathcal D \subset C_b^4(\mathbb R^d)$. Under these assumptions, the fixed-action semigroup admits a second-order expansion:

$$ (P_t^aV)(x)=V(x)+t(L^aV)(x)+R_V(t;x,a), \qquad \sup_{x,a}|R_V(t;x,a)|\le C_V t^2. $$

Substituting this expansion into the finite-horizon definition

$$ q_V^u(x,a)=\frac{e^{-\beta u}\,\mathbb E_x^a[V(X_u^a)]-V(x)}{u}+r(x,a) $$

improves the modelling bias to first order:

$$ \|q_V^u-q_V\|_\infty \le C_q u. $$

3. Richardson cancellation

The corrected estimator is defined by

$$ \tilde q_V^u(x,a):=2q_V^{u/2}(x,a)-q_V^u(x,a). $$

Because the finite-horizon expansion has the form

$$ q_V^u(x,a)=q_V(x,a)+u\,c_1(x,a)+u^2 c_2(u;x,a), $$

the linear term cancels, leaving the sharper bound

$$ \|\tilde q_V^u-q_V\|_\infty \le \tilde C_q u^2. $$

4. Hamiltonian bias and Lipschitz transfer

The next step is to push the $q$-bias through the soft Hamiltonian. We use the log-sum-exp map's Lipschitz property: for bounded functions $z$ and $\bar z$,

$$ \left| \alpha\log\!\int_{\mathcal A} e^{z(a)/\alpha}\,da - \alpha\log\!\int_{\mathcal A} e^{\bar z(a)/\alpha}\,da \right| \le \|z-\bar z\|_\infty. $$

Applying this pointwise with $z=q_V^u(x,\cdot)$ or $z=\tilde q_V^u(x,\cdot)$ yields:

$$ \|H_u^{(\alpha)}(V)-H^{(\alpha)}(V)\|_\infty \le \|q_V^u-q_V\|_\infty, $$
$$ \|\tilde H_u^{(\alpha)}(V)-H^{(\alpha)}(V)\|_\infty \le \|\tilde q_V^u-q_V\|_\infty \le C u^2. $$

We also have the stability estimates:

$$ \|q_V^u-q_W^u\|_\infty \le \frac{2}{u}\,\|V-W\|_\infty, \qquad \|\tilde q_V^u-\tilde q_W^u\|_\infty \le \frac{6}{u}\,\|V-W\|_\infty. $$

5. Perturbed value recursion

Under stronger assumptions, the short-horizon semigroup now has a better local expansion than in the value-only theorem:

$$ \|\Phi_\tau^{(\alpha)}(V)-T_\tau^{(\alpha)}(V)\|_\infty \le C_\Phi \tau^2. $$

Combining that with the Hamiltonian bias above gives the one-step mismatch for the Richardson flow

$$ \|\Phi_\tau^{(\alpha)}(V)-T_{\tau,u}^{R,(\alpha)}(V)\|_\infty \le C(\tau^2+\tau u^2). $$

Finally, the same contraction comparison as in the value-convergence proof yields:

$$ \|V_k-V^{(\alpha)}\|_\infty \le e^{-\beta\tau k}\,\|V_0-V^{(\alpha)}\|_\infty + \frac{C(\tau^2+\tau u^2)}{1-e^{-\beta\tau}}. $$

So the value iterate is still controlled by a geometric contraction term plus an accumulated local mismatch term, except the mismatch is now sharper and explicitly depends on $u$.

6. Transfer from $V$ to $q$

The final step is a triangle inequality that separates approximation and stability:

$$ \|\tilde q_{V_k}^u-q_{V^*}\|_\infty \le \|\tilde q_{V_k}^u-\tilde q_{V^*}^u\|_\infty + \|\tilde q_{V^*}^u-q_{V^*}\|_\infty. $$

The first term is controlled by the Lipschitz-in-$V$ bound, and the second term is the Richardson bias at the fixed point. Together they give:

$$ \|\tilde q_{V_k}^u-q_{V^{(\alpha)}}\|_\infty \le \frac{L}{u}\,\|V_k-V^{(\alpha)}\|_\infty + C_2 u^2. $$

7. Final limit when $\tau/u\to 0$

The Richardson value bound contains a term of order $\tau+u^2$ once the denominator $1-e^{-\rho\tau}$ is expanded, while the $q$-bound multiplies the value error by $1/u$. So the combined error behaves like

$$ \frac{1}{u}(\tau+u^2)+u^2. $$

To ensure this goes to zero, it is enough to take $u\to 0$ and $\tau/u\to 0$. In this setting, the learned finite-horizon rate converges to the true infinitesimal advantage-rate.