Theory

The theoretical analysis is organized in three layers. First, the ideal Picard-Hamiltonian iteration is studied at the analytic level. Next, finite-horizon and Richardson approximations are inserted to recover model-free learning. Finally, these guarantees are lifted to the practical single-critic implementation of CT-SAC and CT-TD3.

Theory roadmap. The key reference object is the entropy-regularized dynamic-programming semigroup $\Phi_{\tau}^{(\alpha)}$, which is a contraction in the sup-norm even though the Hamiltonian update itself is not.

Why compare against a semigroup

The ideal value update in the paper is the Picard-Hamiltonian step

T_{\tau}^{(\alpha)}(V)=V+\tau H^{(\alpha)}(V).

Unlike a discrete-time Bellman operator, this map contains the infinitesimal generator term $L^aV$ and is therefore not a contraction on $(C_b,\|\cdot\|_\infty)$. Instead of forcing a direct analytic contraction argument, the paper introduces a probabilistic dynamic-programming reference operator:

\begin{aligned} (\Phi_{\tau}^{(\alpha)}V)(x) :=\sup_{\pi}\,\mathbb E\Bigg[ \int_0^{\tau} e^{-\beta t}\big(r(X_t,a_t)-\alpha\log \pi(a_t\mid X_t)\big)\,dt + e^{-\beta\tau}V(X_{\tau}) \;\Big|\; X_0=x \Bigg]. \end{aligned}

This semigroup is value-only and serves as the contraction anchor for the whole proof strategy. The Picard-Hamiltonian update is then treated as a local approximation to this semigroup.

Main proof idea. Show that $T_{\tau}^{(\alpha)}$ is close to $\Phi_{\tau}^{(\alpha)}$ for small $\tau$, then let the contraction of $\Phi_{\tau}^{(\alpha)}$ absorb the accumulated local error.

Theorem 1

Value function convergence

For the ideal Picard–Hamiltonian iteration $V_{k+1}=T_{\tau}^{(\alpha)}(V_k)$, the value sequence converges to $V^{(\alpha)}$ as $\tau\to 0$, with an exponentially decaying contraction term plus a local approximation error.

\|V_k - V^{(\alpha)}\|_\infty \le C_0 e^{-\beta\tau k} + \frac{C_1\tau^{3/2}}{1-e^{-\beta\tau}}.

Sketch. Compare the approximate update $T_{\tau}^{(\alpha)}$ to the semigroup iteration $W_{k+1}=\Phi_{\tau}^{(\alpha)}(W_k)$. A short-time Itô expansion yields a one-step mismatch of order $\tau^{3/2}$, and the contraction of $\Phi_{\tau}^{(\alpha)}$ then propagates that error safely across iterations.

See the detailed page →

Theorem 2

$q$-convergence with finite-horizon and Richardson correction

In the model-free setting, the analytic rate $q_V$ is replaced by the finite-horizon approximation $q_V^u$, and then improved through Richardson extrapolation. Under stronger smoothness assumptions, both the value error and the corrected $q$-error can be controlled quantitatively:

\|V_k - V^{(\alpha)}\|_\infty \le e^{-\rho\tau k}\,\|V_0 - V^{(\alpha)}\|_\infty + \frac{C_1(\tau^2 + \tau u^2)}{1-e^{-\rho\tau}} $$ $$ \|\tilde q_{V_k}^u - q_{V^{(\alpha)}}\|_\infty \le \frac{L}{u}\,\|V_k - V^{(\alpha)}\|_\infty + C_2 u^2.

Sketch. The key upgrade is that Richardson cancellation reduces the finite-horizon bias from first order to $O(u^2)$. That sharper local bias leads to a stronger one-step approximation of the semigroup, while the final $q$-bound follows by transferring the work back to the value function through a Lipschitz estimate in $V$.

See the detailed page →

Theorem 3

Single-critic algorithm convergence

The practical critic update used by CT-SAC and CT-TD3 is not a separate object: it is an algebraic rewriting of the decoupled $(V_k,q_k)$ iteration. In the exact population setting, the learned critic converges to $Q^{(\alpha)} = V^{(\alpha)} + q_{V^{(\alpha)}}$.

With finite samples, the theory adds a regression error term at each update and shows that sufficiently accurate fitting drives the learned critic arbitrarily close to the analytic iterate.

See the detailed page →

Proof-sketch narrative

Layer 1 · Ideal flow

Start with the analytic Hamiltonian flow $V_{k+1}=V_k+\tau H^{(\alpha)}(V_k)$ and compare it against the contraction semigroup $\Phi_{\tau}^{(\alpha)}$.

Layer 2 · Model-free correction

Replace $q_V$ by $q_V^u$ and then by the Richardson-corrected $\tilde q_V^u$. The sharper bias control yields better local error terms and convergence for both $V$ and $q$.

Layer 3 · Practical algorithm

Show that the single critic $Q\approx V+q$ preserves the same structure, so the closed-form theory carries over to the critic used in CT-SAC and CT-TD3.

Random-time and sample-complexity

We extend the same theoretical analysis to random holding times $U$ by averaging deterministic bias bounds over the law of $U$. We also develop a more refined statistical argument that transfers $L^2$ regression control into sup-norm control under compactness, coverage, and Lipschitz assumptions, leading to sublinear regret in total samples.

See the detailed page →

Theory detail pages

Value convergenceSemigroup contraction, one-step mismatch, and the proof structure behind the ideal Picard–Hamiltonian iteration. $q$-convergenceFinite-horizon bias, Richardson cancellation, Hamiltonian Lipschitz lemmas, and the final $q$-error transfer. Algorithm convergenceHow the single critic is induced from $(V,q)$ and how finite-sample regression error enters the recursion. Random time and regretAveraged bias bounds for random holding times and the refined sample-complexity discussion.