Algorithm convergence

We close the loop between the abstract decoupled theory and the practical critic update used in CT-SAC and CT-TD3. The algorithm is not merely inspired by the theory, but that the single-critic recursion is an algebraic rewriting of the $(V_k,q_k)$ iteration. Once that equivalence is established, the algorithmic convergence theorem follows by combining the earlier value-convergence and $q$-convergence results, and the statistical corollary follows from a one-step regression error recursion.

Proof roadmap

The argument proceeds in 4 steps. First, rewrite the decoupled $(V_k,q_k)$ iteration as a single-critic update. Next, prove that the decomposition $Q_k=V_k+q_k$ is preserved across iterations. Then transfer the value and rate convergence bounds to the critic itself. Finally, control the learned critic with a regression-error recursion around the analytic update.

Main statements

Analytic theorem. Under the exact update defined by the single-critic target, $Q_k$ converges to the optimal critic $Q^{(\alpha)} = V^{(\alpha)} + q_{V^{(\alpha)}}$.

Statistical corollary. If enough samples are used in each regression step, then the learned critic $\hat Q_k$ tracks the exact iterate closely enough that, for any $\epsilon,\delta>0$, one can reach an iterate $L$ with $\mathbb E[|Q_L(X,A)-Q^{(\alpha)}(X,A)|^2] < \epsilon$ with probability at least $1-\delta$.

1. From the decoupled iteration to a single critic

We begin by fixing an algorithmic flow step $\tau>0$ and a holding time $u>0$, with $\gamma=e^{-\beta u}$. For a value function $V$, recall the finite-horizon rate

q_V^u(x,a)=\frac{\gamma\,\mathbb E_x^a[V(X_u)]-V(x)}{u}+r(x,a),

or equivalently, with $\Delta_u V(x,a):=\gamma\,\mathbb E_x^a[V(X_u)]-V(x)$,

q_V^u(x,a)=\frac{\Delta_u V(x,a)}{u}+r(x,a).

Suppose the decoupled scheme updates the value through the Hamiltonian-flow step

V_{k+1}(x)=V_k(x)+\tau\,\mathcal Q^{(\alpha)}(q_k)(x),

and then refreshes the rate via

q_{k+1}(x,a)=\frac{\gamma\,\mathbb E_x^a[V_{k+1}(X_u)]-V_{k+1}(x)}{u}+r(x,a).

Defining the numerical critic by $Q_k(x,a):=V_k(x)+q_k(x,a)$, we prove a lemma showing that this pairwise update induces exactly the single-critic recursion used in the algorithm. Therefore, the practical critic target is, in fact, the algebraic form of the decoupled theory.

2. The closed-form critic update

Writing $\tilde Q_k(x,a):=Q_k(x,a)-\alpha\log\pi_k(a\mid x)$ and using the Boltzmann policy $\pi_k(a\mid x)\propto \exp(Q_k(x,a)/\alpha)$, the resulting single-critic update is

\begin{aligned} Q_{k+1}(x,a) &= (1-\tau)Q_k(x,a) + \tau r(x,a) + \tau\,\mathbb E_{a\sim\pi_k(\cdot\mid x)}\big[\tilde Q_k(x,a)\big]\\ &\quad + \tau\,\frac{\gamma\,\mathbb E_{a'\sim\pi_k(\cdot\mid X_u)}\big[\tilde Q_k(X_u,a')\big] - \mathbb E_{a\sim\pi_k(\cdot\mid x)}\big[\tilde Q_k(x,a)\big]}{u}. \end{aligned}

This is exactly the continuous-time critic update in the main algorithm. The derivation proceeds by first rewriting the value update in terms of $\tilde Q_k$, substituting that into the definition of $q_{k+1}$, rearranging terms, and then adding $V_{k+1}$ back to recover $Q_{k+1}=V_{k+1}+q_{k+1}$.

3. Preservation of the decomposition

The next lemma is an induction argument. Assume the initialization satisfies the finite-horizon relation $q_0=q_{V_0}^u$ and define $Q_0:=V_0+q_0$. If $(V_k,q_k)$ are generated by the decoupled update and $Q_k$ by the single-critic update above, then for every $k\ge 0$,

$$Q_k(x,a)=V_k(x)+q_k(x,a).$$

This is crucial because it allows the convergence proof to continue in the $(V,q)$ world while still concluding a theorem for the practical single critic. The algorithm never leaves the theoretical decomposition even though it is implemented with one network.

4. Analytic convergence of the critic

Once the decomposition is preserved, the error with respect to the optimal critic splits cleanly as:

Q_k-Q^{(\alpha)}=(V_k-V^{(\alpha)})+(q_k-q_{V^{(\alpha)}}).

Taking the sup-norm and applying the triangle inequality gives:

\|Q_k-Q^{(\alpha)}\|_\infty \le \|V_k-V^{(\alpha)}\|_\infty + \|q_k-q_{V^{(\alpha)}}\|_\infty.

The first term is controlled by the value-convergence theorem, and the second by the $q$-convergence theorem. Therefore, the single critic inherits convergence directly from the two earlier components. Hence, the practical CT-SAC / CT-TD3 critic remains faithful to the continuous-time fixed point $Q^{(\alpha)}=V^{(\alpha)}+q_{V^{(\alpha)}}$.

5. Statistical ingredients for the learned critic

The statistical extension introduces the learned critic $\hat Q_k$, produced by neural regression against the population target operator $\mathcal F_{\tau,u}$. At iteration $k$, one draws i.i.d. transition samples and fits the next critic by minimizing empirical regression loss. We then invoke standard learning-theoretic tools: Rademacher complexity, Lipschitz contraction for losses, and high-probability generalization bounds for bounded neural-network classes. These ingredients yield a one-step statistical error term of the form:

\mathrm{stat}_k:=\|\hat Q_{k+1}-\mathcal F_{\tau,u}(\hat Q_k)\|_{L^2(\nu)},

Together with the usual high-probability estimate:

\mathrm{stat}_k \lesssim \frac{1}{\sqrt{n_k}}\Big(1+\sqrt{\log d}+\sqrt{\log(1/\delta)}\Big).

Hence, the critic regression error can be made arbitrarily small by increasing the batch size at each iteration.

6. Lipschitz stability of the critic operator

A second ingredient is that, when the policy is treated as fixed during the policy-evaluation comparison, the mapping $\mathcal F_{\tau,u}$ is Lipschitz in $L^2(\nu)$:

\|\mathcal F_{\tau,u}(Q)-\mathcal F_{\tau,u}(Q')\|_{L^2(\nu)} \le \Big(1+\frac{\tau}{u}\Big)\|Q-Q'\|_{L^2(\nu)}.

This bound comes from the affine structure of the update and Jensen's inequality for the policy-averaged terms. This allows one to propagate one-step regression errors across iterations.

7. Statistical recursion and final corollary

Define the tracking error between the learned and exact critics by:

\epsilon_k^{(2)}:=\|\hat Q_k-Q_k\|_{L^2(\nu)}.

Then we derive the recursion:

\epsilon_{k+1}^{(2)} \le \mathrm{stat}_k + \Big(1+\frac{\tau}{u}\Big)\epsilon_k^{(2)}.

Unrolling this recursion shows that if the sequence of one-step regression errors is chosen small enough, then the total learned-to-exact gap remains small at the target iteration $L$. Combining this with the already established analytic convergence of $Q_L$ to $Q^{(\alpha)}$ yields the main statistical conclusion:

\|\hat Q_L-Q^{(\alpha)}\|_{L^2(\nu)} \le \underbrace{\|Q_L-Q^{(\alpha)}\|_{L^2(\nu)}}_{\text{algorithmic error}} + \underbrace{\|\hat Q_L-Q_L\|_{L^2(\nu)}}_{\text{statistical error}},

and each of the two terms can be made smaller than $\epsilon/2$. This proves the corollary.

8. Final interpretation

The value-convergence and $q$-convergence pages establish the analytic behavior of the decoupled flow. This page shows that the actual implemented critic is exactly the same object, written in single-critic form, and that finite-sample learning does not break the link as long as regression errors are controlled. In that sense, we gave a full chain from continuous-time stochastic control, to decoupled $(V,q)$ theory, to the critic used in CT-SAC and CT-TD3, to a statistical convergence statement for the learned network.