Continuous-Time RL
Paper Codebase

Value convergence

We present the proof of the ideal Picard-Hamiltonian iteration $V_{k+1}=T_\tau^{(\alpha)}(V_k)$. The main idea is simple but powerful: compare the explicit Hamiltonian step to a short-horizon dynamic-programming semigroup that is a strict sup-norm contraction, and treat the Hamiltonian update as a controlled local approximation.

Proof roadmap

We organize the argument into three layers: define a contraction reference operator $\Phi_\tau^{(\alpha)}$, prove that one Hamiltonian step matches that operator up to order $\tau^{3/2}$, and then propagate the local mismatch through a recursive error bound.

Reference: $\Phi_\tau^{(\alpha)}$ Update: $T_\tau^{(\alpha)}(V)=V+\tau H^{(\alpha)}(V)$ Local error: $\mathcal O(\tau^{3/2})$
Contraction reference Dynamic-programming semigroup Local expansion Fixed action + fixed policy Per-step mismatch Small operator gap is small Error recursion Local error recursion plus contraction Final theorem Geometric decay plus discretization error

Function space

The proof works on bounded continuous value functions $\mathcal D=C_b(\mathbb R^d)$ with sup norm $\|V\|_\infty$, and uses the controlled diffusion, generator, Hamiltonian, and short-horizon semigroup introduced in the appendix preliminaries.

Assumptions

Global Lipschitz dynamics, bounded Lipschitz reward, and enough regularity so that $L^aV$ is also Lipschitz uniformly in $a$. These assumptions allow the short-time expansion with $\tau^{3/2}$ estimate.

Main theorem on this page

$$ \|V_k - V^{(\alpha)}\|_\infty \le C_0 e^{-\beta \tau k} + \frac{C_1\tau^{3/2}}{1-e^{-\beta\tau}}. $$

The first term comes from contraction of the semigroup iteration. The second term is the accumulated price of replacing that semigroup by one explicit Picard--Hamiltonian step.

1. Setup and reference operators

The appendix starts from the entropy-regularized short-horizon dynamic-programming operator

$$ (\Phi_\tau^{(\alpha)}V)(x) := \sup_{\pi}\, \mathbb E_x^\pi\!\left[ \int_0^\tau e^{-\beta t} \big(r(X_t,a_t)-\alpha\log\pi(a_t\mid X_t)\big)\,dt + e^{-\beta\tau}V(X_\tau) \right]. $$

It compares this to the ideal Picard-Hamiltonian step

$$ (T_\tau^{(\alpha)}V)(x):=V(x)+\tau H^{(\alpha)}(V)(x), $$

where the Hamiltonian is built from the instantaneous advantage-rate

$$ q_V(x,a)=r(x,a)+(L^aV)(x)-\beta V(x). $$

The strategy is to prove that $\Phi_\tau^{(\alpha)}$ is easy to control globally, while $T_\tau^{(\alpha)}$ is easy to compute locally.

2. Why the semigroup is the contraction reference

For any two bounded value functions $V$ and $W$, the running reward terms cancel when comparing the two semigroup evaluations. Only the terminal term remains, so the proof gets the clean estimate

$$ \|\Phi_\tau^{(\alpha)}(V)-\Phi_\tau^{(\alpha)}(W)\|_\infty \le e^{-\beta\tau}\,\|V-W\|_\infty. $$

That immediately implies:

  1. $\Phi_\tau^{(\alpha)}$ is a strict contraction on the sup norm.
  2. It has a unique fixed point.
  3. That fixed point is the entropy-regularized value function $V^{(\alpha)}$.
  4. Its iterates converge geometrically:
$$ \| (\Phi_\tau^{(\alpha)})^{(k)}(V_0)-V^{(\alpha)}\|_\infty \le e^{-\beta\tau k}\,\|V_0-V^{(\alpha)}\|_\infty. $$

3. Local expansion: from diffusion dynamics to a one-step mismatch

The local analysis has two parts.

Fixed action

Dynkin's formula expands the terminal term under constant action $a$, while Lipschitz regularity of $L^aV$ and small-time moment bounds control the remainder. This yields

$$ (\Phi_\tau^a V)(x)=V(x)+\tau q_V(x,a)+R_\tau^a(V)(x), \qquad \sup_{x,a}|R_\tau^a(V)(x)|\le C\tau^{3/2}. $$

Fixed policy

Under the relaxed diffusion associated with a policy $\pi$, the same short-time argument gives

$$ (\Phi_\tau^{\pi,(\alpha)}V)(x) = V(x)+\tau\Big(\tilde r^{(\alpha)}(x;\pi)+(L^\pi V)(x)-\beta V(x)\Big)+R_\tau^{\pi,(\alpha)}(V)(x), $$

again with a uniform $\mathcal O(\tau^{3/2})$ remainder.

Now the entropy-KL identity turns the policy supremum into the soft Hamiltonian. That turns the fixed-policy expansion into a global operator comparison:

$$ \Phi_\tau^{(\alpha)}(V) = V+\tau H^{(\alpha)}(V)+R_\tau^{(\alpha)}(V), \qquad \|R_\tau^{(\alpha)}(V)\|_\infty\le C_\alpha\tau^{3/2}. $$

Equivalently,

$$ \|\Phi_\tau^{(\alpha)}(V)-T_\tau^{(\alpha)}(V)\|_\infty \le C_\alpha\tau^{3/2}. $$

4. Error recursion

Let

$$ W_{k+1}:=\Phi_\tau^{(\alpha)}(W_k), \qquad W_0:=V_0, $$

and define the difference between the explicit Hamiltonian iterate and the semigroup iterate by

$$ \Delta_k:=V_k-W_k. $$

Then one step of algebra gives

$$ \begin{aligned} \Delta_{k+1} &=T_\tau^{(\alpha)}(V_k)-\Phi_\tau^{(\alpha)}(W_k)\\ &=\big(T_\tau^{(\alpha)}(V_k)-\Phi_\tau^{(\alpha)}(V_k)\big) +\big(\Phi_\tau^{(\alpha)}(V_k)-\Phi_\tau^{(\alpha)}(W_k)\big). \end{aligned} $$

The first term is the local mismatch; the second is contracted by the semigroup. Therefore

$$ \|\Delta_{k+1}\|_\infty \le C_\alpha\tau^{3/2}+e^{-\beta\tau}\,\|\Delta_k\|_\infty. $$

Because $\Delta_0=0$, unrolling the recursion yields

$$ \|\Delta_k\|_\infty \le \frac{C_\alpha\tau^{3/2}}{1-e^{-\beta\tau}}. $$

So the explicit Hamiltonian scheme never drifts too far from the exact contraction iteration.

5. Final convergence bound and interpretation

By triangle inequality,

$$ \|V_k-V^{(\alpha)}\|_\infty \le \|V_k-W_k\|_\infty + \|W_k-V^{(\alpha)}\|_\infty. $$

Insert the two bounds already proved:

$$ \|V_k-V^{(\alpha)}\|_\infty \le e^{-\beta\tau k}\,\|V_0-V^{(\alpha)}\|_\infty + \frac{C_\alpha\tau^{3/2}}{1-e^{-\beta\tau}}. $$

The interpretation is clear:

  • the term $e^{-\beta\tau k}$ is inherited from the contraction of dynamic programming,
  • the term $\tau^{3/2}/(1-e^{-\beta\tau})$ is the accumulated discretization error from using the explicit Picard-Hamiltonian step instead of the exact short-horizon semigroup.

Since $1-e^{-\beta\tau}=\beta\tau+\mathcal O(\tau^2)$, the remainder behaves like $\mathcal O(\tau^{1/2})$ for small $\tau$. So the scheme converges to the correct value function as the flow discretization step $\tau$ shrinks.