Value convergence
We present the proof of the ideal Picard-Hamiltonian iteration $V_{k+1}=T_\tau^{(\alpha)}(V_k)$. The main idea is simple but powerful: compare the explicit Hamiltonian step to a short-horizon dynamic-programming semigroup that is a strict sup-norm contraction, and treat the Hamiltonian update as a controlled local approximation.
Proof roadmap
We organize the argument into three layers: define a contraction reference operator $\Phi_\tau^{(\alpha)}$, prove that one Hamiltonian step matches that operator up to order $\tau^{3/2}$, and then propagate the local mismatch through a recursive error bound.
Function space
The proof works on bounded continuous value functions $\mathcal D=C_b(\mathbb R^d)$ with sup norm $\|V\|_\infty$, and uses the controlled diffusion, generator, Hamiltonian, and short-horizon semigroup introduced in the appendix preliminaries.
Assumptions
Global Lipschitz dynamics, bounded Lipschitz reward, and enough regularity so that $L^aV$ is also Lipschitz uniformly in $a$. These assumptions allow the short-time expansion with $\tau^{3/2}$ estimate.
Main theorem on this page
The first term comes from contraction of the semigroup iteration. The second term is the accumulated price of replacing that semigroup by one explicit Picard--Hamiltonian step.
1. Setup and reference operators
The appendix starts from the entropy-regularized short-horizon dynamic-programming operator
It compares this to the ideal Picard-Hamiltonian step
where the Hamiltonian is built from the instantaneous advantage-rate
The strategy is to prove that $\Phi_\tau^{(\alpha)}$ is easy to control globally, while $T_\tau^{(\alpha)}$ is easy to compute locally.
2. Why the semigroup is the contraction reference
For any two bounded value functions $V$ and $W$, the running reward terms cancel when comparing the two semigroup evaluations. Only the terminal term remains, so the proof gets the clean estimate
That immediately implies:
- $\Phi_\tau^{(\alpha)}$ is a strict contraction on the sup norm.
- It has a unique fixed point.
- That fixed point is the entropy-regularized value function $V^{(\alpha)}$.
- Its iterates converge geometrically:
3. Local expansion: from diffusion dynamics to a one-step mismatch
The local analysis has two parts.
Fixed action
Dynkin's formula expands the terminal term under constant action $a$, while Lipschitz regularity of $L^aV$ and small-time moment bounds control the remainder. This yields
Fixed policy
Under the relaxed diffusion associated with a policy $\pi$, the same short-time argument gives
again with a uniform $\mathcal O(\tau^{3/2})$ remainder.
Now the entropy-KL identity turns the policy supremum into the soft Hamiltonian. That turns the fixed-policy expansion into a global operator comparison:
Equivalently,
4. Error recursion
Let
and define the difference between the explicit Hamiltonian iterate and the semigroup iterate by
Then one step of algebra gives
The first term is the local mismatch; the second is contracted by the semigroup. Therefore
Because $\Delta_0=0$, unrolling the recursion yields
So the explicit Hamiltonian scheme never drifts too far from the exact contraction iteration.
5. Final convergence bound and interpretation
By triangle inequality,
Insert the two bounds already proved:
- the term $e^{-\beta\tau k}$ is inherited from the contraction of dynamic programming,
- the term $\tau^{3/2}/(1-e^{-\beta\tau})$ is the accumulated discretization error from using the explicit Picard-Hamiltonian step instead of the exact short-horizon semigroup.
Since $1-e^{-\beta\tau}=\beta\tau+\mathcal O(\tau^2)$, the remainder behaves like $\mathcal O(\tau^{1/2})$ for small $\tau$. So the scheme converges to the correct value function as the flow discretization step $\tau$ shrinks.