Value convergence

We present the proof of the ideal Picard-Hamiltonian iteration $V_{k+1}=T_\tau^{(\alpha)}(V_k)$. The main idea is simple but powerful: compare the explicit Hamiltonian step to a short-horizon dynamic-programming semigroup that is a strict sup-norm contraction, and treat the Hamiltonian update as a controlled local approximation.

Proof roadmap

We organize the argument into three layers: define a contraction reference operator $\Phi_\tau^{(\alpha)}$, prove that one Hamiltonian step matches that operator up to order $\tau^{3/2}$, and then propagate the local mismatch through a recursive error bound.

Reference: $\Phi_\tau^{(\alpha)}$ Update: $T_\tau^{(\alpha)}(V)=V+\tau H^{(\alpha)}(V)$ Local error: $\mathcal O(\tau^{3/2})$

Function space

The proof works on bounded continuous value functions $\mathcal D=C_b(\mathbb R^d)$ with sup norm $\|V\|_\infty$, and uses the controlled diffusion, generator, Hamiltonian, and short-horizon semigroup introduced in the appendix preliminaries.

Assumptions

Global Lipschitz dynamics, bounded Lipschitz reward, and enough regularity so that $L^aV$ is also Lipschitz uniformly in $a$. These assumptions allow the short-time expansion with $\tau^{3/2}$ estimate.

Main theorem on this page

\|V_k - V^{(\alpha)}\|_\infty \le C_0 e^{-\beta \tau k} + \frac{C_1\tau^{3/2}}{1-e^{-\beta\tau}}.

The first term comes from contraction of the semigroup iteration. The second term is the accumulated price of replacing that semigroup by one explicit Picard--Hamiltonian step.

1. Setup and reference operators

The appendix starts from the entropy-regularized short-horizon dynamic-programming operator

(\Phi_\tau^{(\alpha)}V)(x) := \sup_{\pi}\, \mathbb E_x^\pi\!\left[ \int_0^\tau e^{-\beta t} \big(r(X_t,a_t)-\alpha\log\pi(a_t\mid X_t)\big)\,dt + e^{-\beta\tau}V(X_\tau) \right].

It compares this to the ideal Picard-Hamiltonian step

(T_\tau^{(\alpha)}V)(x):=V(x)+\tau H^{(\alpha)}(V)(x),

where the Hamiltonian is built from the instantaneous advantage-rate

q_V(x,a)=r(x,a)+(L^aV)(x)-\beta V(x).

The strategy is to prove that $\Phi_\tau^{(\alpha)}$ is easy to control globally, while $T_\tau^{(\alpha)}$ is easy to compute locally.

2. Why the semigroup is the contraction reference

For any two bounded value functions $V$ and $W$, the running reward terms cancel when comparing the two semigroup evaluations. Only the terminal term remains, so the proof gets the clean estimate

\|\Phi_\tau^{(\alpha)}(V)-\Phi_\tau^{(\alpha)}(W)\|_\infty \le e^{-\beta\tau}\,\|V-W\|_\infty.

That immediately implies:

$\Phi_\tau^{(\alpha)}$ is a strict contraction on the sup norm.
It has a unique fixed point.
That fixed point is the entropy-regularized value function $V^{(\alpha)}$.
Its iterates converge geometrically:

\| (\Phi_\tau^{(\alpha)})^{(k)}(V_0)-V^{(\alpha)}\|_\infty \le e^{-\beta\tau k}\,\|V_0-V^{(\alpha)}\|_\infty.

3. Local expansion: from diffusion dynamics to a one-step mismatch

The local analysis has two parts.

Fixed action

Dynkin's formula expands the terminal term under constant action $a$, while Lipschitz regularity of $L^aV$ and small-time moment bounds control the remainder. This yields

(\Phi_\tau^a V)(x)=V(x)+\tau q_V(x,a)+R_\tau^a(V)(x), \qquad \sup_{x,a}|R_\tau^a(V)(x)|\le C\tau^{3/2}.

Fixed policy

Under the relaxed diffusion associated with a policy $\pi$, the same short-time argument gives

(\Phi_\tau^{\pi,(\alpha)}V)(x) = V(x)+\tau\Big(\tilde r^{(\alpha)}(x;\pi)+(L^\pi V)(x)-\beta V(x)\Big)+R_\tau^{\pi,(\alpha)}(V)(x),

again with a uniform $\mathcal O(\tau^{3/2})$ remainder.

Now the entropy-KL identity turns the policy supremum into the soft Hamiltonian. That turns the fixed-policy expansion into a global operator comparison:

\Phi_\tau^{(\alpha)}(V) = V+\tau H^{(\alpha)}(V)+R_\tau^{(\alpha)}(V), \qquad \|R_\tau^{(\alpha)}(V)\|_\infty\le C_\alpha\tau^{3/2}.

Equivalently,

\|\Phi_\tau^{(\alpha)}(V)-T_\tau^{(\alpha)}(V)\|_\infty \le C_\alpha\tau^{3/2}.

4. Error recursion

Let

W_{k+1}:=\Phi_\tau^{(\alpha)}(W_k), \qquad W_0:=V_0,

and define the difference between the explicit Hamiltonian iterate and the semigroup iterate by

\Delta_k:=V_k-W_k.

Then one step of algebra gives

\begin{aligned} \Delta_{k+1} &=T_\tau^{(\alpha)}(V_k)-\Phi_\tau^{(\alpha)}(W_k)\\ &=\big(T_\tau^{(\alpha)}(V_k)-\Phi_\tau^{(\alpha)}(V_k)\big) +\big(\Phi_\tau^{(\alpha)}(V_k)-\Phi_\tau^{(\alpha)}(W_k)\big). \end{aligned}

The first term is the local mismatch; the second is contracted by the semigroup. Therefore

\|\Delta_{k+1}\|_\infty \le C_\alpha\tau^{3/2}+e^{-\beta\tau}\,\|\Delta_k\|_\infty.

Because $\Delta_0=0$, unrolling the recursion yields

\|\Delta_k\|_\infty \le \frac{C_\alpha\tau^{3/2}}{1-e^{-\beta\tau}}.

So the explicit Hamiltonian scheme never drifts too far from the exact contraction iteration.

5. Final convergence bound and interpretation

By triangle inequality,

\|V_k-V^{(\alpha)}\|_\infty \le \|V_k-W_k\|_\infty + \|W_k-V^{(\alpha)}\|_\infty.

Insert the two bounds already proved:

\|V_k-V^{(\alpha)}\|_\infty \le e^{-\beta\tau k}\,\|V_0-V^{(\alpha)}\|_\infty + \frac{C_\alpha\tau^{3/2}}{1-e^{-\beta\tau}}.

The interpretation is clear:

the term $e^{-\beta\tau k}$ is inherited from the contraction of dynamic programming,
the term $\tau^{3/2}/(1-e^{-\beta\tau})$ is the accumulated discretization error from using the explicit Picard-Hamiltonian step instead of the exact short-horizon semigroup.

Since $1-e^{-\beta\tau}=\beta\tau+\mathcal O(\tau^2)$, the remainder behaves like $\mathcal O(\tau^{1/2})$ for small $\tau$. So the scheme converges to the correct value function as the flow discretization step $\tau$ shrinks.