Random time and regret bound
This page extends the deterministic finite-horizon theory in 2 directions. First, it shows that the value-flow and $q$-rate analysis remains valid when holding times are random. Second, it adds a more refined statistical argument that lifts one-step regression control to a cumulative sup-norm bound, leading to a sample-complexity and regret statement for the practical critic.
1. Why random time needs its own argument
Our work is motivated by irregular event times, so the deterministic increment $u$ should not be the end of the story. We therefore replace a fixed holding time by a random variable $U \in (0,1]$ and asks whether the same bias and convergence logic survives after averaging over that randomness.
The key simplification is that most deterministic Hamiltonian bias bounds depend only on powers of $u$. Once this is true, the random-time extension is obtained by taking expectations, so powers such as $u^{\gamma}$ become moments $\mathbb{E}[U^{\gamma}]$.
2. Hamiltonian bias transfers by averaging
Suppose the deterministic approximation error satisfies
Defining the averaged random-time Hamiltonian $\bar H_U^{(\alpha)}(V)=\mathbb{E}[H_U^{(\alpha)}(V)]$, we can prove the direct transfer bound:
The same argument also applies to the Richardson Hamiltonian.
3. Random-time Picard iteration
The deterministic Picard-Hamiltonian update becomes an averaged operator
If the dynamic-programming semigroup still approximates the ideal Hamiltonian step with the local error:
then the same contraction argument from the deterministic theory yields
Again, this is geometric contraction plus a residual discretization term. The only new part is that the residual now depends on moments of the random time.
4. Random-time $q$-convergence and the role of $\mathbb{E}[1/U]$
For the Hamiltonian, averaging is enough. For the finite-horizon rate $q_V^u$, however, the definition contains a division by $u$, so stability with respect to $V$ becomes worse when $u$ is small. We can still prove that:
and for the Richardson estimator,
Hence if $\mathbb{E}[1/U] < \infty$, the value error still transfers to the random-time rate bound exactly as in the deterministic proof:
5. The three random-time settings
We now summarize the random-time consequences of the 3 earlier deterministic cases:
- under basic assumptions, the Hamiltonian bias scales like $\mathbb{E}[U^{1/2}]$;
- under smoother assumptions, it improves to $\mathbb{E}[U]$;
- with Richardson correction, it improves further to $\mathbb{E}[U^2]$.
6. A more refined regret
We study the learned critic $\hat Q_k$ and asks how one-step regression error accumulates over time. The core challenge is that the available generalization control is naturally in $L^2(\nu)$, whereas the desired gap is in the sup norm. To bridge that gap, we assume a compact state-action domain, a replay distribution with density bounded away from zero, and a uniformly Lipschitz critic class. These assumptions allow a geometric norm-transfer argument from $L^2(\nu)$ to $\|\cdot\|_\infty$.
7. From $L^2(\nu)$ error to sup-norm error
The key lemma states that if a Lipschitz function $g$ satisfies $\|g\|_{L^2(\nu)}\le \varepsilon$, then
This geometric observation converts the regression bound on $\mathrm{stat}_k:=\|\hat Q_{k+1}-\mathcal F_{\tau,u}(\hat Q_k)\|_{L^2(\nu)}$ into a sup-norm one-step deviation:
8. Cumulative gap, sample complexity, and regret
Combining the population bias bound with the sup-norm statistical deviation gives a cumulative gap estimate of the form
Then chooses $\tau=L_{\max}^{-1/2}$, $u=L_{\max}^{-1/4}$, and $n_k=(k+1)^{m+2}$, which leads to
Rewriting the cumulative gap in terms of total samples $K$ yields the stated sublinear regret law
With reasonable assumptions, the practical single-critic learner still admits a meaningful long-run statistical guarantee.