Continuous-Time RL
Paper Codebase

Random time and regret bound

This page extends the deterministic finite-horizon theory in 2 directions. First, it shows that the value-flow and $q$-rate analysis remains valid when holding times are random. Second, it adds a more refined statistical argument that lifts one-step regression control to a cumulative sup-norm bound, leading to a sample-complexity and regret statement for the practical critic.

1. Why random time needs its own argument

Our work is motivated by irregular event times, so the deterministic increment $u$ should not be the end of the story. We therefore replace a fixed holding time by a random variable $U \in (0,1]$ and asks whether the same bias and convergence logic survives after averaging over that randomness.

The key simplification is that most deterministic Hamiltonian bias bounds depend only on powers of $u$. Once this is true, the random-time extension is obtained by taking expectations, so powers such as $u^{\gamma}$ become moments $\mathbb{E}[U^{\gamma}]$.

2. Hamiltonian bias transfers by averaging

Suppose the deterministic approximation error satisfies

$$ \|H_u^{(\alpha)}(V)-H^{(\alpha)}(V)\|_\infty \le C_H u^{\gamma}. $$

Defining the averaged random-time Hamiltonian $\bar H_U^{(\alpha)}(V)=\mathbb{E}[H_U^{(\alpha)}(V)]$, we can prove the direct transfer bound:

$$ \|\bar H_U^{(\alpha)}(V)-H^{(\alpha)}(V)\|_\infty \le C_H\,\mathbb{E}[U^{\gamma}]. $$

The same argument also applies to the Richardson Hamiltonian.

3. Random-time Picard iteration

The deterministic Picard-Hamiltonian update becomes an averaged operator

$$ (\widetilde T_{\tau}^{(\alpha)}V)(x) := V(x)+\tau\,\bar H_U^{(\alpha)}(V)(x). $$

If the dynamic-programming semigroup still approximates the ideal Hamiltonian step with the local error:

$$ \|\Phi_\tau^{(\alpha)}(V)-(V+\tau H^{(\alpha)}(V))\|_\infty \le C_{\mathrm{DP}}\tau^p, $$

then the same contraction argument from the deterministic theory yields

$$ \|\widetilde V_k - V^{(\alpha)}\|_\infty \le e^{-\rho\tau k}\,\|\widetilde V_0 - V^{(\alpha)}\|_\infty + \frac{C_{\mathrm{DP}}\tau^p + \tau C_H\,\mathbb{E}[U^{\gamma}]}{1-e^{-\rho\tau}}. $$

Again, this is geometric contraction plus a residual discretization term. The only new part is that the residual now depends on moments of the random time.

4. Random-time $q$-convergence and the role of $\mathbb{E}[1/U]$

For the Hamiltonian, averaging is enough. For the finite-horizon rate $q_V^u$, however, the definition contains a division by $u$, so stability with respect to $V$ becomes worse when $u$ is small. We can still prove that:

$$ \mathbb{E}\big[\|q_V^U-q_W^U\|_\infty\big] \le 2\,\mathbb{E}\!\left[\frac{1}{U}\right]\|V-W\|_\infty, $$

and for the Richardson estimator,

$$ \mathbb{E}\big[\|\tilde q_V^U-\tilde q_W^U\|_\infty\big] \le 10\,\mathbb{E}\!\left[\frac{1}{U}\right]\|V-W\|_\infty. $$

Hence if $\mathbb{E}[1/U] < \infty$, the value error still transfers to the random-time rate bound exactly as in the deterministic proof:

$$ \mathbb{E}\big[\|\tilde q_{\widetilde V_k}^U-q(V^*)\|_\infty\big] \le 10\,\mathbb{E}\!\left[\frac{1}{U}\right]\|\widetilde V_k-V^*\|_\infty + C_q'\,\mathbb{E}[U^2]. $$

5. The three random-time settings

We now summarize the random-time consequences of the 3 earlier deterministic cases:

  • under basic assumptions, the Hamiltonian bias scales like $\mathbb{E}[U^{1/2}]$;
  • under smoother assumptions, it improves to $\mathbb{E}[U]$;
  • with Richardson correction, it improves further to $\mathbb{E}[U^2]$.

6. A more refined regret

We study the learned critic $\hat Q_k$ and asks how one-step regression error accumulates over time. The core challenge is that the available generalization control is naturally in $L^2(\nu)$, whereas the desired gap is in the sup norm. To bridge that gap, we assume a compact state-action domain, a replay distribution with density bounded away from zero, and a uniformly Lipschitz critic class. These assumptions allow a geometric norm-transfer argument from $L^2(\nu)$ to $\|\cdot\|_\infty$.

7. From $L^2(\nu)$ error to sup-norm error

The key lemma states that if a Lipschitz function $g$ satisfies $\|g\|_{L^2(\nu)}\le \varepsilon$, then

$$ \|g\|_\infty \le \left(\frac{2^{m+2}L_g^m}{\nu_{\min}c_m}\right)^{\!\frac{1}{m+2}} \varepsilon^{\frac{2}{m+2}}. $$

This geometric observation converts the regression bound on $\mathrm{stat}_k:=\|\hat Q_{k+1}-\mathcal F_{\tau,u}(\hat Q_k)\|_{L^2(\nu)}$ into a sup-norm one-step deviation:

$$ \zeta_k:=\|\hat Q_{k+1}-\mathcal F_{\tau,u}(\hat Q_k)\|_\infty. $$

8. Cumulative gap, sample complexity, and regret

Combining the population bias bound with the sup-norm statistical deviation gives a cumulative gap estimate of the form

$$ \sum_{k=0}^{L-1}\|\hat Q_k-Q^{(\alpha)}\|_\infty \le \frac{1}{(1-e^{-\rho\tau})u}\|\hat Q_0-Q^{(\alpha)}\|_\infty + C L(\tau/u+u) + \frac{1}{\alpha}\sum_{k=0}^{L-1}\zeta_k. $$

Then chooses $\tau=L_{\max}^{-1/2}$, $u=L_{\max}^{-1/4}$, and $n_k=(k+1)^{m+2}$, which leads to

$$ \sum_{k=0}^{L-1}\|\hat Q_k-Q^{(\alpha)}\|_\infty \le C_1 L^{3/4}, \qquad K(L)=\sum_{k=0}^{L-1}n_k = C_2 L^{m+3}. $$

Rewriting the cumulative gap in terms of total samples $K$ yields the stated sublinear regret law

$$ \mathrm{Regret}(K)=\mathcal O\!\left(K^{\frac{4m+11}{4m+12}}\right). $$

With reasonable assumptions, the practical single-critic learner still admits a meaningful long-run statistical guarantee.