Why continuous-time matters?
Markets
Rebalancing happens when signals arrive, positions drift, or risk changes, so trajectories naturally mix very short and much longer holding intervals even when prices are observed at a fine resolution.
Robotics
Stable movement may tolerate coarse control, but contact changes and balance recovery often require much smaller corrective steps. A single fixed-step Bellman abstraction is a poor fit for this setup.
Diffusion noise
In continuous-time, the value increment contains both Bellman drift and stochastic martingale variation. As the step size shrinks, the noise term can dominate naive TD regression unless the update is redesigned.
Research Questions
Why do fixed-step Bellman updates become unreliable under continuous-time settings?
What should reinforcement learning look like when the environment evolves continuously?
How can we keep the stochastic-control foundation without coupled martingale optimization?
How did our method differ from existing approaches?
Do these continuous-time ideas actually help in practice?
Sample illustrations
Below are short representative clips from the empirical study. The complete gallery are on the Gallery page.
CT-SAC vs CPPO on Trading Task
Our algorithm CT-SAC against continuous baselines on Cheetah
Our algorithm CT-SAC against continuous baselines on Humanoid
Contributions
- Decoupled continuous-time RL. The framework separates learning the advantage-rate signal from updating the value function, replacing martingale-coupled objectives with a simpler iterative scheme that remains compatible with off-policy training.
- Probabilistic convergence analysis. The theory compares the Picard-Hamiltonian value update to a dynamic-programming semigroup, avoiding heavy functional-analytic machinery while still controlling approximation error from sampled stochastic trajectories.
- Strong empirical performance under mixed time scales. The method is evaluated on high-dimensional locomotion and a real trading task with irregular decision intervals, where it outperforms prior continuous-time baselines and strong discrete-time algorithms.
Paper roadmap
Formulation
From discrete MDPs to controlled diffusions, generators, Hamiltonians, and the continuous-time advantage-rate $q$-function.
Algorithm
CT-SAC and CT-TD3, local $q^u$ estimation, Richardson correction, and the single-critic update $Q \approx V + q$.
Theory
Value convergence, $q$-convergence, algorithm convergence, and the random-time and regret extensions.
Experiments
Irregular-time control and trading benchmarks, visualization, time complexity, and generalization to regular-time evaluation.
Gallery
Qualitative images and full video galleries for trading, irregular-time control, and transfer to regular-time evaluation.
Takeaway
Our main claim is that continuous-time RL under irregular decision times becomes substantially easier to optimize once learning the local action signal and updating the value function are separated. That decoupling leads to a Hamiltonian-flow view of value learning, a practical single-critic implementation, rigorous convergence guarantees, and consistently strong performance on both continuous control benchmarks and a real trading task.