Continuous-Time RL
Paper Codebase

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

Many decision problems evolve in continuous time, but most reinforcement learning algorithms still rely on fixed-step Bellman updates. This work develops a decoupled continuous-time actor-critic framework that learns a local advantage-rate signal $q(x,a)$ and updates the value function through a Hamiltonian flow, leading to both rigorous convergence guarantees and strong empirical performance under irregular decision times.

Irregular decision times leading to local q-learning, Hamiltonian flow, and a single-critic continuous-time actor-critic update.

Why continuous-time matters?

Markets

Rebalancing happens when signals arrive, positions drift, or risk changes, so trajectories naturally mix very short and much longer holding intervals even when prices are observed at a fine resolution.

Robotics

Stable movement may tolerate coarse control, but contact changes and balance recovery often require much smaller corrective steps. A single fixed-step Bellman abstraction is a poor fit for this setup.

Diffusion noise

In continuous-time, the value increment contains both Bellman drift and stochastic martingale variation. As the step size shrinks, the noise term can dominate naive TD regression unless the update is redesigned.


Research Questions

Why do fixed-step Bellman updates become unreliable under continuous-time settings?
With large $\Delta t$, a fixed-step model can miss important dynamics between decisions. With very small $\Delta t$, Bellman targets change only weakly from one step to the next, making learning signals fragile and optimization sensitive to noise. More fundamentally, when trajectories mix multiple time scales, the learned update can depend more on discretization than on the underlying continuous-time control problem itself.
What should reinforcement learning look like when the environment evolves continuously?
The ordinary discrete-time $Q$-function collapses to $V$ as $\Delta t \to 0$. Recent continuous-time RL addresses this by moving to stochastic control and replacing the degenerate small-step $Q$-function with the instantaneous advantage-rate function $q(x,a)$. But a major difficulty remains: under diffusion dynamics, minimizing a naive TD loss can fit Brownian-noise variance terms rather than the Bellman drift. Existing techniques rely on martingale-based objectives or orthogonality conditions, which couple $V$ and $q$ into a single optimization problem and make practical training difficult.
How can we keep the stochastic-control foundation without coupled martingale optimization?
Decouple the problem. First learn the local action signal $q_V(x,a)$ for a fixed value function $V$. Then update $V$ not through a discrete-time max backup, but through a Hamiltonian-driven flow. This yields a modular continuous-time actor-critic method that remains aligned with stochastic-control theory while being practical for modern off-policy training.
How did our method differ from existing approaches?
Standard discrete-time algorithms such as SAC, TD3, TRPO, and PPO remain highly effective, but they are fundamentally built around fixed-step MDPs and become sensitive when $\Delta t$ is very small or varies across samples. Earlier continuous-time approaches often relied on deterministic ODE models, closed-form dynamics, or other structural assumptions that are difficult to satisfy in general stochastic environments. More recent stochastic-control formulations introduce the little-$q$ advantage-rate and martingale characterizations under diffusion dynamics, providing the right continuous-time foundation but leaving optimization bottlenecked by coupled martingale objectives. Rather than changing the policy class or reusing the same martingale enforcement in a different form, the contribution here is a new decoupled learning scheme: estimate $q$ locally, update $V$ through a Hamiltonian flow, and recover a practical single-critic actor-critic algorithm with both theory and strong experiments.
Do these continuous-time ideas actually help in practice?
Real-world systems such as markets, robots, and autonomous agents act at irregular times, while mainstream RL still assumes a uniform environment step. Our experiments test precisely this irregular-time setting on continuous control tasks and a minute-resolution trading environment, where CT-SAC and CT-TD3 outperform prior continuous-time baselines and strong discrete-time algorithms.


Contributions

2
Core algorithm variants: CT-SAC and CT-TD3
4
Control benchmarks from DeepMind Control Suite
1
Minute-level irregular-time trading environment
3
Comprehensive experimental study settings
  1. Decoupled continuous-time RL. The framework separates learning the advantage-rate signal from updating the value function, replacing martingale-coupled objectives with a simpler iterative scheme that remains compatible with off-policy training.
  2. Probabilistic convergence analysis. The theory compares the Picard-Hamiltonian value update to a dynamic-programming semigroup, avoiding heavy functional-analytic machinery while still controlling approximation error from sampled stochastic trajectories.
  3. Strong empirical performance under mixed time scales. The method is evaluated on high-dimensional locomotion and a real trading task with irregular decision intervals, where it outperforms prior continuous-time baselines and strong discrete-time algorithms.

Paper roadmap

Formulation

From discrete MDPs to controlled diffusions, generators, Hamiltonians, and the continuous-time advantage-rate $q$-function.

Open page →

Algorithm

CT-SAC and CT-TD3, local $q^u$ estimation, Richardson correction, and the single-critic update $Q \approx V + q$.

Open page →

Theory

Value convergence, $q$-convergence, algorithm convergence, and the random-time and regret extensions.

Open page →

Experiments

Irregular-time control and trading benchmarks, visualization, time complexity, and generalization to regular-time evaluation.

Open page →

Gallery

Qualitative images and full video galleries for trading, irregular-time control, and transfer to regular-time evaluation.

Open page →


Takeaway

Our main claim is that continuous-time RL under irregular decision times becomes substantially easier to optimize once learning the local action signal and updating the value function are separated. That decoupling leads to a Hamiltonian-flow view of value learning, a practical single-critic implementation, rigorous convergence guarantees, and consistently strong performance on both continuous control benchmarks and a real trading task.