Continuous-Time RL
Paper Codebase

Algorithm

The formulation page explains why continuous-time RL needs a Hamiltonian flow and a local advantage-rate signal. Here we focus on how to turn those ideas to practical critic targets, actor updates, and deterministic or Richardson-style variants.

Algorithm pipeline for CT-SAC and CT-TD3: irregular transitions enter a replay buffer, the critic target combines reward, local value aggregation, and a time-scaled bootstrap term, and the actor is updated from the single critic.
Algorithm overview. The environment samples irregular transition durations $u$, while the optimizer advances the critic through an algorithmic step size $\tau$.

Algorithmic pipeline

1

Collect irregular transitions

Store tuples $(x,a,r,x',u)$ in an off-policy replay buffer. The duration $u$ is part of the sample and can vary substantially across the batch.

2

Build a continuous-time critic target

Instead of a fixed-step Bellman backup, use a target that mixes reward, policy-averaged soft value, and a bootstrap term scaled by the observed holding time.

3

Represent both $V$ and $q$ with one critic

The practical parameterization is $Q(x,a)\approx V(x)+q(x,a)$, so a single network carries both value-like and advantage-rate information.

4

Update the actor from the critic

For CT-SAC the policy is improved toward a Boltzmann distribution induced by the critic; CT-TD3 replaces this with a deterministic actor and twin critics.

Single-critic implementation

We use a single critic of the form:

$$Q_k(x,a):=V_k(x)+q_k(x,a).$$

Near the improved policy, the advantage-rate is often small, so $Q_k(x,a)$ behaves almost like a state-value predictor on-policy, but still retains action discrimination off-policy.

CT-SAC critic target

Writing $\tilde Q_k(x,a):=Q_k(x,a)-\alpha\log\pi(a\mid x)$ and $\gamma=e^{-\beta u}$, the single-critic update takes the form:

$$\begin{aligned} Q_{k+1}(x,a)=&(1-\tau)Q_k(x,a) +\tau\,\mathbb E_a[\tilde Q_k(x,a)] +\tau\,r(x,a)\\ &+\tau\, \frac{\gamma\,\mathbb E_{a'}[\tilde Q_k(x',a')]-\mathbb E_a[\tilde Q_k(x,a)]}{u}. \end{aligned}$$

The critic target now depends explicitly on the observed duration $u$, so the method can train on batches that mix many short and large holding times without forcing everything into the same discrete-time step.

Note. The actor still uses a familiar update. Since the soft maximizing policy is Boltzmann in $q_V$, and $Q=V+q$ differs from $q$ only by an action-independent shift, CT-SAC updates the policy through $\pi(a\mid x)\propto \exp(Q(x,a)/\alpha)$.

Fast-update view

For implementation, it is often convenient to separate the "new information" part of the critic target from the mixing step. Define:

$$Q_{k+1}^{\mathrm{fast}}(s,a):=\frac{Q_{k+1}(s,a)-(1-\tau)Q_k(s,a)}{\tau}.$$

Then the update can be written as:

$$\begin{aligned} Q_{k+1}^{\mathrm{fast}}(s,a) &= r(s,a)+\mathbb E_a[\tilde Q_k(s,a)] +\frac{\gamma\,\mathbb E_{a'}[\tilde Q_k(s',a')]-\mathbb E_a[\tilde Q_k(s,a)]}{u},\\ Q_{k+1}(s,a) &= \tau\,Q_{k+1}^{\mathrm{fast}}(s,a)+(1-\tau)Q_k(s,a). \end{aligned}$$

This presentation makes the connection to standard deep RL implementations easier to see: the second line is a weighted interpolation step, while the first line isolates the actual continuous-time target. When $u=1$, the update resembles a discrete-time critic target much more closely.

CT-SAC algorithm

Data collection

Roll out the current stochastic policy, collect $(x,a,r,x',u)$, and store transitions in an off-policy replay buffer. The duration $u$ is kept explicitly instead of being absorbed into a nominal step.

Critic regression

Use mini-batches to fit the single critic against the continuous-time target above. This is the main place where the formulation becomes a scalable algorithm.

Actor improvement

Update the stochastic policy toward $\pi_\phi(a\mid x)\propto \exp(Q_\theta(x,a)/\alpha)$, giving a soft continuous-time analogue of SAC.

CT-TD3 algorithm

We also give a deterministic variant of CT-SAC named CT-TD3 that mirrors discrete-time TD3 directly. CT-TD3 keeps the same irregular-time critic logic but replaces the stochastic actor with a deterministic policy $\mu_\phi$, uses twin critics, adds target policy smoothing, and delays actor updates.

CT-TD3 full algorithm

Target policy smoothing. Use $$\tilde a'=\mu_{\bar\phi}(x')+\epsilon',\qquad \epsilon'\sim \mathrm{clip}\big(\mathcal N(0,\sigma_{\mathrm{targ}}^2I),-c,c\big).$$

Clipped double-$Q$ bootstrap. Form $$\bar Q_{\bar\theta}(x',\tilde a'):=\min\{Q_{\bar\theta_1}(x',\tilde a'),\,Q_{\bar\theta_2}(x',\tilde a')\}.$$

Critic target. For each critic, $$\begin{aligned} y_i:=&(1-\tau)Q_{\bar\theta_i}(x,a) +\tau\,\bar Q_{\bar\theta}\big(x,\mu_{\bar\phi}(x)\big) +\tau\,r\\ &+\tau\,\frac{\gamma\,\bar Q_{\bar\theta}(x',\tilde a')-\bar Q_{\bar\theta}(x,\mu_{\bar\phi}(x))}{u}. \end{aligned}$$

Actor update. Every $d$ steps, update the deterministic actor through the gradient of $Q_{\theta_1}(x,a)$ evaluated at $a=\mu_\phi(x)$.

Richardson variant

We also have a corrected single-critic update based on the Richardson rate estimator. The goal is not to change the overall training loop, but to improve the short-horizon approximation used inside the critic target.

Define the policy-averaged soft value functional:

$$S_k(x):=\mathbb E_{a\sim\pi_k(\cdot\mid x)}[\tilde Q_k^{\mathrm R}(x,a)], \qquad \tilde Q_k^{\mathrm R}(x,a):=Q_k^{\mathrm R}(x,a)-\alpha\log\pi_k(a\mid x),$$

Together with the Richardson critic decomposition:

$$Q_k^{\mathrm R}(x,a):=V_k(x)+\tilde q_{V_k}^u(x,a).$$

The resulting Richardson-corrected update is:

$$\begin{aligned} Q_{k+1}^{\mathrm R}(x,a) &=(1-\tau)Q_k^{\mathrm R}(x,a)+\tau r(x,a)+\tau S_k(x)\\ &\quad+\frac{\tau}{u}\Big( 4\gamma_{1/2}\,\mathbb E_x^a[S_k(X_{u/2})] -\gamma\,\mathbb E_x^a[S_k(X_u)] -3S_k(x)\Big), \end{aligned}$$

where $\gamma=e^{-\beta u}$ and $\gamma_{1/2}=e^{-\beta u/2}$. This variant is more accurate in $u$ under smoother assumptions, and directly links to the stronger convergence on the theory page.

Notes

$u$ versus $\tau$

$u$ comes from the environment and may vary sample by sample. $\tau$ is the algorithmic step size controlling the critic/value flow update. Keeping them separate is essential.

Bridge to theory

The formulation page explains why the Hamiltonian flow is needed. This page explains how that idea becomes actual replay-buffer training targets, deterministic variants, and corrected short-horizon updates.