Pharmacokinetics Playground

A drug concentration over time has a known mechanistic backbone (absorption + elimination) but a hidden latent factor - each person's clearance. A non-negative, mean-variance Neural ODE learns the unknown rate law from real, sparse blood samples and predicts its own error bar. Same machinery as the Lotka-Volterra and KINNs playgrounds, on a real dataset.

Gabriel S. Gusmão
Gabriel S. Gusmão · real open data (theophylline) · see Neural ODE & KINNs

Give someone an oral dose of a drug and draw a handful of blood samples: the concentration rises as the drug is absorbed, then falls as the body clears it. The backbone of that curve is known: a one-compartment model with first-order absorption and elimination,

$$\frac{dA_g}{dt} = -k_a\,A_g,\qquad \frac{dC}{dt} = \frac{k_a}{V}\,A_g - \frac{\mathrm{CL}}{V}\,C$$

Here $C$ is the observed plasma concentration (mg/L), exactly what the blood samples measure and what every plot below shows. $A_g$ is the hidden gut amount (mg): the unabsorbed drug still in the absorption reservoir. It is a mass, not a concentration, and is never plotted; it is the latent state the model carries internally so the rise of $C$ has somewhere to come from. $k_a$ is the absorption rate, $V$ the volume of distribution, and $\mathrm{CL}$ the clearance. The structure is the same for everyone; what differs person to person is the latent clearance $\mathrm{CL}$, exactly the kind of hidden factor a minimal model is built to recover. Below is the classic theophylline dataset: 12 subjects, ~11 samples each over 25 h.

One distinction is worth stating up front. The Lotka-Volterra and KINNs playgrounds run on synthetic data with a known ground truth, so a recovered parameter can be plotted against the value that generated it. Pharmacokinetics here is real data with no known truth: the only references are the measured plasma samples and the closed-form one-compartment least-squares fit, and the gut amount $A_g$ is never measured at all. There is nothing to plot a recovered $\mathrm{CL}$ against except other estimators of the same latent factor. That changes what validation means: the held-out plasma points are the only arbiter, and a parameter is trusted only to the extent the data constrain it, which is exactly what the propagated uncertainty below quantifies.

Train the non-negative UDE

Now the interesting version. The same selected subject, but instead of the closed-form one-compartment least-squares fit we integrate the model with RK4 and learn the rate field by gradient descent through a discrete adjoint, the same machinery as the Lotka-Volterra Neural ODE, pointed at real pharmacokinetics. The model is the non-negative universal differential equation (UDE): only mass balance and positivity are imposed, the rate laws are learned. Hit Train on the floating control (bottom-right); the fit, loss, and uncertainty update live.

autonomous net $f_\theta(A_g, C, W) = (r_a,\, \mathrm{NN_1},\, \mathrm{NN_2},\, \sigma_\varphi)$: three softplus rate heads $\ge 0$ and a noise head
vector field $\dot A_g = -A_g\,r_a$
$\dot C = (A_g/D)\,c\,\mathrm{NN_1} - C\,\mathrm{NN_2}$
integrate (RK4) $\Rightarrow \hat C(t),\, A_g(t)\ \ge 0$
noise head (MLE) $\sigma^2(t) = \sigma_\varphi\big(A_g(t), C(t)\big) \Rightarrow {\pm}\sigma$ - the NODE predicts its own error bar
One shared autonomous network emits three non-negative rate heads and a noise head; mass balance assembles the rates into the vector field, RK4 integrates it to the prediction, and the noise head sets the ±σ at each state. Edit the architecture chips in Model options and this redraws.

Each subject's samples are split by time into train (through the peak and early elimination), test (mid elimination), and validation (the late tail), with the two boundaries set by the draggable handles on the plot. The training loss sees the train region only. The other two get no gradient, but they play different roles: test is the model-selection signal that drives the best-checkpoint / early-stop - selection maximizes the test log-likelihood (the proper objective; the R² is shown alongside) - while validation is the true held-out set, never used for fitting or selection, so its R² is the honest measure of generalization. The fitted curve is still integrated across the full 0-25 h. The shaded bands on both plots mark the three regions, and the two draggable handles on the plot let you move the train|test and test|val boundaries and watch the region R² re-score live. The story to watch: a flexible net can fit the train region almost perfectly yet drift on the held-out tail, and the uncertainty band widens exactly where the data stop constraining it.

What is actually minimized? The net is trained by gradient descent on a cost summed over the train points only, where each prediction $\hat C(t_i;\theta)$ is the UDE integrated to time $t_i$ (not read off a network) and $\theta$ are the net weights:

$$J_{\mathrm{MLE}}(\theta)=\sum_{i\,\in\,\mathrm{train}}\sigma_i^{2\beta}\left[\frac{\big(\hat C(t_i;\theta)-C_i\big)^2}{\sigma_i^2}+\log\sigma_i^2\right].$$

This is the heteroscedastic Gaussian negative log-likelihood, and the twist is that the variance $\sigma_i^2$ is predicted by the network itself: a mean-variance Neural ODE that predicts its own error bar. The same autonomous net carries a fourth output head $\sigma_\varphi$ alongside the three rate heads, so at the integrated state it emits both the rate field and the local observation variance $\sigma_i^2=\sigma_\varphi\big(A_g(t_i),\,C(t_i)\big)=\text{floor}+\text{scale}\cdot\mathrm{softplus}(o_3)>0$. The NODE therefore predicts its own error bar at each point, and that prediction is the $\sigma_i$ the likelihood is written in. Rates and noise descend this one objective in a single optimizer step; because $\sigma_i$ depends on the integrated state, $\partial J/\partial v\cdot\partial v/\partial x$ propagates back through the dynamics on the same discrete adjoint (an extra state-space seed). The $\beta$-weight $\sigma_i^{2\beta}$ (Seitzer et al. 2022, $\beta=0.5$) plus a short warmup on the mean fit stop the variance from inflating to explain away a poor fit (the standard mean-variance collapse). The net learns a tight $\sigma$ where the train data pin it and, being unconstrained on the held-out tail, an overconfident one out there - the honest failure mode, visible in the band below.

The Error view selector (Model options) switches the main plot between two read-only views. Interpolation shows the $\pm\sigma$ ribbon on the concentration $\hat C(t)$. Derivative space shows the learned rate $f_C=\dot C$ along the trajectory with its own band $\operatorname{Var}[\dot C(t)] = g_f(t)^\top\Sigma_w\,g_f(t)$, propagated from the same weight-space covariance $\Sigma_w$ that paints the $\pm\sigma$ band on $\hat C(t)$. Neither view is a training objective. Where the trajectory runs past the training data the derivative band widens, an honest signal that the net was never constrained out there.

The MLE option closes the loop between the fit and its uncertainty, and there are two distinct uncertainties. The aleatoric piece is the network's predicted noise $\sigma_\varphi$ - the measurement spread it expects at each state. The epistemic piece is parameter uncertainty: how well the data pin the net weights. The predicted noise plays the first role directly, and it also sets the scale for the second - it reweights the loss point-by-point ($w_i=1/\sigma_i^2$), setting the balance between interpolating the data and trusting the physics, and its typical value feeds the Fisher information $\mathcal{I} = J^\top J / \hat\sigma^2$, where $J_{ik} = \partial C(t_i)/\partial\theta_k$ are the sensitivities of the predicted concentration to the parameters, evaluated by the same discrete adjoint that supplies the training gradient. The covariance $\Sigma = \mathcal{I}^{-1}$ gives the epistemic $\pm\sigma$, and the shaded ribbon on the fitted curve is $g(t)^\top \Sigma\, g(t)$ with $g(t)_k = \partial C(t)/\partial\theta_k$. A direction the data barely constrain shows up as a wide $\pm\sigma$, which is the honest answer for a real dataset with no ground truth to check against.

Scope
Model options
Objective Loss MLE (mean-variance NODE - rKINNs error propagation) i
Integration
Mechanistic Hybrid learns
Architecture
i
Train
-
log L-
Test · selection
-
log L-
Val · held-out
-
log L-
Loss (train)
-
RMSE train - test - val - σ̂2 -
Best checkpoint (selected on test log-likelihood)
- @ epoch -
The blue dots are the measured samples; the dashed gray line is the Neural ODE's current (live) fit and the solid blue line is the frozen best checkpoint; the faint red dashed line is the closed-form one-compartment least-squares fit, the analytic baseline the UDE has to match. Recovered CL for theophylline should land around 1.5-4 L/h with a half-life of roughly 5-13 h.

Where these models go next

Two structural choices carry the whole playground, and neither is specific to theophylline. The first is positivity and mass balance by construction: the rate field is factored into non-negative heads, a pure-decay gut $\dot A_g = -A_g\,r_a$ and a fed-then-eliminated plasma $\dot C = (A_g/D)\,c\,\mathrm{NN_1} - C\,\mathrm{NN_2}$, with every head softplus-bounded, so the vector field cannot push a state below zero no matter what the net learns and $A_g$ decays monotonically from the dose. The constraint lives in the geometry of the field, not in a post-hoc clamp. The second is a model that predicts its own uncertainty: under MLE the same net carries a fourth head $\sigma_\varphi$ that outputs the local observation noise at each state (a mean-variance Neural ODE), trained jointly with the rates by the heteroscedastic $\beta$-NLL - the aleatoric piece. On top of it, rKINNs-style error propagation through the integrator adds the epistemic piece: the same discrete-adjoint sensitivities that supply the training gradient build the MLE covariance $\Sigma$. The $\pm\sigma$ band on $\hat C(t)$ is the two in quadrature, and a read-only derivative error-view shows the learned rate's posterior variance where the data are held out.

Both ideas generalize directly. The non-negative factored field extends to multi-compartment and target-mediated drug disposition (TMDD) models, where a saturable binding term couples drug and receptor and positivity is non-negotiable; to hierarchical / mixed-effects populations with genuine random effects rather than the single shared net used here; and to mechanism discovery, where a sparse regression (SINDy-style $L_1$ / STLSQ pruning) on the learned rate field recovers an interpretable rate law instead of an opaque net. The same machinery applies to any positive, mass-action-structured dynamical system: chemical reaction networks, compartmental epidemic models, and ecological predator-prey dynamics (the Lotka-Volterra link to the companion Neural ODE playground).

Other coupled physiological systems fit the same template. Glucose-insulin dynamics, for instance, can be cast as a compartment model - non-negative, coupled states with a minimal-model skeleton (glucose effectiveness, insulin-dependent disposal, a remote insulin-action compartment) but uncertain, person-specific rate laws, observed only as a noisy continuous-glucose time series. That is the same regime as here: a known backbone, an unknown latent rate field, positivity required by the physiology, and held-out uncertainty that matters because the tail is where a forecast either holds or fails. The non-negative UDE supplies the positivity-preserving structure and the mean-variance noise head plus the propagated parameter covariance supply the honest uncertainty - the two pieces such a model needs that an unconstrained black box does not give.

None of these extensions changes the core loop on this page; they change the skeleton the net hangs off and the states it has to keep positive. The theophylline curve is just the smallest example where a known mechanism, a hidden latent factor, and real held-out data all sit in one place.

What ties this page to the rest of the program is the treatment of the data-physics balance and its uncertainty. A Neural ODE embedded as the right-hand side of an integrator is a universal differential equation: a known mechanistic skeleton with a learned component for the unknown rate law. The same construction underlies the KINNs formulation, where the mechanism is known and the network is a surrogate for the state trajectory, and the Lotka-Volterra playground, where the rate field itself is learned. What is taken from rKINNs is not the problem setup but the error-propagation machinery: the discrete-adjoint sensitivities that supply the training gradient also build the Fisher information and the Laplace covariance, so the uncertainty on the recovered rate law is derived from the same model geometry rather than estimated separately or set by a hand-tuned weight. The one addition specific to a real dataset with no ground truth is the mean-variance head, which lets the model report the aleatoric noise it expects alongside the epistemic uncertainty the data leave on its parameters. Population pharmacokinetics has historically resolved the same between-subject variability with nonlinear mixed-effects estimation; the construction here recovers a comparable decomposition through a single shared network and a propagated covariance, without committing in advance to a parametric rate law. The broader claim is narrow and testable: for positive, mechanistically structured dynamical systems observed sparsely and with noise, the rate law and a calibrated uncertainty on it can be read off the same integrator that fits the data, and the same machinery transfers across pharmacokinetics, chemical kinetics, epidemiology, and the coupled physiological systems where the held-out tail is what a forecast lives or dies on.

References

  1. Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. NeurIPS 2018. arXiv:1806.07366
  2. Gusmão, G. S.; Retnanto, A. P.; da Silva, S. C.; Medford, A. J. Kinetics-Informed Neural Networks. Catalysis Today, 2023. doi:10.1016/j.cattod.2022.04.002
  3. Gusmão, G. S.; Medford, A. J. Maximum-Likelihood Estimators in Physics-Informed Neural Networks for High-Dimensional Inverse Problems. Computers & Chemical Engineering, 2024, 182, 108547. doi:10.1016/j.compchemeng.2023.108547
  4. Rackauckas, C.; Ma, Y.; Martensen, J.; Warner, C.; Zubov, K.; Supekar, R.; Skinner, D.; Ramadhan, A.; Edelman, A. Universal Differential Equations for Scientific Machine Learning. 2020. arXiv:2001.04385
  5. Seitzer, M.; Tavakoli, A.; Antic, D.; Martius, G. On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks. ICLR 2022. arXiv:2203.09168
  6. Beal, S. L.; Sheiner, L. B.; Boeckmann, A. J.; Bauer, R. J. NONMEM Users Guides. Icon Development Solutions, Ellicott City, MD, 1989-2018. Nonlinear mixed-effects modeling for population pharmacokinetics.
  7. Boeckmann, A. J.; Sheiner, L. B.; Beal, S. L. Theophylline. Classic open population-pharmacokinetics dataset (12 subjects, oral dose), distributed with NONMEM and the R datasets / nlme packages.
Cost function and how the bands are built

The objective is the heteroscedastic Gaussian negative log-likelihood, run as a mean-variance Neural ODE. The same autonomous net grows a 4th output head $\sigma_\varphi$ next to the three rate heads, so at the integrated state it predicts both the rate field and the local observation variance $\sigma_i^2=\sigma_\varphi(A_g,C)=\text{floor}+\text{scale}\cdot\mathrm{softplus}(o_3)>0$. The NODE literally predicts its own error bar. Heads and noise descend the SAME Gaussian $\beta$-NLL $\sum_\mathrm{train}\sigma_i^{2\beta}[r_i^2/\sigma_i^2+\log\sigma_i^2]$ ($\beta=0.5$, with a short warmup on the mean fit so the variance cannot inflate to explain away misfit) in one optimizer step. Because $\sigma_i$ depends on the integrated state, $\partial J/\partial v\cdot\partial v/\partial x$ rides back through the dynamics on the same discrete adjoint (an extra state-space seed). Each point is weighted by its inverse predicted variance $1/\sigma_i^2$, so the likelihood you minimize is the one the uncertainty is read off of. The net learns a tight $\sigma$ where the train data pin it and an overconfident one on the unconstrained held-out tail - visible as a small $\sigma$ where the fit has since drifted.

The $\pm\sigma$ band: two pieces. The aleatoric piece is the head's predicted noise $\sigma_\varphi$ at each state. The epistemic piece is parameter uncertainty, propagated through the model: we form the sensitivities $J_{ik}=\partial \hat C(t_i)/\partial\theta_k$ on the train points by running them through the SAME integrator (the discrete adjoint of the NODE), then the Fisher information $\mathcal I = J^\top J/\hat\sigma^2$ (with $\hat\sigma^2$ the typical head variance) and covariance $\Sigma=\mathcal I^{-1}$. The epistemic curve band is $\mathrm{var}\,\hat C(t)=g(t)^\top\Sigma\,g(t)$ with $g(t)=\partial\hat C(t)/\partial\theta$. Here $\theta$ are the net weights $w$: since $n_w\gg n_\mathrm{train}$ we never form $J^\top J$ but use the linearized-Laplace covariance $\Sigma_w=\hat\sigma^2(J^\top J+\lambda I)^{-1}$ through the Woodbury identity on the small $n_\mathrm{train}\times n_\mathrm{train}$ Gram, which is why the band widens in the held-out tail: there $g(t)$ points outside the row space the training points ever constrained.

The total ribbon is the two added in quadrature: $\sqrt{\,\sigma_\varphi^2 + g(t)^\top\Sigma\,g(t)\,}$ - the predicted measurement noise plus the parameter uncertainty - so it is a genuine predictive interval for a new observation, tight where the data constrain the net and wide on the held-out tail.

Live architecture editor

The chips are the real network: in takes $(A_g, C, W/70)$ - an autonomous field with no explicit time input, the rate law depends only on the state and the constant weight covariate - each hidden chip is one layer of swish units, and out is four heads of the same net: three softplus rate heads $(r_a,\mathrm{NN_1},\mathrm{NN_2})\ge 0$ that build $\dot A_g=-A_g\,r_a$ and $\dot C=(A_g/D)\,c\,\mathrm{NN_1}-C\,\mathrm{NN_2}$ (gut emptying $r_a$, absorption $\mathrm{NN_1}$ feeding plasma from the gut, elimination $\mathrm{NN_2}$; non-negative by construction), plus a noise head $\sigma_\varphi$ that predicts the local observation variance for the MLE (it does not enter the dynamics). Use $-$ / $+$ to change a layer's neuron count and + hidden layer to add depth; the net and optimizer rebuild live on each change.

The $W/70$ input is the black-box analog of allometry: instead of imposing $\mathrm{CL}\propto W^{0.75}$, the network learns the weight dependence, so one shared net can fit subjects of different size.

Why this is the correctness anchor

For a one-compartment model the analytic solution is the exact least-squares optimum. Fitting the hybrid backbone by gradient descent on the same points must therefore converge to it. Agreement to about 1 percent is the check that the adjoint gradient, integrator, and optimizer are all correct.

The live readout above trains on the train region only and so targets a different, held-out objective; this table fits on every point, which is the apples-to-apples comparison.

Epoch0