Pharmacokinetics Playground
A drug concentration over time has a known mechanistic backbone (absorption + elimination) but a hidden latent factor - each person's clearance. A non-negative, mean-variance Neural ODE learns the unknown rate law from real, sparse blood samples and predicts its own error bar. Same machinery as the Lotka-Volterra and KINNs playgrounds, on a real dataset.
Give someone an oral dose of a drug and draw a handful of blood samples: the concentration rises as the drug is absorbed, then falls as the body clears it. The backbone of that curve is known: a one-compartment model with first-order absorption and elimination,
$$\frac{dA_g}{dt} = -k_a\,A_g,\qquad \frac{dC}{dt} = \frac{k_a}{V}\,A_g - \frac{\mathrm{CL}}{V}\,C$$
Here $C$ is the observed plasma concentration (mg/L), exactly what the blood samples measure and what every plot below shows. $A_g$ is the hidden gut amount (mg): the unabsorbed drug still in the absorption reservoir. It is a mass, not a concentration, and is never plotted; it is the latent state the model carries internally so the rise of $C$ has somewhere to come from. $k_a$ is the absorption rate, $V$ the volume of distribution, and $\mathrm{CL}$ the clearance. The structure is the same for everyone; what differs person to person is the latent clearance $\mathrm{CL}$, exactly the kind of hidden factor a minimal model is built to recover. Below is the classic theophylline dataset: 12 subjects, ~11 samples each over 25 h.
One distinction is worth stating up front. The Lotka-Volterra and KINNs playgrounds run on synthetic data with a known ground truth, so a recovered parameter can be plotted against the value that generated it. Pharmacokinetics here is real data with no known truth: the only references are the measured plasma samples and the closed-form one-compartment least-squares fit, and the gut amount $A_g$ is never measured at all. There is nothing to plot a recovered $\mathrm{CL}$ against except other estimators of the same latent factor. That changes what validation means: the held-out plasma points are the only arbiter, and a parameter is trusted only to the extent the data constrain it, which is exactly what the propagated uncertainty below quantifies.
Train the non-negative UDE
Now the interesting version. The same selected subject, but instead of the closed-form one-compartment least-squares fit we integrate the model with RK4 and learn the rate field by gradient descent through a discrete adjoint, the same machinery as the Lotka-Volterra Neural ODE, pointed at real pharmacokinetics. The model is the non-negative universal differential equation (UDE): only mass balance and positivity are imposed, the rate laws are learned. Hit Train on the floating control (bottom-right); the fit, loss, and uncertainty update live.
$\dot C = (A_g/D)\,c\,\mathrm{NN_1} - C\,\mathrm{NN_2}$
Each subject's samples are split by time into train (through the peak and early elimination), test (mid elimination), and validation (the late tail), with the two boundaries set by the draggable handles on the plot. The training loss sees the train region only. The other two get no gradient, but they play different roles: test is the model-selection signal that drives the best-checkpoint / early-stop - selection maximizes the test log-likelihood (the proper objective; the R² is shown alongside) - while validation is the true held-out set, never used for fitting or selection, so its R² is the honest measure of generalization. The fitted curve is still integrated across the full 0-25 h. The shaded bands on both plots mark the three regions, and the two draggable handles on the plot let you move the train|test and test|val boundaries and watch the region R² re-score live. The story to watch: a flexible net can fit the train region almost perfectly yet drift on the held-out tail, and the uncertainty band widens exactly where the data stop constraining it.
What is actually minimized? The net is trained by gradient descent on a cost summed over the train points only, where each prediction $\hat C(t_i;\theta)$ is the UDE integrated to time $t_i$ (not read off a network) and $\theta$ are the net weights:
$$J_{\mathrm{MLE}}(\theta)=\sum_{i\,\in\,\mathrm{train}}\sigma_i^{2\beta}\left[\frac{\big(\hat C(t_i;\theta)-C_i\big)^2}{\sigma_i^2}+\log\sigma_i^2\right].$$
This is the heteroscedastic Gaussian negative log-likelihood, and the twist is that the variance $\sigma_i^2$ is predicted by the network itself: a mean-variance Neural ODE that predicts its own error bar. The same autonomous net carries a fourth output head $\sigma_\varphi$ alongside the three rate heads, so at the integrated state it emits both the rate field and the local observation variance $\sigma_i^2=\sigma_\varphi\big(A_g(t_i),\,C(t_i)\big)=\text{floor}+\text{scale}\cdot\mathrm{softplus}(o_3)>0$. The NODE therefore predicts its own error bar at each point, and that prediction is the $\sigma_i$ the likelihood is written in. Rates and noise descend this one objective in a single optimizer step; because $\sigma_i$ depends on the integrated state, $\partial J/\partial v\cdot\partial v/\partial x$ propagates back through the dynamics on the same discrete adjoint (an extra state-space seed). The $\beta$-weight $\sigma_i^{2\beta}$ (Seitzer et al. 2022, $\beta=0.5$) plus a short warmup on the mean fit stop the variance from inflating to explain away a poor fit (the standard mean-variance collapse). The net learns a tight $\sigma$ where the train data pin it and, being unconstrained on the held-out tail, an overconfident one out there - the honest failure mode, visible in the band below.
The Error view selector (Model options) switches the main plot between two read-only views. Interpolation shows the $\pm\sigma$ ribbon on the concentration $\hat C(t)$. Derivative space shows the learned rate $f_C=\dot C$ along the trajectory with its own band $\operatorname{Var}[\dot C(t)] = g_f(t)^\top\Sigma_w\,g_f(t)$, propagated from the same weight-space covariance $\Sigma_w$ that paints the $\pm\sigma$ band on $\hat C(t)$. Neither view is a training objective. Where the trajectory runs past the training data the derivative band widens, an honest signal that the net was never constrained out there.
The MLE option closes the loop between the fit and its uncertainty, and there are two distinct uncertainties. The aleatoric piece is the network's predicted noise $\sigma_\varphi$ - the measurement spread it expects at each state. The epistemic piece is parameter uncertainty: how well the data pin the net weights. The predicted noise plays the first role directly, and it also sets the scale for the second - it reweights the loss point-by-point ($w_i=1/\sigma_i^2$), setting the balance between interpolating the data and trusting the physics, and its typical value feeds the Fisher information $\mathcal{I} = J^\top J / \hat\sigma^2$, where $J_{ik} = \partial C(t_i)/\partial\theta_k$ are the sensitivities of the predicted concentration to the parameters, evaluated by the same discrete adjoint that supplies the training gradient. The covariance $\Sigma = \mathcal{I}^{-1}$ gives the epistemic $\pm\sigma$, and the shaded ribbon on the fitted curve is $g(t)^\top \Sigma\, g(t)$ with $g(t)_k = \partial C(t)/\partial\theta_k$. A direction the data barely constrain shows up as a wide $\pm\sigma$, which is the honest answer for a real dataset with no ground truth to check against.
Where these models go next
Two structural choices carry the whole playground, and neither is specific to theophylline. The first is positivity and mass balance by construction: the rate field is factored into non-negative heads, a pure-decay gut $\dot A_g = -A_g\,r_a$ and a fed-then-eliminated plasma $\dot C = (A_g/D)\,c\,\mathrm{NN_1} - C\,\mathrm{NN_2}$, with every head softplus-bounded, so the vector field cannot push a state below zero no matter what the net learns and $A_g$ decays monotonically from the dose. The constraint lives in the geometry of the field, not in a post-hoc clamp. The second is a model that predicts its own uncertainty: under MLE the same net carries a fourth head $\sigma_\varphi$ that outputs the local observation noise at each state (a mean-variance Neural ODE), trained jointly with the rates by the heteroscedastic $\beta$-NLL - the aleatoric piece. On top of it, rKINNs-style error propagation through the integrator adds the epistemic piece: the same discrete-adjoint sensitivities that supply the training gradient build the MLE covariance $\Sigma$. The $\pm\sigma$ band on $\hat C(t)$ is the two in quadrature, and a read-only derivative error-view shows the learned rate's posterior variance where the data are held out.
Both ideas generalize directly. The non-negative factored field extends to multi-compartment and target-mediated drug disposition (TMDD) models, where a saturable binding term couples drug and receptor and positivity is non-negotiable; to hierarchical / mixed-effects populations with genuine random effects rather than the single shared net used here; and to mechanism discovery, where a sparse regression (SINDy-style $L_1$ / STLSQ pruning) on the learned rate field recovers an interpretable rate law instead of an opaque net. The same machinery applies to any positive, mass-action-structured dynamical system: chemical reaction networks, compartmental epidemic models, and ecological predator-prey dynamics (the Lotka-Volterra link to the companion Neural ODE playground).
Other coupled physiological systems fit the same template. Glucose-insulin dynamics, for instance, can be cast as a compartment model - non-negative, coupled states with a minimal-model skeleton (glucose effectiveness, insulin-dependent disposal, a remote insulin-action compartment) but uncertain, person-specific rate laws, observed only as a noisy continuous-glucose time series. That is the same regime as here: a known backbone, an unknown latent rate field, positivity required by the physiology, and held-out uncertainty that matters because the tail is where a forecast either holds or fails. The non-negative UDE supplies the positivity-preserving structure and the mean-variance noise head plus the propagated parameter covariance supply the honest uncertainty - the two pieces such a model needs that an unconstrained black box does not give.
None of these extensions changes the core loop on this page; they change the skeleton the net hangs off and the states it has to keep positive. The theophylline curve is just the smallest example where a known mechanism, a hidden latent factor, and real held-out data all sit in one place.
What ties this page to the rest of the program is the treatment of the data-physics balance and its uncertainty. A Neural ODE embedded as the right-hand side of an integrator is a universal differential equation: a known mechanistic skeleton with a learned component for the unknown rate law. The same construction underlies the KINNs formulation, where the mechanism is known and the network is a surrogate for the state trajectory, and the Lotka-Volterra playground, where the rate field itself is learned. What is taken from rKINNs is not the problem setup but the error-propagation machinery: the discrete-adjoint sensitivities that supply the training gradient also build the Fisher information and the Laplace covariance, so the uncertainty on the recovered rate law is derived from the same model geometry rather than estimated separately or set by a hand-tuned weight. The one addition specific to a real dataset with no ground truth is the mean-variance head, which lets the model report the aleatoric noise it expects alongside the epistemic uncertainty the data leave on its parameters. Population pharmacokinetics has historically resolved the same between-subject variability with nonlinear mixed-effects estimation; the construction here recovers a comparable decomposition through a single shared network and a propagated covariance, without committing in advance to a parametric rate law. The broader claim is narrow and testable: for positive, mechanistically structured dynamical systems observed sparsely and with noise, the rate law and a calibrated uncertainty on it can be read off the same integrator that fits the data, and the same machinery transfers across pharmacokinetics, chemical kinetics, epidemiology, and the coupled physiological systems where the held-out tail is what a forecast lives or dies on.
References
- Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. NeurIPS 2018. arXiv:1806.07366
- Gusmão, G. S.; Retnanto, A. P.; da Silva, S. C.; Medford, A. J. Kinetics-Informed Neural Networks. Catalysis Today, 2023. doi:10.1016/j.cattod.2022.04.002
- Gusmão, G. S.; Medford, A. J. Maximum-Likelihood Estimators in Physics-Informed Neural Networks for High-Dimensional Inverse Problems. Computers & Chemical Engineering, 2024, 182, 108547. doi:10.1016/j.compchemeng.2023.108547
- Rackauckas, C.; Ma, Y.; Martensen, J.; Warner, C.; Zubov, K.; Supekar, R.; Skinner, D.; Ramadhan, A.; Edelman, A. Universal Differential Equations for Scientific Machine Learning. 2020. arXiv:2001.04385
- Seitzer, M.; Tavakoli, A.; Antic, D.; Martius, G. On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks. ICLR 2022. arXiv:2203.09168
- Beal, S. L.; Sheiner, L. B.; Boeckmann, A. J.; Bauer, R. J. NONMEM Users Guides. Icon Development Solutions, Ellicott City, MD, 1989-2018. Nonlinear mixed-effects modeling for population pharmacokinetics.
-
Boeckmann, A. J.; Sheiner, L. B.; Beal, S. L.
Theophylline. Classic open population-pharmacokinetics dataset
(12 subjects, oral dose), distributed with NONMEM and the R
datasets/nlmepackages.