Lotka–Volterra Neural ODE Playground
A Neural ODE learns an unknown predator–prey rate law from noisy data. A single structural prior, enabled or disabled, separates a purely data-driven fit from a model that recovers the rate law's structure and extrapolates from it.
A Neural ODE recovers the rate law of an unknown predator–prey system from noisy, sparsely sampled observations. The top control, Model, selects between two formulations. A black-box NODE parameterizes the entire rate function with a free network and imposes structure only softly, through a monomial prior weighted by error propagation; it recovers the low-order structure but degrades at higher orders. A hybrid NODE is a monomial network whose hidden units are $h = \exp(W\log\mathbf u)$, so the first-layer weights $W$ are the powers and the read-out head the coefficients, $f_x = x^{\hat p_x}(c_0 + c_1 y^{\hat q_{xy}})$ (and symmetrically $f_y = y^{\hat p_y}(c_2 + c_3 x^{\hat q_{yx}})$); it recovers all the orders, diagonal and cross, at the cost of assuming the power-law form. Three panels track training: the populations plot, the phase portrait, and the discovery panel. The sections that follow set out the black-box formulation; the hybrid is summarized in the hybrid alternative at the end, with derivations collapsed by default.
1 · The Neural ODE as a learnable rate function
In a neural ordinary differential equation (NODE) [1] the network does not predict the answer directly. It parameterizes the rate of change, the right-hand side of a dynamical system, and the trajectory is recovered by integrating it forward in time:
\[ \dot{\mathbf{u}}(t) \;=\; \mathbf{f}_\theta(\mathbf{u}(t)), \qquad \mathbf{u}(t) \;=\; \mathbf{u}_0 + \!\int_{t_0}^{t}\! \mathbf{f}_\theta(\mathbf{u}(\tau))\,d\tau. \]So $\mathbf{f}_\theta$ is the unknown vector field. Training nudges the network until the integrated curve $\mathbf{u}(t;\theta)$ passes through the observed states; the gradients come from the discrete adjoint of the integrator.1 On its own, this construction encodes no prior on the class of system being fit. We close that gap with the physics principle once the problem and its data are in place (§4).
2 · Test system: the Lotka–Volterra equations
The predator–prey system serves as a low-dimensional test problem in which the ground-truth model is known but withheld from the network. In its general two-monomial form,
\[ \begin{aligned} \dot{x} &= \alpha\, x^{p_x} \;-\; \beta\, x^{p_x}\, y^{q_{xy}}, \\ \dot{y} &= \delta\, x^{q_{yx}}\, y^{p_y} \;-\; \gamma\, y^{p_y}, \end{aligned} \]with the classical Lotka–Volterra recovered at $(p_x, p_y, q_{xy}, q_{yx}) = (1, 1, 1, 1)$. Each of the eight scalars $(\alpha, \beta, \delta, \gamma, p_x, p_y, q_{xy}, q_{yx})$ is independently controllable from the panel above the time-series plot. Observations are noisy snapshots of $(x, y)$ at fixed temporal intervals; the noise level $\sigma_\mathrm{obs}$ and the sampling density are set in the same panel.
3 · Train, test, and validation partitions
The temporal axis is partitioned into three contiguous regions by two draggable handles. The first $t < t_\mathrm{train}$ defines the training set used to compute $\nabla_\theta \mathcal{L}$. The interval $t_\mathrm{train} \le t < t_\mathrm{val}$ is held out as the test set; the training step at which the test loss attains its minimum identifies the checkpoint $\theta^\star$ retained for downstream evaluation. The final region $t \ge t_\mathrm{val}$ is the validation hold-out, which receives no gradient signal and is never used for checkpoint selection; reporting the validation loss at $\theta^\star$ provides an unbiased estimate of generalization beyond the training distribution.
4 · The structural prior: data-driven and physics-informed regimes
With the problem and its data defined, the Physics priors control in the training-mode bar enables or disables a single structural assumption. The two regimes yield qualitatively different models from the same Neural ODE.
Disabled: an unconstrained Neural ODE. The network approximates the rate function and is constrained only by the data, which determines the rate function along the observed orbit. Away from that orbit the field is unidentified: distinct rate functions reproduce the same trajectory on the data while diverging elsewhere. The model interpolates the observations but recovers no structure, extrapolates poorly, and yields no characterization of the underlying dynamics.
Enabled: a physics-informed Neural ODE. We impose one generic structural assumption: each term of the rate function is locally a monomial in its arguments. This does not prescribe the rate law. It constrains the log-derivative to be constant and lets the data determine its value (§6). Under this constraint the network resolves the low-order diagonal structure of the rate law (the self-order exponents $\hat{p}_x,\hat{p}_y$); the cross orders emerge only as an orbit-weighted diagnostic rather than a true identification (§8). It extrapolates substantially better because the prior continues to constrain the solution where the data are uninformative, and reports that structure under a data-to-physics weighting that requires no tuned hyperparameter (§7). Toggling the control isolates the effect of the prior on the recovered exponents.
The distinction is most evident in the validation region. With the prior disabled, the integrated trajectory departs from the data beyond the observation window and the recovered exponents drift without converging. With the prior enabled, the trajectory retains its structure past the data and $\hat{p}_x,\hat{p}_y$ converge toward the true exponents.
5 · Trajectory fitting by multiple shooting
The network is trained to make its integrated curve pass through the noisy observations. There is a subtlety the playground lets you see directly: errors in the rate function compound through the integrator, so late points carry more accumulated error than early ones. Left unchecked, training over-corrects the far end of the orbit and training diverges. This is the classic long-horizon Neural ODE failure. The default multiple-shooting method sidesteps it by fitting many short segments anchored at real observations instead of one long shot. The gradient-method dropdown also offers two annealed curricula that grow the horizon gradually.
The View toggle above the populations plot switches between the two pictures. Segments shows the short forward/backward arcs the optimizer actually fits between observations. Full integral shows one continuous curve integrated from the true initial condition of each region, which isolates the behavior of the learned rate function under self-integration.
▶The data objective and horizon-annealing schedules derivation
With Gaussian observation noise of covariance $\boldsymbol\Sigma_\mathbf{x}$ (precision $\boldsymbol\Omega_\mathbf{x} \equiv \boldsymbol\Sigma_\mathbf{x}^{-1}$), the negative log-likelihood of the data under the integrated trajectory is the Mahalanobis misfit
\[ \mathcal{L}_\mathrm{data}(\theta) \;=\; \frac{1}{2} \sum_{i=1}^{N} \boldsymbol\varepsilon^{\mathrm{obs}\top}_i\,\boldsymbol\Omega_\mathbf{x}\, \boldsymbol\varepsilon^{\mathrm{obs}}_i, \qquad \boldsymbol\varepsilon^{\mathrm{obs}}_i = \hat{\mathbf{u}}(t_i;\theta) - \mathbf{u}^\mathrm{obs}_i, \]where $\hat{\mathbf{u}}(t_i;\theta)$ denotes the RK4 forward integration of $\mathbf{f}_\theta$. The isotropic case $\boldsymbol\Sigma_\mathbf{x} = \sigma^2_\mathrm{obs}\mathbf{I}$ reduces this to the scalar-weighted sum of squares $\tfrac{\omega_\mathrm{obs}}{2}\sum_i\lVert\boldsymbol\varepsilon^{\mathrm{obs}}_i\rVert^2$ with $\omega_\mathrm{obs}=1/\sigma^2_\mathrm{obs}$; the full $\boldsymbol\Sigma_\mathbf{x}$ is the single free covariance of the maximum-likelihood weighting in §7, where it is estimated from the residuals and propagated to the physics term. The residual at observation $i$ decomposes into the observational noise on $\mathbf{u}^\mathrm{obs}_i$ and the accumulated integration error $\int_{t_0}^{t_i} [\mathbf{f}_\theta(\mathbf{u}) - \mathbf{f}(\mathbf{u})]\,d\tau$, whose growth with $t_i$ is what biases a uniformly-weighted MLE toward late-time residuals (the long-horizon failure described above). Quantitatively, the magnitude of this term scales with the integration horizon $t_i - t_0$, so the bias is set by how far the supervised window extends past each anchor.
Two annealed curricula mitigate it: time-decay weighting down-weights each residual by a factor that decays with its distance from $t_\mathrm{start}$, with the decay rate $\alpha(\tau)$ annealed to zero so the supervised horizon lengthens across training; horizon annealing is the discrete counterpart, supervising only the first $K(\tau)$ observations of each batch with $K$ growing to the full batch size. Both are selectable from the training-mode bar below the plots (formula in the curriculum tooltip).
6 · The monomial prior as residual constraints
The prior from §4 says each term of the rate law is a power law (a monomial), $f_i \propto u_j^{\,p}$. To make it useful we turn it into something the optimizer can drive to zero. A power law has a constant log-slope, so taking the logarithmic derivative turns “is a monomial” into a simple algebraic condition:
\[ \frac{\partial \ln |f_i|}{\partial \ln u_j} \;=\; \frac{u_j}{f_i}\,\frac{\partial f_i}{\partial u_j} \;=\; p \qquad\Longleftrightarrow\qquad u_j\,\frac{\partial f_i}{\partial u_j} \;-\; p\, f_i \;=\; 0. \]For our two-variable system this gives four residuals on the network output, with four learnable exponents $(\hat{p}_x, \hat{p}_y, \hat{q}_{xy}, \hat{q}_{yx})$ that the optimizer discovers jointly with the weights $\theta$. These four exponents are exactly the numbers in the discovery panel.
\[ \begin{aligned} r_x &= x\,\partial f_x/\partial x \;-\; \hat{p}_x \, f_x, \\ r_y &= y\,\partial f_y/\partial y \;-\; \hat{p}_y \, f_y, \\ r_{xy} &= y\,\partial f_x/\partial y \;-\; \hat{q}_{xy}\, f_x, \\ r_{yx} &= x\,\partial f_y/\partial x \;-\; \hat{q}_{yx}\, f_y. \end{aligned} \]Driving these residuals toward zero forces the network's rate function to behave like a power law of the discovered order, term by term, without supplying the true exponents.
▶Matrix form of the residuals and convergence of the cross-exponents derivation
Collected into a single matrix relation, let $\mathbf{u} = (x, y)^\top$, $\mathbf{f}_\theta = (f_x, f_y)^\top$, $\mathbf{D} = \mathrm{diag}(x, y)$, and let $\mathbf{J}_\theta \in \mathbb{R}^{2\times 2}$ be the Jacobian $[\mathbf{J}_\theta]_{ij} = \partial f_i / \partial u_j$. Define the log-Jacobian $\mathbf{J}^{\log}_\theta \;\equiv\; \mathbf{J}_\theta\,\mathbf{D}$ so that $[\mathbf{J}^{\log}_\theta]_{ij} = u_j\,\partial f_i/\partial u_j$. The four residuals above are the entries of a single matrix,
\[ \begin{aligned} \mathbf{R}(\mathbf{u}, \theta, \hat{\mathbf{P}}) &= \mathbf{J}^{\log}_\theta(\mathbf{u}) \;-\; \hat{\mathbf{P}} \odot \bigl(\mathbf{f}_\theta(\mathbf{u})\,\mathbf{1}^\top\bigr), \\[4pt] \hat{\mathbf{P}} &= \begin{pmatrix} \hat{p}_x & \hat{q}_{xy} \\ \hat{q}_{yx} & \hat{p}_y \end{pmatrix}. \end{aligned} \]with $\odot$ the Hadamard product and $\mathbf{1}=(1,1)^\top$, so that $r_x = [\mathbf{R}]_{xx}$, $r_y = [\mathbf{R}]_{yy}$, $r_{xy} = [\mathbf{R}]_{xy}$, $r_{yx} = [\mathbf{R}]_{yx}$. The residuals are evaluated at collocation points sampled in the neighborhood of the current NN trajectory. For a pure monomial rate function $\mathbf{R} \equiv 0$; for a sum of monomials (Lotka–Volterra is the simplest case) no choice of $\hat{\mathbf{P}}$ annihilates all entries simultaneously, so the off-diagonal entries $\hat{q}_{xy}, \hat{q}_{yx}$ are not identified — see §8 for how they converge and how to read the discovery panel.
7 · Maximum-likelihood weighting by error propagation
A physics-informed objective introduces a regularization weight that balances the physics residual against the data misfit. Its value is consequential: too large and the prior dominates the likelihood, too small and it is inert, and conventionally it is selected by hand. The rKINNs [3] formulation eliminates it. The only free quantity is the observation-error covariance $\boldsymbol\Sigma_\mathbf{x}$; the physics-residual covariance is not estimated independently but obtained from $\boldsymbol\Sigma_\mathbf{x}$ by first-order propagation through the model. Each residual is then weighted by the precision of the integrated state that produces it.
The shaded $\pm\sigma$ bands on the discovery panel are this propagated uncertainty made visible: the bands are wide while the fit is poorly constrained and contract as the network resolves the structure.
▶The propagation law and the concentrated maximum-likelihood objective derivation
The state is 2D, so the residuals are vectors and their second-order statistics are matrices. Let $\boldsymbol\varepsilon^{\mathrm{obs}}_i = \hat{\mathbf{u}}(t_i;\theta) - \mathbf{u}^\mathrm{obs}_i \in \mathbb{R}^2$ be the per-observation data residual and $\mathbf{r}_\mathbf{R}(\mathbf{x}_c) \equiv \mathrm{vec}\,\mathbf{R}(\mathbf{x}_c, \theta, \hat{\mathbf{P}}) \in \mathbb{R}^4$ the stacked physics residual at integrated collocation point $\mathbf{x}_c$. The crucial point is that only one covariance is a free unknown, the state-error covariance $\boldsymbol\Sigma_\mathbf{x}$. The physics residual covariance $\boldsymbol\Sigma_\mathbf{R}$ is not a separate free parameter; it is determined deterministically from $\boldsymbol\Sigma_\mathbf{x}$ by the propagation law derived below.
The rKINNs propagation law
Every point $\mathbf{x}_c$ at which we evaluate the physics residual is an output of the NODE integration of $\mathbf{f}_\theta$ from the initial condition; the state at $\mathbf{x}_c$ carries an error covariance $\boldsymbol\Sigma_\mathbf{x}(\mathbf{x}_c)$ obtained by transporting the observation covariance through the integrator. A first-order expansion of the vectorized physics residual around the integrated state,
\[ \delta \mathbf{r}_\mathbf{R}(\mathbf{x}_c) \;\approx\; \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}}\bigg|_{\mathbf{x}_c} \,\delta \mathbf{x}_c, \qquad \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}} \;\in\; \mathbb{R}^{4\times 2}, \]yields the central rKINNs propagation law: at each integrated point $\mathbf{x}_c$, the physics-residual covariance is a Jacobian sandwich of the state covariance at that same point,
\[ \boxed{\;\; \boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c) \;=\; \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}}\bigg|_{\mathbf{x}_c} \;\boldsymbol\Sigma_\mathbf{x}\; \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}}^{\!\top}\bigg|_{\mathbf{x}_c}. \;\;} \]Single-source chain: $\boldsymbol\Sigma_\mathbf{x}$ estimated from the data residuals $\to$ $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)$ at every collocation point via the residual Jacobian. The precision $\boldsymbol\Omega_\mathbf{R}(\mathbf{x}_c) \equiv \boldsymbol\Sigma_\mathbf{R}^{-1}(\mathbf{x}_c)$ used in the loss is then a deterministic function of $\theta$ and $\boldsymbol\Sigma_\mathbf{x}$, evaluated at each $\mathbf{x}_c$ along the orbit.
The MLE objective with propagated physics covariance
With the propagation law in hand, only $\boldsymbol\Sigma_\mathbf{x}$ is a free covariance; the physics term uses the propagated $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)=J\boldsymbol\Sigma_\mathbf{x}J^\top$. Profiling $\partial\mathcal{L}/\partial\boldsymbol\Sigma_\mathbf{x}=0$ gives the sample covariance of the data residuals, $\hat{\boldsymbol\Sigma}_\mathbf{x}=\tfrac{1}{N}\sum_i\boldsymbol\varepsilon^{\mathrm{obs}}_i\boldsymbol\varepsilon^{\mathrm{obs}\top}_i$, and substituting back yields the rKINNs concentrated objective,
\[ \begin{aligned} \mathcal{L}_\mathrm{rKINNs}(\theta, \hat{\mathbf{P}}) \;=\; &\tfrac{N}{2}\,\log\bigl|\hat{\boldsymbol\Sigma}_\mathbf{x}(\theta)\bigr| \\[4pt] +\; &\tfrac{1}{2}\!\sum_{c=1}^{N_c}\!\Bigl[ \log\bigl|\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)\bigr| + \mathbf{r}_\mathbf{R}^{\top}(\mathbf{x}_c) \,\boldsymbol\Sigma_\mathbf{R}^{-1}(\mathbf{x}_c)\, \mathbf{r}_\mathbf{R}(\mathbf{x}_c) \Bigr] \,+\, \text{const}. \end{aligned} \]A caveat on rank. The residual Jacobian $\partial\mathbf{r}_\mathbf{R}/\partial\mathbf{x}\in\mathbb{R}^{4\times 2}$ carries a two-dimensional state covariance into a four-component residual space, so $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)$ is a $4\times4$ matrix of rank at most $2$ — singular. The $\log\bigl|\boldsymbol\Sigma_\mathbf{R}\bigr|$ and $\boldsymbol\Sigma_\mathbf{R}^{-1}$ above are thus the general-form objective: in this two-dimensional demo the weighting uses the diagonal of $\boldsymbol\Sigma_\mathbf{R}$ (one scalar precision per residual class), while the full-rank rKINNs treatment conditions the covariance through an SVD preconditioning of the coupling map before inversion [3].
In practice we optimize the precision-weighted, per-sample-mean form used in rKINNs [3]. The precision matrices $\boldsymbol\Omega_\mathbf{x}=\boldsymbol\Sigma_\mathbf{x}^{-1}$ and $\boldsymbol\Omega_\mathbf{R}(\mathbf{x}_c)=\boldsymbol\Sigma_\mathbf{R}^{-1}(\mathbf{x}_c)$ are held fixed within each step and refreshed from the residuals, exactly as in the KINNs auto-weighting scheme:
\[ \mathcal{L}_\mathrm{rKINNs} \;=\; \frac{1}{N}\sum_{i=1}^{N} \boldsymbol\varepsilon^{\mathrm{obs}\top}_i\,\boldsymbol\Omega_\mathbf{x}\,\boldsymbol\varepsilon^{\mathrm{obs}}_i \;+\; \frac{1}{N_c}\sum_{c=1}^{N_c} \mathbf{r}_\mathbf{R}^{\top}(\mathbf{x}_c)\,\boldsymbol\Omega_\mathbf{R}(\mathbf{x}_c)\,\mathbf{r}_\mathbf{R}(\mathbf{x}_c). \]The $\tfrac1N$ and $\tfrac1{N_c}$ make each term a per-sample mean, so the data/physics balance does not ride on how many observations or collocation points happen to be sampled. It is set by the precisions, with no hand-tuned $\alpha$.
The engine builds the full propagated covariance $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)=\mathbf{J}_\mathbf{R}\,\boldsymbol\Sigma_\mathbf{x}\,\mathbf{J}_\mathbf{R}^\top$ at every collocation point: at each $\mathbf{x}_c$ the residual Jacobian $\mathbf{J}_\mathbf{R}=\partial\mathbf{r}_\mathbf{R}/\partial\mathbf{x}$ is taken by finite differences of $\mathbf{r}_\mathbf{R}$ along $x$ and $y$, sandwiched against the interpolation covariance $\boldsymbol\Sigma_\mathbf{x}$ (the per-state observation precisions), and its diagonal inverted to weight each residual class locally, $\omega_k(\mathbf{x}_c) = 1/[\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)]_{kk}$. This is the genuine rKINNs error-propagation weighting, not a global scalar. A residual that is stiff (large $\partial r/\partial\mathbf{x}$) where the state is poorly resolved gets damped automatically, while a well-resolved residual is trusted. A cheaper global-EMA fallback ($\hat\omega_k=1/\widehat{\mathrm{Var}}(r_k)$) is available for comparison, but the full per-point propagation is the default.
This is the direct Neural-ODE analogue of the rKINNs construction [3]. The one structural difference is the propagation law. rKINNs carries a second term $\mathbf{J}_p\boldsymbol\Sigma_p\mathbf{J}_p^\top$ for the uncertainty of the mechanistic rate constants $p$. Here the network is the rate function, so there are no separate kinetic parameters, that term drops, and the physics covariance is the single Jacobian sandwich of the interpolation covariance. For the full derivation and the high-dimensional case see the KINNs / rKINNs playground and the paper [3].
8 · Discovery diagnostics
The discovery panel tracks the recovered exponents throughout training with their propagated $\pm\sigma$ bands, and their convergence behavior is the diagnostic of interest. In the classical unit-order Lotka–Volterra case the diagonal exponents $\hat{p}_x, \hat{p}_y$ converge to the true orders ($1$ and $1$); at higher diagonal orders they are biased low, for the reason given in the hybrid section. The cross exponents do not identify the true coupling order: they converge to an orbit-weighted average between $0$ (where the collocation points are dominated by the self-growth or self-decay term) and the true coupling order (where the interaction term dominates), the OLS-best single-exponent fit along the orbit. Reshaping the exponents in the Truth model panel, or restricting the training window, exposes how identifiability depends on the available data. The diagnostic content lies in the trajectory of these scalars through training rather than in any single converged value.
Reading the loss panel
The two draggable handles set the train | test (amber, default $t = 10$) and test | validation (emerald, default $t = 20$) boundaries of §3. The loss panel tracks $\mathcal{L}_\mathrm{train}$ alongside derivative-matching losses on the test and validation regions, with a dotted reference line at the best test loss seen so far. The further apart the handles, the more demanding the extrapolation.
The chips reshape the network on the fly. The + / − controls on any chip resize that layer by $4$ neurons; × deletes a hidden layer; the dashed “+ hidden layer” chip duplicates the last layer's width as a new layer. Each hidden chip carries its own activation. The MLP / KAN toggle switches the model class. Every change re-initializes the network. Learning rate, optimizer, momentum, and the auto-stop watchdog live in the gear panel in the lower-right.
Disabled: an unconstrained Neural ODE. The network fits the observed orbit but recovers no structure; beyond the train/test boundary the integrated trajectory diverges and the four discovery scalars fail to converge. Enabled: the network additionally satisfies a generic monomial constraint, so the discovery scalars converge toward the true orders $(p_x, p_y)$, the orbit remains closed, and the held-out test loss tracks the training loss over a longer horizon.
Disable the prior, re-initialize with Reset NN, and compare the recovered exponents against the enabled case.
Long-horizon NODE rollouts compound integration error, so the late residuals dominate the adjoint and ruin the early dynamics. Time-decay weights each residual by $\exp(-\alpha(\tau)(t_i - t_\mathrm{start}))$ with $\alpha(\tau)$ annealed from a finite value to zero. Horizon annealing only supervises the first $K(\tau)$ observations in each batch; $K$ grows from a small constant to the full batch over training.
Multiple shooting (default): integrate each segment $\mathbf{u}_i^{obs} \to \mathbf{u}_{i+K}^{obs}$ (and backward) anchored at real observations, residual $\tfrac12\lVert\hat{\mathbf{u}}(t_{i+K};\mathbf{u}_i^{obs}) - \mathbf{u}_{i+K}^{obs}\rVert^2$. Short segments do not compound integration error, so it avoids the single-shooting bias where a noisy initial condition poisons the whole window. Two independent knobs: K is the number of observations a segment spans ($i \to i{+}K$); M is the integration resolution, the RK4 substeps taken between each pair of observations ($dt = \text{obs-spacing}/M$), so a segment uses $K\!\cdot\!M$ substeps, independent of data density. ±dir adds the backward segment, cancelling directional bias. Per-segment residuals are weighted by the adaptive per-residual MLE variance; the physics residuals additionally use the full per-collocation-point propagated covariance $\mathbf{J}_\mathbf{R}\boldsymbol\Sigma_\mathbf{x}\mathbf{J}_\mathbf{R}^\top$ of the rKINNs scheme.
Windowed rollout: single-shooting from one anchor across a long window, closer to a classic Neural ODE, but compounds error. Derivative match: match NN rates to finite-difference derivatives, no integration. Fastest, noisiest.
Move $p_x$ to $1.5$ or $q_{xy}$ to $2.0$ using the sliders. The truth trajectory regenerates instantly and the discovery scalars climb toward the new values. Full convergence on a non-unit truth needs several tens of thousands of steps because the NN has to relearn the new monomial shape from scratch.
The physics priors never tell the network what the orders are. The residual $r = x\,\partial f/\partial x - \hat{p}\,f$ only says the log-derivative is a constant; the optimizer and the data pick which constant. The same machinery admits non-integer orders (Hill kinetics, allometric scaling, anomalous diffusion) with no code changes.
Drag the amber train|test handle left to $t = 5$ so the network sees only the first half-cycle. With priors ON, the integrated trajectory still closes through the test and validation regions because the monomial-shape constraint extrapolates the rate function correctly. With priors OFF, the orbit visibly drifts: a pure data fit cannot recover what was never in the training window.
Train: SEEN by the optimizer, evaluates $\nabla_\theta\mathcal{L}$. Test: SEEN by checkpoint selection; we pick $\theta^\star$ that minimizes test loss. Validation: blind hold-out, never seen, ever. Reporting validation loss at $\theta^\star$ is the unbiased generalization estimate.
Relation to the unconstrained Neural ODE
| Unconstrained Neural ODE | Black-box + monomial prior | |
|---|---|---|
| Objective | data fit only | data fit + monomial physics residual |
| Off the data | field unconstrained → drifts | prior keeps constraining → extrapolates |
| Structure | none recovered | resolves low-order diagonal exponents $(\hat{p})$; cross orders as a diagnostic |
| Uncertainty | none | propagated $\pm\sigma$ on the orders |
| Data/physics balance | n/a (single term) | set by MLE, no hand-tuned $\alpha$ |
The table contrasts the two regimes of §4. An unconstrained Neural ODE [1] minimizes the data misfit alone, $\mathcal{L}_\mathrm{data}(\theta) = \tfrac{1}{2}\sum_i \lVert \hat{\mathbf{u}}(t_i;\theta) - \mathbf{u}^\mathrm{obs}_i \rVert^2$; as a universal approximator it interpolates the observed orbit to arbitrary precision but constrains the rate function only on the support of that orbit, so off the orbit the field is unidentified and neither structure nor uncertainty is available.
The present formulation adds a second objective on the same network: residual physics priors $\mathbf{R}_k(\mathbf{x}_c)$ requiring the rate function to be locally monomial (§6), evaluated at collocation points along the orbit. The prior is generic and not specific to KINNs; any physics-informed objective could impose it. The distinction lies in how the two objectives are balanced. A conventional physics-informed loss weights them with a hyperparameter $\alpha$, $\mathcal{L} = \mathcal{L}_\mathrm{data} + \alpha\,\mathcal{L}_\mathrm{phys}$, that controls the variance between adhering to the physical constraint and interpolating the observations. The construction adopted from rKINNs instead treats both residuals as Gaussian and lets a profiled maximum-likelihood estimator set the weights, with the physics-residual covariance derived from the interpolation covariance rather than estimated independently, eliminating $\alpha$.
The propagation law $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)=J\boldsymbol\Sigma_\mathbf{x}J^\top$ of §7 is what closes this: with the physics-residual covariance derived from the single interpolation covariance, the data-to-physics balance is a deterministic function of the model geometry rather than a free hyperparameter, and the formulation yields two quantities unavailable to an unconstrained Neural ODE — a rate law with identified structure and a calibrated uncertainty on it.
The hybrid alternative
The soft monomial prior recovers the low-order structure but degrades for higher diagonal orders, where the free network fits the orbit with a field that is flatter than the truth and the recovered exponent is biased low. The cause is identifiable: the data constrain the rate function only along the orbit, leaving the transverse derivative underdetermined, so the soft penalty settles on a low-order compromise.
The hybrid model (selected by the Model control) removes this bias structurally. It is still a Neural ODE, but the network is built so its weights are the rate-law structure. Take the log of the state, $\mathbf z = \log\mathbf u$, pass it through a first layer with weight matrix $W$, and apply an elementwise $\exp$ activation:
\[ h_k(\mathbf u) \;=\; \exp\!\Big(\textstyle\sum_j W_{kj}\,\log u_j\Big) \;=\; \prod_j u_j^{\,W_{kj}}. \]Each hidden unit is a monomial whose exponents are exactly the first-layer weights $W$. The read-out is per output component: each $f_i$ has its own linear head $\mathbf b_i$ (and, for products of features, a bilinear matrix $A_i$) that maps the monomials to that component of the rate, $f_i(\mathbf u) = \mathbf b_i^\top \mathbf h + \mathbf h^\top A_i\,\mathbf h$. So the network's weights split into powers ($W$) and per-output coefficients ($\mathbf b_i, A_i$): reading them off after training gives the rate law directly. This is the same object as a fixed monomial dictionary (SINDy [4]), except the exponents are learned rather than enumerated, and embedding it as the right-hand side of the integrator makes it a universal differential equation [5] whose learned component is itself structured.
Pinning the monomials to the four physically relevant terms collapses $W$ to four exponents and the head to four coefficients, which is what the discovery panel reports:
\[ \dot x = x^{\hat p_x}\big(c_0 + c_1\, y^{\hat q_{xy}}\big), \qquad \dot y = y^{\hat p_y}\big(c_2 + c_3\, x^{\hat q_{yx}}\big). \]Here the diagonal orders $\hat p_x,\hat p_y$ and the cross orders $\hat q_{xy},\hat q_{yx}$ are entries of the exponent matrix $W$, and $c_0,\dots,c_3$ are the entries of the per-output read-out head ($c_0, c_1$ in $\mathbf b_x, A_x$; $c_2, c_3$ in $\mathbf b_y, A_y$). All eight are trained directly from the data loss through the discrete adjoint, and with sufficient orbit coverage the integration regularizes the observation noise enough for the parameters to converge to their true values. Unlike the soft prior, this recovers the cross orders as well as the diagonals, because the cross exponent is a network weight rather than a quantity read off a free field. This is the trade-off: it recovers the full structure precisely, but only by assuming the rate is a sum of power-law terms. It sits at the white-box end of the spectrum, next to the fully mechanistic KINNs (known mechanism, fit the rate constants), with the black-box NODE at the other end and the soft-prior formulation in between.
Identifiability still benefits from more data coverage. A single closed orbit is a one-dimensional curve in state space; adding initial conditions (the Experiments control) spreads several orbits across the plane, which tightens the recovered exponents and resolves the residual growth/coupling trade-off between the two additive terms.
Generalization to higher-dimensional systems
Each component of the formulation (the Neural ODE objective, the horizon-annealing schedule, the residual physics priors, and the propagated maximum-likelihood weighting) is system-agnostic. Substituting a higher-dimensional vector field for $\mathbf{f}_\theta$ and applying the same residual constraints on each component extends the method in principle to catalytic kinetics (mass-action exponents, adsorption-site orders), enzyme cascades, gene-regulatory networks, and mechanical or electrical systems.
Relation to KINNs and rKINNs
This is easy to conflate with related work. In KINNs [2] (PINNs for chemical kinetics) the mechanistic model is known: the network is a surrogate for the state trajectory, the physics residual is the known rate law, and the inverse problem recovers its rate constants. Here it is the other way around: the model is unknown, the network is the rate function, and the prior imposes only a generic monomial shape, never the true rate law. What this Neural ODE takes from rKINNs [3] is not the problem setup but the error-propagation machinery: the maximum-likelihood weighting that ties the data–physics balance to the observation uncertainty (§7), removing the hand-tuned loss-weight hyperparameter.
Notes
- How the gradient is computed. This playground uses the discrete adjoint (discretize-then-optimize, or backprop-through-the-solver): we RK4-integrate $\mathbf{f}_\theta$ forward while caching every step, then walk the cached steps in reverse, propagating the adjoint state $\boldsymbol\lambda$ through each RK4 stage to accumulate $\partial\mathcal{L}/\partial\theta$. This is the exact gradient of the discretized loss. The alternative, the continuous adjoint of Chen et al. (2018), instead solves a separate adjoint ODE backward in time (optimize-then-discretize). Discrete adjoint, pros: exact, consistent with the forward solver (no discretization mismatch), which keeps the discovery scalars $(\hat p_x,\hat p_y,\dots)$ clean; robust on stiff dynamics; trivial on a fixed-step solver. Cons: stores the forward trajectory, so memory grows with the number of steps. Continuous adjoint, pros: $O(1)$ memory in the number of steps (re-integrates backward), scaling to very long or deep trajectories. Cons: the re-discretized backward solve can mismatch the forward pass and inject gradient bias/noise, and is fragile on stiff systems. For a short, non-stiff predator–prey orbit the discrete adjoint is exact, fast, and the better choice; the continuous adjoint only pays off when the forward trajectory is too long to store. ↩
References
- Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. NeurIPS 2018. arXiv:1806.07366
- Gusmão, G. S.; Retnanto, A. P.; da Cunha, S. C.; Medford, A. J. Kinetics-Informed Neural Networks. Catalysis Today, 2023. doi:10.1016/j.cattod.2022.04.002
- Gusmão, G. S.; Medford, A. J. Maximum-Likelihood Estimators in Physics-Informed Neural Networks for High-Dimensional Inverse Problems. Computers & Chemical Engineering, 2024, 181, 108547. doi:10.1016/j.compchemeng.2023.108547
- Brunton, S. L.; Proctor, J. L.; Kutz, J. N. Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems (SINDy). PNAS, 2016, 113 (15), 3932–3937. doi:10.1073/pnas.1517384113
- Rackauckas, C.; Ma, Y.; Martensen, J.; Warner, C.; Zubov, K.; Supekar, R.; Skinner, D.; Ramadhan, A.; Edelman, A. Universal Differential Equations for Scientific Machine Learning. 2020. arXiv:2001.04385
- Lotka, A. J. Elements of Physical Biology. Williams & Wilkins, 1925.
- Volterra, V. Variazioni e fluttuazioni del numero d'individui in specie animali conviventi. Mem. Acad. Lincei Roma, 1926.