What does the Lotka-Volterra Neural ODE playground do?

It trains a Neural ODE on noisy predator-prey time series and recovers the unknown rate law through the integrator. Generic monomial physics priors weighted by rKINNs-style error-propagation MLE expose which terms of the Lotka-Volterra structure are identifiable from the data.

How is a Neural ODE different from fitting a curve?

A Neural ODE learns the vector field (the right-hand side of the differential equation) rather than the trajectory itself. Gradients flow back through a discrete adjoint of the integrator, so the network discovers governing dynamics that generalize beyond the observed window.

Can it recover the true equation structure from noisy data?

The diagonal self-interaction and predation terms are identifiable under the error-propagation MLE, while higher-order and cross terms are often not identifiable from sparse noisy data. The playground makes this distinction visible as you train.

Does the playground run in the browser?

Yes. The Lotka-Volterra Neural ODE trains live in your browser with no install and no server. The data generation, integration, and adjoint-based training all run client-side in JavaScript.

Lotka-Volterra Neural ODE Playground

Name: Lotka-Volterra Neural ODE Playground
Author: Gabriel S. Gusmão

A Neural ODE learns an unknown predator-prey rate law from noisy data. A single structural prior, enabled or disabled, separates a purely data-driven fit from a model that recovers the rate law's structure and extrapolates from it.

Gabriel S. Gusmão · Neural ODE with rKINNs-style MLE · see KINNs & Pharmacokinetics

The Lotka-Volterra model is the canonical predator-prey system. Like any autonomous rate law, its dynamics depend on the current state alone: the prey population $x$ and predator population $y$ change at rates set by a fixed vector field $\mathbf f$, $\dot x = f_x(x, y)$ and $\dot y = f_y(x, y)$. Classical Lotka-Volterra takes the mass-action form $\dot x = \alpha x - \beta xy$, $\dot y = \delta xy - \gamma y$, whose coupling sustains the familiar out-of-phase oscillations. This playground uses a generalized version whose interaction orders are free parameters, $\dot x = \alpha\,x^{p_x} - \beta\,x^{p_x} y^{q_{xy}}$ and $\dot y = \delta\,x^{q_{yx}} y^{p_y} - \gamma\,y^{p_y}$, so the integer mass-action law ($p_x = p_y = q_{xy} = q_{yx} = 1$) is one point in a continuum the method must recover from data, including fractional orders that standard sparse regression cannot reach.

In practice the vector field $\mathbf f$ is exactly what we do not know; we know only that the rate of change depends on the state. A Neural ODE therefore represents $\mathbf f$ by a learnable function and fits it to noisy, sparsely sampled observations. How much structure we assume for $\mathbf f$ is the choice set by the top Model control. A black-box NODE lets a free network be the entire rate function and imposes structure only softly, through a monomial prior weighted by error propagation; it recovers the low-order structure but degrades at higher orders. When we do know something about the law, the hybrid and discovery models build that knowledge in: the hybrid NODE is a monomial network whose hidden units are $h = \exp(W\log\mathbf u)$, so the first-layer weights $W$ are the powers and the read-out head the coefficients, $f_x = x^{\hat p_x}(c_0 + c_1 y^{\hat q_{xy}})$ (and symmetrically $f_y = y^{\hat p_y}(c_2 + c_3 x^{\hat q_{yx}})$), recovering all the orders, diagonal and cross, at the cost of assuming the power-law form. Three panels track training: the populations plot, the phase portrait, and the discovery panel. The sections that follow set out the black-box formulation; the hybrid is summarized in the hybrid alternative at the end, with derivations collapsed by default.

1 · The Neural ODE as a learnable rate function

In a neural ordinary differential equation (NODE) [1] the network does not predict the answer directly. It parameterizes the rate of change, the right-hand side of a dynamical system, and the trajectory is recovered by integrating it forward in time:

\[ \dot{\mathbf{u}}(t) \;=\; \mathbf{f}_\theta(\mathbf{u}(t)), \qquad \mathbf{u}(t) \;=\; \mathbf{u}_0 + \!\int_{t_0}^{t}\! \mathbf{f}_\theta(\mathbf{u}(\tau))\,d\tau. \]

So $\mathbf{f}_\theta$ is the unknown vector field. Training nudges the network until the integrated curve $\mathbf{u}(t;\theta)$ passes through the observed states; the gradients come from the discrete adjoint of the integrator. On its own, this construction encodes no prior on the class of system being fit. We close that gap with the physics principle once the problem and its data are in place.

2 · Test system: the Lotka-Volterra equations

The predator-prey system serves as a low-dimensional test problem in which the ground-truth model is known but withheld from the network. In its general two-monomial form,

\[ \begin{aligned} \dot{x} &= \alpha\, x^{p_x} \;-\; \beta\, x^{p_x}\, y^{q_{xy}}, \\ \dot{y} &= \delta\, x^{q_{yx}}\, y^{p_y} \;-\; \gamma\, y^{p_y}, \end{aligned} \]

with the classical Lotka-Volterra recovered at $(p_x, p_y, q_{xy}, q_{yx}) = (1, 1, 1, 1)$. Each of the eight scalars $(\alpha, \beta, \delta, \gamma, p_x, p_y, q_{xy}, q_{yx})$ is independently controllable from the panel above the time-series plot. Observations are noisy snapshots of $(x, y)$ at fixed temporal intervals; the noise level $\sigma_\mathrm{obs}$ and the sampling density are set in the same panel.

3 · Train, test, and validation partitions

The temporal axis is partitioned into three contiguous regions by two draggable handles. The first $t < t_\mathrm{train}$ defines the training set used to compute $\nabla_\theta \mathcal{L}$. The interval $t_\mathrm{train} \le t < t_\mathrm{val}$ is held out as the test set; the training step at which the test loss attains its minimum identifies the checkpoint $\theta^\star$ retained for downstream evaluation. The final region $t \ge t_\mathrm{val}$ is the validation hold-out, which receives no gradient signal and is never used for checkpoint selection; reporting the validation loss at $\theta^\star$ provides an unbiased estimate of generalization beyond the training distribution.

4 · The structural prior: data-driven and physics-informed regimes

With the problem and its data defined, the Physics priors control in the training-mode bar enables or disables a single structural assumption. The two regimes yield qualitatively different models from the same Neural ODE.

Disabled: an unconstrained Neural ODE. The network approximates the rate function and is constrained only by the data, which determines the rate function along the observed orbit. Away from that orbit the field is unidentified: distinct rate functions reproduce the same trajectory on the data while diverging elsewhere. The model interpolates the observations but recovers no structure, extrapolates poorly, and yields no characterization of the underlying dynamics.

Enabled: a physics-informed Neural ODE. We impose one generic structural assumption: each term of the rate function is locally a monomial in its arguments. This does not prescribe the rate law. It constrains the log-derivative to be constant and lets the data determine its value. Under this constraint the network resolves the low-order diagonal structure of the rate law (the self-order exponents $\hat{p}_x,\hat{p}_y$); the cross orders emerge only as an orbit-weighted diagnostic rather than a true identification. It extrapolates substantially better because the prior continues to constrain the solution where the data are uninformative, and reports that structure under a data-to-physics weighting that requires no tuned hyperparameter. Toggling the control isolates the effect of the prior on the recovered exponents.

The distinction is most evident in the validation region. With the prior disabled, the integrated trajectory departs from the data beyond the observation window and the recovered exponents drift without converging. With the prior enabled, the trajectory retains its structure past the data and $\hat{p}_x,\hat{p}_y$ converge toward the true exponents.

5 · Trajectory fitting by multiple shooting

The network is trained to make its integrated curve pass through the noisy observations. There is a subtlety the playground lets you see directly: errors in the rate function compound through the integrator, so late points carry more accumulated error than early ones. Left unchecked, training over-corrects the far end of the orbit and training diverges. This is the classic long-horizon Neural ODE failure. The default multiple-shooting method sidesteps it by fitting many short segments anchored at real observations instead of one long shot. The gradient-method dropdown also offers two annealed curricula that grow the horizon gradually.

The View toggle above the populations plot switches between the two pictures. Segments shows the short forward/backward arcs the optimizer actually fits between observations. Full integral shows one continuous curve integrated from the true initial condition of each region, which isolates the behavior of the learned rate function under self-integration.

▶The data objective and horizon-annealing schedules derivation

With Gaussian observation noise of covariance $\boldsymbol\Sigma_\mathbf{x}$ (precision $\boldsymbol\Omega_\mathbf{x} \equiv \boldsymbol\Sigma_\mathbf{x}^{-1}$), the negative log-likelihood of the data under the integrated trajectory is the Mahalanobis misfit

\[ \mathcal{L}_\mathrm{data}(\theta) \;=\; \frac{1}{2} \sum_{i=1}^{N} \boldsymbol\varepsilon^{\mathrm{obs}\top}_i\,\boldsymbol\Omega_\mathbf{x}\, \boldsymbol\varepsilon^{\mathrm{obs}}_i, \qquad \boldsymbol\varepsilon^{\mathrm{obs}}_i = \hat{\mathbf{u}}(t_i;\theta) - \mathbf{u}^\mathrm{obs}_i, \]

where $\hat{\mathbf{u}}(t_i;\theta)$ denotes the RK4 forward integration of $\mathbf{f}_\theta$. The isotropic case $\boldsymbol\Sigma_\mathbf{x} = \sigma^2_\mathrm{obs}\mathbf{I}$ reduces this to the scalar-weighted sum of squares $\tfrac{\omega_\mathrm{obs}}{2}\sum_i\lVert\boldsymbol\varepsilon^{\mathrm{obs}}_i\rVert^2$ with $\omega_\mathrm{obs}=1/\sigma^2_\mathrm{obs}$; the full $\boldsymbol\Sigma_\mathbf{x}$ is the single free covariance of the maximum-likelihood weighting, where it is estimated from the residuals and propagated to the physics term. The residual at observation $i$ decomposes into the observational noise on $\mathbf{u}^\mathrm{obs}_i$ and the accumulated integration error $\int_{t_0}^{t_i} [\mathbf{f}_\theta(\mathbf{u}) - \mathbf{f}(\mathbf{u})]\,d\tau$, whose growth with $t_i$ is what biases a uniformly-weighted MLE toward late-time residuals (the long-horizon failure described above). Quantitatively, the magnitude of this term scales with the integration horizon $t_i - t_0$, so the bias is set by how far the supervised window extends past each anchor.

Two annealed curricula mitigate it: time-decay weighting down-weights each residual by a factor that decays with its distance from $t_\mathrm{start}$, with the decay rate $\alpha(\tau)$ annealed to zero so the supervised horizon lengthens across training; horizon annealing is the discrete counterpart, supervising only the first $K(\tau)$ observations of each batch with $K$ growing to the full batch size. Both are selectable from the training-mode bar below the plots (formula in the curriculum tooltip).

6 · The monomial prior as residual constraints

This prior says each term of the rate law is a power law (a monomial), $f_i \propto u_j^{\,p}$. To make it useful we turn it into something the optimizer can drive to zero. A power law has a constant log-slope, so taking the logarithmic derivative turns “is a monomial” into a simple algebraic condition:

\[ \frac{\partial \ln |f_i|}{\partial \ln u_j} \;=\; \frac{u_j}{f_i}\,\frac{\partial f_i}{\partial u_j} \;=\; p \qquad\Longleftrightarrow\qquad u_j\,\frac{\partial f_i}{\partial u_j} \;-\; p\, f_i \;=\; 0. \]

For our two-variable system this gives four residuals on the network output, with four learnable exponents $(\hat{p}_x, \hat{p}_y, \hat{q}_{xy}, \hat{q}_{yx})$ that the optimizer discovers jointly with the weights $\theta$. These four exponents are exactly the numbers in the discovery panel.

\[ \begin{aligned} r_x &= x\,\partial f_x/\partial x \;-\; \hat{p}_x \, f_x, \\ r_y &= y\,\partial f_y/\partial y \;-\; \hat{p}_y \, f_y, \\ r_{xy} &= y\,\partial f_x/\partial y \;-\; \hat{q}_{xy}\, f_x, \\ r_{yx} &= x\,\partial f_y/\partial x \;-\; \hat{q}_{yx}\, f_y. \end{aligned} \]

Driving these residuals toward zero forces the network's rate function to behave like a power law of the discovered order, term by term, without supplying the true exponents.

▶Matrix form of the residuals and convergence of the cross-exponents derivation

Collected into a single matrix relation, let $\mathbf{u} = (x, y)^\top$, $\mathbf{f}_\theta = (f_x, f_y)^\top$, $\mathbf{D} = \mathrm{diag}(x, y)$, and let $\mathbf{J}_\theta \in \mathbb{R}^{2\times 2}$ be the Jacobian $[\mathbf{J}_\theta]_{ij} = \partial f_i / \partial u_j$. Define the log-Jacobian $\mathbf{J}^{\log}_\theta \;\equiv\; \mathbf{J}_\theta\,\mathbf{D}$ so that $[\mathbf{J}^{\log}_\theta]_{ij} = u_j\,\partial f_i/\partial u_j$. The four residuals above are the entries of a single matrix,

\[ \begin{aligned} \mathbf{R}(\mathbf{u}, \theta, \hat{\mathbf{P}}) &= \mathbf{J}^{\log}_\theta(\mathbf{u}) \;-\; \hat{\mathbf{P}} \odot \bigl(\mathbf{f}_\theta(\mathbf{u})\,\mathbf{1}^\top\bigr), \\[4pt] \hat{\mathbf{P}} &= \begin{pmatrix} \hat{p}_x & \hat{q}_{xy} \\ \hat{q}_{yx} & \hat{p}_y \end{pmatrix}. \end{aligned} \]

with $\odot$ the Hadamard product and $\mathbf{1}=(1,1)^\top$, so that $r_x = [\mathbf{R}]_{xx}$, $r_y = [\mathbf{R}]_{yy}$, $r_{xy} = [\mathbf{R}]_{xy}$, $r_{yx} = [\mathbf{R}]_{yx}$. The residuals are evaluated at collocation points sampled in the neighborhood of the current NN trajectory. For a pure monomial rate function $\mathbf{R} \equiv 0$; for a sum of monomials (Lotka-Volterra is the simplest case) no choice of $\hat{\mathbf{P}}$ annihilates all entries simultaneously, so the off-diagonal entries $\hat{q}_{xy}, \hat{q}_{yx}$ are not identified - the discovery panel below shows how they converge.

7 · Maximum-likelihood weighting by error propagation

A physics-informed objective introduces a regularization weight that balances the physics residual against the data misfit. Its value is consequential: too large and the prior dominates the likelihood, too small and it is inert, and conventionally it is selected by hand. The rKINNs [3] formulation eliminates it. The only free quantity is the observation-error covariance $\boldsymbol\Sigma_\mathbf{x}$; the physics-residual covariance is not estimated independently but obtained from $\boldsymbol\Sigma_\mathbf{x}$ by first-order propagation through the model. Each residual is then weighted by the precision of the integrated state that produces it.

The shaded $\pm\sigma$ bands on the discovery panel are this propagated uncertainty made visible: the bands are wide while the fit is poorly constrained and contract as the network resolves the structure.

▶The propagation law and the concentrated maximum-likelihood objective derivation

The state is 2D, so the residuals are vectors and their second-order statistics are matrices. Let $\boldsymbol\varepsilon^{\mathrm{obs}}_i = \hat{\mathbf{u}}(t_i;\theta) - \mathbf{u}^\mathrm{obs}_i \in \mathbb{R}^2$ be the per-observation data residual and $\mathbf{r}_\mathbf{R}(\mathbf{x}_c) \equiv \mathrm{vec}\,\mathbf{R}(\mathbf{x}_c, \theta, \hat{\mathbf{P}}) \in \mathbb{R}^4$ the stacked physics residual at integrated collocation point $\mathbf{x}_c$. The crucial point is that only one covariance is a free unknown, the state-error covariance $\boldsymbol\Sigma_\mathbf{x}$. The physics residual covariance $\boldsymbol\Sigma_\mathbf{R}$ is not a separate free parameter; it is determined deterministically from $\boldsymbol\Sigma_\mathbf{x}$ by the propagation law derived below.

The rKINNs propagation law

Every point $\mathbf{x}_c$ at which we evaluate the physics residual is an output of the NODE integration of $\mathbf{f}_\theta$ from the initial condition; the state at $\mathbf{x}_c$ carries an error covariance $\boldsymbol\Sigma_\mathbf{x}(\mathbf{x}_c)$ obtained by transporting the observation covariance through the integrator. A first-order expansion of the vectorized physics residual around the integrated state,

\[ \delta \mathbf{r}_\mathbf{R}(\mathbf{x}_c) \;\approx\; \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}}\bigg|_{\mathbf{x}_c} \,\delta \mathbf{x}_c, \qquad \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}} \;\in\; \mathbb{R}^{4\times 2}, \]

yields the central rKINNs propagation law: at each integrated point $\mathbf{x}_c$, the physics-residual covariance is a Jacobian sandwich of the state covariance at that same point,

\[ \boxed{\;\; \boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c) \;=\; \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}}\bigg|_{\mathbf{x}_c} \;\boldsymbol\Sigma_\mathbf{x}\; \frac{\partial \mathbf{r}_\mathbf{R}}{\partial \mathbf{x}}^{\!\top}\bigg|_{\mathbf{x}_c}. \;\;} \]

Single-source chain: $\boldsymbol\Sigma_\mathbf{x}$ estimated from the data residuals $\to$ $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)$ at every collocation point via the residual Jacobian. The precision $\boldsymbol\Omega_\mathbf{R}(\mathbf{x}_c) \equiv \boldsymbol\Sigma_\mathbf{R}^{-1}(\mathbf{x}_c)$ used in the loss is then a deterministic function of $\theta$ and $\boldsymbol\Sigma_\mathbf{x}$, evaluated at each $\mathbf{x}_c$ along the orbit.

The MLE objective with propagated physics covariance

With the propagation law in hand, only $\boldsymbol\Sigma_\mathbf{x}$ is a free covariance; the physics term uses the propagated $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)=J\boldsymbol\Sigma_\mathbf{x}J^\top$. Profiling $\partial\mathcal{L}/\partial\boldsymbol\Sigma_\mathbf{x}=0$ gives the sample covariance of the data residuals, $\hat{\boldsymbol\Sigma}_\mathbf{x}=\tfrac{1}{N}\sum_i\boldsymbol\varepsilon^{\mathrm{obs}}_i\boldsymbol\varepsilon^{\mathrm{obs}\top}_i$, and substituting back yields the rKINNs concentrated objective,

\[ \begin{aligned} \mathcal{L}_\mathrm{rKINNs}(\theta, \hat{\mathbf{P}}) \;=\; &\tfrac{N}{2}\,\log\bigl|\hat{\boldsymbol\Sigma}_\mathbf{x}(\theta)\bigr| \\[4pt] +\; &\tfrac{1}{2}\!\sum_{c=1}^{N_c}\!\Bigl[ \log\bigl|\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)\bigr| + \mathbf{r}_\mathbf{R}^{\top}(\mathbf{x}_c) \,\boldsymbol\Sigma_\mathbf{R}^{-1}(\mathbf{x}_c)\, \mathbf{r}_\mathbf{R}(\mathbf{x}_c) \Bigr] \,+\, \text{const}. \end{aligned} \]

A caveat on rank. The residual Jacobian $\partial\mathbf{r}_\mathbf{R}/\partial\mathbf{x}\in\mathbb{R}^{4\times 2}$ carries a two-dimensional state covariance into a four-component residual space, so $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)$ is a $4\times4$ matrix of rank at most $2$ - singular. The $\log\bigl|\boldsymbol\Sigma_\mathbf{R}\bigr|$ and $\boldsymbol\Sigma_\mathbf{R}^{-1}$ above are thus the general-form objective: in this two-dimensional demo the weighting uses the diagonal of $\boldsymbol\Sigma_\mathbf{R}$ (one scalar precision per residual class), while the full-rank rKINNs treatment conditions the covariance through an SVD preconditioning of the coupling map before inversion [3].

In practice we optimize the precision-weighted, per-sample-mean form used in rKINNs [3]. The precision matrices $\boldsymbol\Omega_\mathbf{x}=\boldsymbol\Sigma_\mathbf{x}^{-1}$ and $\boldsymbol\Omega_\mathbf{R}(\mathbf{x}_c)=\boldsymbol\Sigma_\mathbf{R}^{-1}(\mathbf{x}_c)$ are held fixed within each step and refreshed from the residuals, exactly as in the KINNs auto-weighting scheme:

\[ \mathcal{L}_\mathrm{rKINNs} \;=\; \frac{1}{N}\sum_{i=1}^{N} \boldsymbol\varepsilon^{\mathrm{obs}\top}_i\,\boldsymbol\Omega_\mathbf{x}\,\boldsymbol\varepsilon^{\mathrm{obs}}_i \;+\; \frac{1}{N_c}\sum_{c=1}^{N_c} \mathbf{r}_\mathbf{R}^{\top}(\mathbf{x}_c)\,\boldsymbol\Omega_\mathbf{R}(\mathbf{x}_c)\,\mathbf{r}_\mathbf{R}(\mathbf{x}_c). \]

The $\tfrac1N$ and $\tfrac1{N_c}$ make each term a per-sample mean, so the data/physics balance does not ride on how many observations or collocation points happen to be sampled. It is set by the precisions, with no hand-tuned $\alpha$.

What runs in the engine

The engine builds the full propagated covariance $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)=\mathbf{J}_\mathbf{R}\,\boldsymbol\Sigma_\mathbf{x}\,\mathbf{J}_\mathbf{R}^\top$ at every collocation point: at each $\mathbf{x}_c$ the residual Jacobian $\mathbf{J}_\mathbf{R}=\partial\mathbf{r}_\mathbf{R}/\partial\mathbf{x}$ is taken by finite differences of $\mathbf{r}_\mathbf{R}$ along $x$ and $y$, sandwiched against the interpolation covariance $\boldsymbol\Sigma_\mathbf{x}$ (the per-state observation precisions), and its diagonal inverted to weight each residual class locally, $\omega_k(\mathbf{x}_c) = 1/[\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)]_{kk}$. This is the genuine rKINNs error-propagation weighting, not a global scalar. A residual that is stiff (large $\partial r/\partial\mathbf{x}$) where the state is poorly resolved gets damped automatically, while a well-resolved residual is trusted. A cheaper global-EMA fallback ($\hat\omega_k=1/\widehat{\mathrm{Var}}(r_k)$) is available for comparison, but the full per-point propagation is the default.

This is the direct Neural-ODE analogue of the rKINNs construction [3]. The one structural difference is the propagation law. rKINNs carries a second term $\mathbf{J}_p\boldsymbol\Sigma_p\mathbf{J}_p^\top$ for the uncertainty of the mechanistic rate constants $p$. Here the network is the rate function, so there are no separate kinetic parameters, that term drops, and the physics covariance is the single Jacobian sandwich of the interpolation covariance. For the full derivation and the high-dimensional case see the KINNs / rKINNs playground and the paper [3].

8 · Discovery diagnostics

The discovery panel tracks the recovered exponents throughout training with their propagated $\pm\sigma$ bands, and their convergence behavior is the diagnostic of interest. In the classical unit-order Lotka-Volterra case the diagonal exponents $\hat{p}_x, \hat{p}_y$ converge to the true orders ($1$ and $1$); at higher diagonal orders they are biased low, for the reason given in the hybrid section. The cross exponents do not identify the true coupling order: they converge to an orbit-weighted average between $0$ (where the collocation points are dominated by the self-growth or self-decay term) and the true coupling order (where the interaction term dominates), the OLS-best single-exponent fit along the orbit. Reshaping the exponents in the Truth model panel, or restricting the training window, exposes how identifiability depends on the available data. The diagnostic content lies in the trajectory of these scalars through training rather than in any single converged value.

Model

p̂_x · x-order of f_x

−

truth 1.00

p̂_y · y-order of f_y

−

truth 1.00

q̂_xy · y-order of f_x

−

mixes 0 & 1.00

q̂_yx · x-order of f_y

−

mixes 0 & 1.00

▶ TRUTH MODEL · DATA i $\dot{x} = \alpha\,x^{p_x} - \beta\,x^{p_x} y^{q_{xy}}$ · $\dot{y} = \delta\,x^{q_{yx}} y^{p_y} - \gamma\,y^{p_y}$

▶ Train · Test · Validation roles

TRAIN · t < ttrain
          i
evaluate gradients · update θ
TEST · ttrain ≤ t < tval
pick checkpoint with lowest error
VALIDATION · t ≥ tval
blind hold-out · honest generalization

View

        Data density
        
        Obs noise σ
        
        0.050
      
          Experiments (initial conditions)
          
More orbits cover
          more of the state space, sharpening the recovered orders. The populations plot
          shows the first orbit; the phase portrait below overlays all of them. Drag the
          blue / red
          grips at t = 0 on any orbit to set its initial condition.

Dynamics

Learned f_θ

True f

Same speed colour-scale and orbits on both. The learned field (left) should match the true field (right) along the orbits; away from them it is unconstrained and free to differ.

TRAINING MODE

Gradient method i

K (obs) 1 M (int/obs) 10 ±dir

Physics priors i

Grad accum

Rollout curriculum i

▶ Diagnostics step counts · losses · lr · grad norm · discovered orders · split bounds

Initializing…

Reading the loss panel

The two draggable handles set the train | test (amber, default $t = 10$) and test | validation (emerald, default $t = 20$) boundaries. The loss panel tracks $\mathcal{L}_\mathrm{train}$ alongside derivative-matching losses on the test and validation regions, with a dotted reference line at the best test loss seen so far. The further apart the handles, the more demanding the extrapolation.

Live architecture editor

The chips reshape the network on the fly. The + / − controls on any chip resize that layer by $4$ neurons; × deletes a hidden layer; the dashed “+ hidden layer” chip duplicates the last layer's width as a new layer. Each hidden chip carries its own activation. The MLP / KAN toggle switches the model class. Every change re-initializes the network. Learning rate, optimizer, momentum, and the auto-stop watchdog live in the gear panel in the lower-right.

Structural prior (enabled / disabled)

Disabled: an unconstrained Neural ODE. The network fits the observed orbit but recovers no structure; beyond the train/test boundary the integrated trajectory diverges and the four discovery scalars fail to converge. Enabled: the network additionally satisfies a generic monomial constraint, so the discovery scalars converge toward the true orders $(p_x, p_y)$, the orbit remains closed, and the held-out test loss tracks the training loss over a longer horizon.

Disable the prior, re-initialize with Reset NN, and compare the recovered exponents against the enabled case.

About the curriculum

Long-horizon NODE rollouts compound integration error, so the late residuals dominate the adjoint and ruin the early dynamics. Time-decay weights each residual by $\exp(-\alpha(\tau)(t_i - t_\mathrm{start}))$ with $\alpha(\tau)$ annealed from a finite value to zero. Horizon annealing only supervises the first $K(\tau)$ observations in each batch; $K$ grows from a small constant to the full batch over training.

Gradient method

Multiple shooting (default): integrate each segment $\mathbf{u}_i^{obs} \to \mathbf{u}_{i+K}^{obs}$ (and backward) anchored at real observations, residual $\tfrac12\lVert\hat{\mathbf{u}}(t_{i+K};\mathbf{u}_i^{obs}) - \mathbf{u}_{i+K}^{obs}\rVert^2$. Short segments do not compound integration error, so it avoids the single-shooting bias where a noisy initial condition poisons the whole window. Two independent knobs: K is the number of observations a segment spans ($i \to i{+}K$); M is the integration resolution, the RK4 substeps taken between each pair of observations ($dt = \text{obs-spacing}/M$), so a segment uses $K\!\cdot\!M$ substeps, independent of data density. ±dir adds the backward segment, cancelling directional bias. Per-segment residuals are weighted by the adaptive per-residual MLE variance; the physics residuals additionally use the full per-collocation-point propagated covariance $\mathbf{J}_\mathbf{R}\boldsymbol\Sigma_\mathbf{x}\mathbf{J}_\mathbf{R}^\top$ of the rKINNs scheme.

Windowed rollout: single-shooting from one anchor across a long window, closer to a classic Neural ODE, but compounds error. Derivative match: match NN rates to finite-difference derivatives, no integration. Fastest, noisiest.

Try this

Move $p_x$ to $1.5$ or $q_{xy}$ to $2.0$ using the sliders. The truth trajectory regenerates instantly and the discovery scalars climb toward the new values. Full convergence on a non-unit truth needs several tens of thousands of steps because the NN has to relearn the new monomial shape from scratch.

Think about

The physics priors never tell the network what the orders are. The residual $r = x\,\partial f/\partial x - \hat{p}\,f$ only says the log-derivative is a constant; the optimizer and the data pick which constant. The same machinery admits non-integer orders (Hill kinetics, allometric scaling, anomalous diffusion) with no code changes.

Try this

Drag the amber train|test handle left to $t = 5$ so the network sees only the first half-cycle. With priors ON, the integrated trajectory still closes through the test and validation regions because the monomial-shape constraint extrapolates the rate function correctly. With priors OFF, the orbit visibly drifts: a pure data fit cannot recover what was never in the training window.

Roles

Train: SEEN by the optimizer, evaluates $\nabla_\theta\mathcal{L}$. Test: SEEN by checkpoint selection; we pick $\theta^\star$ that minimizes test loss. Validation: blind hold-out, never seen, ever. Reporting validation loss at $\theta^\star$ is the unbiased generalization estimate.

Relation to the unconstrained Neural ODE

	Unconstrained Neural ODE	Black-box + monomial prior
Objective	data fit only	data fit + monomial physics residual
Off the data	field unconstrained → drifts	prior keeps constraining → extrapolates
Structure	none recovered	resolves low-order diagonal exponents $(\hat{p})$; cross orders as a diagnostic
Uncertainty	none	propagated $\pm\sigma$ on the orders
Data/physics balance	n/a (single term)	set by MLE, no hand-tuned $\alpha$

The table contrasts the two regimes. An unconstrained Neural ODE [1] minimizes the data misfit alone, $\mathcal{L}_\mathrm{data}(\theta) = \tfrac{1}{2}\sum_i \lVert \hat{\mathbf{u}}(t_i;\theta) - \mathbf{u}^\mathrm{obs}_i \rVert^2$; as a universal approximator it interpolates the observed orbit to arbitrary precision but constrains the rate function only on the support of that orbit, so off the orbit the field is unidentified and neither structure nor uncertainty is available.

The present formulation adds a second objective on the same network: residual physics priors $\mathbf{R}_k(\mathbf{x}_c)$ requiring the rate function to be locally monomial, evaluated at collocation points along the orbit. The prior is generic and not specific to KINNs; any physics-informed objective could impose it. The distinction lies in how the two objectives are balanced. A conventional physics-informed loss weights them with a hyperparameter $\alpha$, $\mathcal{L} = \mathcal{L}_\mathrm{data} + \alpha\,\mathcal{L}_\mathrm{phys}$, that controls the variance between adhering to the physical constraint and interpolating the observations. The construction adopted from rKINNs instead treats both residuals as Gaussian and lets a profiled maximum-likelihood estimator set the weights, with the physics-residual covariance derived from the interpolation covariance rather than estimated independently, eliminating $\alpha$.

The propagation law $\boldsymbol\Sigma_\mathbf{R}(\mathbf{x}_c)=J\boldsymbol\Sigma_\mathbf{x}J^\top$ is what closes this: with the physics-residual covariance derived from the single interpolation covariance, the data-to-physics balance is a deterministic function of the model geometry rather than a free hyperparameter, and the formulation yields two quantities unavailable to an unconstrained Neural ODE - a rate law with identified structure and a calibrated uncertainty on it.

The hybrid alternative

The soft monomial prior recovers the low-order structure but degrades for higher diagonal orders, where the free network fits the orbit with a field that is flatter than the truth and the recovered exponent is biased low. The cause is identifiable: the data constrain the rate function only along the orbit, leaving the transverse derivative underdetermined, so the soft penalty settles on a low-order compromise.

The hybrid model (selected by the Model control) removes this bias structurally. It is still a Neural ODE, but the network is built so its weights are the rate-law structure. Take the log of the state, $\mathbf z = \log\mathbf u$, pass it through a first layer with weight matrix $W$, and apply an elementwise $\exp$ activation:

\[ h_k(\mathbf u) \;=\; \exp\!\Big(\textstyle\sum_j W_{kj}\,\log u_j\Big) \;=\; \prod_j u_j^{\,W_{kj}}. \]

Each hidden unit is a monomial whose exponents are exactly the first-layer weights $W$. The read-out is per output component: each $f_i$ has its own linear head $\mathbf b_i$ (and, for products of features, a bilinear matrix $A_i$) that maps the monomials to that component of the rate, $f_i(\mathbf u) = \mathbf b_i^\top \mathbf h + \mathbf h^\top A_i\,\mathbf h$. So the network's weights split into powers ($W$) and per-output coefficients ($\mathbf b_i, A_i$): reading them off after training gives the rate law directly. This is the same object as a fixed monomial dictionary (SINDy [4]), except the exponents are learned rather than enumerated, and embedding it as the right-hand side of the integrator makes it a universal differential equation [5] whose learned component is itself structured.

Pinning the monomials to the four physically relevant terms collapses $W$ to four exponents and the head to four coefficients, which is what the discovery panel reports:

\[ \dot x = x^{\hat p_x}\big(c_0 + c_1\, y^{\hat q_{xy}}\big), \qquad \dot y = y^{\hat p_y}\big(c_2 + c_3\, x^{\hat q_{yx}}\big). \]

Here the diagonal orders $\hat p_x,\hat p_y$ and the cross orders $\hat q_{xy},\hat q_{yx}$ are entries of the exponent matrix $W$, and $c_0,\dots,c_3$ are the entries of the per-output read-out head ($c_0, c_1$ in $\mathbf b_x, A_x$; $c_2, c_3$ in $\mathbf b_y, A_y$). All eight are trained directly from the data loss through the discrete adjoint, and with sufficient orbit coverage the integration regularizes the observation noise enough for the parameters to converge to their true values. Unlike the soft prior, this recovers the cross orders as well as the diagonals, because the cross exponent is a network weight rather than a quantity read off a free field. This is the trade-off: it recovers the full structure precisely, but only by assuming the rate is a sum of power-law terms. It sits at the white-box end of the spectrum, next to the fully mechanistic KINNs (known mechanism, fit the rate constants), with the black-box NODE at the other end and the soft-prior formulation in between.

Identifiability still benefits from more data coverage. A single closed orbit is a one-dimensional curve in state space; adding initial conditions (the Experiments control) spreads several orbits across the plane, which tightens the recovered exponents and resolves the residual growth/coupling trade-off between the two additive terms.

Generalization to higher-dimensional systems

Each component of the formulation (the Neural ODE objective, the horizon-annealing schedule, the residual physics priors, and the propagated maximum-likelihood weighting) is system-agnostic. Substituting a higher-dimensional vector field for $\mathbf{f}_\theta$ and applying the same residual constraints on each component extends the method in principle to catalytic kinetics (mass-action exponents, adsorption-site orders), enzyme cascades, gene-regulatory networks, and mechanical or electrical systems.

Relation to KINNs and rKINNs

This is easy to conflate with related work. In KINNs [2] (PINNs for chemical kinetics) the mechanistic model is known: the network is a surrogate for the state trajectory, the physics residual is the known rate law, and the inverse problem recovers its rate constants. Here it is the other way around: the model is unknown, the network is the rate function, and the prior imposes only a generic monomial shape, never the true rate law. What this Neural ODE takes from rKINNs [3] is not the problem setup but the error-propagation machinery: the maximum-likelihood weighting that ties the data-physics balance to the observation uncertainty, removing the hand-tuned loss-weight hyperparameter.

How to cite

If you reference this playground or its underlying methods in a paper, presentation, or post, please cite the source papers below.

BibTeX - KINNs (Catalysis Today, 2023)

@article{gusmao2023kinns,
  author  = {Gusm{\~a}o, Gabriel S. and Retnanto, Adhika P. and da Silva, Shachi C. and Medford, Andrew J.},
  title   = {Kinetics-Informed Neural Networks},
  journal = {Catalysis Today},
  volume  = {417},
  pages   = {113936},
  year    = {2023},
  doi     = {10.1016/j.cattod.2022.04.002},
  url     = {https://www.gabrielgusmao.com/publications/kinetics-informed-neural-networks/}
}

BibTeX - rKINNs / MLE for PINNs (Computers & Chemical Engineering, 2024)

@article{gusmao2024rkinns,
  author  = {Gusm{\~a}o, Gabriel S. and Medford, Andrew J.},
  title   = {Maximum-Likelihood Estimators in Physics-Informed Neural Networks for High-Dimensional Inverse Problems},
  journal = {Computers \& Chemical Engineering},
  volume  = {182},
  pages   = {108547},
  year    = {2024},
  doi     = {10.1016/j.compchemeng.2023.108547},
  url     = {https://www.gabrielgusmao.com/publications/maximum-likelihood-estimators-in-pinns/}
}

References

Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. NeurIPS 2018. arXiv:1806.07366
Gusmão, G. S.; Retnanto, A. P.; da Cunha, S. C.; Medford, A. J. Kinetics-Informed Neural Networks. Catalysis Today, 2023. doi:10.1016/j.cattod.2022.04.002
Gusmão, G. S.; Medford, A. J. Maximum-Likelihood Estimators in Physics-Informed Neural Networks for High-Dimensional Inverse Problems. Computers & Chemical Engineering, 2024, 181, 108547. doi:10.1016/j.compchemeng.2023.108547
Brunton, S. L.; Proctor, J. L.; Kutz, J. N. Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems (SINDy). PNAS, 2016, 113 (15), 3932-3937. doi:10.1073/pnas.1517384113
Rackauckas, C.; Ma, Y.; Martensen, J.; Warner, C.; Zubov, K.; Supekar, R.; Skinner, D.; Ramadhan, A.; Edelman, A. Universal Differential Equations for Scientific Machine Learning. 2020. arXiv:2001.04385
Lotka, A. J. Elements of Physical Biology. Williams & Wilkins, 1925.
Volterra, V. Variazioni e fluttuazioni del numero d'individui in specie animali conviventi. Mem. Acad. Lincei Roma, 1926.

Gabriel S. Gusmão