Operator Learning & Neural Operators

From Separation of Variables to Fourier Neural Operators


ML for Science and Engineering · Lecture 18
Joseph Bakarji

Where We Are

A spatio-temporal PDE solution $u(x, t)$, e.g. advection-diffusion

PINNs: train a network $(x, t) \to u(x,t)$ for a single initial condition.

Shortcoming: for a different $u_0(x)$, you must retrain from scratch.
Can we do better? Learn the solution operator $\mathcal{G}: u_0 \mapsto u(x,t)$, the inverse of the differential operator, so that one trained model works for any initial condition?
This means mapping functions to functions, not vectors to vectors. This is where the notion of an operator comes in.

Part I: From Vectors to Functions

What are operators, and why should neural networks learn them?

What Is an Operator?

Standard Neural Network
Maps vectors to vectors:
$f_\theta: \mathbb{R}^n \to \mathbb{R}^m$

Input: $\mathbf{x} = [x_1, \ldots, x_n]$
Output: $\mathbf{y} = [y_1, \ldots, y_m]$
Operator Network
Maps functions to functions:
$\mathcal{G}_\theta: \mathcal{U} \to \mathcal{V}$

Input: $u(\cdot)$ (e.g., initial condition)
Output: $\mathcal{G}(u)(\cdot)$ (e.g., solution)

Example: $u_0(x) \mapsto u(x, T)$: given an initial temperature distribution, predict the temperature at time $T$.

Why Functions, Not Vectors?

A vector $[u_1, u_2, \ldots, u_n]$ is just a discretization of a function $u(x)$ on a grid.

If we think in terms of functions, the learned operator should work at any resolution: coarse or fine.

Discretization invariance: the same trained model works on grids of different sizes. Train on $64 \times 64$, evaluate on $256 \times 256$.

This is fundamentally different from CNNs, which are tied to a fixed grid resolution.

Universal Approximation for Operators

Chen & Chen (1995): A neural network with a single hidden layer can approximate any continuous nonlinear operator. [paper]

$$\mathcal{G}(u)(y) \approx \sum_{i=1}^{p} c_i(u) \cdot \psi_i(y)$$
$c_i(u)$ , coefficients from the input function (the branch)
$\psi_i(y)$ , basis functions at the query point (the trunk)
Basis functions weighted by coefficients

This is a modal decomposition: basis functions $\times$ coefficients, summed. It is the theoretical foundation for DeepONet.

The Operator Learning Setup

Input
$u_0(x)$, BCs, forcing $f$
Operator Network
$\mathcal{G}_\theta$
Output
$u(x,t)$ at any $(x,t)$

Training data: many (input function, solution) pairs from a classical PDE solver.

Once trained: instant evaluation for any new input (no retraining needed).

Two main architectures:
  • Deep Operator Networks (DeepONet) , inspired by Chen & Chen's universal approximation theorem [Lu et al. 2021]
  • Fourier Neural Operators , inspired by Green's functions and Fourier transformations [Li et al. 2021]

Part II: Deep Operator Networks

From classical modal decomposition to learned operators.

Spatial-Temporal Modal Decomposition

Many PDE solutions can be written as a sum of spatial modes weighted by temporal coefficients:

$$u(x, t) = \sum_{k=1}^{K} a_k(t) \cdot \varphi_k(x)$$

You have seen this in multiple forms:

  • SVD/POD (Lecture 9): orthogonal modes from data ($U \Sigma V^\top$)
  • DMD (Lecture 10): dynamic modes with eigenvalue evolution
  • Fourier series: sines and cosines as basis functions
All of these share the same structure: basis functions × coefficients, summed over modes.

DeepONet Architecture

Two sub-networks whose outputs are combined via a dot product (Lu et al., Nature Machine Intelligence, 2021): [paper]

DeepONet Architecture (Lu et al. 2021, Fig. 1)
$\mathcal{G}(u)(y) = \sum_{i=1}^{p} \underbrace{\text{Branch}_i(u)}_{\text{coefficients}} \cdot \underbrace{\text{Trunk}_i(y)}_{\text{basis functions}} = \langle \mathbf{b}(u),\, \mathbf{t}(y) \rangle$

DeepONet = Learned Modal Decomposition

Classical SVD/POD
$u(x,t) \approx \sum_k a_k \cdot \varphi_k(x)$

$\varphi_k(x)$: eigenvectors (fixed)
$a_k$: projection coefficients (linear)
DeepONet
$\mathcal{G}(u)(y) \approx \sum_i b_i(u) \cdot t_i(y)$

$t_i(y)$: trunk outputs (learned basis)
$b_i(u)$: branch outputs (nonlinear in $u$)
The trunk learns basis functions. The branch learns how to weight them given the input function. This is separation of variables, generalized by neural networks.

Training DeepONet

Supervised training: given $N$ (input function, solution) pairs from a PDE solver:

$\mathcal{L}_{\text{data}} = \frac{1}{N} \sum_{k=1}^{N} \sum_{j} \left\| \mathcal{G}_\theta(u_k)(y_j) - u_k^{\text{true}}(y_j) \right\|^2$

Here $u_k$ is the $k$-th initial condition (sampled at sensor locations), and $y_j$ are query points where we evaluate the output solution.

$\bullet$ sensors = where $u_k$ is sampled • $\bullet$ $y_j$ = query points

Physics-informed (PI-DeepONet): add PDE residual loss where $f$ is the forcing/source term:

$\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{data}} + \beta \cdot \int \left\| \mathcal{N}\big[\mathcal{G}_\theta(u)\big] - f \right\|^2 dx$   ($\mathcal{N}$: PDE operator, $f$: forcing term)

DeepONet: Key Properties

PropertyDeepONet
Irregular geometries trunk evaluates at any point
Variable sensor locations branch takes arbitrary inputs
Noise robustness error increases <10× at 0.1% noise
No grid requirement fully meshless
Resolution invariance~ approximate (depends on sensor density)

DeepONet: Results

From Lu et al. (2021), DeepONet achieves high accuracy across diverse PDE families:

ProblemRelative $L^2$ ErrorNotes
Advection equation$\sim\!10^{-3}$Linear, smooth solutions
Diffusion-reaction$\sim\!10^{-3}$Nonlinear source term
Stochastic ODE system$\sim\!10^{-2}$Random forcing
Gravity pendulum$\sim\!10^{-3}$Nonlinear ODE
Key finding: A single trained DeepONet generalizes over thousands of initial conditions with millisecond inference time, while maintaining $<1\%$ relative error.

Lu et al. (2021). Learning nonlinear operators via DeepONet. Nature Machine Intelligence, 3, 218–229. [link]

Part III: Green's Functions → Fourier Neural Operators

From integral solutions to learnable spectral layers.

Green's Functions: The Integral Solution

$\mathcal{L}\, u(x) = f(x)$   (linear PDE, $f$ = forcing/source)

The Green's function $G(x,y)$ is the solution to the PDE when the forcing is a single impulse at $y$:

$\mathcal{L}_x\, G(x,y) = \delta(x - y)$

It tells you how a point source at $y$ influences the solution at $x$. Because the PDE is linear, the full solution is a superposition of these impulse responses:

$$u(x) = \int_\Omega G(x, y)\, f(y)\, dy$$
If you know $G$, you can solve the PDE for any forcing $f$ via integration. This is the ultimate non-local solution operator: every point in the domain can influence every other.

Superposition: Interactive

For the 1D diffusion equation $\;\frac{\partial u}{\partial t} = \nu \frac{\partial^2 u}{\partial x^2} + f(x)$, the Green's function is a Gaussian kernel. Add impulses to $f(x)$ and see the superposition build the solution $u(x)$.

Add impulses or a smooth forcing. Drag dots to reposition.

From Green's Functions to Neural Operator Layers

Green's functions rely on linearity (superposition). For nonlinear PDEs, no simple $G$ exists. But we can learn a kernel and compose it through nonlinear layers.

Standard NN layer
$v^{(\ell+1)} = \sigma\!\big(W v^{(\ell)} + b\big)$

$W$ is a finite matrix (fixed input dimension)

Neural operator layer
$v^{(\ell+1)}\!(x) = \sigma\!\left(W v^{(\ell)}\!(x) + \int \kappa(x,y)\, v^{(\ell)}\!(y)\, dy + b\right)$

$\kappa(x,y)$ is a learnable kernel (continuous, any resolution)

The kernel $\kappa(x,y)$ is the learned analogue of the Green's function: a continuous mapping from input to output, capturing how point $y$ influences point $x$. When discretized on a grid, $\kappa$ reduces to the weight matrix $W$. But by learning it in continuous form, the operator generalizes across resolutions.

Kovachki, N. et al. (2023). Neural Operator: Learning Maps Between Function Spaces. JMLR, 24(89), 1–97.

The Convolution Trick

A general kernel $\kappa(x, y)$ is expensive: $O(n^2)$ to evaluate on a grid of $n$ points.

Assume translation invariance: the kernel depends only on the difference $\kappa(x, y) = \kappa(x - y)$.

→ This is convolution!

→ Convolution in space = multiplication in Fourier domain.

The FFT makes this $O(n \log n)$ instead of $O(n^2)$. We fit the kernel parameters in Fourier space, where the integral becomes element-wise multiplication.

FNO Layer: Interactive

Each Fourier Neural Operator layer: apply FFT, multiply by learned weights, inverse FFT, add spatial path, activate.

10
$v^{(\ell+1)} = \sigma\!\left(W v^{(\ell)} + \mathcal{F}^{-1}\!\big(R_\phi \cdot \mathcal{F}(v^{(\ell)})\big)\right)$

FNO: Key Properties

PropertyFNO
Resolution invariance same parameters at any grid size
Computational cost $O(n \log n)$ per layer (via FFT)
Regular grids only requires uniform Cartesian mesh
Spectral bias mode truncation discards high frequencies
Speed vs. classical 1000× faster than pseudo-spectral solvers

Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhatt, K., Stuart, A. & Anandkumar, A. (2021). Fourier Neural Operator for Parametric PDEs. ICLR 2021.

In Code: DeepONet

class DeepONet(nn.Module):
    def __init__(self, u_dim, y_dim, p=64):
        super().__init__()
        self.branch = nn.Sequential(        # input function → coefficients
            nn.Linear(u_dim, 128), nn.ReLU(),
            nn.Linear(128, p))
        self.trunk = nn.Sequential(         # query point → basis functions
            nn.Linear(y_dim, 128), nn.ReLU(),
            nn.Linear(128, p))

    def forward(self, u, y):
        b = self.branch(u)                  # (batch, p)
        t = self.trunk(y)                   # (batch, p)
        return (b * t).sum(-1)              # dot product

In Code: FNO Layer

class SpectralConv1d(nn.Module):
    def __init__(self, ch, modes):
        super().__init__()
        self.modes = modes
        self.weights = nn.Parameter(         # learnable Fourier weights
            torch.randn(ch, ch, modes, 2))

    def forward(self, x):                    # x: (batch, ch, N)
        x_ft = torch.fft.rfft(x)
        out_ft = torch.zeros_like(x_ft)
        out_ft[:,:,:self.modes] = \          # multiply low modes by R
            complex_mul(x_ft[:,:,:self.modes], self.weights)
        return torch.fft.irfft(out_ft, n=x.size(-1))

Operator Learning: DeepONet & FNO Hands-On

Operator Learning: DeepONet & FNO Hands-On

In this notebook we implement minimal versions of DeepONet and Fourier Neural Operator (FNO) and train them on a 1D diffusion equation.

Problem Setup

We learn the solution operator for the 1D heat equation:

$$\frac{\partial u}{\partial t} = \nu \frac{\partial^2 u}{\partial x^2}, \quad u(x, 0) = u_0(x)$$

Given random initial conditions $u_0(x)$, the operator maps $u_0 \mapsto u(x, T)$ at a fixed time $T$.

Python
import numpy as np import matplotlib.pyplot as plt import torch import torch.nn as nn import torch.optim as optim # Grid parameters N = 64 # spatial points nu = 0.01 # diffusivity T_final = 0.1 # evaluation time x = np.linspace(0, 1, N, endpoint=False) dx = x[1] - x[0]

Data Generation

We generate random initial conditions as sums of Fourier modes with decaying amplitudes, then solve the heat equation exactly using the spectral method (FFT-based). This gives us ground-truth (input, output) pairs for training.

Python
def random_ic(n_samples, n_modes=5): """Generate random initial conditions as sum of Fourier modes.""" u0 = np.zeros((n_samples, N)) for k in range(1, n_modes + 1): a = np.random.randn(n_samples, 1) / k b = np.random.randn(n_samples, 1) / k u0 += a * np.sin(2 * np.pi * k * x) + b * np.cos(2 * np.pi * k * x) return u0 def exact_solution(u0, t, nu): """Solve heat equation spectrally (exact for periodic BC).""" u0_hat = np.fft.fft(u0, axis=-1) k = np.fft.fftfreq(N, d=dx) * 2 * np.pi decay = np.exp(-nu * k**2 * t) return np.real(np.fft.ifft(u0_hat * decay, axis=-1)) # Generate training and test data n_train, n_test = 1000, 200 u0_train = random_ic(n_train) uT_train = exact_solution(u0_train, T_final, nu) u0_test = random_ic(n_test) uT_test = exact_solution(u0_test, T_final, nu) print(f'Training: {n_train} samples, Test: {n_test} samples') print(f'Grid: {N} points, T = {T_final}, nu = {nu}')

Part 1: DeepONet

The branch network takes the input function $u_0$ (sampled at sensor locations) and produces coefficients $b_1, \ldots, b_p$.

The trunk network takes query points $y$ and produces basis functions $t_1(y), \ldots, t_p(y)$.

$$\mathcal{G}(u_0)(y) = \sum_{i=1}^{p} b_i(u_0) \cdot t_i(y) = \langle \mathbf{b}(u_0), \mathbf{t}(y) \rangle$$

Python
class DeepONet(nn.Module): def __init__(self, u_dim, y_dim=1, p=64, hidden=128): super().__init__() self.branch = nn.Sequential( nn.Linear(u_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, p) ) self.trunk = nn.Sequential( nn.Linear(y_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, p) ) self.bias = nn.Parameter(torch.zeros(1)) def forward(self, u, y): # u: (batch, u_dim) — input function values # y: (batch, n_query, 1) — query points b = self.branch(u) # (batch, p) t = self.trunk(y) # (batch, n_query, p) out = torch.einsum('bp,bqp->bq', b, t) return out + self.bias # Prepare data u0_t = torch.tensor(u0_train, dtype=torch.float32) uT_t = torch.tensor(uT_train, dtype=torch.float32) y_grid = torch.tensor(x, dtype=torch.float32).unsqueeze(-1) # (N, 1) y_batch = y_grid.unsqueeze(0).expand(n_train, -1, -1) # (n_train, N, 1) model_don = DeepONet(u_dim=N, p=64) optimizer = optim.Adam(model_don.parameters(), lr=1e-3) scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=500, gamma=0.5) # Train losses = [] for epoch in range(1500): pred = model_don(u0_t, y_batch) loss = nn.MSELoss()(pred, uT_t) optimizer.zero_grad() loss.backward() optimizer.step() scheduler.step() losses.append(loss.item()) if (epoch + 1) % 300 == 0: print(f'Epoch {epoch+1}: loss = {loss.item():.6f}') plt.semilogy(losses) plt.xlabel('Epoch'); plt.ylabel('MSE Loss'); plt.title('DeepONet Training') plt.grid(alpha=0.3); plt.show()

DeepONet Architecture

The model below has two sub-networks:

  • Branch (nn.Sequential): takes the full input function $u_0$ sampled at $N$ sensor points and outputs $p$ coefficients.
  • Trunk (nn.Sequential): takes query coordinates $y$ and outputs $p$ basis function values.

The output is their dot product: $\mathcal{G}(u_0)(y) = \langle \mathbf{b}(u_0), \mathbf{t}(y) \rangle + \text{bias}$.

Python
# Test DeepONet u0_te = torch.tensor(u0_test, dtype=torch.float32) y_te = y_grid.unsqueeze(0).expand(n_test, -1, -1) with torch.no_grad(): pred_don = model_don(u0_te, y_te).numpy() rel_err = np.mean(np.linalg.norm(pred_don - uT_test, axis=1) / (np.linalg.norm(uT_test, axis=1) + 1e-8)) print(f'DeepONet relative L2 error: {rel_err:.4f}') fig, axes = plt.subplots(1, 3, figsize=(14, 3.5)) for i, ax in enumerate(axes): ax.plot(x, uT_test[i], 'k-', label='True', linewidth=2) ax.plot(x, pred_don[i], '--', color='#61afef', label='DeepONet') ax.set_title(f'Test {i+1}', fontsize=11) ax.legend(fontsize=9); ax.set_xlabel('x') plt.suptitle('DeepONet Predictions on Test Set', fontsize=13) plt.tight_layout(); plt.show()

Part 2: Fourier Neural Operator (FNO)

Each FNO layer applies: $v^{(\ell+1)} = \sigma\big(W v^{(\ell)} + \mathcal{F}^{-1}(R_\phi \cdot \mathcal{F}(v^{(\ell)}))\big)$

  • FFT the input, multiply low modes by learned weights $R_\phi$, iFFT back
  • Add a local linear transform $W v$ (skip connection in spatial domain)
  • Apply nonlinear activation

FNO Architecture

The key building block is SpectralConv1d: it applies the FFT, multiplies the lowest modes Fourier coefficients by a learned complex weight matrix $R_\phi$, then applies the inverse FFT.

The full FNO1d model stacks multiple such layers. Each layer also has a local 1x1 convolution $W$ (skip connection in spatial domain). The input is first "lifted" from 1 channel to width channels, and the output is projected back to 1 channel.

Python
class SpectralConv1d(nn.Module): def __init__(self, in_ch, out_ch, modes): super().__init__() self.modes = modes scale = 1 / (in_ch * out_ch) self.weights = nn.Parameter(scale * torch.randn(in_ch, out_ch, modes, dtype=torch.cfloat)) def forward(self, x): # x: (batch, in_ch, N) x_ft = torch.fft.rfft(x) out_ft = torch.zeros(x.size(0), self.weights.size(1), x_ft.size(-1), dtype=torch.cfloat, device=x.device) out_ft[:, :, :self.modes] = torch.einsum('bix,iox->box', x_ft[:, :, :self.modes], self.weights) return torch.fft.irfft(out_ft, n=x.size(-1)) class FNO1d(nn.Module): def __init__(self, modes=16, width=32, n_layers=4): super().__init__() self.lift = nn.Linear(1, width) # lift input to channel dim self.layers = nn.ModuleList() self.ws = nn.ModuleList() for _ in range(n_layers): self.layers.append(SpectralConv1d(width, width, modes)) self.ws.append(nn.Conv1d(width, width, 1)) self.proj = nn.Linear(width, 1) # project back to 1D def forward(self, x): # x: (batch, N) → output: (batch, N) x = x.unsqueeze(-1) # (batch, N, 1) x = self.lift(x) # (batch, N, width) x = x.permute(0, 2, 1) # (batch, width, N) for conv, w in zip(self.layers, self.ws): x = torch.relu(conv(x) + w(x)) x = x.permute(0, 2, 1) # (batch, N, width) return self.proj(x).squeeze(-1) # (batch, N) model_fno = FNO1d(modes=16, width=32, n_layers=4) optimizer = optim.Adam(model_fno.parameters(), lr=1e-3) scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=500, gamma=0.5) losses_fno = [] for epoch in range(1500): pred = model_fno(u0_t) loss = nn.MSELoss()(pred, uT_t) optimizer.zero_grad() loss.backward() optimizer.step() scheduler.step() losses_fno.append(loss.item()) if (epoch + 1) % 300 == 0: print(f'Epoch {epoch+1}: loss = {loss.item():.6f}') plt.semilogy(losses_fno) plt.xlabel('Epoch'); plt.ylabel('MSE Loss'); plt.title('FNO Training') plt.grid(alpha=0.3); plt.show()
Python
# Test FNO with torch.no_grad(): pred_fno = model_fno(u0_te).numpy() rel_err_fno = np.mean(np.linalg.norm(pred_fno - uT_test, axis=1) / (np.linalg.norm(uT_test, axis=1) + 1e-8)) print(f'FNO relative L2 error: {rel_err_fno:.4f}') fig, axes = plt.subplots(1, 3, figsize=(14, 3.5)) for i, ax in enumerate(axes): ax.plot(x, uT_test[i], 'k-', label='True', linewidth=2) ax.plot(x, pred_fno[i], '--', color='#c678dd', label='FNO') ax.set_title(f'Test {i+1}', fontsize=11) ax.legend(fontsize=9); ax.set_xlabel('x') plt.suptitle('FNO Predictions on Test Set', fontsize=13) plt.tight_layout(); plt.show()

Comparison

Let's compare DeepONet and FNO side by side on the same test samples.

Python
print(f'DeepONet relative L2 error: {rel_err:.4f}') print(f'FNO relative L2 error: {rel_err_fno:.4f}') fig, axes = plt.subplots(2, 3, figsize=(14, 7)) for i in range(3): axes[0, i].plot(x, uT_test[i], 'k-', lw=2, label='True') axes[0, i].plot(x, pred_don[i], '--', color='#61afef', label='DeepONet') axes[0, i].set_title(f'Test {i+1}'); axes[0, i].legend(fontsize=9) axes[1, i].plot(x, uT_test[i], 'k-', lw=2, label='True') axes[1, i].plot(x, pred_fno[i], '--', color='#c678dd', label='FNO') axes[1, i].set_xlabel('x'); axes[1, i].legend(fontsize=9) axes[0, 0].set_ylabel('DeepONet'); axes[1, 0].set_ylabel('FNO') plt.suptitle('DeepONet vs FNO on Heat Equation', fontsize=14) plt.tight_layout(); plt.show()

Exercises

  1. Resolution transfer (FNO): Train the FNO on a 64-point grid, then evaluate it on a 128-point grid. Does it transfer? Why?

  2. Sensor sparsity (DeepONet): Reduce the number of sensor points in the branch network (e.g., use every 4th point). How does accuracy degrade?

  3. Harder PDE: Replace the heat equation with Burgers' equation $u_t + u u_x = \nu u_{xx}$. Which method handles the nonlinearity better?

  4. Physics-informed DeepONet: Add a PDE residual loss term $|\mathcal{N}[\mathcal{G}_\theta(u_0)] - 0|^2$ and train with fewer data pairs. Does it help?

Scroll to explore · code cells are collapsible

Part IV: Comparing Architectures

DeepONet vs FNO vs PINNs: when to use what.

When to Use What

Feature DeepONet FNO PINNs
Irregular geometry✓✓
Resolution invariance~✓✓
Noise robustness✓✓~
No training data needed
Generalize over ICs
Speed (inference) ~ms✓✓ ~ms retrain

Other Operator Architectures

The field is evolving rapidly. Some notable variants:

Neural POD / POD-DeepONet
Replace Fourier basis with POD modes learned from data. Better for problems with dominant coherent structures.
Graph Neural Operators
Kernel as message-passing on graphs. Works on irregular meshes. $O(n^2)$ complexity.
LOCA (Attention-based)
Attention-weighted averaging of features. Strong with sparse measurements. (Kissas et al., JMLR 2022)
Open question:
No single architecture dominates. Choice depends on geometry, data quality, and problem structure.

Connections Across the Course

SVD/POD (Lecture 9)
→ DeepONet trunk learns generalized basis functions
DMD (Lecture 10)
→ Temporal dynamics of modes, eigenvalue evolution
PINNs (Lecture 20)
→ PI-DeepONet adds PDE loss to operator training
Separation of Variables (Lectures 4–5)
→ Branch-trunk split is a neural generalization
Unifying theme: the more you know about classical methods, the better your neural architecture. Every operator architecture encodes some classical structure as an inductive bias.

Summary: Design Principles

Function → Function
Operators map functions to functions, generalizing beyond fixed-size vectors. Discretization invariance is the goal.
Classical → Neural
Modal decomposition → DeepONet.
Green's function → FNO.
Inductive bias from mathematics.
Train Once, Evaluate Many
The key advantage over PINNs: learn the operator once, then instantly solve for any new input.

References

Chen, T. & Chen, H. (1995). Universal approximation to nonlinear operators by neural networks with arbitrary activation functions. IEEE Trans. Neural Networks, 6(4), 911–917.

Lu, L., Jin, P., Pang, G., Zhang, Z. & Karniadakis, G.E. (2021). Learning nonlinear operators via DeepONet. Nature Machine Intelligence, 3, 218–229.

Li, Z., Kovachki, N., Azizzadenesheli, K. et al. (2021). Fourier Neural Operator for Parametric PDEs. ICLR 2021.

Kovachki, N., Li, Z., Liu, B. et al. (2023). Neural Operator: Learning Maps Between Function Spaces. JMLR, 24(89), 1–97.