What are Graph Transformers ?

From social networks to urban transit, graphs are everywhere. But beyond those obvious candidates, graphs can elegantly describe gridded data, where each cell’s value is dependent on its neighbours. This perspective opens a tremendous range of applications in scientific computing. In computational fluid dynamics, models are based on differential equations, such as the Navier-Stokes equations, which describe the behavior of the fluid through advection, diffusion and reaction. Numerical models solve them by approximating local gradients, and the value of each cell is updated at each time step from the value of a finite number of neighbors. This approach, despite it computational load, works well in idealized cases. However, as soon as complexity in the form of viscosity or turbulence arises, equations fall short in describing the entire picture. Modelling relies on empirical parametrizations to describe viscosity coefficients for instance.

Such cases are perfect candidates for graph transformers. Cells can be represented by nodes, and their connectivity by edges, thus creating regular graphs. They can be trained to not only approximate the gradient contribution from neighboring nodes, but also account for phenomenon that were previously parametrized. But how do graph transformers work in the first place ?

Let’s first take it back to the classical transformer architecture. Transformers are a type of neural networks specialized in sequence modelling ¹. They iteratively transform a sequence of data into another, while mixing information between elements, or tokens, of the sequence via self-attention ². A transformer typically has several attention heads through which the data goes in parallel. In each head, the input data is transformed by going through a transformer block a given number of time. A transformer block defines a series of deep learning layers. Typically, it includes layer normalization, self-attention and MLP layers. The output is given by concatenating the output of the different heads, before an output transformation can be applied. Graph Transformer use a variation of self-attention referred to as sparse attention. Figure 1 shows where sparse attention is implemented in a classical transformer architecture.

During self-attention, the output $Z$ is a weighted average of the $T_{in}$ input tokens and is computed from the query $Q$, keys $K$ and values $V$ matrices :

$$z_i = \sum_{j=1}^{T_{in}} p_{i,j} \cdot v_j$$ with $$\sum_{j=1}^{T_{in}} p_{i,j} = 1$$

forming a valid probability distribution. This can be written in matrix form as:

$$Z = PV = softmax(\frac{Q K^T}{\sqrt{d_k}}) \cdot V$$

The summation is taken over the entire input tokens sequence, making this multiplication very large, and potentially a computational bottleneck in the model. Sparse attention is a variant of the self-attention mechanism, where tokens only attend to a subset of other tokens. This subset can be defined in several manners, such as choosing a random subgroup, or attending to neighboring tokens only. Graph transformers employ sparse attention and define token attendance by their relative position on the graph. The softmax operation only sums the neighbouring tokens. This process is sometimes referred to as window attention. Figure 2 illustrates this process in the one-dimensional case.

**Figure 2.** The pink shadow represents tokens respectively attended to by each other token for (a) Regular Attention, (b) Sparse Attention with a window of size 3.

Additionally, the updated cell value also depends on edges attributes, such as length and direction, of the edge that links it to its neighbor. This method is referred to edge-conditioned convolution ³. The attention computation becomes :

$$z_i = \sum_{j \in \mathcal{N}_i} p_{i,j} \cdot v_j \cdot h_{\theta}(e_{i,j})$$

with $h_{\theta}(e_{i,j})$ a function of edge attributes $e_{i,j}$ between nodes $i$ and$j$. This function can be a learnt function, such as a simple MLP. $\mathcal{N}_i$ denotes the neighbourhood of cell $i$. This variant allows the transfer of information to be conditional on the graph structure.

In conclusion, graph transformers are constrained by which input fields are connected during the self-attention layer ⁴. This eliminates the need for positional encoding, as all positional information is contained in the existing edge connections and attributes. The final architecture induces a strong relational inductive bias as it assumes the structure of the data ⁵. In the context of computational fluid dynamics, it is equivalent to stating that the fluid at a point directly depends on the local fluid. Repeated passes of the transformer block allow the account of longer range spatial dependencies. The crux here is to have sufficient high-quality data to train such model, and a reliable way to assess its performance. In such applications, graph transformers are a prime example of how prior knowledge about the problem can be leveraged to apply an inductive bias in deep neural networks, thus reducing the computational and data requirements.

A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, Vol. 30, 2017. ↩
M. Jaggi and N. Flammarion, Class notes from CS-433 Machine Learning. Computer Sciences, EPFL, 2023. ↩
M. Simonovsky and N. Komodakis, “Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs.”. Available: https://arxiv.org/abs/1704.02901 ↩
C. K. Joshi, “Transformers are Graph Neural Networks.”. Available:https://arxiv.org/abs/2506.22084 ↩
P. W. Battaglia et al., “Relational inductive biases, deep learning, and graph networks.”. Available: https://arxiv.org/abs/1806.01261 ↩