Home All projects

March 20, 2026

Hyperbolic Recurrent State-Space Model (H-RSSM)

Personal research on hyperbolic latent spaces for learning hierarchical representations.

World ModelsGeometric Deep LearningHyperbolic Latent SpacesHierarchical Representations

Status: Work in progress. This is ongoing research.

All results below are preliminary and, except where stated, single-seed (because of computional limitation). Multi-seed variance is reported only for the comparison in Experiment 3.

Project Overview

World models for RL learn a latent representation of the environment and reason entirely within it. In practice, that latent space is almost always Euclidean—which works fine, but isn’t optimal for everything. When you’re dealing with hierarchical structure (trees, nested sub-goals, part-whole relationships), hyperbolic geometry is theoretically superior: it can embed trees with bounded distortion in constant dimension, whereas Euclidean space simply can’t. The problem is computational. The natural distribution on Hd\mathbb{H}^d lacks a closed-form normalising constant, which makes the KL term in the ELBO untractable at training speed.

That’s what I set out to fix. I introduce the Horospherical Gaussian (HG\mathcal{HG}): a distribution family over the upper half-space model of hyperbolic space where the normalising constant, KL divergence between pairs, and reparameterised sampler all have closed-form expressions in O(d)O(d), matching the cost of a standard Gaussian RSSM. It’s geometrically motivated but not geodesically exact; I characterise the approximation error explicitly.

On a synthetic BB-ary tree MDP, the HG\mathcal{HG} prior does produce depth alignment (ρτ0.270.31|\rho_\tau|\approx 0.27\text{–}0.31), but it’s modest and the reconstruction MSE stays well above the observation noise floor. The most interesting empirical result is that alignment extends monotonically to depths the model never saw during training (ρood=+0.536\rho_\text{ood} = +0.536). But it doesn’t hold up. Under 3×3\times longer training, the hyperbolic ρ\rho drops from 0.3060.306 to 0.2000.200 while the Euclidean baseline stays flat. Every architectural tweak I tried to strengthen the signal—decoupled KL weights, geometry-aware decoder, action-conditioned prior—either collapsed the representation or got absorbed by the reconstruction loss. This write-up documents what worked, what didn’t, and where I’m still uncertain.

Key findings at a glance

  • Dimensional efficiency (consistent with Sarkar (2011)). H2\mathbb{H}^2 reaches ρτ=0.264|\rho_\tau| = 0.264, statistically indistinguishable from R32\mathbb{R}^{32} at ρ=0.277|\rho| = 0.277. But ρ|\rho| is essentially flat across dimensions for both models, so the result is only consistent with the capacity advantage of H2\mathbb{H}^2 and not a clear demonstration of the theorem.
  • Out-of-distribution depth ordering. Trained on depths 0–5, the standard HG\mathcal{HG} model correctly ranks depth 6–7 nodes it never saw (ρood=+0.536\rho_\text{ood}=+0.536). The corresponding action-prior variant, whose bias parameters collapse to 0.0008\approx 0.0008, inverts the OOD ordering (ρood=0.511\rho_\text{ood}=-0.511).
  • Erosion under extended training. At 50k steps, ρτ\rho_\tau for HG\mathcal{HG} drops from 0.3060.2000.306\to 0.200; the Euclidean baseline stays at 0.2750.2760.275\to 0.276. The geometric inductive bias is not self-reinforcing under the current (Euclidean) reconstruction loss.

Table of Contents

  1. Hyperbolic Latent Spaces
  2. The Technical Obstacle: Intractable KL Divergences
  3. The Horospherical Gaussian
  4. Experimental Setup
  5. Results
  6. Discussion
  7. Assumptions and Limitations
  8. Code and Reproducibility
  9. References
  10. Appendix: Derivations

1. Hyperbolic Latent Spaces

1.1 Why Hyperbolic Geometry?

World models like DreamerV3 or IRIS jointly learn a compact latent representation of the environment and a transition kernel over it. The agent reasons over latent states, not observations. So the geometry of that latent space shapes everything it can represent and predict.

In practice, we always choose Euclidean geometry: Z=Rd\mathcal{Z} = \mathbb{R}^d. It’s the default because it’s computationally convenient.

But many environments have hierarchical structure. Think of an agent navigating nested sub-goals in a grid world, manipulating objects with part-whole relationships, or exploring a maze with branches. In all these cases the state space is tree-like. States multiply exponentially with depth. Euclidean space has polynomial volume growth: the volume of a ball of radius rr grows as rdr^d. That means you can’t fit exponential node counts without distorting distances badly. You’d need O(n)O(n) dimensions to embed a tree of nn nodes with acceptable distortion.

Hyperbolic space is different. It has exponential volume growth: a ball of radius rr in Hd\mathbb{H}^d has volume e(d1)re^{(d-1)r}, exactly matching tree structure. Sarkar (2011) showed that any tree of nn nodes embeds into H2\mathbb{H}^2 with bounded distortion—constant, independent of nn—in just 2 dimensions. In Euclidean space, you’d need d=Ω(logn)d = \Omega(\log n).

Tree embedded in R^2 (high distortion) vs tree embedded in H^2 (low distortion, Poincaré disk)

Trees embed efficiently in hyperbolic space (H2\mathbb{H}^2) with bounded distortion, compared to polynomial distortion in Euclidean space (R2\mathbb{R}^2).

1.2 Embedding in Hd\mathbb{H}^d

Early work on hyperbolic geometry in ML focused on static embeddings: you have a dataset with known hierarchical structure (a taxonomy, knowledge graph, phylogeny), and you learn to map nodes to H2\mathbb{H}^2 while preserving distances. Nickel & Kiela’s Poincaré embeddings are the classic example. They showed that WordNet hierarchies embed in H2\mathbb{H}^2 with far less distortion than any Euclidean baseline at the same dimension.

The insight is elegant. The Poincaré ball model Bd={xRd:x<1}\mathbb{B}^d = \{x \in \mathbb{R}^d : \|x\| < 1\} with metric ds2=4dx2(1x2)2ds^2 = \frac{4\|dx\|^2}{(1-\|x\|^2)^2} naturally places the root near the origin (where space is sparse) and leaves near the boundary (where space is exponentially dense). Depth in the hierarchy maps directly to distance from the center.

Poincaré disk with a small tree embedded. Root at center, leaves near boundary

Poincaré disk with a small tree embedded. Root at center, leaves near boundary

But that’s learning from ground truth. In a world model, you don’t know the structure ahead of time. You have to infer it from observations alone. The embedding becomes a posterior qϕ(ztht,ot)q_\phi(z_t \mid h_t, o_t): a distribution over Hd\mathbb{H}^d that changes at each timestep, learned end-to-end from raw data.

1.3 Learning a Model Inside Hd\mathbb{H}^d

That’s where the problems start. Moving from offline static embeddings to a learned generative model introduces three tight constraints, all rooted in the same issue: there’s no tractable Gaussian-like distribution on Hd\mathbb{H}^d. Specifically:

  • (i) no standard family has a closed-form density with respect to the Riemannian volume
  • (ii) without that, the KL in the ELBO has to be estimated by Monte Carlo
  • (iii) the reparameterisation trick doesn’t carry over from Rd\mathbb{R}^d

Distributions on Hd\mathbb{H}^d. A VAE-style world model needs a prior pθ(ztht)p_\theta(z_t \mid h_t) and a posterior qϕ(ztht,ot)q_\phi(z_t \mid h_t, o_t), both on Hd\mathbb{H}^d. The natural choice is the Riemannian Normal NH(μ,σ2)edH2(z,μ)/2σ2\mathcal{N}_\mathbb{H}(\mu, \sigma^2) \propto e^{-d_\mathbb{H}^2(z, \mu)/2\sigma^2}, but it has no closed-form normalising constant for d2d \geq 2. Other proposals exist: wrapped normals (Nagano et al., 2019) are computationally heavy; pseudo-hyperbolic Gaussians sacrifice geometric fidelity. None of them are simultaneously tractable, geometrically sound, and fast enough for RL training.

KL divergences. Training requires computing KL[qϕpθ]\mathrm{KL}[q_\phi \| p_\theta] at every gradient step. Without a closed form, you resort to Monte Carlo estimation, which kills throughput when you need fast ELBO evaluations. This is why hyperbolic world models haven’t really taken off.

Reparameterisation. You need to backprop through zqϕ(ht,ot)z \sim q_\phi(\cdot \mid h_t, o_t). On a Riemannian manifold, the standard Euclidean reparameterisation trick doesn’t apply directly. Exponential-map approaches exist but blow up numerically near the boundary of the Poincaré ball.

 full model pipeline — observation → GRU → HG posterior → latent z in H^d → decoder → reconstruction, with KL between HG prior and posterior

Full model pipeline

My approach solves all three by using the upper half-space model and introducing a new distribution: the Horospherical Gaussian. It’s designed so the KL is exactly tractable in O(d)O(d), sampling is exact, and the geometry holds to leading order.


2. The Technical Obstacle: Intractable KL Divergences

World models are trained by maximising the ELBO:

L(θ,ϕ)=t=1TEqϕ ⁣[logpθ(otzt,ht)]βt=1TKL ⁣[qϕ(ztht,ot)pθ(ztht)]\mathcal{L}(\theta, \phi) = \sum_{t=1}^T \mathbb{E}_{q_\phi}\!\left[\log p_\theta(o_t \mid z_t, h_t)\right] - \beta \sum_{t=1}^T \mathrm{KL}\!\left[q_\phi(z_t \mid h_t, o_t)\,\big\|\,p_\theta(z_t \mid h_t)\right]

In the Euclidean case both prior and posterior are Gaussian and the KL has a closed form in O(d)O(d). On Hd\mathbb{H}^d the natural candidate is the Riemannian Normal, but its normalising constant is the integral that blocks the whole approach:

ZH(σ)=0er2/2σ2sinhd1(r)drZ_\mathbb{H}(\sigma) = \int_0^\infty e^{-r^2/2\sigma^2}\,\sinh^{d-1}(r)\, dr

For every d2d \geq 2 this integral has no closed form. KLs between Riemannian Normals therefore require Monte Carlo estimation, which is far too slow for the per-step ELBO evaluations we need during training. A like-for-like MC estimator would demand a large sample budget per step compared with the O(d)O(d) arithmetic of the closed-form alternative I develop below.

This is the problem I set out to fix first.


3. The Horospherical Gaussian

3.1 Construction

I work in the upper half-space model of hyperbolic space, using horospherical (log-height) coordinates:

Hd={(τ,b):τR,  bRd1}\mathbb{H}^d = \{(\tau, b) : \tau \in \mathbb{R},\; b \in \mathbb{R}^{d-1}\}

The metric is

ds2=dτ2+e2τdb2ds^2 = d\tau^2 + e^{-2\tau}\|db\|^2

and the Riemannian volume form is

dvol(τ,b)=e(d1)τdτdbd\mathrm{vol}(\tau, b) = e^{-(d-1)\tau}\, d\tau\, db

This is the same geometry as the usual upper half-space coordinates (x,y)(x,y) with y>0y>0 after the change of variables b=xb=x, τ=logy\tau=\log y (see A.6 for the derivation). The coordinate τ\tau is the Busemann coordinate (signed distance to the ideal boundary Hd\partial\mathbb{H}^d) and bb is the fibre coordinate (the lateral position within a level set). I prefer this model to the Poincaré ball because additive updates (τ,b)(τ+δτ,b+δb)(\tau, b) \leftarrow (\tau + \delta\tau, b + \delta b) are globally well-defined and avoid the boundary singularities that can destabilise Möbius-based implementations.

Horospherical Gaussian. Fix μτR\mu_\tau \in \mathbb{R}, μbRd1\mu_b \in \mathbb{R}^{d-1}, στ>0\sigma_\tau > 0, σb>0\sigma_b > 0. I define the distribution HG(μτ,μb,στ,σb)\mathcal{HG}(\mu_\tau, \mu_b, \sigma_\tau, \sigma_b) with density w.r.t. dvold\mathrm{vol} by

p(τ,b)=1Zexp ⁣((τμτ)22στ2bμb22e2τσb2)p(\tau, b) = \frac{1}{Z}\exp\!\left(-\frac{(\tau - \mu_\tau)^2}{2\sigma_\tau^2} - \frac{\|b - \mu_b\|^2}{2\,e^{2\tau}\,\sigma_b^2}\right)

The conditional variance Var(bτ)=e2τσb2\mathrm{Var}(b \mid \tau) = e^{2\tau}\sigma_b^2 is the unique scaling under which the two cancellations below occur.

3.2 Tractability: the two cancellations

Two elementary cancellations are what make the family tractable.

Cancellation 1 (normalising constant). If you integrate out the bb-component at fixed τ\tau you pick up a factor e(d1)τe^{(d-1)\tau} that exactly cancels the e(d1)τe^{-(d-1)\tau} in the volume form. The normalising constant evaluates to

Z=2πστ(2πσb2)(d1)/2Z = \sqrt{2\pi}\,\sigma_\tau \cdot (2\pi\sigma_b^2)^{(d-1)/2}

independent of μτ\mu_\tau and μb\mu_b (full derivation in A.1).

Cancellation 2 (drift moment). The tower property gives

Eq ⁣[e2τbμb2]=Eτ ⁣[e2τ(d1)e2τσb2]=(d1)σb2.\mathbb{E}_q\!\left[e^{-2\tau}\|b-\mu_b\|^2\right] = \mathbb{E}_\tau\!\left[e^{-2\tau} \cdot (d-1)e^{2\tau}\sigma_b^2\right] = (d-1)\sigma_b^2.

Both cancellations are exact and together they make the closed-form KL possible.

Closed-form KL. Combining the cancellations yields a closed-form KL between two HG\mathcal{HG} distributions (derivation in A.2):

KL[qp]=logστpστq+(στq)2+(μτqμτp)22(στp)212height KL+(d1) ⁣(logσbpσbq+(σbq)22(σbp)212)fibre scale KL+μbqμbp22(σbp)2e2μτq+2(στq)2drift term\mathrm{KL}[q\|p] = \underbrace{\log\frac{\sigma_\tau^p}{\sigma_\tau^q} + \frac{(\sigma_\tau^q)^2 + (\mu_\tau^q - \mu_\tau^p)^2}{2(\sigma_\tau^p)^2} - \frac{1}{2}}_{\text{height KL}} + (d-1)\underbrace{\!\left(\log\frac{\sigma_b^p}{\sigma_b^q} + \frac{(\sigma_b^q)^2}{2(\sigma_b^p)^2} - \frac{1}{2}\right)}_{\text{fibre scale KL}} + \underbrace{\frac{\|\mu_b^q - \mu_b^p\|^2}{2(\sigma_b^p)^2}\,e^{-2\mu_\tau^q + 2(\sigma_\tau^q)^2}}_{\text{drift term}}

All three terms are exact and total computation is O(d)O(d). The drift term follows from the MGF identity E[e2τ]=e2μτq+2(στq)2\mathbb{E}[e^{-2\tau}] = e^{-2\mu_\tau^q + 2(\sigma_\tau^q)^2}. I validated the full expression against a large-sample GPU Monte Carlo estimate: relative error 0.005%.

Reparameterisation. Sampling is exact and straightforward to implement:

τ=μτq+στqε1,ε1N(0,1),\tau = \mu_\tau^q + \sigma_\tau^q\,\varepsilon_1, \quad \varepsilon_1 \sim \mathcal{N}(0,1), b=μbq+eτσbqε2,ε2N(0,Id1)b = \mu_b^q + e^\tau\,\sigma_b^q\,\varepsilon_2, \quad \varepsilon_2 \sim \mathcal{N}(0, I_{d-1})

Note the coupling of bb to τ\tau through the factor eτe^\tau: it is exact and must be preserved in autograd (see A.3).

3.3 Cost of tractability: HG\mathcal{HG} is not the Riemannian Normal

The HG\mathcal{HG} is a tractable approximation rather than the geodesically exact Riemannian Normal. The leading-order discrepancy (see A.4) is

KL ⁣[NH(μ,σ2)HG(μτ,μb,σ,σ)]=d218σ2+O(σ3)\mathrm{KL}\!\left[\mathcal{N}_\mathbb{H}(\mu, \sigma^2)\,\big\|\,\mathcal{HG}(\mu_\tau, \mu_b, \sigma, \sigma)\right] = \frac{d^2-1}{8}\,\sigma^2 + O(\sigma^3)

For d=16d = 16 and σ=0.3\sigma = 0.3 this term evaluates to approximately 2.872.87. Whether this approximation error matters in practice is one of the empirical questions the experiments address.

The trade-off summarised:

HG\mathcal{HG}Riemannian NormalWrapped Normal
Closed-form KL✓ (O(d)O(d))partial
Exact reparameterisation✓ (exp-map)
No boundary singularities✓ (half-space)✗ (Möbius)
Geodesic exactnessleading-orderapproximate

4. Experimental Setup

4.1 Hypotheses under test

I’m testing three concrete claims, in order:

  • H1 (alignment). Training on the ELBO alone is enough for HG\mathcal{HG} to align its Busemann coordinate τ\tau with ground-truth depth.
  • H2 (capacity). HG\mathcal{HG} reaches a given level of depth alignment at lower dimension than a Euclidean Gaussian (a sample-efficiency consequence of Sarkar’s theorem).
  • H3 (extrapolation). The alignment creates an ordering that extends monotonically to depths never seen in training.

4.2 Environment

I use a BB-ary tree MDP: a rooted tree with branching factor BB and depth LL. Each node vv has a known depth (v){0,,L}\ell(v) \in \{0, \ldots, L\} and a fixed observation generated by mixing random depth and branch embeddings:

ov=tanh(Wobsconcat(evdepth,evbranch))+ε,εN(0,σobs2I)o_v = \tanh(W_{\mathrm{obs}} \cdot \mathrm{concat}(e_v^{\mathrm{depth}}, e_v^{\mathrm{branch}})) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma_{\mathrm{obs}}^2 I)

The embeddings evdepthe_v^{\mathrm{depth}} and evbranche_v^{\mathrm{branch}} are not given to the model. It only sees ovR64o_v \in \mathbb{R}^{64} and has to infer the structure from sequential transitions alone. For the main experiments I use B=4B = 4, L=5L = 5, which gives 1365 nodes. That suggests a crossover around dlog2(1365)10.4d \approx \log_2(1365) \approx 10.4: below that scale you would expect a Euclidean model to start losing its advantage and struggle with the tree hierarchy relative to H2\mathbb{H}^2.

diagram of the B=4, L=5 tree with color-coded depth levels

B-ary tree MDP

The observation noise creates a reconstruction MSE floor: if the decoder perfectly reconstructed the clean mean, the residual error from noise alone would be Eε2/dim(o)=σobs2\mathbb{E}\|\varepsilon\|^2/\dim(o) = \sigma_{\mathrm{obs}}^2. In my code, σobs=0.05\sigma_{\mathrm{obs}} = 0.05, which means the per-dimension floor is σobs2=0.0025\sigma_{\mathrm{obs}}^2 = 0.0025. I use this later to interpret whether my model’s reconstruction is actually learning anything or just hitting the noise floor.

4.3 Model

I build a sequential world model with GRU recurrence. The prior and posterior are either HG\mathcal{HG} distributions on Hd\mathbb{H}^{d} (the hyperbolic model) or Gaussians on Rd\mathbb{R}^{d} (the Euclidean baseline). Everything else is held equal: same dimensions, same decoder architecture, same parameter count. This makes the comparison fair: Hd\mathbb{H}^{d} vs Rd\mathbb{R}^{d}.

I don’t run any existing hyperbolic-VAE baselines (wrapped normal, Poincaré VAE). This is a real gap (see §7), so the main comparisons in this work are just HG\mathcal{HG} vs Euclidean Gaussian.

The HG\mathcal{HG} code path is the same order of complexity as the Euclidean baseline, but I haven’t included a wall-clock benchmark to prove it. That’s something I should add.

4.4 Metrics

I measure four things:

  • Reconstruction MSE on held-out test trajectories - interpreted relative to the σobs2=0.0025\sigma_{\mathrm{obs}}^2 = 0.0025 floor above.
  • Spearman ρτ\rho_\tau — correlation between μτq\mu_\tau^q and ground-truth depth (hyperbolic only).
  • Linear probes — logistic regression from τ\tau alone to depth, and from bb alone to branch index at depth 1, separating representation quality from reconstruction quality.
  • Gradient attenuation E[eτt]\mathbb{E}[e^{\tau_t}] - measuring the amplification of bb-gradients through the reparameterisation coupling.

4.5 Reporting convention

Except where I explicitly say otherwise, every table below shows a single training run per configuration. Computing limits meant I couldn’t afford many seeds. The only multi-seed estimate is Experiment 3 (5 seeds), which gives me a variance band of roughly ±0.05\pm 0.05 in ρ\rho. If you see a difference in ρ\rho smaller than 0.05 between two runs, it’s probably within that noise band and shouldn’t be trusted as a real effect.


5. Results

I’ve grouped the experiments into three blocks:

  • (A) the baseline claim that HG\mathcal{HG} induces depth alignment
  • (B) four attempts to strengthen that signal, each one either trading off against MSE or collapsing the prior
  • (C) the two findings where hyperbolic geometry does provide something Euclidean does not: dimensional efficiency and out-of-distribution depth ordering

Group A - Baseline claim: does HG induce depth alignment?

Experiment 1: τ Spontaneously Encodes Depth

Question. Trained purely on the ELBO with no depth supervision, does the Busemann coordinate τ\tau end up correlated with ground-truth depth?

Setup. Single seed, d=16d = 16, standard HG\mathcal{HG} prior/posterior.

Result. The hyperbolic model achieves ρτ+0.306\rho_\tau \approx +0.306 on a single run. The Euclidean baseline achieves ρpc1=0.275\rho_\mathrm{pc1} = 0.275 at d=16d = 16 (best principal component, the strongest possible PCA baseline). The difference is small and must be read together with Exp 3, which quantifies seed variance at ±0.05\pm 0.05. τ\tau also shows a positive correlation with depth without supervision. Whether this constitutes a robust “advantage” over the Euclidean baseline is a question Exp 3 addresses directly.

The drift term penalises fibre mismatch less when τ\tau is larger. So if deeper nodes usually have more fibre mismatch, the model is pushed toward larger τ\tau for those nodes. This depth-to-fibre-mismatch link is only a hypothesis here; we did not measure it directly.


Experiment 2: Reconstruction vs Representation

Question. Do hyperbolic latent dimensions improve either reconstruction or representation quality?

Setup. Sweep d{2,4,8,16,32}d \in \{2, 4, 8, 16, 32\}, 15 000 training steps, KL warmup, single seed.

ddMSE hypMSE eucE[eτ]\mathbb{E}[e^\tau]ρτ\rho_\tauProbe τ\tau \to \ellProbe bb \to branch
20.4050.4051.04+0.3210.68
40.4050.4051.64+0.2660.68
80.4050.4051.07+0.2850.68
160.4060.4051.04+0.3060.68
320.4050.4050.73+0.2770.68

Reconstruction (MSE). Both models stay at 0.405\approx 0.405 across every dimension, differing by <0.001< 0.001. With σobs2=0.0025\sigma_{\mathrm{obs}}^2 = 0.0025, these are far above the noise floor, so what I’m comparing here is model reconstruction quality, not noise saturation.

Representation. ρτ|\rho_\tau| ranges 0.266–0.321 across dd, essentially flat within seed noise. The τ\tau \to depth probe sits flat at 0.68 across all dimensions. This could mean one of two things:

  • (i) the depth signal in τ\tau genuinely caps at the probe’s capacity,
  • (ii) or the logistic-regression probe is saturated by the tree’s depth class imbalance. My current data doesn’t distinguish between them.

The bb \to branch probe never converges. Experiment 4 revises my interpretation: it’s a regularisation artefact, not a structural property of the fibre coordinate.

Reconstruction MSE vs dimension. Euclidean (orange) stays flat; hyperbolic (blue) moves within a negligible range. Both remain far above the observation-noise floor.

Linear probe accuracy. Blue circles: depth from τ\tau at 0.68, flat across dd. The bb \to branch probe did not converge.


Experiment 3: Geometric Inductive Bias Across Seeds

Question. Is that modest hyperbolic advantage in Exp 1 larger than seed noise, or is it just a fluke?

Setup. Both models at d=16d = 16, 2000 steps, 5 seeds each. Computing limits kept me from running more. I logged Spearman ρ\rho between μτq\mu_\tau^q and depth every 100 steps (hyperbolic), and best-PCA ρ|\rho| across all components for Euclidean (the strongest possible baseline). Shaded regions show ±1σ across seeds.

Mean ρ\rho at step 2000: hyperbolic 0.277±0.0520.277 \pm 0.052, Euclidean 0.271±0.0540.271 \pm 0.054. The two are indistinguishable in both mean and variance. The Euclidean model starts with high accidental alignment (best-PC cherry-picked from random init), then settles to the same regime as hyperbolic. The hyperbolic trajectory starts lower but rises from step 0.

Within the noise from this experiment, hyperbolic and Euclidean reach the same depth-alignment magnitude at step 2000. What differs is the trajectory, not where they end up. So the honest headline for the depth-alignment claim is: Exp 1’s single-seed 0.3060.306 vs 0.2750.275 sits right within the ±0.05\pm 0.05 noise band.

What this does not show. I report absolute values of ρ\rho, which hides the sign. A geometric inductive bias is only as strong as the sign consistency across seeds. Without checking the sign across all five seeds, I can’t verify the “consistent push toward deeper = higher τ\tau” story directly, even though the mechanism supports it.

The drift term μbqμbp22(σbp)2e2μτq+2(στq)2\frac{\|\mu_b^q - \mu_b^p\|^2}{2(\sigma_b^p)^2} \cdot e^{-2\mu_\tau^q + 2(\sigma_\tau^q)^2} penalises fibre mismatch in proportion to e2μτqe^{-2\mu_\tau^q}, which creates a directional gradient pushing larger τ\tau for nodes that diverge from the prior. But the signal-to-noise threshold for reconstructing tree-like structure (Kesten–Stigum, 1966) is not met at B=4B=4, so the maximum attainable ρ\rho stays capped regardless of geometry.


Group B - Attempts to strengthen the signal (all fail or trade off)

Below I try four different architectural levers to amplify the depth-encoding bias. Each one fails or introduces a matching cost elsewhere.

Experiment 4: Separate KL Weights β_τ / β_b

Question. In Experiments 1–3 I used a single β\beta to weight all KL terms equally. But the fibre KL is weighted by (d1)(d-1) and dominates the height KL at large dd. What if I decouple the weights (βτ\beta_\tau for the height term, βb\beta_b for the fibre and drift terms) to improve the depth signal without hurting reconstruction?

Setup. Single seed, d=16d=16, sweep over βτ\beta_\tau and βb\beta_b.

βτ\beta_\tauβb\beta_bMSEρτ\rho_\tauProbe τ\tau \to \ell
1.01.00.405+0.3380.669
1.00.10.363+0.3060.515
1.00.010.068+0.0160.510
2.00.10.363+0.3040.524

Reducing βb\beta_b reveals a sharp trade-off. At βb=0.1\beta_b = 0.1, MSE drops 11% (0.405 → 0.363) with only marginal loss in ρ\rho: a clear MSE vs depth trade, but still far above the noise floor. At βb=0.01\beta_b = 0.01, MSE collapses to 0.068 but ρτ\rho_\tau collapses to 0.016: the model stops encoding depth and just learns a trivial reconstruction. Doubling βτ\beta_\tau to 2.0 with βb=0.1\beta_b = 0.1 doesn’t recover the depth signal (ρ=0.304\rho = 0.304).

What this reveals about Exp 2’s bb \to branch null. The bb-probe never converged in Exp 2. Here I see that the fibre KL wasn’t suppressing bb in a way that encoded branch structure elsewhere: it was just acting as a regulariser that prevents MSE collapse. Remove it and MSE drops but the latent becomes unstructured. So I should revise my original reading of the branch-probe failure from “structural constraint” to “regularisation artefact”.

What this does not show. The βb=0.1\beta_b=0.1 MSE drop is tempting, but ρ\rho moves the wrong way. The trade-off is real; the best setting is just the unweighted default.


Experiment 5: Longer Training (50k steps)

Question. Does the depth alignment from Exp 1–3 strengthen with more training, stabilise, or erode?

Setup. d=16d=16, 50 000 steps (3× longer than Exp 2), single seed, direct comparison to the 15k numbers.

ModelMSE (15k)MSE (50k)ρ\rho (15k)ρ\rho (50k)
Hyperbolic0.4060.404+0.306+0.200
Euclidean0.4050.404+0.275+0.276

MSE doesn’t move for either model. The hyperbolic ρ\rho erodes from 0.306 to 0.200, falling below the Euclidean baseline at 0.276. This erosion is specific to HG, it’s not a generic over-training effect.

The drift term creates a depth-aligning gradient early in training, but the (Euclidean) MSE reconstruction objective doesn’t reinforce it. As optimisation continues, μτq\mu_\tau^q drifts toward whatever value minimises reconstruction, which isn’t necessarily depth-correlated. The Euclidean model, lacking any geometric inductive bias to lose, settles into a stable local minimum.

This same erosion mechanism (geometric structure in the prior being absorbed by reconstruction pressure) shows up again as an explicit parameter collapse in Exp 7.

What this does not show. Single seed; I don’t know the magnitude of erosion across different seeds.


Experiment 6: Geometry-Aware Decoder

Question. The decoder in all previous experiments receives (τ,b)(\tau, b) as a flat vector and is not told that τ\tau is a Busemann coordinate. If I give the decoder explicit geometric features, does that improve reconstruction or representation?

Setup. Augment the decoder input with eτe^\tau (metric scaling factor) and b2e2τ\|b\|^2 e^{-2\tau} (squared bb-distance in the hyperbolic metric). Single seed, d=16d=16.

VariantMSEρτ\rho_\tauProbe τ\tau \to \ell
Geometry-aware decoder0.405+0.2700.632
Standard (flat) decoder0.405+0.2950.680
Euclidean baseline0.405+0.287

The geometry-aware decoder does not help. It slightly worsens both ρτ\rho_\tau (+0.270 vs +0.295) and the depth probe (0.632 vs 0.680), and it does so with no MSE improvement.

I think this is a classic representation-collapse dynamic. Exposing eτe^\tau and b2e2τ\|b\|^2 e^{-2\tau} directly gives the decoder a reconstruction shortcut that removes the encoder’s incentive to encode depth cleanly in τ\tau. The pressure that created the structured representation is gone, and the MSE stays unchanged because both paths reach essentially the same reconstruction level.

Future work. Gating or annealing the shortcut features, for example warm-starting with flat input and slowly exposing the geometric features, might preserve the encoder pressure. I have not tested that.


Experiment 7: Action-Conditioned Prior

Question. Exp 5 showed depth alignment erodes under extended training. Can injecting action labels directly into the prior make depth alignment an explicit structural objective rather than a fragile KL side effect?

Setup. Augment the prior with an action-conditioned offset to μτp\mu_\tau^p:

μτp(t)=fbase(ht)+δ(at1),δ(+1)=+softplus(α),  δ(1)=softplus(β),  δ(0)=0\mu_\tau^p(t) = f_\text{base}(h_t) + \delta(a_{t-1}), \quad \delta(+1) = +\text{softplus}(\alpha),\; \delta(-1) = -\text{softplus}(\beta),\; \delta(0) = 0

where at1{1,0,+1}a_{t-1} \in \{-1, 0, +1\} encodes the direction of the last transition (deeper / shallower / same level) and α,β\alpha, \beta are learned scalars.

Modelρ @ 5kρ @ 15kρ @ 30kρ @ 50k
Standard HG0.0800.2810.2710.273
ActionPrior HG0.7550.3190.2860.277

At step 5k the action prior reaches ρ=0.755\rho = 0.755, which is an exceptional early alignment and far above the standard model’s ρ=0.080\rho = 0.080. That advantage disappears by step 15k, and both models converge to ρ0.27\rho \approx 0.27 by step 30k. The learned bias parameters collapse to near zero: softplus(α)=0.0008\text{softplus}(\alpha) = 0.0008, softplus(β)=0.0008\text{softplus}(\beta) = 0.0008. In practice, the model learns to ignore its own structural prior.

The action prior functions as a warm-start mechanism, not a stable structural constraint. Early in training the action signal provides a strong depth gradient that initialises τ\tau well. As optimisation continues, the reconstruction loss dominates and the model adapts by zeroing out the prior bias, which is the same failure mode as Exp 5 but reached faster and more visibly.

What this does not show. I have not tested whether a non-learnable action offset, with frozen α\alpha and β\beta, would preserve the early alignment.


Group C - Where hyperbolic geometry pays off

Experiment 8: Dimensional Efficiency - H2\mathbb{H}^2 vs R32\mathbb{R}^{32}

Question. Sarkar (2011) proves any finite tree embeds into H2\mathbb{H}^2 with distortion converging to 1. In the learning setting, can a 2-dimensional hyperbolic latent capture depth as well as a 32-dimensional Euclidean latent?

Setup. Train each model at d{2,4,8,16,32,64}d \in \{2, 4, 8, 16, 32, 64\} (where defined), 15k steps, single seed per cell.

Modeld=2d = 2d=4d = 4d=8d = 8d=16d = 16d=32d = 32d=64d = 64
Hd\mathbb{H}^d ρτ|\rho_\tau|0.2640.2660.2660.2750.272
Rd\mathbb{R}^d ρpc1|\rho_\text{pc1}|0.2640.2620.2650.2770.071

Three things are true here, and only one of them is what Sarkar predicts:

  1. H2\mathbb{H}^2 matches R32\mathbb{R}^{32} within noise (0.264 vs 0.277). Consistent with the capacity advantage.
  2. Neither model benefits from dimension. Hyperbolic ρτ|\rho_\tau| is flat at 0.26–0.28 across d=232d = 2\ldots 32; Euclidean ρ|\rho| is similarly flat at 0.26–0.28 across d=432d = 4\ldots 32. Sarkar predicts the hyperbolic model should beat Euclidean at matched low dd — that is not observed.
  3. R64\mathbb{R}^{64} collapses to 0.071. This is almost certainly an optimisation artefact (best-PC drift at high dd with a fixed training budget), not a capacity failure.

The honest reading is therefore: H² is not worse than R³² on this metric, and neither model is capacity-limited in the tested range. This is consistent with Sarkar’s theorem carrying over to the learning setting but does not cleanly demonstrate the predicted advantage, because both models are in a regime where added dimensions are not helping.

The flatness in ρτ|\rho_\tau| is a separate, interpretable finding: all depth information in HG\mathcal{HG} is carried by the scalar τ\tau by construction; the fibre coordinates bRd1b \in \mathbb{R}^{d-1} are depth-uninformative by design. Adding fibre dimensions cannot change ρτ\rho_\tau.

What this does not show. A clean dimensional-efficiency advantage would require a setting where Euclidean ρ\rho grows with dd, so that the capacity curve is visible, while hyperbolic saturates at low dd. At the observation-noise and training-budget levels here, no such curve appears.


Experiment 9: OOD Depth Generalisation

Question. Does the depth ordering induced by HG\mathcal{HG} extend to depths the model never saw in training?

Setup. Train on a deeper tree (B=4B=4, L=7L=7) with walks restricted to depth 5\leq 5. Evaluate ρτ\rho_\tau separately on in-distribution nodes (depth 0–5) and OOD nodes (depth 6–7). Compare the standard HG\mathcal{HG} to the ActionPrior variant from Exp 7.

Caveat, stated up front. The OOD depth range covers 2 levels (6, 7) versus 6 levels (0..5) in-distribution. A scalar correlation over a 2-level range is simpler than over a 6-level range and tends to give larger magnitudes. The sign and consistency of the OOD ranking are the load-bearing claims; the magnitude comparison to in-distribution ρ\rho is not.

Modelρ\rho in-dist (depth 0–5)ρ\rho OOD (depth 6–7)
Standard HG+0.334+0.536
ActionPrior HG+0.108−0.511

The standard HG\mathcal{HG} model achieves ρood=+0.536\rho_\text{ood} = +0.536. The sign is correct, the magnitude is strong, and the model has never seen depth 6 or 7 in training. The induced ordering, deeper to higher τ\tau, extends monotonically beyond the training range. This is the hardest result to explain without the hyperbolic geometry, because the drift term’s penalty structure creates a ranking that comes from the KL objective rather than memorised depth values.

The ActionPrior variant inverts the OOD ordering (ρood=0.511\rho_\text{ood} = -0.511). This is not a small degradation, because the model actively predicts that deeper nodes are shallower. Combined with Exp 7’s bias-parameter collapse (softplus(α)=0.0008\text{softplus}(\alpha) = 0.0008), this suggests the ActionPrior creates a subtly distorted geometry that does not extrapolate. Structural priors that learn to zero out their own parameters can break the geometry they were meant to reinforce.

What this does not show. This is still single-seed. I have not tested whether the OOD result holds across seeds, and given Exp 3’s seed variance on in-distribution ρ\rho, a multi-seed replication of Exp 9 is the single most important follow-up. I also have not tested whether OOD generalisation degrades with extended training in the same way as in-distribution ρ\rho did in Exp 5.


6. Discussion

The strongest result in this work is Experiment 9: the standard HG\mathcal{HG} model induces a depth ordering that extends monotonically to depths it never saw during training (ρood=+0.536\rho_\text{ood} = +0.536), while the ActionPrior variant, whose learned bias collapses to 0.0008\approx 0.0008, inverts that ordering (ρood=0.511\rho_\text{ood} = -0.511). The positive case is the clearest sign that the geometry is acting like a structural property of the objective rather than a memorised fit to the training set. I still need to keep the caveat from Exp 9 in mind: the OOD ρ\rho is computed over a 2-level depth range, so it should not be read as absolute evidence that the model is “generally better” at extrapolation.

The rest of the picture is more mixed. I summarise the evidence as honestly as I can:

ClaimEvidence forWhat’s not yet shown
HG\mathcal{HG} has closed-form KL at O(d)O(d)Derivation in §3.2 and A.2; Monte Carlo validation, relative error 0.005%Wall-clock comparison to Euclidean Gaussian was not measured; comparison to wrapped normal was not run
τ\tau aligns with depth without supervisionConsistent positive ρτ\rho_\tau in Exp 1, 2, 8Sign consistency across seeds was not measured; mechanism (depth ↔ fibre mismatch) is hypothesised
Hyperbolic ρ\rho > Euclidean ρ\rho at d=16d=16Exp 1 (single seed): 0.306 vs 0.275Exp 3 (5 seeds): 0.277 ± 0.052 vs 0.271 ± 0.054, indistinguishable. Single-seed gap is within seed noise
Dimensional efficiency (Sarkar, learned)Exp 8: H² (0.264) ≈ R³² (0.277)Both models flat across dd; expected capacity curve not visible
Depth generalisation OODExp 9: ρood=+0.536\rho_\text{ood} = +0.536 (standard HG)Single seed; OOD depth range only 2 levels; degradation under long training untested
Alignment is durableExp 5: HG-specific erosion 0.306 → 0.200 at 50k; Exp 7: prior collapses to 0.0008

What works. The HG\mathcal{HG} construction itself, namely closed-form KL at O(d)O(d), exact reparameterisation, no Möbius singularities, and a controlled approximation error, is the load-bearing contribution. The OOD result in Exp 9 is the sharpest empirical sign that the geometry is doing something the Euclidean baseline cannot do.

What doesn’t work. Every architectural lever I tried to strengthen the in-distribution depth signal either created a trade-off (Exp 4: MSE versus ρ\rho), introduced a shortcut (Exp 6: representation collapse), or got absorbed by the reconstruction loss (Exp 5, Exp 7). My reading of that pattern is simple: Euclidean reconstruction pressure erodes hyperbolic structure because good reconstruction does not require good geometric alignment.

A speculative path forward, not evidence. Replacing Euclidean MSE with a geodesic reconstruction loss on Hd\mathbb{H}^d would couple reconstruction directly to the hyperbolic metric and might make geometric alignment a prerequisite for low loss rather than a side effect. That is a hypothesis, not a result.


7. Assumptions and Limitations

Taken in rough order of how much each point matters:

Single-seed main tables. Experiments 1, 2, 4, 5, 6, 7, 8, 9 report single training runs. Exp 3 (5 seeds, ±0.052\pm 0.052) is the only variance estimate, and it shows that the single-seed ρ\rho differences of <0.05< 0.05 elsewhere are within seed noise. The OOD result (ρood=+0.536\rho_\text{ood} = +0.536) especially needs multi-seed replication before I would cite it confidently.

MSE floor σobs2\approx \sigma_\mathrm{obs}^2. The observation noise σobs\sigma_\mathrm{obs} in the tree MDP is 0.05, so the per-dimension floor is σobs2=0.0025\sigma_\mathrm{obs}^2 = 0.0025. The recurring value 0.405 across Exps 2, 5, 6 is far above this floor, so cross-model MSE comparisons are informative here.

Unverified mechanism. The drift-term explanation for why τ\tau aligns with depth requires that deeper nodes have higher fibre mismatch on average. That is plausible for typical random-walk trajectories on the tree, but I did not measure it directly.

No hyperbolic-VAE baseline. The baseline comparison is HG\mathcal{HG} versus Euclidean Gaussian only. Existing hyperbolic alternatives, such as wrapped normals (Nagano et al., 2019) and Poincaré VAEs (Mathieu et al., 2019), were not run. Claims of “computational parity” are against Euclidean only, and the representation-quality results do not show that HG\mathcal{HG} beats other hyperbolic distributions.

HG\mathcal{HG} approximation error. KL[NHHG]=d218σ2+O(σ3)\mathrm{KL}[\mathcal{N}_\mathbb{H}\|\mathcal{HG}] = \frac{d^2-1}{8}\sigma^2 + O(\sigma^3). For d8d \geq 8 and typical posterior widths this is not small, so the optimisation target is not geodesically correct.

Sign of ρτ\rho_\tau. Exp 3 reports ρ|\rho|, not ρ\rho. Cross-seed sign consistency, which is crucial for the “geometric inductive bias” claim, is inferred from the drift-term mechanism but not measured.

GRU architecture. The GRU treats (τt1,bt1)(\tau_{t-1}, b_{t-1}) as a flat Euclidean vector, with no inductive bias toward using τ\tau for depth or bb for branch. An architecture with attention over hyperbolic representations, or a genuine hyperbolic recurrence, might produce qualitatively different results.

Synthetic, fully-balanced tree. The tree MDP is maximally hierarchical. Real environments have weaker, irregular, and potentially non-tree-like structure. Whether HG\mathcal{HG} helps there is untested.

Compute parity is claimed, not measured.


8. Code and Reproducibility

→ View source code on GitHub

All experiments are implemented in PyTorch. The hrssm library contains the HG\mathcal{HG} distribution with closed-form KL (validated against Monte Carlo, relative error 0.005%), the BB-ary tree MDP with OOD depth support, and all world model variants. Experiments 1–6 run in h_rssm_gpu.ipynb; Experiments 7–9 run in experiments_v2.ipynb, both Kaggle-compatible.

The three formulas needed to reproduce the key results:

  • KL: KL[qp]=height KL+(d1)fibre KL+drift term\mathrm{KL}[q\|p] = \text{height KL} + (d-1) \cdot \text{fibre KL} + \text{drift term}, as given above.
  • Reparameterisation: τ=μτq+στqε1\tau = \mu_\tau^q + \sigma_\tau^q \varepsilon_1, b=μbq+eτσbqε2b = \mu_b^q + e^\tau \sigma_b^q \varepsilon_2 (coupling must be preserved).
  • Gradient attenuation diagnostic: E[eτt]=eμτq+(στq)2/2\mathbb{E}[e^{\tau_t}] = e^{\mu_\tau^q + (\sigma_\tau^q)^2/2}.

9. References


Appendix: Derivations

A.1 Normalising Constant of the Horospherical Gaussian

The unnormalised density w.r.t. the Riemannian volume form dvol=e(d1)τdτdbd\mathrm{vol} = e^{-(d-1)\tau}\,d\tau\,db is:

p~(τ,b)=exp ⁣((τμτ)22στ2bμb22e2τσb2)\widetilde{p}(\tau, b) = \exp\!\left(-\frac{(\tau-\mu_\tau)^2}{2\sigma_\tau^2} - \frac{\|b-\mu_b\|^2}{2e^{2\tau}\sigma_b^2}\right)

Step 1 (integrate over bb at fixed τ\tau).

Rd1exp ⁣(bμb22e2τσb2)db\int_{\mathbb{R}^{d-1}} \exp\!\left(-\frac{\|b-\mu_b\|^2}{2e^{2\tau}\sigma_b^2}\right) db

This is a (d1)(d-1)-dimensional Gaussian integral with variance e2τσb2e^{2\tau}\sigma_b^2 per dimension:

=(2πe2τσb2)(d1)/2=(2πσb2)(d1)/2e(d1)τ= \left(2\pi e^{2\tau}\sigma_b^2\right)^{(d-1)/2} = (2\pi\sigma_b^2)^{(d-1)/2}\cdot e^{(d-1)\tau}

Step 2 (the key cancellation). The volume form contributes e(d1)τe^{-(d-1)\tau}. Multiplying:

e(d1)τe(d1)τ=1e^{(d-1)\tau} \cdot e^{-(d-1)\tau} = 1

The τ\tau-dependence of the bb-integral cancels exactly against the volume form. The remaining integral is purely over τ\tau:

Z=(2πσb2)(d1)/2exp ⁣((τμτ)22στ2)dτ=(2πσb2)(d1)/22πστZ = (2\pi\sigma_b^2)^{(d-1)/2} \int_{-\infty}^\infty \exp\!\left(-\frac{(\tau-\mu_\tau)^2}{2\sigma_\tau^2}\right)d\tau = (2\pi\sigma_b^2)^{(d-1)/2} \cdot \sqrt{2\pi}\,\sigma_\tau

This is independent of μτ\mu_\tau and μb\mu_b, which is exactly what I need for closed-form KL divergences.


A.2 Closed-Form KL Divergence

Let q=HG(μτq,μbq,στq,σbq)q = \mathcal{HG}(\mu_\tau^q, \mu_b^q, \sigma_\tau^q, \sigma_b^q) and p=HG(μτp,μbp,στp,σbp)p = \mathcal{HG}(\mu_\tau^p, \mu_b^p, \sigma_\tau^p, \sigma_b^p).

The log-density w.r.t. dvold\mathrm{vol} is:

logq(τ,b)=(τμτq)22(στq)2bμbq22e2τ(σbq)2logZq\log q(\tau,b) = -\frac{(\tau-\mu_\tau^q)^2}{2(\sigma_\tau^q)^2} - \frac{\|b-\mu_b^q\|^2}{2e^{2\tau}(\sigma_b^q)^2} - \log Z_q

The KL divergence is KL[qp]=Eq[logq(τ,b)logp(τ,b)]\mathrm{KL}[q\|p] = \mathbb{E}_q[\log q(\tau,b) - \log p(\tau,b)].

Taking the difference of log-densities:

logqlogp=(τμτq)22(στq)2+(τμτp)22(στp)2height terms+(12(στp)212(στq)2)absorbed below+bμbp2bμbq22e2τ(σbq)2fibre terms+logZpZq\log q - \log p = \underbrace{-\frac{(\tau-\mu_\tau^q)^2}{2(\sigma_\tau^q)^2} + \frac{(\tau-\mu_\tau^p)^2}{2(\sigma_\tau^p)^2}}_{\text{height terms}} + \underbrace{\left(\frac{1}{2(\sigma_\tau^p)^2} - \frac{1}{2(\sigma_\tau^q)^2}\right)}_{\text{absorbed below}} + \underbrace{\frac{\|b-\mu_b^p\|^2 - \|b-\mu_b^q\|^2}{2e^{2\tau}(\sigma_b^q)^2}}_{\text{fibre terms}} + \log\frac{Z_p}{Z_q}

Height KL. Taking Eq\mathbb{E}_q over τ\tau (which marginally is N(μτq,(στq)2)\mathcal{N}(\mu_\tau^q, (\sigma_\tau^q)^2)):

Eq ⁣[(τμτp)22(στp)2(τμτq)22(στq)2]=(στq)2+(μτqμτp)22(στp)212\mathbb{E}_q\!\left[\frac{(\tau-\mu_\tau^p)^2}{2(\sigma_\tau^p)^2} - \frac{(\tau-\mu_\tau^q)^2}{2(\sigma_\tau^q)^2}\right] = \frac{(\sigma_\tau^q)^2 + (\mu_\tau^q-\mu_\tau^p)^2}{2(\sigma_\tau^p)^2} - \frac{1}{2}

Combined with logZp/Zq\log Z_p/Z_q for the τ\tau part (logστp/στq\log\sigma_\tau^p/\sigma_\tau^q):

heightKL=logστpστq+(στq)2+(μτqμτp)22(στp)212\mathrm{height\,KL} = \log\frac{\sigma_\tau^p}{\sigma_\tau^q} + \frac{(\sigma_\tau^q)^2 + (\mu_\tau^q - \mu_\tau^p)^2}{2(\sigma_\tau^p)^2} - \frac{1}{2}

Fibre terms. Expand bμbp2=bμbq2+2(bμbq)(μbqμbp)+μbqμbp2\|b - \mu_b^p\|^2 = \|b - \mu_b^q\|^2 + 2(b-\mu_b^q)^\top(\mu_b^q - \mu_b^p) + \|\mu_b^q - \mu_b^p\|^2. Taking Eq\mathbb{E}_q conditioned on τ\tau, using E[bτ]=μbq\mathbb{E}[b|\tau] = \mu_b^q:

Eq ⁣[bμbp22e2τ(σbp)2]=(d1)e2τ(σbq)2+μbqμbp22e2τ(σbp)2\mathbb{E}_q\!\left[\frac{\|b-\mu_b^p\|^2}{2e^{2\tau}(\sigma_b^p)^2}\right] = \frac{(d-1)e^{2\tau}(\sigma_b^q)^2 + \|\mu_b^q-\mu_b^p\|^2}{2e^{2\tau}(\sigma_b^p)^2}
=(d1)(σbq)22(σbp)2+μbqμbp22e2τ(σbp)2= \frac{(d-1)(\sigma_b^q)^2}{2(\sigma_b^p)^2} + \frac{\|\mu_b^q-\mu_b^p\|^2}{2e^{2\tau}(\sigma_b^p)^2}

Second cancellation and the drift term. Taking Eq\mathbb{E}_q over τ\tau for the second part:

Eq ⁣[μbqμbp22e2τ(σbp)2]=μbqμbp22(σbp)2Eq[e2τ]\mathbb{E}_q\!\left[\frac{\|\mu_b^q-\mu_b^p\|^2}{2e^{2\tau}(\sigma_b^p)^2}\right] = \frac{\|\mu_b^q-\mu_b^p\|^2}{2(\sigma_b^p)^2}\,\mathbb{E}_q[e^{-2\tau}]

Since τN(μτq,(στq)2)\tau \sim \mathcal{N}(\mu_\tau^q, (\sigma_\tau^q)^2), the moment generating function at s=2s = -2 gives:

E[e2τ]=e2μτq+2(στq)2\mathbb{E}[e^{-2\tau}] = e^{-2\mu_\tau^q + 2(\sigma_\tau^q)^2}

This is the drift term. Combined with the logZp/Zq\log Z_p/Z_q contribution for bb (which is (d1)logσbp/σbq(d-1)\log\sigma_b^p/\sigma_b^q) and Eq[bμbq2/2e2τ(σbq)2]=(d1)/2\mathbb{E}_q[-\|b-\mu_b^q\|^2 / 2e^{2\tau}(\sigma_b^q)^2] = -(d-1)/2:

fibreKL=(d1) ⁣(logσbpσbq+(σbq)22(σbp)212)\mathrm{fibre\,KL} = (d-1)\!\left(\log\frac{\sigma_b^p}{\sigma_b^q} + \frac{(\sigma_b^q)^2}{2(\sigma_b^p)^2} - \frac{1}{2}\right)

Assembling all terms:

KL[qp]=logστpστq+(στq)2+(μτqμτp)22(στp)212height KL+(d1)(logσbpσbq+(σbq)22(σbp)212)fibre KL+μbqμbp22(σbp)2e2μτq+2(στq)2drift\boxed{\mathrm{KL}[q\|p] = \underbrace{\log\frac{\sigma_\tau^p}{\sigma_\tau^q} + \frac{(\sigma_\tau^q)^2 + (\mu_\tau^q - \mu_\tau^p)^2}{2(\sigma_\tau^p)^2} - \frac{1}{2}}_{\text{height KL}} + (d-1)\underbrace{\left(\log\frac{\sigma_b^p}{\sigma_b^q} + \frac{(\sigma_b^q)^2}{2(\sigma_b^p)^2} - \frac{1}{2}\right)}_{\text{fibre KL}} + \underbrace{\frac{\|\mu_b^q - \mu_b^p\|^2}{2(\sigma_b^p)^2}\,e^{-2\mu_\tau^q + 2(\sigma_\tau^q)^2}}_{\text{drift}}}

All three terms are O(d)O(d) to compute.


A.3 Reparameterisation and Gradient Flow

Sampling. Draw ε1N(0,1)\varepsilon_1 \sim \mathcal{N}(0,1) and ε2N(0,Id1)\varepsilon_2 \sim \mathcal{N}(0, I_{d-1}):

τ=μτq+στqε1\displaystyle \tau = \mu_\tau^q + \sigma_\tau^q\,\varepsilon_1
b=μbq+eτσbqε2b = \mu_b^q + e^\tau\,\sigma_b^q\,\varepsilon_2

Correctness. The marginal of τ\tau is N(μτq,(στq)2)\mathcal{N}(\mu_\tau^q, (\sigma_\tau^q)^2) by construction. The conditional bτb|\tau is:

bτN(μbq,  e2τ(σbq)2Id1)b|\tau \sim \mathcal{N}(\mu_b^q,\; e^{2\tau}(\sigma_b^q)^2 I_{d-1})

which matches Var(bτ)=e2τ(σbq)2\mathrm{Var}(b|\tau) = e^{2\tau}(\sigma_b^q)^2 from the HG\mathcal{HG} definition. The coupling b=f(μτq,στq,μbq,σbq,ε1,ε2)b = f(\mu_\tau^q, \sigma_\tau^q, \mu_b^q, \sigma_b^q, \varepsilon_1, \varepsilon_2) is fully differentiable.

Gradient attenuation. The gradient of the decoder loss w.r.t. μbq\mu_b^q passes through b/μbq=I\partial b/\partial \mu_b^q = I, but w.r.t. μτq\mu_\tau^q it passes through b/μτq=eτσbqε2\partial b/\partial \mu_\tau^q = e^\tau \sigma_b^q \varepsilon_2. The expected magnitude of this gradient amplification factor is:

E[eτ]=E[eμτq+στqε1]=eμτq+(στq)2/2\mathbb{E}[e^\tau] = \mathbb{E}[e^{\mu_\tau^q + \sigma_\tau^q \varepsilon_1}] = e^{\mu_\tau^q + (\sigma_\tau^q)^2/2}

by the MGF of a Gaussian at s=+1s = +1. When E[eτ]1\mathbb{E}[e^\tau] \gg 1, gradients from the fibre reconstruction are amplified into μτq\mu_\tau^q and can destabilise τ\tau training, which is the gradient attenuation diagnostic reported in Experiment 2.


A.4 KL Approximation Error

The HG\mathcal{HG} is not the Riemannian Normal NH(μ,σ2)\mathcal{N}_\mathbb{H}(\mu, \sigma^2). To quantify the discrepancy, consider the isotropic case μτ=0\mu_\tau = 0, μb=0\mu_b = 0, στ=σb=σ\sigma_\tau = \sigma_b = \sigma, and compare with NH(0,σ2)\mathcal{N}_\mathbb{H}(\mathbf{0}, \sigma^2) whose density w.r.t. dvold\mathrm{vol} is proportional to edH2(z,0)/2σ2e^{-d_\mathbb{H}^2(z,\mathbf{0})/2\sigma^2}.

In the upper half-space, the geodesic distance from the origin (τ=0,b=0)(\tau=0, b=0) satisfies:

dH2(z,0)=τ2+e2τb2+O(b4e4τ)d_\mathbb{H}^2(z, \mathbf{0}) = \tau^2 + e^{-2\tau}\|b\|^2 + O(\|b\|^4 e^{-4\tau})

The HG\mathcal{HG} density exponent is (τ2/2σ2+e2τb2/2σ2)-(\tau^2/2\sigma^2 + e^{-2\tau}\|b\|^2/2\sigma^2), so it agrees with NH\mathcal{N}_\mathbb{H} to leading order. The discrepancy arises from higher-order terms in the geodesic distance expansion. A Taylor expansion gives:

KL ⁣[NH(0,σ2)HG(0,0,σ,σ)]=d218σ2+O(σ3)\mathrm{KL}\!\left[\mathcal{N}_\mathbb{H}(\mathbf{0},\sigma^2) \,\big\|\, \mathcal{HG}(\mathbf{0},\mathbf{0},\sigma,\sigma)\right] = \frac{d^2-1}{8}\,\sigma^2 + O(\sigma^3)

The leading coefficient d218\frac{d^2-1}{8} grows quadratically with dd. At d=16d = 16, σ=0.3\sigma = 0.3: error 2558×0.092.87\approx \frac{255}{8} \times 0.09 \approx 2.87. I do not think this is negligible for large dd or large posterior widths, and I treat it as the main theoretical limitation of HG\mathcal{HG} as an approximation to the Riemannian Normal.


A.5 MGF Identity and Drift Term Stability

The drift term involves e2μτq+2(στq)2e^{-2\mu_\tau^q + 2(\sigma_\tau^q)^2}, which can grow without bound as στq\sigma_\tau^q \to \infty. In practice, a clamp at logMGF10\log \text{MGF} \leq 10 is applied before exponentiation to prevent KL overflow during early training when posteriors are wide.

The full drift is bounded by:

driftμbqμbp22(σbp)2e10μbqμbp22(σbp)2×22026\mathrm{drift} \leq \frac{\|\mu_b^q - \mu_b^p\|^2}{2(\sigma_b^p)^2} \cdot e^{10} \approx \frac{\|\mu_b^q - \mu_b^p\|^2}{2(\sigma_b^p)^2} \times 22026

After warmup, posteriors narrow and στq0\sigma_\tau^q \to 0, so e2μτq+2(στq)2e2μτqe^{-2\mu_\tau^q + 2(\sigma_\tau^q)^2} \to e^{-2\mu_\tau^q} and the drift naturally weights fibre mismatch by e2μτqe^{-2\mu_\tau^q}, which penalises shallow nodes more than deep ones. This is the mechanism I rely on when I say that high fibre mismatch pushes μτq\mu_\tau^q upward and produces positive ρτ\rho_\tau.


A.6 Coordinate Change from (x,y)(x,y) to (τ,b)(\tau,b)

Step 1 (starting model). Start from the standard upper half-space model

Hd={(x,y):xRd1,  y>0},\mathbb{H}^d = \{(x,y) : x \in \mathbb{R}^{d-1},\; y>0\},

with metric and volume form

g=dx2+dy2y2,dvol=yddxdy.g = \frac{\|dx\|^2 + dy^2}{y^2}, \qquad d\mathrm{vol} = y^{-d}\,dx\,dy.

Step 2 (change of variables). Define

b:=x,τ:=logyy=eτ.b := x, \qquad \tau := \log y \quad \Longleftrightarrow \quad y = e^{\tau}.

Because log:(0,)R\log : (0,\infty) \to \mathbb{R} is a smooth bijection with smooth inverse exp\exp, this is a global diffeomorphism in the vertical coordinate, so τR\tau \in \mathbb{R} (hence can range from -\infty to ++\infty).

Step 3 (differentials).

dy=eτdτ,dx2=db2,y2=e2τ.dy = e^{\tau} d\tau, \qquad \|dx\|^2 = \|db\|^2, \qquad y^2 = e^{2\tau}.

Step 4 (metric substitution). Substitute into gg:

g=db2+e2τdτ2e2τ=dτ2+e2τdb2.g = \frac{\|db\|^2 + e^{2\tau}d\tau^2}{e^{2\tau}} = d\tau^2 + e^{-2\tau}\|db\|^2.

Step 5 (volume-form substitution). Use the Jacobian determinant of (τ,b)(y,x)=(eτ,b)(\tau,b) \mapsto (y,x) = (e^{\tau}, b):

det(y,x)(τ,b)=eτ,\left|\det\frac{\partial(y,x)}{\partial(\tau,b)}\right| = e^{\tau},

hence

dxdy=db(eτdτ),dx\,dy = db\,(e^{\tau}d\tau),

and therefore

dvol=yddxdy=edτdb(eτdτ)=e(d1)τdτdb.d\mathrm{vol} = y^{-d}\,dx\,dy = e^{-d\tau}\,db\,(e^{\tau}d\tau) = e^{-(d-1)\tau}\,d\tau\,db.

So the (τ,b)(\tau,b) parametrisation is not a new model. It is the standard upper half-space model written in log-height, or horospherical, coordinates.