Can the geometry of thought reveal how alignment works?
I develop differential-geometric frameworks for understanding how large language models encode, transform, and ultimately suppress beliefs β with a focus on mechanistic interpretability and AI alignment. The Torsional Belief Vector Field (TBVF), models transformer hidden states as discrete curves on a Riemannian belief manifold equipped with a Cartan torsion connection. This reveals, for the first time, where and how DPO/RLHF alignment geometrically reshapes model internals β creating what I call "brake layers": localized, geometrically distinct suppression mechanisms.
I am actively seeking fully-funded PhD positions at world-class research universities in mechanistic interpretability, geometric deep learning, and AI alignment.
|
Torsional Belief Vector Field treats each transformer layer's hidden state as a point on a high-dimensional Riemannian manifold with Fisher-Rao metric. The torsion tensor β antisymmetric component of cross-layer covariance β measures rotational mismatch between consecutive belief updates. |
|

