According to a new paper on arxiv.org, researchers have developed a technique requiring only five lines of PyTorch code to reveal what large language models learned during training—including potentially problematic content—by analyzing the model’s weight matrix without running any inference.
The method applies singular value decomposition (SVD) to the lm_head weight matrix of transformer-based LLMs. According to the research, “each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction,” exposing the model’s training data composition.
Analyzing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, the researchers found systematic differences: GPT exhibited “a graduated hierarchy of functionally differentiated subspaces,” while Gemma was “dominated by pre-nineteenth-century English orthography.” For Qwen, the analysis revealed “subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication,” according to the paper.
Critically, the research found that “ethically concerning subspaces originate in pretraining and are not removed by post-training alignment” when comparing base and instruction-tuned models.
The technique also detected glitch tokens: applying their Weighted Projection Score (WPS) to GPT-OSS-120B recovered “shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference.”
The researchers propose that lm_head SVD analysis “be adopted as a standard pre-release safety auditing step.”