Posted
CrystalASR: Hierarchical Phoneme-Grounded Speech Decoding
- 1 Independent Researcher
We investigate a design principle for automatic speech recognition where linguistic structure is explicitly enforced through intermediate representations. The resulting system, CrystalASR, decomposes decoding into three modular layers: a 3.3M-parameter phoneme CTC head, a zero-parameter rule-based word decoder, and an optional language model (LM) for disambiguation. A defining constraint is strict upward information flow, ensuring higher layers modulate but never override lower-level acoustic evidence. Experiments on LibriSpeech dev-clean show that CrystalASR achieves 17.44% WER while requiring 21× fewer trainable parameters and 14× faster inference than a comparable end-to-end baseline. Error attribution reveals that word-level errors primarily originate from subtle phoneme inaccuracies amplified by downstream segmentation. Furthermore, a language model weight sweep reveals a sharp phase transition: beyond a narrow tiebreaker role (w_LM > 0.03), WER rises from 17% to 96% as the LM's score scale overwhelms acoustic evidence. These findings suggest that explicitly decoupling acoustic and lexical processing yields interpretable error diagnostics and substantial parameter savings, at a moderate accuracy cost relative to end-to-end models.
- automatic speech recognition
- phoneme CTC
- modular decoding
- rule-based word decoder
- language model weighting
- interpretability
@misc{lin2026crystalasr,
title = {CrystalASR: Hierarchical Phoneme-Grounded Speech Decoding},
author = {Po-Ting Lin},
year = {2026},
howpublished = {Zenodo},
doi = {10.5281/zenodo.19317074}
}