Zenodo (preprint) Preprint

Posted

CrystalASR: Hierarchical Phoneme-Grounded Speech Decoding

Po-Ting Lin 1

  1. 1 Independent Researcher
DOI
10.5281/zenodo.19317074
License
CC BY 4.0
Categories
Speech Recognition · Machine Learning

We investigate a design principle for automatic speech recognition where linguistic structure is explicitly enforced through intermediate representations. The resulting system, CrystalASR, decomposes decoding into three modular layers: a 3.3M-parameter phoneme CTC head, a zero-parameter rule-based word decoder, and an optional language model (LM) for disambiguation. A defining constraint is strict upward information flow, ensuring higher layers modulate but never override lower-level acoustic evidence. Experiments on LibriSpeech dev-clean show that CrystalASR achieves 17.44% WER while requiring 21× fewer trainable parameters and 14× faster inference than a comparable end-to-end baseline. Error attribution reveals that word-level errors primarily originate from subtle phoneme inaccuracies amplified by downstream segmentation. Furthermore, a language model weight sweep reveals a sharp phase transition: beyond a narrow tiebreaker role (w_LM > 0.03), WER rises from 17% to 96% as the LM's score scale overwhelms acoustic evidence. These findings suggest that explicitly decoupling acoustic and lexical processing yields interpretable error diagnostics and substantial parameter savings, at a moderate accuracy cost relative to end-to-end models.

  • automatic speech recognition
  • phoneme CTC
  • modular decoding
  • rule-based word decoder
  • language model weighting
  • interpretability
Loading PDF…
Cite as (BibTeX)
@misc{lin2026crystalasr,
  title = {CrystalASR: Hierarchical Phoneme-Grounded Speech Decoding},
  author = {Po-Ting Lin},
  year = {2026},
  howpublished = {Zenodo},
  doi = {10.5281/zenodo.19317074}
}

← All publications