I realized that the best first step, the begin, is to research topics in the NLP space. This involves reading papers, which I’ll start outlining here:
Phonemic Similarity Metrics to Compare Pronunciation Methods
This paper outlines a novel algorithm to score the similarity of pronunciations that based partly on how biologists compare protein sequences. A score is calculated based on the length and phonemes in the word, and the number of blank entries inserted to achieve that alignment. They also leverage the alternate pronunciations for words in the CMUDict, splitting out their different phonemes and use that as a basis for what should be considered similar.