Analysis 1 - Word Segmentation and Morphology (10/24/2023)
Content:
- What is a "word"?
- Tokenization
- Morphology and morphological analysis
- Unsupervised subword segmentation
Reading Material
- Required Reading: Words and Transducers Jurafsky and Martin v2, Chapter 3, through section 3.9
- Required Reading: Text Normalization Jurafsky and Martin v3, Section 2.4
- Reference: SentencePiece (Kudo and Richardson 2018)
- Reference: Subword Regularization ("unigram") (Kudo 2018)
Slides: Word Segmentation and Morphology Slides
Software: sentencepiece