Learning 1 - Modeling Long Sequences (11/14/2023)
- Reference: RNN Language Models (Mikolov et al 2010)
- Reference: Larger Context RNNLMs (Mikolov and Zweig 2012)
- Reference: Self Attention over Previous Sentence (Voita et al. 2018)
- Reference: Self Attention over Previous Vectors (Dai et al. 2019)
- Reference: Compressive Transformer (Lillicrap et al. 2019)
- Reference: Sparse Transformers (Child et al. 2019)
- Reference: Longformer: The Long-Document Transformer (Beltagy et al. 2020)
- Reference: Mistral-7B (Jiang et al. 2023)
- Reference: Adaptive Span Transformer (Sukhbaatar et al. 2019)
- Reference: Adaptively Sparse Transformers (Correia et al. 2019)
- Reference: Reformer (Kitaev et al. 2020)
- Reference: Linformer (Wang et al. 2020)
- Reference: Nystromformer (Xiong et al. 2021)
- Reference: ALiBi (Press et al. 2022)
- Reference: KERPLE (Chi and Fan et al. 2022)
- Reference: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al. 2022)
- Reference: Evaluation: Sentence Scrambling (Barzilay and Lapata 2008)
- Reference: Evaluation: Final Sentence Prediction (Mostafazadeh et al. 2016)
- Reference: Evaluation: Final Word Prediction (Paperno et al. 2016)
- Reference: Long Range Arena (Tay et al. 2020)
- Reference: In the long (context) run (De Vries 2023)
- Reference: In-Context Pretraining: Language Modeling Beyond Document Boundaries (Shi et al. 2023)
- Reference: S4: Efficiently Modeling Long Sequences with Structured States Spaces (Gu et al. 2022)
- Reference: MEGA: Moving Average Equipped Gated Attention (Ma and Zhou et al. 2023)
Slides: Long Sequences Slides