2025 Paragraph Technologies Inc

Popular Trending Privacy Terms Home

Paragraph

Paragraph

Search...Ctrl+K

SeanChao

Share Summary -- Beyond Paragraphs: NLP for Long Sequences (NAACL 2021 Tutorial)

Twitter Bluesky

Summary -- Beyond Paragraphs: NLP for Long Sequences (NAACL 2021 Tutorial)

Cover photo

Search...Ctrl+K

SeanChao

Cover photo

Share Summary -- Beyond Paragraphs: NLP for Long Sequences (NAACL 2021 Tutorial)

Twitter Bluesky

2025 Paragraph Technologies Inc

Popular Trending Privacy Terms Home

Subscribe

Subscribe

Share Dialog

Share Dialog

Subscribe to SeanChao

Subscribe to SeanChao

Subscribe

Subscribe

Summary -- Beyond Paragraphs: NLP for Long Sequences (NAACL 2021 Tutorial)

Summary -- Beyond Paragraphs: NLP for Long Sequences (NAACL 2021 Tutorial)

<100 subscribers

<100 subscribers

https://github.com/allenai/naacl2021-longdoc-tutorial

https://underline.io/events/122/sessions?eventSessionId=4103

Graph-based Models

Naive Baseline: Splitting into chunks

Hierarchical Modeling

Leverage natural hierarchy of the document (words→sentences→paragraphs)

Local LSTM + Global LSTM as Encoder:

This can be used for pre-training.

Finer-grained representation

Goal: update the representations of chunks conditioned on other chunks multiple times

Faster Transformers

the time complexity of self-attention is O(n^2)

Transformer XL
Compressive Transformers: keep a compressed history memory

Sparse Patterns

make attention operations sparse

Content-based patterns

Reformer: locality sensitive hashing + attention calculation within the same hash buckets

Kernel Methods

Apply a kernel function to Q and V, the attention becomes:

https://github.com/allenai/naacl2021-longdoc-tutorial

https://underline.io/events/122/sessions?eventSessionId=4103

Graph-based Models

Naive Baseline: Splitting into chunks

Hierarchical Modeling

Leverage natural hierarchy of the document (words→sentences→paragraphs)

Local LSTM + Global LSTM as Encoder:

This can be used for pre-training.

Finer-grained representation

Goal: update the representations of chunks conditioned on other chunks multiple times

Faster Transformers

the time complexity of self-attention is O(n^2)

Transformer XL
Compressive Transformers: keep a compressed history memory

Sparse Patterns

make attention operations sparse

Content-based patterns

Reformer: locality sensitive hashing + attention calculation within the same hash buckets

Kernel Methods

Apply a kernel function to Q and V, the attention becomes:

ar://tBD5XWE94Dn3E8U--GI8_qOy7wKvFmAEO8sgg-9vVuQ

ar://tBD5XWE94Dn3E8U--GI8_qOy7wKvFmAEO8sgg-9vVuQ

<1 min read·December 8, 2021

<1 min read·December 8, 2021

No activity yet

No activity yet