NLP IL F2F Meetup at Intuit

A recap of the NLP IL meetup at Intuit, featuring talks on long-range reasoning in language models, efficient long-text understanding, and modern speech recognition.

A detailed recap of the NLP IL meetup at Intuit, which included talks on long-range reasoning in language models, efficient long-text understanding with short-text models, and modern speech recognition techniques, along with insights into the latest research and applications in natural language processing.
meetup
nlp
intuit
Author

Oren Bochman

Published

23-01-11

Modified

Monday, May 18, 2026

Keywords

NLP, Intuit, Meetup, Long-Range Reasoning, Efficient Long-Text Understanding, Speech Recognition, SCROLLS, SLED

Session Video

Introductions

intuit shani gershtein nlp.il nlp.il at intuit nlp.il at intuit who we are who we serve

quickbooks credit karma mail chimp intuit israel intuit nlp team DS IL

Session Video

SCROLLS: Standardized CompaRison Over Long Language Sequences

Paper

Standardized CompaRison Over Long Language Sequences SCROLLS

Abstract

NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pertaining methods.

Speaker

  • Uri Shaham Uri_Shaham Page - PhD candidate in Tel Aviv university,
  • Uri is a Ph.D. student at the Tel Aviv University NLP lab, working with Omer Levy. His research focuses on conditional language generation, involving model architectures, inference algorithms, and evaluation benchmarks.

Slides

SCROLLS

SCROLLS

SOTA in NLU

SOTA in NLU

Problem - Transformers

Problem - Transformers

Problem - Solutions

Problem - Solutions

Evaluation on long texts

Evaluation on long texts

Can we do better?

Can we do better?
  • Preplexity of next token prediction
  • Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 284–294, Melbourne, Australia. Association for Computational Linguistics.
  • Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations (ICLR), 2020b
  • Simeng Sun, Kalpesh Krishna, Andrew MattarellaMicke, and Mohit Iyyer. 2021. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Ofir Press, Noah A. Smith, and Mike Lewis. 2021a. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493–5505, Online. Association for Computational Linguistics.

SCROLLS

SCROLLS

Building SCROLS 1

Building SCROLS 1

Building SCROLS 2

Building SCROLS 2

Desiderata

Desiderata

Tasks

Tasks

Example Q&A

Example Q&A

delete

delete

Examples require long-range reasoning

Examples require long-range reasoning

Does Context improve performance

Does Context improve performance

Processing the entire input helps

Processing the entire input helps

Analysis

Analysis

Does More context improve performance

Does More context improve performance

Language understanding is crucial

Language understanding is crucial

Is more context all you need?

Is more context all you need?

Is more context all you need?

Is more context all you need?

How far is SCROLLS from being solved

How far is SCROLLS from being solved

Big room for improvement?

Big room for improvement?

Leaderboard

Leaderboard

Conclusions

Conclusions

Leaderboard

Leaderboard

Notes

  • Few comments about this talk.
  • Met with a company that worked on patents and had lots of issues with long range.
  • Most of the points raised were ‘straw men’ so there is not much surprise.

Efficient Long-Text Understanding with Short-Text Models

Paper

Efficient Long-Text Understanding with Short-Text Models

Abstract:

Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.

Speaker

  • [Maor Ivgi]
    • PhD candidate in Tel Aviv university
    • Maor is an NLP researcher and entrepreneur. He has vast experience in implementing state-of-the-art deep learning models for real-world use cases. He received his masters in Computer Science at Tel-Aviv University advised by Prof. Jonathan Berant, focusing on NLP models’ Robustness. As a Ph.D. candidate at Prof. Berant’s lab, his research is focused on long-range reasoning in large language models.

Slides

Efficient Long-Text Understanding with Short-Text Models

Efficient Long-Text Understanding with Short-Text Models

NLP Papers

NLP Papers
  • NLP seems to have reached new level of maturity for use in Industry
    • c.f. Attention is all you need
    • c.f. BERT pre-training of deep bidirectional transformers for language understanding

Model Timeline

Model Timeline

Q&A challenges

Q&A challenges

Transformers - Good on short text NLU

Transformers - Good on short text NLU

Long Text NLU Fail

Long Text NLU Fail

Transformers Quadratic dependency limits

Transformers Quadratic dependency limits

Transformers Attention complexity

Transformers Attention complexity
  • Transformers have issues with long texts:
    • self attention is O(n^2)
    • cross attention is O(nk) Novel Transformer Architecture Papers
  • Efficient LLM papers are:
  • Hard to understand,
  • Hard to generalize (due to platform specific engineering tricks)
  • Expensive to reproduce
  • Inference run into Memory is an issue
  • Training is often on beginning of document so does not see the end
  • Self Attention is has a limited window size.

SLED - Locality

SLED - Locality

SLED - Properties

SLED - Properties
  • SLED’s Approach
    • Assume locality of information: “In an encoder-decoder architecture, the encoder can effectively contextualize input tokens with local context only, leaving long range dependency to be handled by the decoder.”
    • Split text into short fixed length overlapping chunks of text (short contexts).
    • Prepend the prefix/prompt to each chunk
    • The decoder will need to put it all together.

SLED Properties

SLED Properties

Model Size effect

Model Size effect

SLED Performance Boost

SLED Performance Boost

SLED is Competitive with short text models

SLED is Competitive with short text models

Analysis

Analysis
  • this is a great slide!
  • it summarizes lots of info
  • SLED’s Analysis
    • Contextual encoding is crucial
    • Cheating is not enough
    • The is real benefit in fusion Finding a Needle in a Haystack

Finding a Needle perfectly

Finding a Needle perfectly

Fusing Information Pieces

Fusing Information Pieces
  • what is Cheating?

Cheating is not enough

Cheating is not enough
  • Quantifying SLED’s benefits using relative improvement.

\text{Relative Improvement} = \frac{Score(SLED)-Score(Bart)}{Score(Bart)}

Gains Formula from longer inputs Gains

Gains Formula from longer inputs Gains

Chart of longer inputs Gains

Chart of longer inputs Gains

Limitations & Future Work

Limitations & Future Work
  • Limits & Future Work
    • Long outputs are still a constraint
    • No explicit global contextualization
    • No explicit global positional information
    • Not applicable for decoder-only architecture
    • (Corrective) pre-training is expected to help

Takeaways

Takeaways
  • Takeaways
    • Individual pieces of information are localized
    • Fusion in decoder works
    • SLED does well on long range tasks.

Questions

Questions
  • Main points They point out that the encoder can usually do a adequate job of understanding the input by looking at local context. Mostly a window with a few surrounding sentences. It uses this to create encode the input into a compact representation we call the state. The decoder will then be leverage the compression with “adequate” encodings to efficiently retrieve results from much longer contexts during inference on different tasks.

An Overview of Modern Speech Recognition

Abstract

Automatic speech recognition has been impacted by advances in related fields like image processing and natural language processing in recent years. One notable achievement in these areas has been the use of self-supervised learning to improve performance in computer vision and NLP tasks. This led to the development of the first self-supervised language model for speech representations, which has demonstrated impressive results in various NLP tasks. In this talk, we will review the key principles of automatic speech recognition and discuss the current progress, research, and challenges in the field

Speaker

  • Gal Hever
    • Algorithm Developer, Vision Map
    • MSc in Data Science, with over a decade of accumulated expertise in Machine Learning & Data Analytics from 8200, academy, and industry. Deploying algorithms to production by applying data-driven Machine Learning & AI solutions end to end, starting from research to development and testing.

Slides

Overview

Overview

Conversational AI

Conversational AI

ASR

ASR

ASR input challanges

ASR input challanges

Signal & Noise

Signal & Noise

Ideal System

Ideal System

ASR Task

ASR Task

slide009

slide009

slide010

slide010

slide011

slide011

WER Metric

WER Metric

ASR History

ASR History

ASR Time Line

ASR Time Line

Augumentations

Augumentations

WER we are 21

WER we are 21

WER we are 2

WER we are 2

ASR challanges

ASR challanges

diversity challange

diversity challange

language is dynamic

language is dynamic

whar’s next

whar’s next

covid understanding challenges

covid understanding challenges

Non verbal communication 1

Non verbal communication 1

Non verbal communication 2

Non verbal communication 2

DataNights Cohort

DataNights Cohort

QR for ASR Course

QR for ASR Course

Questions

Questions

Reflections

  • I’ve read a couple of books on the subject, but this shows more up to date results.
  • Show me the papers?
  • The Data Nights course should be worth taking

Session Video

Efficient Long-Text Understanding with Short-Text Models

Paper

Abstract:

Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pre-training from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pre-training step.

Speaker

  • Maor Ivgi
    • PhD candidate in Tel Aviv university,
    • Maor is an NLP researcher and entrepreneur. He has vast experience in implementing state-of-the-art deep learning models for real-world use cases. He received his masters in Computer Science at Tel-Aviv University advised by Prof. Jonathan Berant, focusing on NLP models’ Robustness. As a Ph.D. candidate at Prof. Berant’s lab, his research is focused on long-range reasoning in large language models.

Slides

Efficient Long-Text Understanding with Short-Text Models

Efficient Long-Text Understanding with Short-Text Models

NLP Papers

NLP Papers
  • NLP seems to have reached new level of maturity for use in Industry
    • c.f. Attention is all you need
    • c.f. BERT pre-training of deep bidirectional transformers for language understanding

Model Timeline

Model Timeline

Q&A challenges

Q&A challenges

Transformers - Good on short text NLU

Transformers - Good on short text NLU

Long Text NLU Fail

Long Text NLU Fail

Transformers Quadratic dependency limits

Transformers Quadratic dependency limits

Transformers Attention complexity

Transformers Attention complexity
  • Transformers have issues with long texts:
    • self attention is O(n^2)
    • cross attention is O(nk) Novel Transformer Architecture Papers
  • Efficient LLM papers are:
  • Hard to understand,
  • Hard to generalize (due to platform specific engineering tricks)
  • Expensive to reproduce
  • Inference run into Memory is an issue
  • Training is often on beginning of document so does not see the end
  • Self Attention is has a limited window size.

SLED - Locality

SLED - Locality

SLED - Properties

SLED - Properties
  • SLED’s Approach
    • Assume locality of information: “In an encoder-decoder architecture, the encoder can effectively contextualize input tokens with local context only, leaving long range dependency to be handled by the decoder.”
    • Split text into short fixed length overlapping chunks of text (short contexts).
    • Prepend the prefix/prompt to each chunk
    • The decoder will need to put it all together.

SLED Properties

SLED Properties

Model Size effect

Model Size effect

SLED Performance Boost

SLED Performance Boost

SLED is Competitive with short text models

SLED is Competitive with short text models

Analysis

Analysis
  • this is a great slide!
  • it summarizes lots of info
  • SLED’s Analysis
    • Contextual encoding is crucial
    • Cheating is not enough
    • The is real benefit in fusion Finding a Needle in a Haystack

Finding a Needle perfectly

Finding a Needle perfectly

Fusing Information Pieces

Fusing Information Pieces
  • what is Cheating?

Cheating is not enough

Cheating is not enough
  • Quantifying SLED’s benefits using relative improvement.

\text{Relative Improvement} = \frac{Score(SLED)-Score(Bart)}{Score(Bart)}

Gains Formula from longer inputs Gains

Gains Formula from longer inputs Gains

Chart of longer inputs Gains

Chart of longer inputs Gains

Limitations & Future Work

Limitations & Future Work
  • Limits & Future Work
    • Long outputs are still a constraint
    • No explicit global contextualization
    • No explicit global positional information
    • No applicable for decoder-only architecture
    • (Corrective) pre-training is expected to help

Takeways

Takeways
  • Takeaways
    • Individual pieces of information are localized
    • Fusioin in decoder works
    • SLED does well on long range tasks.

Questions

Questions
  • Main points

They point out that the encoder can usually do a adequate job of understanding the input by looking at local context. Mostly a window with a few surrounding sentences. It uses this to create encode the input into a compact representation we call the state. The decoder will then be leverage the compression with “adaquate” encodings to efficently retrieve results from much longer contexts during inference on different tasks.

Session Video

An Overview of Modern Speech Recognition

Abstract

Automatic speech recognition has been impacted by advances in related fields like image processing and natural language processing in recent years. One notable achievement in these areas has been the use of self-supervised learning to improve performance in computer vision and NLP tasks. This led to the development of the first self-supervised language model for speech representations, which has demonstrated impressive results in various NLP tasks. In this talk, we will review the key principles of automatic speech recognition and discuss the current progress, research, and challenges in the field

Speaker

  • Gal Hever
    • Algorithm Developer, Vision Map
    • MSc in Data Science, with over a decade of accumulated expertise in Machine Learning & Data Analytics from 8200, academy, and industry. Deploying algorithms to production by applying data-driven Machine Learning & AI solutions end to end, starting from research to development and testing.

Slides

Overview

Overview

Conversational AI

Conversational AI

ASR

ASR

ASR input challenges

ASR input challenges

Signal & Noise

Signal & Noise

Ideal System

Ideal System

ASR Task

ASR Task

slide009

slide009

slide010

slide010

slide011

slide011

WER Metric

WER Metric

ASR History

ASR History

ASR Time Line

ASR Time Line

Augmentations

Augmentations

WER we are 21

WER we are 21

WER we are 2

WER we are 2

ASR challenges

ASR challenges

diversity challenge

diversity challenge

language is dynamic

language is dynamic

what’s next

what’s next

covid understanding challenges

covid understanding challenges

Non verbal communication 1

Non verbal communication 1

Non verbal communication 2

Non verbal communication 2

DataNights Cohort

DataNights Cohort

QR for ASR Course

QR for ASR Course

Questions

Questions
  • I’ve read a couple of books on the subject, but this shows more up to date results.

  • Show me the papers?

  • The Data Nights course should be worth taking

Citation

BibTeX citation:
@online{bochman2012,
  author = {Bochman, Oren},
  title = {NLP {IL} {F2F} {Meetup} at {Intuit}},
  date = {2012-11-01},
  url = {https://orenbochman.github.io/posts/2023/01-11-nlp-il-meetup-intuit/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2012. “NLP IL F2F Meetup at Intuit.” November 1. https://orenbochman.github.io/posts/2023/01-11-nlp-il-meetup-intuit/.