Motivation
In the Deeplearning.ai NLP specialization, we implemented both Q&A and Summarization tasks. However, we only looked at these tasks in terms of the attention models. It is now time to dive deeper into the summarization task. In this article I will review a talk covering a body of research on summarization, which has many ideas about features. Consider some of these and see what aspects are relevant to a modern task implementation.
Some of the tasks here can be done with a simple LSTM or Transformer. However it seems that RL may be useful if we want to plan the summary in a non-linear fashion or to make use of a dynamic programming approach. For instance to identify the narrative structure might require many more steps then we typically use in a summerization task. For a speeches we may want to capture a gist of the rhetoric.
Ideas
Several ideas surfaced while working on the assignment for building a hybrid generative/abstractive summarizer based on GPT21. Many ideas came from prior work developing search engines. I had noticed that some issues were anathema to summarization from its inception.
1 a Generative Pre-training Transformer or a decoder transformer
Coverage ranking content by importance
The first issue is coverage, which is how much of the original document to include. Coverage is a much more difficult problem than it appears.
Consider summerizing a long novel a single paragraph summary may provide a good idea of the main plot points of the story. On the other hand summerising a similar sized encyclopedia might require listing all the top level headings and subheadings i.e. a one paragraph summary would not be able to provide a good idea of the content of the document. On the other hand it should be much easier to generate a summary for the encyclopedia than for the novel as the encyclopedia has a highly structured document with clear headings and subheadings and writing conventions that provide cues to the reader about the importance of the content. Summerizing a novel is more challanging as we would actualy want to ommit alsmost all the details and so deciding which parts to keep is a major challange with a very sparse signal.
For example, technical documents, like patents, headings, academic writing cues, and even IR statistics, may serve as features that we may feed into a regression that can guide us in discarding the chuff from the grain. On the other hand, for movie scripts or novels, we can’t use all that, and we need to decide what is essential based on the narrative structure, which requires sophisticated processing and extensive domain knowledge to extract. So, When we want to create a synopsis of a book like War and Peace, specific details are part of the narrative structure that stands out as essential to us, and I can say that even in 2025, LLM is not good at picking these out of a large document. For attentive readers with a suitable background, if we double the length of the text, they might pick additional plot points and then less significant subplots. If again we double we would, they would include more details concerning characters, their motivations, etc.
To conclude, different documents can have radically different structures and cues. These suggest different strategies for selecting and ranking the material to be included in a summary of a given length. More so, when we consider summaries of different lengths, one can see that we are talking about a hierarchal structure.
Avoiding repetition
A second challenge that is pervasive is avoiding repetition. In the generative settings we have a tougher challenge since sentences that are generated may well be equivalent to sentences we have already generated, and we want the model to redirect it attention from material already covered. Luckily we have learned to detect similar sentences in the Q&A task2
2 using Siamese networks for oneshot similarity detection.
check if it is similar to any sentence in the summary
Alg1: similarity detection
summary = []
tokenized_doc = tokenize(doc)
embeddings_doc= embed(tokenized_doc)
while len(summary) < doc_tokens * summary_ratio:
a = gen_a_sentence(embeddings_doc,summary)
for s in summary:
if sim(a,s) > threshold:
continue
else:
summary.append(a)
s -> gen a sentence,
Context window constraints
The third challenge is technical and is related to the length of the context window. as transformer models have a limit on the number of tokens that they can process in a context window. A second issue here is that the summary itself needs to also fit in the context window. If the structure is linear and incremental we can chunk it into many smaller pieced say using sections. However in reality we are often dealing with trees of graphs structures and chunking can erode the model’s ability to inspect said hierarchy of the structure it is tasked with summarizing. In Q&A or IR tasks since we can process each chunk separately and then combine the results.
If we are not using a generative model one imagines a tree data structure that can efficiently track what what part of the document has been summarized. If we can use that to work with the attention mechanism we may be able to reduce the effective window size as we progress with the summary. A more nuanced idea would come from bayesian search where we may want to keep moving the probability mass for the attention mechanism to regions that are significant but have not been covered.
Another challenge is that we may want to grow our summary in a non-linear fashion. We may consider the main narrative structure then add in the subplots and then the character motivations. This suggests two approaches - a top down planning based approach3 and a bottom up approach where we start with the most important details, then add in the less important ones. In the second case we may want to revise the initial sentences to efficiently incorporate more details as we progress.
3 using rl or dynamic programming
4 reformer layers
I also interviewed with a company that told me they often worked with patents and that these documents were frequently in excess of a hundred pages and they were having lots of problems with the summarization task. This suggest that an efficient transformer4 would be needed if we are to support context windows that can span hundreds of pages. But that besides the context window we need some good way to ensure that the summary is balanced and that it is not too repetitive.
Compassion
My experience with summarization by hand is that most texts can be compressed by 5-10% without losing any information simply by rephrasing with more concise language. So another idea that should be obvious is that the generated summary should be compressive, i.e., once we generate a good sentence we should consider if we can make it shorter. This should be fairly easy as we are reducing the length of a single sentence. We may however be able to do even better by considering a section and trying to see if we can merge two sentences or if two sentences have partial overlap that can be merged or referenced via an anaphora. This is another area where we diverge from Q&A, where we do not have a length constraint and may often want to include all relevant information. Implementation of this idea cam be easily embodied in a loss function that penalizes summaries for each token or character. This same idea can also be used in a RL summarizer.
Grounding
Generative summarization is a case where we can require that all generated text be grounded in the original document. In reality generative models have many glitches, due to tokenization issues, or failures of the attention mechanism, or out of distribution generation, due to bias learned from the corpus, due to negative recency bias and and due to emergent contradictions (where a contradiction is introduced into the context window and cannot be removed.) And even if we can avoid all these there is still nothing in the model that makes it prefer truthy generations. So it seems essential that the generated text is grounded in the original document. This seems to be verifiable by visually inspecting the using the attention mechanism. In reality there may be multiple heads and multiple layers. This means we need to access the attention mechanism programmatically and store annotations from the source document for each generated token. We may also want to make abstracting summarization explicit. c.f. [Extrinsic Hallucinations in LLMs in ](https://lilianweng.github.io/posts/2024-07-07-hallucination/) by Lilian Weng
Citation
@online{bochman2025,
author = {Bochman, Oren},
title = {Summarization {Task}},
date = {2025-02-10},
url = {https://orenbochman.github.io/notes-nlp/posts/2025-02-10-summerizer/},
langid = {en}
}