Probabilitic Modeling with Language Models

Ever since I learned about building LLM I’ve been asking myself now that I can get a probability distribution over a token given a context how can I use this in (bayesian) probabilistic modeling. Except this is not the only interesting (and unknown) quantity that LLMs can provide and we the uses of such probabilities are only limited by one’s imagination and prior knowledge of probabilistic modeling in NLP.

What I find utterly shocking is that this is not a question I’ve heard anyone ask. People want to know about making language do risky things they were never designed for like doing math, reasoning, playing chess or doing customer support. But no one seems too interested in maximizing thier abilities on the top NLP tasks where they can provide lots of benefits at minimal risk.

Ok, I realize that I’m stretching my point a bit. But I do think that there is a lot of low hanging fruit in this area. And lots more mid to high fruit as well.

I did hear Dr Goldberg ask why people are using regex instead of using Neural Parsing to process text. I think that this is pretty much the same idea as I have.

A number of papers came out about how LLMs are knowldegebases or how they are great for common sense reasoning. In reality though LLMs are not designed for this and while they have great facilities for doing these and many other tasks often referred to as emergent abilities¹, they are notoriously unreliable. Which bring us to the topic of this post how can we use language modeling to get

¹ insight: think that if large language model has been trained on enough chess book and incorporates key values for some unrelated function f(x) e.g. best chess move in position x it may seem to be a good player so long as the position is similar enough to what it has seen, since the model doesn’t actually reason in these sense of a chess program does, it will play miserably outside the moves it knows. For easier function other f(z) it might learn all the cases or enough cases to be good 80% of the time…

Probabilistic Modeling with LLMs

What can an LLM provide us most easily?
How about a fixed embedding ?

Applications abound. What this article is calling for is:

A systematic study of the applications of LLMs to NLP tasks.
In a broader sense application considering how unsupervised (Next Token Prediction) can be used for pre-training probabilistic models for NLP tasks. I’m thinking of T5 and other models that have commands for doing different NLP tasks and shared architecture with different entry point and exit points for different tasks and types of data.
When we were starting out with LLMs many researchers complained how there wasn’t good data to build the supervised models for different NLP tasks. LLM can provide us with a short cut to do just that. This doesn’t mean that all such data-set creation is now trivial but that we can use LLMs and smaller specialized models to jump forward in building probabilistic models.

a systematic study of the applications of LLMs to NLP tasks. And in a broader sense application of Next Token Prediction based pretraining to NLP tasks.

Contextual spelling correction. Grammarly is very poor at fixing mistakes
Grammar correction or improvement. Grammarly makes very many bad suggestions seemingly blind to the context.
Text segmentation
Text classification
Text summarization
Translation, at least in hebrew is weak as the model needs to infer morphological states (e.g. gender and number) that are marked in hebrew but do not appear in most english sentences. We don’t have good priors due to sparsity (most morphological states are under represented or even missing in most corpora.)
Text to speech (TTS) in hebrew is very error prone but likely due to added issues and poor inference.
Query expansion and understanding (asking good followup questions)

A probability distribution over the next token given a context. This is a very powerful primitive.

Some ideas:

Train a model extending T_5 with support for additional nlp tasks. Adding support for tools that extend the language model to capabilities it lacks or is very poor at.

A (non-parametric) model of Named entities e.g. people, places and organizations. It would also help if we would also indicate if the name encountered might also a (common) word. i.e. ambiguity.

Then we can query our LLM for the probability of a name given a context. Using different queries like

“The capital of France is ”
“The president of the USA is ”
“The CEO of Microsoft is ”
“The author of ‘War and Peace’ is ”

Citation

BibTeX citation:

@online{bochman2025,
  author = {Bochman, Oren},
  title = {Probabilitic {Modeling} with {Language} {Models}},
  date = {2025-09-14},
  url = {https://orenbochman.github.io/posts/2025/2025-09-14-bayes-for-llm/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2025. “Probabilitic Modeling with Language Models.” September 14, 2025. https://orenbochman.github.io/posts/2025/2025-09-14-bayes-for-llm/.