vLLM with the Transformers Modelling Backend

A deep dive into Harry Mellor’s talk on vLLM with the Transformers Modelling Backend, exploring the integration of vLLM with HuggingFace Transformers for efficient model deployment.

odsc

talk

tutorial

transformers

vLLM

vLLM with the Transformers Modelling Backend

Harry Mellor
- vLLM with the Transformers Modelling Backend
- LinkedIn
- HuggingFace
- tutorial

Notes

Tools:
- Transformers
- torchtitan
- Axolotl
- TRL
- Unsloth

Transformers - covers transformes serve, from_pretrained and generate_batch
vLLM - covers llm serve and the LLM class.
transformers backend how vLLm can run a transformers model implementation without reimplementing it from scratch.
- c.f. Building a compatible model backend for inference
Bring your own transformers model to vLLM
- When making a model compatible with the Transformers backend, watch out for:
  - Missing kwargs at any level** — The most common issue. If OlmoeModel.forward accepted **kwargs but OlmoeDecoderLayer.forward didn’t, attention_instances would be silently dropped.
  - Custom attention not using ALL_ATTENTION_FUNCTIONS — Models that compute attention inline can’t be dispatched to vLLM’s kernels. The model must use the standard dispatch pattern.
  - Incorrect TP plans — Misspecifying “colwise” vs “rowwise” will produce wrong results silently. Remember: projections that increase dimension (Q, K, V, gate, up) are typically “colwise”, and projections that decrease dimension (O, down) are “rowwise”.
  - Non-standard attention mask handling — Models that manipulate attention weights directly (e.g., adding positional bias to attention scores after softmax) may not be expressible through the standard attention interface.

Citation

BibTeX citation:

@online{bochman2026,
  author = {Bochman, Oren},
  title = {vLLM with the {Transformers} {Modelling} {Backend}},
  date = {2026-04-28},
  url = {https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk3.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2026. “vLLM with the Transformers Modelling Backend.” April 28. https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk3.html.