vLLM with the Transformers Modelling Backend

A deep dive into Harry Mellor’s talk on vLLM with the Transformers Modelling Backend, exploring the integration of vLLM with HuggingFace Transformers for efficient model deployment.
odsc
Author

Oren Bochman

Published

Tuesday, April 28, 2026

Modified

Monday, May 18, 2026

Keywords

vLLM, Transformers, HuggingFace, Model Deployment, Tutorial

vLLM with the Transformers Modelling Backend

NoteNotes
  • Tools:
    • Transformers
    • torchtitan
    • Axolotl
    • TRL
    • Unsloth
  1. Transformers - covers transformes serve, from_pretrained and generate_batch
  2. vLLM - covers llm serve and the LLM class.
  3. transformers backend how vLLm can run a transformers model implementation without reimplementing it from scratch.
  4. Bring your own transformers model to vLLM
    • When making a model compatible with the Transformers backend, watch out for:
      • Missing kwargs at any level** — The most common issue. If OlmoeModel.forward accepted **kwargs but OlmoeDecoderLayer.forward didn’t, attention_instances would be silently dropped.
      • Custom attention not using ALL_ATTENTION_FUNCTIONS — Models that compute attention inline can’t be dispatched to vLLM’s kernels. The model must use the standard dispatch pattern.
      • Incorrect TP plans — Misspecifying “colwise” vs “rowwise” will produce wrong results silently. Remember: projections that increase dimension (Q, K, V, gate, up) are typically “colwise”, and projections that decrease dimension (O, down) are “rowwise”.
      • Non-standard attention mask handling — Models that manipulate attention weights directly (e.g., adding positional bias to attention scores after softmax) may not be expressible through the standard attention interface.

Citation

BibTeX citation:
@online{bochman2026,
  author = {Bochman, Oren},
  title = {vLLM with the {Transformers} {Modelling} {Backend}},
  date = {2026-04-28},
  url = {https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk3.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2026. “vLLM with the Transformers Modelling Backend.” April 28. https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk3.html.