tiny-vllm

High performance minimal LLM inference engine, a younger sibling of vLLM, written in scratch C++ and CUDA

I build the project based on the vLLM paper Efficient Memory Management for Large Language Model Serving with PagedAttention

load a LLM model from safetensors
full LLM forward pass
CUDA kernels for attention etc
KV cache as described in the paper
PagedAttention
batching

External libraries:

cuBLAS for all GEMMs
tokenizer from HuggingFace Transformers

Main design choices:

Test on Llama 3.2 1B
BF16 (because Llama 3.2 1B uses it)
Single GPU (tested on RTX 5090 32GB)

use

python python/tokenizer.py "The capital of France is" | ./tiny-vllm

model architecture

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

Jędrzej Maczan, 2026, Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.vscode		.vscode
include		include
python		python
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
check.sh		check.sh
full_test.sh		full_test.sh
ncu.sh		ncu.sh
nsys.sh		nsys.sh
reference.txt		reference.txt
run.sh		run.sh
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-vllm

use

model architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

jmaczan/tiny-vllm

Folders and files

Latest commit

History

Repository files navigation

tiny-vllm

use

model architecture

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages