Forge

Live

Personal project - measured vLLM serving benchmark for Llama 3.1 8B on a RunPod A5000, with BF16 at 2,169 total tok/s, $0.035 per 1M tokens, and AWQ retaining 99.4% mean quality.

Role: Solo - Python, vLLM, benchmarking, infra
Period: May 2026
Stack: Python 3.12
uv
vLLM
AWQ-INT4 (Marlin kernels)
Llama 3.1 8B
lm-evaluation-harness
Prometheus
Grafana
RunPod RTX A5000
Matplotlib
Ruff
mypy (strict)
pytest
GitHub Actions
Links: Repo

What it is

A focused engineering artifact, not a SaaS. The shipped pieces:

Serving - env-driven vLLM config with continuous batching, KV cache, and an OpenAI-compatible streaming API, plus native Prometheus metrics and Grafana provisioning.
Benchmark harness - vllm bench serve wrapped around a ShareGPT trace, sweeping concurrency 1, 4, 16, 32, and 64 with 256 prompts at each level.
Model comparison - BF16 meta-llama/Llama-3.1-8B-Instruct versus hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4.
Quality eval - lm-evaluation-harness against the running vLLM server on MMLU, GSM8K, and HellaSwag.
Cost model - measured throughput turned into dollars per million tokens, compared with GPT-4o, GPT-4o mini, Claude Sonnet, and Claude Haiku.
Charts and CI - result JSON drives the chart pipeline; CI covers parsers, config validation, cost math, and chart data shaping.

Why I built it

To answer a production question with measured numbers: should I self-host an open model instead of paying a commercial API, and does INT4 quantization improve the economics on a cheap 24 GB GPU?

Forge answers both. BF16 Llama 3.1 8B on the RunPod A5000 was much cheaper than hosted frontier APIs on raw token cost. AWQ-INT4 kept quality but did not improve throughput or cost in this particular vLLM/A5000 setup.

Measured result

BF16 was the throughput winner. At concurrency 64, it served 2,169 total tokens per second, compared with 1,017 total tokens per second for AWQ-INT4 on the same A5000 pod.

Line chart comparing BF16 and AWQ total-token throughput across concurrency levels on Forge. — BF16 peaked at 2,169 total tokens per second; AWQ peaked at 1,017.

Latency split by metric. AWQ had lower high-concurrency p99 time-to-first-token, 508 ms at concurrency 64 versus 2,822 ms for BF16. BF16 decoded faster: mean time per output token at concurrency 64 was 52.7 ms versus 88.0 ms for AWQ.

Line chart comparing p99 time to first token for BF16 and AWQ across concurrency levels. — AWQ had lower p99 TTFT under high concurrency in this run.

Line chart comparing mean time per output token for BF16 and AWQ across concurrency levels. — BF16 decoded faster, which is why it won throughput and cost.

Cost

At $0.27/hr compute, BF16 cost $0.0346 per 1M total tokens at measured peak throughput. AWQ cost $0.0737 per 1M total tokens because it was slower. Against GPT-4o blended pricing at $6.25 per 1M tokens, the measured compute-only ratios are about 181x cheaper for BF16 and 85x cheaper for AWQ before storage and operational overhead.

Quality

AWQ-INT4 retained 99.4% mean quality versus BF16 across the three eval tasks. It lost 1.97 percentage points on MMLU, lost 0.85 points on HellaSwag, and gained 1.52 points on GSM8K. That is good enough quality retention for many serving experiments, but the benchmark still favored BF16 because AWQ was slower.

How it works

Reproducibility as a first-class constraint

Methodology, hardware, model IDs, exact vLLM setup, workload, pricing constants, and generated results are committed in the Forge repository. The raw benchmark files live under results/bench/full-*, raw eval under results/eval/full, and chart data under results/charts.

M1 rehearsal gate

The same shell script that runs on RunPod runs locally in rehearsal mode against Qwen/Qwen2.5-0.5B-Instruct. That catches parser, config, chart, and orchestration bugs before paid GPU time starts.

Strategic test coverage

Tests cover the load-bearing utilities around the model: cost math, result parsers, chart-data shaping, config validation, and benchmark metadata. The model output itself is measured in the paid run, not mocked into unit tests.

Status

Done. The core Forge milestone is complete: paid BF16 and AWQ serving runs, quality eval, cost analysis, charts, methodology docs, and reproduction guides are landed. The important result is narrower and more useful than the original assumption: self-hosted BF16 was extremely cheap on the A5000, while AWQ-INT4 was a quality-retention success but not a throughput win on this setup.

Questions

What is Forge?

Forge is a completed serving benchmark for self-hosted Llama 3.1 8B. It runs BF16 and AWQ-INT4 variants on vLLM, sweeps concurrency with ShareGPT-style prompts, evaluates quality with lm-evaluation-harness, and reduces the measured throughput into cost per million tokens.

What did the paid run prove?

BF16 was the winner on this A5000/vLLM setup: 2,169 total tokens per second and $0.035 per 1M tokens. AWQ-INT4 retained 99.4% mean quality but was slower, at 1,017 total tokens per second and $0.074 per 1M tokens.

Why is the AWQ result useful if it was slower?

It is a useful negative result. AWQ-INT4 kept quality close to BF16, but did not produce the expected throughput or cost win on the measured RunPod RTX A5000 configuration. The page reports that directly instead of turning quantization into a generic speedup claim.

How is the methodology kept defensible?

Hardware, model IDs, vLLM configuration, concurrency levels, prompt counts, raw benchmark JSON, raw eval JSON, and chart JSON are committed in the Forge repository. The full pipeline is rehearsed locally on M1 before paid GPU time is used.