Forge
In developmentPersonal project — a serving + benchmarking artifact for self-hosted Llama 3.1 8B (AWQ-INT4 on vLLM). Answers "should I self-host an OSS LLM instead of paying a commercial API" with a reproducible methodology.
What it is
A focused engineering artifact, not a SaaS. The shipped pieces:
- Serving — env-driven vLLM config with continuous batching, KV cache, and an OpenAI-compatible streaming API. Native Prometheus metrics exported and scraped into an auto-provisioned Grafana dashboard.
- Quantization — AWQ-INT4 recipe documented in
forge/quantization/awq.pyagainstmeta-llama/Llama-3.1-8B-Instruct, served with vLLM’s Marlin kernels. - Benchmark harness — wraps
vllm bench servewith a ShareGPT trace, sweeps concurrency 1/4/16/32/64 (256 prompts per level), and writes structured JSON toresults/bench/with the GPU, vLLM version, and model SHA in each record’s metadata. - Quality eval —
lm-evaluation-harnesswith thelocal-completionsmodel type pointed at the running vLLM endpoint. Tasks: MMLU (5-shot,acc), GSM8K (5-shot,exact_match), HellaSwag (5-shot,acc_norm). Retention math lives inforge/eval/. - Cost model —
$/1M tokens = gpu_hourly_usd * 1e6 / (3600 * sustained_throughput * utilization). Headline number uses peak throughput atutilization=1.0; the cost-comparison JSON next to the chart reports the 80%-utilization sensitivity. - Chart pipeline — five canonical charts (throughput vs concurrency, TTFT vs concurrency, TPOT vs concurrency, cost per 1M tokens, quantization quality retention), regenerated from
results/with onemake chartinvocation. - CI — GitHub Actions runs Ruff + mypy (strict) + pytest on every PR. No GPU jobs in CI — benchmark reproduction is a manual, methodology-driven step on RunPod.
Why I built it
To prove I can serve and optimize an open-source LLM end-to-end — not just call a hosted API. The interesting question isn’t “can vLLM run Llama 3.1 8B” (yes, the docs cover that) — it’s “what does the defensible $/1M-tokens number look like once you account for utilization, p99 latency, and quality regression from quantization?” That’s what Forge measures, and it traces every number back to a JSON file you can inspect.
How it works
Reproducibility as a first-class constraint
Methodology, hardware, model SHAs, exact vLLM version, AWQ recipe, and pricing-source dates are committed in docs/methodology.md and the per-run metadata. Anyone with the methodology doc and a RunPod account can reproduce the numbers — that’s the point of the deliverable. Versions are pinned in uv.lock, with the GPU-coupled dependencies (vLLM, lm-eval, transformers) pinned separately in constraints/serve.txt and constraints/eval.txt so the CI image can install them without a GPU.
M1 rehearsal gate
The same shell script that runs the paid benchmark on RunPod runs in --rehearsal mode on a base-M1 MacBook against Qwen/Qwen2.5-0.5B-Instruct. The rehearsal must pass before any paid GPU is rented. Config typos, parser bugs, and chart-pipeline regressions cost $0 on M1 instead of $0.27/hr on an A5000.
Strategic test coverage
Tests cover the load-bearing utilities — cost model, results parsers, chart-data shaping, config validators — not LLM outputs. The LLM is the system under test; the tests verify the harness around it.
Status
The serving config, quantization recipe, benchmark harness, quality eval, cost model, and chart pipeline are landed and rehearsed end-to-end on M1 against a tiny model. The paid RunPod RTX A5000 run that replaces every illustrative number with a measured one is queued; the final case study writeup follows once those numbers land.
Questions
What is Forge?
Forge answers one question rigorously — should I self-host an open-source LLM instead of paying a commercial API? The deliverable is a reproducible methodology, a benchmark harness on top of vllm bench serve, a quality eval via lm-evaluation-harness, a chart pipeline, and a cost model. Every claim traces back to a JSON file in results/ whose metadata names the GPU, the vLLM version, the model SHA, and the date.
Why AWQ-INT4 on vLLM?
AWQ-INT4 with vLLM-native Marlin kernels is the fastest INT4 path on vLLM and retains ~1–2 percentage points more quality than GPTQ on the standard tasks. vLLM gives continuous batching, a KV cache, native Prometheus metrics, and an OpenAI-compatible streaming API — the stack a real product would use, not a synthetic benchmark.
How is the methodology kept defensible?
Hardware (RunPod RTX A5000 24 GB at $0.27/hr), model SHAs (meta-llama/Llama-3.1-8B-Instruct BF16 baseline and hugging-quants AWQ-INT4 variant), vLLM version, workload (ShareGPT trace at concurrency 1/4/16/32/64, 256 prompts per level), and pricing-source dates are all committed in docs/methodology.md and the results JSON. The full pipeline is rehearsed locally on a base-M1 MacBook against a tiny model (Qwen 2.5 0.5B) before any paid GPU minute, so the harness is debugged on free hardware.
Why the M1 rehearsal gate?
The same shell script that runs the real benchmark on RunPod runs in --rehearsal mode on M1 against the tiny model. That gate must pass before a paid GPU is rented. It catches a config typo, a parser bug, or a chart pipeline regression for free instead of for $0.27/hr.