vLLM Feasibility Study: 35x Throughput on Edge Hardware
Infrastructure validation for local inference: Can we run high-throughput batch classification on a consumer laptop (RTX 4060)?
OPTIMIZED RESULTS (RTX 4060 Laptop)
- > Throughput: 49.66 requests/second
- > Improvement: 35.2x vs Baseline (1.41 → 49.66 RPS)
- > Latency: Sub-20ms per token (Amortized)
- > Conclusion: Validated architectural viability of air-gapped inference.
Research Process
01 // The Hypothesis
Deploying Large Language Models usually requires expensive H100 clusters. I wanted to verify if edge hardware (a consumer RTX 4060 Laptop with 8GB VRAM) could support high-throughput text classification workloads by leveraging modern quantization and paging techniques.
"Can we process a 30,000+ item daily workload locally on 8GB VRAM without hitting OOM errors or unacceptable latency?"
02 // Baseline: The Memory Tax
I first benchmarked the industry-standard bitsandbytes (NF4) quantization pipeline. While memory efficient for *loading*, the standard attention mechanism processes requests serially and locks up contiguous VRAM blocks.
BASELINE RESULTS (BitsAndBytes NF4)
- > Throughput: 1.41 requests/second
- > Bottleneck: OS Display Buffer.
On a laptop, the OS consumes ~1.2GB for the display. This left only ~6.8GB available. The baseline approach fragmented this remaining space, preventing effective batching.
03 // Optimization: vLLM + AWQ
To solve the memory fragmentation, I migrated the pipeline to vLLM to utilize PagedAttention. This allows the KV cache to live in non-contiguous memory blocks, filling the "gaps" left by the OS overhead.
I paired this with AWQ 4-bit Quantization to lower the model weights to 5.34GB. The engineering challenge was tuning the gpu_memory_utilization parameter to finding the exact limit before the OS killed the process.
# The "Magic Numbers" for RTX 4060 Stability VLLM_GPU_UTIL = 0.86 # 0.87 crashed the OS. 0.85 crashed the model. VLLM_CONTEXT = 1024 # Truncated context to fit KV cache in remaining 220MB llm = LLM( model="casperhansen/llama-3-8b-instruct-awq", quantization="awq", dtype="float16", gpu_memory_utilization=VLLM_GPU_UTIL, max_model_len=VLLM_CONTEXT, enforce_eager=True )
04 // Final Scorecard
The comparative analysis validated that high-performance AI is feasible on consumer hardware if memory is managed at the page level.
================================================================
FINAL SCORECARD
================================================================
BASELINE (BitsAndBytes): 1.41 req/s
OPTIMIZED (vLLM AWQ): 49.66 req/s
----------------------------------------------------------------
>>> SPEEDUP FACTOR: 35.2x
================================================================
Study Parameters
- > Hardware: NVIDIA GeForce RTX 4060 Laptop (8GB)
- > Model: Meta-Llama-3-8B-Instruct (AWQ 4-bit)
- > Optimization: vLLM Kernel + PagedAttention