vLLM Feasibility Study: 35x Throughput on Edge Hardware

Infrastructure validation for local inference: Can we run high-throughput batch classification on a consumer laptop (RTX 4060)?

OPTIMIZED RESULTS (RTX 4060 Laptop)

> Throughput: 49.66 requests/second
> Improvement: 35.2x vs Baseline (1.41 → 49.66 RPS)
> Latency: Sub-20ms per token (Amortized)
> Conclusion: Validated architectural viability of air-gapped inference.

Research Process

01 // The Hypothesis 02 // Baseline: The Memory Tax 03 // Optimization: vLLM + AWQ 04 // Final Scorecard

01 // The Hypothesis

Deploying Large Language Models usually requires expensive H100 clusters. I wanted to verify if edge hardware (a consumer RTX 4060 Laptop with 8GB VRAM) could support high-throughput text classification workloads by leveraging modern quantization and paging techniques.

"Can we process a 30,000+ item daily workload locally on 8GB VRAM without hitting OOM errors or unacceptable latency?"

02 // Baseline: The Memory Tax

I first benchmarked the industry-standard bitsandbytes (NF4) quantization pipeline. While memory efficient for *loading*, the standard attention mechanism processes requests serially and locks up contiguous VRAM blocks.

BASELINE RESULTS (BitsAndBytes NF4)

> Throughput: 1.41 requests/second
> Bottleneck: OS Display Buffer.

On a laptop, the OS consumes ~1.2GB for the display. This left only ~6.8GB available. The baseline approach fragmented this remaining space, preventing effective batching.

03 // Optimization: vLLM + AWQ

To solve the memory fragmentation, I migrated the pipeline to vLLM to utilize PagedAttention. This allows the KV cache to live in non-contiguous memory blocks, filling the "gaps" left by the OS overhead.

I paired this with AWQ 4-bit Quantization to lower the model weights to 5.34GB. The engineering challenge was tuning the gpu_memory_utilization parameter to finding the exact limit before the OS killed the process.

benchmark_suite.py (Snippet)

# The "Magic Numbers" for RTX 4060 Stability
VLLM_GPU_UTIL = 0.86  # 0.87 crashed the OS. 0.85 crashed the model.
VLLM_CONTEXT  = 1024  # Truncated context to fit KV cache in remaining 220MB

llm = LLM(
    model="casperhansen/llama-3-8b-instruct-awq",
    quantization="awq",
    dtype="float16",
    gpu_memory_utilization=VLLM_GPU_UTIL,
    max_model_len=VLLM_CONTEXT,
    enforce_eager=True
)

04 // Final Scorecard

The comparative analysis validated that high-performance AI is feasible on consumer hardware if memory is managed at the page level.

TERMINAL OUTPUT

================================================================
                      FINAL SCORECARD                           
================================================================
 BASELINE (BitsAndBytes):  1.41 req/s
 OPTIMIZED (vLLM AWQ):     49.66 req/s
----------------------------------------------------------------
 >>> SPEEDUP FACTOR:       35.2x
================================================================

Study Parameters

> Hardware: NVIDIA GeForce RTX 4060 Laptop (8GB)
> Model: Meta-Llama-3-8B-Instruct (AWQ 4-bit)
> Optimization: vLLM Kernel + PagedAttention