I build high-throughput inference engines, speculative inference pipelines, and autonomous vision systems that run on the edge.
Reduced "Menu Spam" by 45%. A case study in engineering pragmatism: replacing a heavy Kubeflow pipeline with a lightweight Shadow Deployment system.
A high-recall toxicity screener for pharmaceutical auditing. Scaled to 0.96 Recall through hyper-fine LR convergence and deployed as a Distributed Ray Serve API.
A zero-latency voice system designed for consumer hardware. Optimized inference through Speculative Audio Decoding and model quantization.
A physics-informed scoring system for broadcast basketball. Eliminated parallax errors and false positives from net deformation using Geometric Heuristics.
Throughput Speedup
Migrating a Fraud Detection system from CPU-bound Python to NVIDIA Triton. Achieved 1,088 RPS by bypassing Python serialization tax with the C++ FIL Backend.
Throughput Increase
A stress test for the proposed "Real-Time News" feature. Benchmarked vLLM vs HuggingFace on Google Colab (T4) to validate if a single GPU could handle the global news cycle using PagedAttention.