BERT at 12 rps? This middleware squeezes GPUs to 1,400
A Python middleware that turns discrete web requests into GPU-friendly mini-batches, multiplying inference throughput without rewriting your model.

What it does
Service Streamer sits between your web framework and your deep-learning model. It queues incoming HTTP requests, groups them into mini-batches, and feeds them to GPU workers. The goal is to keep the GPU saturated instead of idling while processing one lonely sentence at a time. It works with any web framework or DL backend.
The interesting bit
The README shows a Flask BERT service jumping from 12.78 requests/sec (naive) to 207.59 with ThreadedStreamer, then 321.70 with multi-process Streamer, and 372.45 with RedisStreamer—all by adding a few lines of wrapper code. The 80× claim for multi-GPU (1,000+ sentences/sec) is mentioned but not benchmarked in the visible portion.
Key highlights
ThreadedStreamerfor single-GPU batching;Streamerfor multi-GPU via spawned workersManagedModelclass for lazy initialization and pinning workers to specific CUDA devicesRedisStreamerfor distributed web servers with CPU preprocessing and shared GPU workers- Future-based async API for non-web use cases
- Benchmarks on Titan Xp / CUDA 9.0 / PyTorch 1.1
Caveats
- The README is truncated mid-sentence during the multi-GPU benchmark section, so the 80× multi-GPU figure is stated but not fully substantiated in the visible text
- Default
max_latencyof 0.1s means requests may wait up to 100ms to fill a batch; real-time apps need tuning - The project appears quiet; last visible Travis badge and PyTorch 1.1 suggest it may need updates for modern stacks
Verdict
Worth a look if you’re running GPU inference behind a Python web server and your GPU utilization is in the basement. Skip it if you’re already on a batched gRPC service or Triton/TF Serving.