← all repositories
ShannonAI/service-streamer

BERT at 12 rps? This middleware squeezes GPUs to 1,400

A Python middleware that turns discrete web requests into GPU-friendly mini-batches, multiplying inference throughput without rewriting your model.

1.2k stars Python Inference · Serving
service-streamer
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

Service Streamer sits between your web framework and your deep-learning model. It queues incoming HTTP requests, groups them into mini-batches, and feeds them to GPU workers. The goal is to keep the GPU saturated instead of idling while processing one lonely sentence at a time. It works with any web framework or DL backend.

The interesting bit

The README shows a Flask BERT service jumping from 12.78 requests/sec (naive) to 207.59 with ThreadedStreamer, then 321.70 with multi-process Streamer, and 372.45 with RedisStreamer—all by adding a few lines of wrapper code. The 80× claim for multi-GPU (1,000+ sentences/sec) is mentioned but not benchmarked in the visible portion.

Key highlights

  • ThreadedStreamer for single-GPU batching; Streamer for multi-GPU via spawned workers
  • ManagedModel class for lazy initialization and pinning workers to specific CUDA devices
  • RedisStreamer for distributed web servers with CPU preprocessing and shared GPU workers
  • Future-based async API for non-web use cases
  • Benchmarks on Titan Xp / CUDA 9.0 / PyTorch 1.1

Caveats

  • The README is truncated mid-sentence during the multi-GPU benchmark section, so the 80× multi-GPU figure is stated but not fully substantiated in the visible text
  • Default max_latency of 0.1s means requests may wait up to 100ms to fill a batch; real-time apps need tuning
  • The project appears quiet; last visible Travis badge and PyTorch 1.1 suggest it may need updates for modern stacks

Verdict

Worth a look if you’re running GPU inference behind a Python web server and your GPU utilization is in the basement. Skip it if you’re already on a batched gRPC service or Triton/TF Serving.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.