Is service-streamer open source?

Yes — ShannonAI/service-streamer is open source, released under the Apache-2.0 license.

What language is service-streamer written in?

ShannonAI/service-streamer is primarily written in Python.

How popular is service-streamer?

ShannonAI/service-streamer has 1.2k stars on GitHub.

Where can I find service-streamer?

ShannonAI/service-streamer is on GitHub at https://github.com/ShannonAI/service-streamer.

← all repositories

ShannonAI/service-streamer

BERT at 12 rps? This middleware squeezes GPUs to 1,400

A Python middleware that turns discrete web requests into GPU-friendly mini-batches, multiplying inference throughput without rewriting your model.

★1.2k stars Python Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Service Streamer sits between your web framework and your deep-learning model. It queues incoming HTTP requests, groups them into mini-batches, and feeds them to GPU workers. The goal is to keep the GPU saturated instead of idling while processing one lonely sentence at a time. It works with any web framework or DL backend.

The interesting bit

The README shows a Flask BERT service jumping from 12.78 requests/sec (naive) to 207.59 with ThreadedStreamer, then 321.70 with multi-process Streamer, and 372.45 with RedisStreamer—all by adding a few lines of wrapper code. The 80× claim for multi-GPU (1,000+ sentences/sec) is mentioned but not benchmarked in the visible portion.

Key highlights

ThreadedStreamer for single-GPU batching; Streamer for multi-GPU via spawned workers
ManagedModel class for lazy initialization and pinning workers to specific CUDA devices
RedisStreamer for distributed web servers with CPU preprocessing and shared GPU workers
Future-based async API for non-web use cases
Benchmarks on Titan Xp / CUDA 9.0 / PyTorch 1.1

Caveats

The README is truncated mid-sentence during the multi-GPU benchmark section, so the 80× multi-GPU figure is stated but not fully substantiated in the visible text
Default max_latency of 0.1s means requests may wait up to 100ms to fill a batch; real-time apps need tuning
The project appears quiet; last visible Travis badge and PyTorch 1.1 suggest it may need updates for modern stacks

Verdict

Worth a look if you’re running GPU inference behind a Python web server and your GPU utilization is in the basement. Skip it if you’re already on a batched gRPC service or Triton/TF Serving.

Frequently asked

What is ShannonAI/service-streamer?: A Python middleware that turns discrete web requests into GPU-friendly mini-batches, multiplying inference throughput without rewriting your model.
Is service-streamer open source?: Yes — ShannonAI/service-streamer is open source, released under the Apache-2.0 license.
What language is service-streamer written in?: ShannonAI/service-streamer is primarily written in Python.
How popular is service-streamer?: ShannonAI/service-streamer has 1.2k stars on GitHub.
Where can I find service-streamer?: ShannonAI/service-streamer is on GitHub at https://github.com/ShannonAI/service-streamer.