bytedance/effective_transformer
Optimized inference engine for BERT that dynamically removes and restores padding values to reduce memory and computation waste on variable-length sequences.

Effective Transformer is a CUDA-accelerated inference optimization library built on NVIDIA FasterTransformer. It addresses the inefficiency of padding variable-length sequences into uniform batch matrices by computing prefix sums of attention masks to access only valid tokens. During computation stages, padding values are dynamically removed and restored, significantly reducing execution time and memory consumption especially for large batch sizes with highly variable sequence lengths.