meizhong986/WhisperJAV
Speech-to-text pipeline combining Whisper, Qwen3-ASR, TEN-VAD, and speech enhancement models to generate subtitles on noisy Japanese audio.

WhisperJAV is an automated subtitle generator targeting Japanese Adult Video content, which presents acoustic challenges like non-verbal vocalizations, low signal-to-noise ratio, and extreme audio dynamics that degrade standard ASR performance. The pipeline chains TEN-VAD for voice activity detection, Zipformer for speech enhancement, Whisper for initial transcription, and Qwen3-ASR as a local LLM for hallucination correction. It runs on Google Colab, Kaggle, or locally with GGUF/MLX quantization support.