ByteDance-Seed/m3-agent
A multimodal agent framework from ByteDance that processes visual and auditory inputs, builds episodic and semantic long-term memory, and performs autonomous multi-turn reasoning.

M3-Agent is a multimodal agent system that processes real-time visual and auditory inputs to build and update long-term memory. It combines episodic memory for experience-based learning with semantic memory for world knowledge accumulation. The agent organizes memory in an entity-centric multimodal format and performs iterative reasoning with retrieved information. The repository includes M3-Bench, a benchmark with 1000 real-world and web videos for evaluating memory effectiveness and multimodal reasoning in agents.