ControlFoleyUnified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Jianxuan Yang1*†,  Xinyue Guo1*,  Zhi Cheng1,2,  Kai Wang1,2,  Lipan Zhang1,  Jinjie Hu1
Qiang Ji1,  Yihua Cao1,  Yihao Meng1,2,  Zhaoyue Cui1,2,  Mengmei Liu1,  Meng Meng1,  Jian Luan1
1MiLM Plus, Xiaomi Inc.    2Wuhan University
*Equal contribution    †Corresponding author
Website arXiv Code Model

Abstract

Recent advances in video-to-audio (V2A) generation have enabled high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains a fundamental challenge. In particular, existing methods suffer from two key limitations: weak textual controllability under visual-text semantic conflict, and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks further hinders systematic evaluation of controllability. To address these challenges, we: (1) introduce a joint visual encoding paradigm; (2) propose a temporal-timbre decoupling strategy; (3) design a modality-robust training scheme; (4) present VGGSound-TVC, a benchmark for text controllability under text-visual conflict.

Introduction Video

Main Contributions

We propose ControlFoley, a unified and controllable multimodal V2A framework that enables precise control across video, text and reference audio. The key contributions of ControlFoley are as follows.

Joint Visual Encoding for Robust Multimodal Control.

We propose a dual-branch visual encoding paradigm that combines CLIP and CAV-MAE-ST representations, capturing both vision-language and audio-visual correlations to mitigate modality conflict and improve textual controllability.

Timbre-Focused Reference Audio Control.

We design a reference audio control mechanism that suppresses temporal information and extracts global timbre representations, enabling precise acoustic style control without interfering with video-driven synchronization.

Modality-Robust Training with Unified Alignment.

We introduce an all-modality dropout strategy and a unified REPA alignment objective, improving robustness under varying modality combinations and enhancing multimodal consistency.

VGGSound-TVC Benchmark.

We construct a benchmark for evaluating textual controllability under visual-text semantic conflicts, providing a standardized testbed for TC-V2A.

Extensive experiments demonstrate that ControlFoley achieves state-of-the-art performance across multiple V2A tasks, including TV2A, TC-V2A, and AC-V2A, while significantly improving controllability and robustness under challenging multimodal conditions.

Samples

TV2A

Text-Guided Video-to-Audio

Generates temporally synchronized audio for video under textual guidance.

Expand

TC-V2A

Text-Controlled Video-to-Audio

Audio generation under video–text conflicts, with semantics consistent with text prompts and temporally synchronized with video contents.

Expand

AC-V2A

Audio-Controlled Video-to-Audio

Audio generation conditioned on a reference audio, with timbre consistent with reference audio and temporally synchronized with video contents.

Expand

Performance

We evaluate ControlFoley on the TV2A task across three benchmarks: VGGSound-Test, Kling-Audio-Eval, and MovieGen-Audio-Bench, covering both in-distribution and out-of-distribution scenarios.

tv2a

ControlFoley achieves state-of-the-art performance across all benchmarks, consistently obtaining the highest CLAP scores and lowest DeSync. It also significantly improves audio quality, achieving up to 27% relative gain in IS (22.08 vs. 17.36 on VGGSound).

Benchmark

We propose VGGSound-TVC, a benchmark for assessing text controllability under varying levels of visual-text semantic conflict. Example samples from VGGSound-TVC are as follows.

benchmark

We systematically modify textual descriptions of videos to introduce controlled semantic discrepancies with visual content, forcing models to balance competing modalities. We define four conflict levels (L0--L3), ranging from no conflict to strong conflict, enabling systematic analysis of modality dominance as conflict increases.

Citation

@misc{yang2026controlfoleyunifiedcontrollablevideotoaudio,
             title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling}, 
             author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan},
             year={2026},
             eprint={2604.15086},
             archivePrefix={arXiv},
             primaryClass={cs.MM},
             url={https://arxiv.org/abs/2604.15086}, 
}