ControlFoleyUnified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Jianxuan Yang^1*†, Xinyue Guo^1*, Zhi Cheng^1,2, Kai Wang^1,2, Lipan Zhang¹, Jinjie Hu¹

Qiang Ji¹, Yihua Cao¹, Yihao Meng^1,2, Zhaoyue Cui^1,2, Mengmei Liu¹, Meng Meng¹, Jian Luan¹

¹MiLM Plus, Xiaomi Inc. ²Wuhan University
*Equal contribution †Corresponding author

Abstract

Recent advances in video-to-audio (V2A) generation have enabled high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains a fundamental challenge. In particular, existing methods suffer from two key limitations: weak textual controllability under visual-text semantic conflict, and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks further hinders systematic evaluation of controllability. To address these challenges, we: (1) introduce a joint visual encoding paradigm; (2) propose a temporal-timbre decoupling strategy; (3) design a modality-robust training scheme; (4) present VGGSound-TVC, a benchmark for text controllability under text-visual conflict.

Left: Overview of the ControlFoley framework with three multimodal conditioning modes for controllable video-synchronized audio generation. Right: Performance radar chart of Video-to-Audio models.

Introduction Video

Main Contributions

We propose ControlFoley, a unified and controllable multimodal V2A framework that enables precise control across video, text and reference audio. The key contributions of ControlFoley are as follows.

• Joint Visual Encoding for Robust Multimodal Control.

We propose a dual-branch visual encoding paradigm that combines CLIP and CAV-MAE-ST representations, capturing both vision-language and audio-visual correlations to mitigate modality conflict and improve textual controllability.

• Timbre-Focused Reference Audio Control.

We design a reference audio control mechanism that suppresses temporal information and extracts global timbre representations, enabling precise acoustic style control without interfering with video-driven synchronization.

• Modality-Robust Training with Unified Alignment.

We introduce an all-modality dropout strategy and a unified REPA alignment objective, improving robustness under varying modality combinations and enhancing multimodal consistency.

• VGGSound-TVC Benchmark.

We construct a benchmark for evaluating textual controllability under visual-text semantic conflicts, providing a standardized testbed for TC-V2A.

Extensive experiments demonstrate that ControlFoley achieves state-of-the-art performance across multiple V2A tasks, including TV2A, TC-V2A, and AC-V2A, while significantly improving controllability and robustness under challenging multimodal conditions.

Overview of the ControlFoley model architecture.

Samples

TV2A

Text-Guided Video-to-Audio

Generates temporally synchronized audio for video under textual guidance.

AudioX

TC-V2A

Text-Controlled Video-to-Audio

Audio generation under video–text conflicts, with semantics consistent with text prompts and temporally synchronized with video contents.

AudioX

AC-V2A

Audio-Controlled Video-to-Audio

Audio generation conditioned on a reference audio, with timbre consistent with reference audio and temporally synchronized with video contents.

CondFoleyGen

Performance

We evaluate ControlFoley on the TV2A task across three benchmarks: VGGSound-Test, Kling-Audio-Eval, and MovieGen-Audio-Bench, covering both in-distribution and out-of-distribution scenarios.

ControlFoley achieves state-of-the-art performance across all benchmarks, consistently obtaining the highest CLAP scores and lowest DeSync. It also significantly improves audio quality, achieving up to 27% relative gain in IS (22.08 vs. 17.36 on VGGSound).

Benchmark

We propose VGGSound-TVC, a benchmark for assessing text controllability under varying levels of visual-text semantic conflict. Example samples from VGGSound-TVC are as follows.

We systematically modify textual descriptions of videos to introduce controlled semantic discrepancies with visual content, forcing models to balance competing modalities. We define four conflict levels (L0--L3), ranging from no conflict to strong conflict, enabling systematic analysis of modality dominance as conflict increases.

Citation

@misc{yang2026controlfoleyunifiedcontrollablevideotoaudio,
             title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling}, 
             author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan},
             year={2026},
             eprint={2604.15086},
             archivePrefix={arXiv},
             primaryClass={cs.MM},
             url={https://arxiv.org/abs/2604.15086}, 
}