Publications

Multi-Agent Debate for LLM Judges with Adaptive Stability Detection

Published in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

Introduces a multi-agent debate framework for LLM judges with adaptive stability detection to improve evaluation reliability.

Recommended citation: T. Hu, Z. Tan, S. Wang, H. Qu, and T. Chen. Multi-Agent Debate for LLM Judges with Adaptive Stability Detection. The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.

PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Published in arXiv preprint arXiv:2512.02624, 2025

Presents PPTBench, a benchmark for holistic evaluation of large language models on PowerPoint layout and design understanding.

Recommended citation: Z. Huang, X. Liu, T. Hu, K. Zhang, and Y. Liu. PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding. arXiv preprint arXiv:2512.02624, 2025. https://arxiv.org/abs/2512.02624