Noise-Aware Direct Preference Optimization for RLAIF
Toleu A. Tolegen G. Pak A. Jaxylykova A.
October 2025Multidisciplinary Digital Publishing Institute (MDPI)
Applied Sciences (Switzerland)
2025#15Issue 19
Reinforcement Learning from Human Feedback (RLHF) produces powerful instruction-following models but relies on a preference-labeling process that is both costly and slow. An effective alternative, Reinforcement Learning from AI Feedback (RLAIF), uses large language models as teachers for relabeling; however, this introduces substantial label noise. In our setting, we found that AI teachers flipped approximately 50% of the original human preferences on the dataset, a condition that degrades the performance of standard direct preference optimization (DPO). We propose noise-robust DPO (nrDPO) and nrDPO-gated, two drop-in variants that make DPO resilient to noisy preferences. nrDPO reweights each pair by (i) a margin-confidence term from a frozen reference policy (base or SFT), (ii) a context-stability term that penalizes preferences that change under truncated histories, and (iii) a length correction to curb verbosity bias. nrDPO-gated further filters low-confidence pairs via a simple threshold on the reference margin. On a dataset with heavy synthetic noise (30% flips), nrDPO-gated improves the preference accuracy by +3.8% over vanilla DPO; in a realistic RLAIF setting, nrDPO-gated is the only configuration that recovers competitive alignment, reaching ≈60% on a 5k relabeled set (vs. ≈49–50% for vanilla DPO) and approaching RLHF baselines.
DPO , LLM , noise robustness , preference optimization , RLAIF , RLHF
Text of the article Перейти на текст статьи
School of Information Technology and Engineering, Kazakh-British Technical University, Almaty, 050000, Kazakhstan
AI Research Laboratory, Satbayev University, Almaty, 050040, Kazakhstan
School of Information Technology and Engineering
AI Research Laboratory
10 лет помогаем публиковать статьи Международный издатель
Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026