What is it about?
This paper proposes AlignCP, a lightweight and interpretable data-aware preference alignment framework designed to mitigate the negative impacts of noise in preference data (e.g., label flipping, inconsistent judgments) on DPO/preference optimization training. Through empirical analysis on datasets such as Anthropic-HH, we find that approximately 25% of preference pairs exhibit significant inconsistency between reward model predictions and human annotations, and such samples often fail to provide effective training signals or even degrade model behavior. Based on this observation, AlignCP constructs two interpretable signals from the reward model outputs: Confidence, which measures the certainty of the preference decision, and Polarity, which characterizes the consistency between the reward ranking direction and human annotations. By jointly designing sample weights based on these two signals, AlignCP performs reinforcement learning on preference pairs with high confidence and consistent polarity, while suppressing low-confidence or direction-conflicting samples, thereby effectively filtering out noise interference without requiring data relabeling. Experimental results demonstrate that AlignCP significantly outperforms DPO and its several variants on benchmark datasets such as Anthropic-HH, achieving consistent improvements in both helpfulness and safety metrics, and exhibiting stronger robustness under noisy scenarios such as label flipping. Overall, AlignCP provides an automated, interpretable, and efficient quality control and reweighting strategy for preference data, offering a practical solution for robustly improving LLM preference alignment performance.
Featured Image
Photo by Jona on Unsplash
Why is it important?
It eliminates the need for extensive human intervention and incurs only minimal additional computational overhead.
Read the Original
This page is a summary of: AlignCP: Noise-Aware Preference Alignment for LLMs via Confidence and Polarity Reweighting, April 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3774904.3792972.
You can read the full text:
Contributors
The following have contributed to this page







