What Works in Synthetic Alignment: Evidence and Scorecard
Part 3 of 4 in the series “From Human Feedback to Synthetic Alignment”
← Part 2: Architecture | Series Overview | Part 4: Research Frontiers →
In Part 1, we established RLHF’s fundamental limitations. In Part 2, we mapped the design space of the synthetic alignment methods built to overcome them. Now, it’s time for the verdict.
This article examines the empirical evidence to deliver a scorecard on synthetic alignment: which of RLHF’s problems have been solved, which remain, and what new challenges have emerged? We’ll ground this assessment in six key insights from recent research that are shaping the future of the field.
The Verdict: A Scorecard for Synthetic Alignment
We began this series in Part 1 by identifying the core limitations of RLHF. Now, after exploring the architecture of synthetic alignment, we can return to those original problems and assess the progress. The scorecard below provides a high-level verdict, which we will explore in detail in the subsequent analysis.
| RLHF Limitation | Status | Key Improvements | Remaining Challenges |
|---|---|---|---|
| Scalability & Cost | ✅ Solved | Automated preference generation eliminates human annotation bottleneck; methods generate tens of thousands of preference pairs without human cost | None—the economic constraint is decisively addressed |
| Research Velocity | ✅ Solved | Iteration cycles reduced from months to days; rapid algorithmic exploration enabled | None—temporal constraints eliminated |
| Distribution Shift | ✅ Solved (at computational cost) | On-policy training maintains data-policy alignment; demonstrably more stable than offline methods | Requires constant data regeneration and judge interaction—high computational expense |
| Reward Hacking | ⚠️ Partially Addressed | DPO methods eliminate explicit reward model exploitation; reduced staleness | Gaming shifts to judge scoring functions; self-judgment creates feedback loops; policy still optimizes proxies |
| Human Inconsistency | ⚠️ Partially Solved | Perfectly consistent AI judge evaluations eliminate annotator disagreement and noise | Judge models introduce systematic biases (vs. random noise); correctness ≠ consistency |
| Value Alignment & Representation | ⚠️ Improved Transparency | Constitutional principles make values explicit and modifiable; fine-grained control over optimization targets | Value authorship problem remains; cross-cultural representation unsolved; single value set scaled globally |
| Post-Deployment Adaptation | ⚠️ Easier to Iterate | Friction reduced for running new alignment iterations; no human coordination needed | Deployed models still frozen; no continual learning from user interactions; paradigm remains training-time vs. deployment-time |
Detailed Analysis of the Scorecard
✅ Scalability, Cost, and Research Velocity: Solved
The transformation here is unambiguous. Synthetic alignment fundamentally eliminates the human resource and time bottlenecks that cap RLHF. Methods like ULTRAFEEDBACK (Cui et al., 2024) generate vast preference datasets using automated judges at a scale that would be prohibitively expensive and slow with human annotators. The iterative self-improvement demonstrated by Yuan et al. (2025), Wu et al. (2024), and Dong et al. (2024)—generating tens of thousands of preference pairs over multiple rounds—occurs in days, not months.
The economic constraint that Casper et al. (2023) called “a major bottleneck” and the “research velocity problem” identified by Bowman et al. (2022) have been decisively resolved.
✅ Distribution Shift: Solved (at Computational Cost)
The evidence here is among the strongest in the entire analysis, and it leads to our first key insight.
Insight 1: On-Policy Training Dominates Offline Approaches. The most effective methods generate training data dynamically from the current policy rather than using a fixed, offline dataset. Guo et al. (2024) demonstrate that online DPO is not only more stable but also wins human preference comparisons 58% of the time against offline DPO and RLAIF.
By continuously regenerating data from the evolving policy, methods from Yuan et al. (2025) to Guo et al. (2024) solve the distribution mismatch and reward model staleness that plague RLHF. The trade-off is computational—on-policy training requires constant data regeneration and judge interaction—but in an era of abundant compute, it’s a winning trade.
⚠️ Reward Hacking: Partially Addressed
Direct preference optimization (DPO) methods eliminate the explicit reward model, preventing agents from directly gaming a proxy score. However, optimization still finds shortcuts. The policy can learn to exploit the judge model’s scoring function, a problem that becomes more complex in iterative self-play. This leads to two more insights.
Insight 2: Explicit Judge Training Outperforms Emergent Capabilities. You can’t just hope a good generator will be a good evaluator. Wu et al. (2024) surpassed Yuan et al. (2025) on AlpacaEval 2.0 by explicitly training the judge model in each iteration. This prevents the judge from becoming stale and demonstrates that dedicated training for the evaluator role is critical.
Insight 3: Self-Improvement Hits Diminishing Returns After 3-4 Iterations. A consistent pattern has emerged: performance gains diminish after a few rounds. Wu et al. (2024) found judge scores clustering around 5, suggesting the policy had learned to produce homogeneous responses that satisfied the judge, eliminating the strong preference signal needed for further improvement. This suggests current methods face an inherent ceiling.
While DPO is an improvement, the fundamental challenge of proxy gaming persists, and self-judgment introduces new failure modes like feedback loops and premature convergence.
⚠️ Human Inconsistency: Partially Solved
Synthetic alignment replaces the random noise of human disagreement with the systematic bias of an AI judge. The AI judge is perfectly consistent, which dramatically improves the signal-to-noise ratio. However, consistency is not correctness. The judge imposes a single, monolithic perspective, inheriting the values and biases of its training data. This brings us to another crucial insight.
Insight 4: Data Quality is Multifaceted, Not Unidimensional. What constitutes “high-quality” data is complex. Shi et al. (2024) found that prompts generated from real-world harmful text were more effective for safety training than purely synthetic prompts. Dong et al. (2024) discovered that prompt diversity was the most important factor for improvement. And Cheng et al. (2025) showed that investing more compute in data generation via tree-search refinement yields superior results.
The problem shifts from managing noisy human labels to understanding and auditing the systematic biases of AI judges and the subtle qualities of the data they train on.
⚠️ Value Alignment and Representation: Improved Transparency, Same Fundamental Challenge
Constitutional AI (Bai et al., 2022) makes the values being optimized for explicit and modifiable. This is a significant step forward in transparency compared to the opaque values encoded in an RLHF reward model. However, it doesn’t solve the representation problem. Where RLHF scaled the values of a small group of annotators, Constitutional AI scales the values of whoever writes the constitution. The challenge of value pluralism—how to build systems that respect diverse global preferences—remains unsolved.
⚠️ Post-Deployment Adaptation: Easier to Iterate, Fundamentally Still Static
Synthetic alignment dramatically reduces the friction of updating a model. Running a new alignment iteration with a revised constitution or an improved judge model is now feasible without a massive human coordination effort. However, the paradigm remains the same: the model is aligned during training and then frozen for deployment. True continual learning—where a model adapts to user interactions and evolving norms after deployment—is not yet supported by these methods.
Six Key Takeaways from the Evidence
Distilling the analysis above, six empirical insights have emerged from the research that should guide practitioners and researchers in the field.
- On-Policy Training is Superior: Dynamic, on-policy data generation is demonstrably more stable and effective than training on static, offline datasets.
- Train Your Judge Explicitly: Don’t rely on emergent evaluation capabilities. Co-evolving the judge via dedicated training leads to better performance.
- Data Quality is Complex: The best data isn’t just “preferred,” it’s also authentic, diverse, and the product of computational effort.
- Self-Improvement Plateaus: Current iterative methods hit a ceiling after 3-4 rounds, suggesting a need for new techniques to sustain long-term improvement.
- Alignment Doesn’t Have to Hurt Capabilities: Comprehensive regression testing by Dong et al. (2024) and others shows that careful alignment can be achieved without degrading performance on general capabilities.
- Controlled Comparisons are Crucial: The most valuable research isolates specific design choices. The field needs more controlled experiments to attribute progress correctly.
From Assessment to Frontiers
The scorecard reveals both genuine progress and persistent challenges. Synthetic alignment has decisively solved RLHF’s scalability and velocity problems, and on-policy training has addressed distribution shift. However, judge model biases, the limits of self-improvement, and value pluralism remain thorny issues.
The evidence suggests that synthetic alignment hasn’t “solved” alignment; it has shifted and refined the challenge. Judge model biases replace human biases. Computational costs replace human costs. New failure modes have emerged. These aren’t just engineering hurdles—they are conceptual frontiers that will define the next chapter of alignment research.
In Part 4, the final article in this series, we will explore these frontiers in depth, examining the five most critical research questions that will shape the future of self-improving AI systems.
References
Askell, A., Bai, Y., Chen, A., et al., 2021. A General Language Assistant as a Laboratory for Alignment. https://doi.org/10.48550/arXiv.2112.00861
Bai, Y., Kadavath, S., Kundu, S., et al., 2022. Constitutional AI: Harmlessness from AI Feedback. https://doi.org/10.48550/arXiv.2212.08073
Bowman, S.R., Hyun, J., Perez, E., et al., 2022. Measuring Progress on Scalable Oversight for Large Language Models. https://doi.org/10.48550/arXiv.2211.03540
Casper, S., Davies, X., Shi, C., et al., 2023. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. https://doi.org/10.48550/arXiv.2307.15217
Cheng, J., Liu, X., Wang, C., et al., 2025. SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models. https://doi.org/10.48550/arXiv.2412.11605
Cui, G., Yuan, L., Ding, N., et al., 2024. UltraFeedback: Boosting Language Models with Scaled AI Feedback. https://doi.org/10.48550/arXiv.2310.01377
Dong, Q., Dong, L., Zhang, X., Sui, Z., Wei, F., 2024. Self-Boosting Large Language Models with Synthetic Preference Data. https://doi.org/10.48550/arXiv.2410.06961
Gao, L., Schulman, J., Hilton, J., 2022. Scaling Laws for Reward Model Overoptimization. https://doi.org/10.48550/arXiv.2210.10760
Guo, S., Zhang, B., Liu, T., et al., 2024. Direct Language Model Alignment from Online AI Feedback. https://doi.org/10.48550/arXiv.2402.04792
Kim, S., Bae, S., Shin, J., et al., 2023. Aligning Large Language Models through Synthetic Feedback. https://doi.org/10.48550/arXiv.2305.13735
Kundu, S., Bai, Y., Kadavath, S., et al., 2023. Specific versus General Principles for Constitutional AI. https://doi.org/10.48550/arXiv.2310.13798
Lee, H., Phatale, S., Mansoor, H., et al., 2024. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. https://doi.org/10.48550/arXiv.2309.00267
Ouyang, L., Wu, J., Jiang, X., et al., 2022. Training language models to follow instructions with human feedback. https://doi.org/10.48550/arXiv.2203.02155
Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., Hashimoto, T., 2023. Whose Opinions Do Language Models Reflect? https://doi.org/10.48550/arXiv.2303.17548
Shi, T., Chen, K., Zhao, J., 2024. Safer-Instruct: Aligning Language Models with Automated Preference Data. https://doi.org/10.48550/arXiv.2311.08685
Stiennon, N., Ouyang, L., Wu, J., et al., 2022. Learning to summarize from human feedback. https://doi.org/10.48550/arXiv.2009.01325
Wu, T., Yuan, W., Golovneva, O., et al., 2024. Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge. https://doi.org/10.48550/arXiv.2407.19594
Yu, T., Zhang, H., Li, Q., et al., 2025. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness. https://doi.org/10.48550/arXiv.2405.17220
Yuan, W., Pang, R.Y., Cho, K., et al., 2025. Self-Rewarding Language Models. https://doi.org/10.48550/arXiv.2401.10020
Zweiger, A., Pari, J., Guo, H., et al., 2025. Self-Adapting Language Models. https://doi.org/10.48550/arXiv.2506.10943
Related
- The Path Forward: Five Critical Research Frontiers
- Synthetic Alignment Research: Key Insights for AI Leaders
- The Architecture of Synthetic Alignment: Paradigms and Design Factors
- The Economics of Alignment: Why RLAIF Delivers 11x Cost Reduction
- Why RLHF Can't Scale: Understanding the Fundamental Limitations