What Works in Synthetic Alignment: Evidence and Scorecard

Part 3 of 4 in the series “From Human Feedback to Synthetic Alignment”
← Part 2: Architecture | Series Overview | Part 4: Research Frontiers →
In Part 1, we established RLHF’s fundamental limitations. In Part 2, we mapped the design space of the synthetic alignment methods built to overcome them. Now, it’s time for the verdict.
This article examines the empirical evidence to deliver a scorecard on synthetic alignment: which of RLHF’s problems have been solved, which remain, and what new challenges have emerged? We’ll ground this assessment in six key insights from recent research that are shaping the future of the field.

The Verdict: A Scorecard for Synthetic Alignment

We began this series in Part 1 by identifying the core limitations of RLHF. Now, after exploring the architecture of synthetic alignment, we can return to those original problems and assess the progress. The scorecard below provides a high-level verdict, which we will explore in detail in the subsequent analysis.

RLHF Limitation	Status	Key Improvements	Remaining Challenges
Scalability & Cost	✅ Solved	Automated preference generation eliminates human annotation bottleneck; methods generate tens of thousands of preference pairs without human cost	None—the economic constraint is decisively addressed
Research Velocity	✅ Solved	Iteration cycles reduced from months to days; rapid algorithmic exploration enabled	None—temporal constraints eliminated
Distribution Shift	✅ Solved (at computational cost)	On-policy training maintains data-policy alignment; demonstrably more stable than offline methods	Requires constant data regeneration and judge interaction—high computational expense
Reward Hacking	⚠️ Partially Addressed	DPO methods eliminate explicit reward model exploitation; reduced staleness	Gaming shifts to judge scoring functions; self-judgment creates feedback loops; policy still optimizes proxies
Human Inconsistency	⚠️ Partially Solved	Perfectly consistent AI judge evaluations eliminate annotator disagreement and noise	Judge models introduce systematic biases (vs. random noise); correctness ≠ consistency
Value Alignment & Representation	⚠️ Improved Transparency	Constitutional principles make values explicit and modifiable; fine-grained control over optimization targets	Value authorship problem remains; cross-cultural representation unsolved; single value set scaled globally
Post-Deployment Adaptation	⚠️ Easier to Iterate	Friction reduced for running new alignment iterations; no human coordination needed	Deployed models still frozen; no continual learning from user interactions; paradigm remains training-time vs. deployment-time

Detailed Analysis of the Scorecard

✅ Scalability, Cost, and Research Velocity: Solved

The transformation here is unambiguous. Synthetic alignment fundamentally eliminates the human resource and time bottlenecks that cap RLHF. Methods like ULTRAFEEDBACK (Cui et al., 2024) generate vast preference datasets using automated judges at a scale that would be prohibitively expensive and slow with human annotators. The iterative self-improvement demonstrated by Yuan et al. (2025), Wu et al. (2024), and Dong et al. (2024)—generating tens of thousands of preference pairs over multiple rounds—occurs in days, not months.

The economic constraint that Casper et al. (2023) called “a major bottleneck” and the “research velocity problem” identified by Bowman et al. (2022) have been decisively resolved.

✅ Distribution Shift: Solved (at Computational Cost)

The evidence here is among the strongest in the entire analysis, and it leads to our first key insight.

Insight 1: On-Policy Training Dominates Offline Approaches. The most effective methods generate training data dynamically from the current policy rather than using a fixed, offline dataset. Guo et al. (2024) demonstrate that online DPO is not only more stable but also wins human preference comparisons 58% of the time against offline DPO and RLAIF.

By continuously regenerating data from the evolving policy, methods from Yuan et al. (2025) to Guo et al. (2024) solve the distribution mismatch and reward model staleness that plague RLHF. The trade-off is computational—on-policy training requires constant data regeneration and judge interaction—but in an era of abundant compute, it’s a winning trade.

⚠️ Reward Hacking: Partially Addressed

Direct preference optimization (DPO) methods eliminate the explicit reward model, preventing agents from directly gaming a proxy score. However, optimization still finds shortcuts. The policy can learn to exploit the judge model’s scoring function, a problem that becomes more complex in iterative self-play. This leads to two more insights.

Insight 2: Explicit Judge Training Outperforms Emergent Capabilities. You can’t just hope a good generator will be a good evaluator. Wu et al. (2024) surpassed Yuan et al. (2025) on AlpacaEval 2.0 by explicitly training the judge model in each iteration. This prevents the judge from becoming stale and demonstrates that dedicated training for the evaluator role is critical.

Insight 3: Self-Improvement Hits Diminishing Returns After 3-4 Iterations. A consistent pattern has emerged: performance gains diminish after a few rounds. Wu et al. (2024) found judge scores clustering around 5, suggesting the policy had learned to produce homogeneous responses that satisfied the judge, eliminating the strong preference signal needed for further improvement. This suggests current methods face an inherent ceiling.

While DPO is an improvement, the fundamental challenge of proxy gaming persists, and self-judgment introduces new failure modes like feedback loops and premature convergence.

⚠️ Human Inconsistency: Partially Solved

Synthetic alignment replaces the random noise of human disagreement with the systematic bias of an AI judge. The AI judge is perfectly consistent, which dramatically improves the signal-to-noise ratio. However, consistency is not correctness. The judge imposes a single, monolithic perspective, inheriting the values and biases of its training data. This brings us to another crucial insight.

Insight 4: Data Quality is Multifaceted, Not Unidimensional. What constitutes “high-quality” data is complex. Shi et al. (2024) found that prompts generated from real-world harmful text were more effective for safety training than purely synthetic prompts. Dong et al. (2024) discovered that prompt diversity was the most important factor for improvement. And Cheng et al. (2025) showed that investing more compute in data generation via tree-search refinement yields superior results.

The problem shifts from managing noisy human labels to understanding and auditing the systematic biases of AI judges and the subtle qualities of the data they train on.

⚠️ Value Alignment and Representation: Improved Transparency, Same Fundamental Challenge

Constitutional AI (Bai et al., 2022) makes the values being optimized for explicit and modifiable. This is a significant step forward in transparency compared to the opaque values encoded in an RLHF reward model. However, it doesn’t solve the representation problem. Where RLHF scaled the values of a small group of annotators, Constitutional AI scales the values of whoever writes the constitution. The challenge of value pluralism—how to build systems that respect diverse global preferences—remains unsolved.

⚠️ Post-Deployment Adaptation: Easier to Iterate, Fundamentally Still Static

Synthetic alignment dramatically reduces the friction of updating a model. Running a new alignment iteration with a revised constitution or an improved judge model is now feasible without a massive human coordination effort. However, the paradigm remains the same: the model is aligned during training and then frozen for deployment. True continual learning—where a model adapts to user interactions and evolving norms after deployment—is not yet supported by these methods.

Six Key Takeaways from the Evidence

Distilling the analysis above, six empirical insights have emerged from the research that should guide practitioners and researchers in the field.

On-Policy Training is Superior: Dynamic, on-policy data generation is demonstrably more stable and effective than training on static, offline datasets.
Train Your Judge Explicitly: Don’t rely on emergent evaluation capabilities. Co-evolving the judge via dedicated training leads to better performance.
Data Quality is Complex: The best data isn’t just “preferred,” it’s also authentic, diverse, and the product of computational effort.
Self-Improvement Plateaus: Current iterative methods hit a ceiling after 3-4 rounds, suggesting a need for new techniques to sustain long-term improvement.
Alignment Doesn’t Have to Hurt Capabilities: Comprehensive regression testing by Dong et al. (2024) and others shows that careful alignment can be achieved without degrading performance on general capabilities.
Controlled Comparisons are Crucial: The most valuable research isolates specific design choices. The field needs more controlled experiments to attribute progress correctly.

From Assessment to Frontiers

The scorecard reveals both genuine progress and persistent challenges. Synthetic alignment has decisively solved RLHF’s scalability and velocity problems, and on-policy training has addressed distribution shift. However, judge model biases, the limits of self-improvement, and value pluralism remain thorny issues.

The evidence suggests that synthetic alignment hasn’t “solved” alignment; it has shifted and refined the challenge. Judge model biases replace human biases. Computational costs replace human costs. New failure modes have emerged. These aren’t just engineering hurdles—they are conceptual frontiers that will define the next chapter of alignment research.

In Part 4, the final article in this series, we will explore these frontiers in depth, examining the five most critical research questions that will shape the future of self-improving AI systems.

References

Askell, A., Bai, Y., Chen, A., et al., 2021. A General Language Assistant as a Laboratory for Alignment. https://doi.org/10.48550/arXiv.2112.00861

Bai, Y., Kadavath, S., Kundu, S., et al., 2022. Constitutional AI: Harmlessness from AI Feedback. https://doi.org/10.48550/arXiv.2212.08073

Bowman, S.R., Hyun, J., Perez, E., et al., 2022. Measuring Progress on Scalable Oversight for Large Language Models. https://doi.org/10.48550/arXiv.2211.03540

Casper, S., Davies, X., Shi, C., et al., 2023. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. https://doi.org/10.48550/arXiv.2307.15217

Cheng, J., Liu, X., Wang, C., et al., 2025. SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models. https://doi.org/10.48550/arXiv.2412.11605

Cui, G., Yuan, L., Ding, N., et al., 2024. UltraFeedback: Boosting Language Models with Scaled AI Feedback. https://doi.org/10.48550/arXiv.2310.01377

Dong, Q., Dong, L., Zhang, X., Sui, Z., Wei, F., 2024. Self-Boosting Large Language Models with Synthetic Preference Data. https://doi.org/10.48550/arXiv.2410.06961

Gao, L., Schulman, J., Hilton, J., 2022. Scaling Laws for Reward Model Overoptimization. https://doi.org/10.48550/arXiv.2210.10760

Guo, S., Zhang, B., Liu, T., et al., 2024. Direct Language Model Alignment from Online AI Feedback. https://doi.org/10.48550/arXiv.2402.04792

Kim, S., Bae, S., Shin, J., et al., 2023. Aligning Large Language Models through Synthetic Feedback. https://doi.org/10.48550/arXiv.2305.13735

Kundu, S., Bai, Y., Kadavath, S., et al., 2023. Specific versus General Principles for Constitutional AI. https://doi.org/10.48550/arXiv.2310.13798

Lee, H., Phatale, S., Mansoor, H., et al., 2024. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. https://doi.org/10.48550/arXiv.2309.00267

Ouyang, L., Wu, J., Jiang, X., et al., 2022. Training language models to follow instructions with human feedback. https://doi.org/10.48550/arXiv.2203.02155

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., Hashimoto, T., 2023. Whose Opinions Do Language Models Reflect? https://doi.org/10.48550/arXiv.2303.17548

Shi, T., Chen, K., Zhao, J., 2024. Safer-Instruct: Aligning Language Models with Automated Preference Data. https://doi.org/10.48550/arXiv.2311.08685

Stiennon, N., Ouyang, L., Wu, J., et al., 2022. Learning to summarize from human feedback. https://doi.org/10.48550/arXiv.2009.01325

Wu, T., Yuan, W., Golovneva, O., et al., 2024. Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge. https://doi.org/10.48550/arXiv.2407.19594

Yu, T., Zhang, H., Li, Q., et al., 2025. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness. https://doi.org/10.48550/arXiv.2405.17220

Yuan, W., Pang, R.Y., Cho, K., et al., 2025. Self-Rewarding Language Models. https://doi.org/10.48550/arXiv.2401.10020

Zweiger, A., Pari, J., Guo, H., et al., 2025. Self-Adapting Language Models. https://doi.org/10.48550/arXiv.2506.10943