Jean Michel A. Sarr
Jean Michel A. Sarr
(he/him)

Research Engineer

I build synthetic data infrastructure for training frontier language models at scale.

As a Research Software Engineer at Google, I was a core contributor to Gemini’s multilingual capabilities. I architected the end-to-end synthetic data pipeline that scaled instruction-following to 25 languages. My work sits at the intersection of research velocity and production scale: I build systems that cleanly separate experimental logic from infrastructure, enabling constant-overhead dataset expansion and fully reproducible training.

My technical expertise covers the generation of synthetic data for critical post-training phases: from Supervised Fine-Tuning (SFT) to creating data for preference learning. I view synthetic data not merely as a cost-saving measure, but as the key lever for post-training scaling. I have also synthesized the latest trends in synthetic alignment methods (including the limitations and alternatives to RLHF/RLAIF) in a comprehensive research series.

This engineering philosophy is grounded in my PhD research at Sorbonne University, where I developed methods using synthetic data to predict model behavior under distribution shift—principles I now apply to designing robust, large-scale production data systems.

I write about the shift from human to synthetic labeling on jmamath.github.io.