Developing provable methods to ensure advanced AI systems remain beneficial and aligned with human values as they scale in capability. Our work spans reward modeling, scalable oversight, debate-based training, and formal verification of alignment properties under distribution shift.
Reverse-engineering the internal representations of neural networks to build tools that let humans understand and predict model behavior.
Designing institutional frameworks and evaluation benchmarks for responsible frontier AI deployment across jurisdictions.
Rigorous measurement and prediction of emergent behaviors in scaled systems, including dangerous capability detection.
Our residency program brings together PhD researchers, postdoctoral scholars, and independent investigators to collaborate on the most pressing problems in AI safety. Residents have full access to compute infrastructure, mentorship from senior researchers, and a cross-disciplinary community.