Designing Intelligence
That Remains
Aligned

Alignment Monitoring v3.2 LIVE
0
Papers Published
0
Models Open-Sourced
0
Researchers
0
Partner Institutions

Our research spans four
interconnected domains

Primary Focus

AI Alignment

Developing provable methods to ensure advanced AI systems remain beneficial and aligned with human values as they scale in capability. Our work spans reward modeling, scalable oversight, debate-based training, and formal verification of alignment properties under distribution shift.

23 active projects · 14 researchers
Interpretability

Mechanistic Interpretability

Reverse-engineering the internal representations of neural networks to build tools that let humans understand and predict model behavior.

14 publications this year
Governance

Governance & Policy

Designing institutional frameworks and evaluation benchmarks for responsible frontier AI deployment across jurisdictions.

8 policy briefs released
Evaluation

Capabilities Evaluation

Rigorous measurement and prediction of emergent behaviors in scaled systems, including dangerous capability detection.

6 benchmarks released

Advancing the Field

NeurIPS 2024

Recursive Oversight Through Hierarchical Verification

Chen, Wei, Park Read Paper →
ICML 2024

Geometric Interpretability of Latent Representations

Nakamura, Reyes Read Paper →
AAAI 2024

Cross-Jurisdictional Governance Frameworks for Frontier AI

Okonkwo, Müller Read Paper →
ICLR 2024

Quantifying Emergent Capabilities in Scaled Architectures

Lindqvist, Das Read Paper →
arXiv 2024

Debate-Based Alignment via Red-Team Evaluation

Torres, Gupta, Kim Read Paper →
NeurIPS 2023

Reward Model Verification Under Distribution Shift

Zhai, Andersen Read Paper →
SAFEML 2024

Formal Methods for Constitutional AI Constraints

Petrov, Liu Read Paper →

Join the Lab

Our residency program brings together PhD researchers, postdoctoral scholars, and independent investigators to collaborate on the most pressing problems in AI safety. Residents have full access to compute infrastructure, mentorship from senior researchers, and a cross-disciplinary community.

  • Access to 10,000+ GPU cluster
  • Dedicated mentorship pairing
  • Publication support and review
  • Travel funding for conferences
  • Cross-institutional collaboration
  • Flexible remote participation

Apply for Residency

3 – 6 months

Rolling applications

Remote-friendly