Skip to content
Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)
· Existential Risk

Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)

Inria researcher Carina Prunkl discusses why AI evaluation struggles to keep pace with general-purpose systems, including jagged capabilities, missed real-world behavior, misuse risks, de-skilling, red teaming, and layered safeguards.

Watch Episode Here


Listen to Episode Here


Show Notes

Carina Prunkl is a researcher at Inria. She joins the podcast to discuss how to assess the capabilities and risks of general-purpose AI. We examine why systems can solve hard coding and math problems yet still fail at simple tasks, why pre-deployment tests often miss real-world behavior, and how faster capability gains can increase misuse risks. The conversation also covers de-skilling, red teaming, layered safeguards, and warning signs that AIs might undermine oversight.

LINKS:

CHAPTERS:

(00:00) Episode Preview

(01:04) Introducing the report

(02:10) Jagged frontier capabilities

(05:29) Formal reasoning progress

(12:36) Risks and evaluation science

(19:00) Funding evaluation capacity

(24:03) Autonomy and de-skilling

(31:32) Authenticity and AI companions

(41:00) Defense in depth methods

(48:34) Loss of control risks

(53:16) Where to read report

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://podcast.futureoflife.org

Twitter (FLI): https://x.com/FLI_org

Twitter (Gus): https://x.com/gusdocker

LinkedIn: https://www.linkedin.com/company/future-of-life-institute/

YouTube: https://www.youtube.com/channel/UC-rCCy3FQ-GItDimSR9lhzw/

Apple: https://geo.itunes.apple.com/us/podcast/id1170991978

Spotify: https://open.spotify.com/show/2Op1WO3gwVwCrYHg4eoGyP


Related episodes

No matter your level of experience or seniority, there is something you can do to help us ensure the future of life is positive.