This talk will attempt to demystify, for a non-technical audience, the current state of neural network explainability and interpretability, as well as trace the boundaries of what is in principle possible to achieve. We will first set up the necessary background to talk about interpretability methods with stakeholders in mind, define basic concepts, and explain differences such as inner interpretability versus explainability. Along the way, we will touch on issues of relevance to various stakeholders; for instance, the role of interpretability in attempting explanations of how large language models generate text, in revealing reasons for model biases, and in model distillation.
Throughout, we will use a particular lens to demystify what AI interpretability is, and which goals are within or out of its reach: instead of focusing on the promises of (algorithmic) solutions for interpretability, we will focus on the properties of the (computational) problems they attempt to solve. This lens—which we call computational meta-theory—will allow us to put stakeholders’ goals at the centre and to reason about the adequacy of interpretability ‘hammers’ to hit practically meaningful ‘nails’.