Interpretability by Design

LEXplain: Improving Model Explanations via Lexicon Supervision

Model explanations that shed light on the model{‘}s predictions are becoming a desired additional output of NLP models, alongside their predictions. Challenges in creating these explanations include making them trustworthy and faithful to the model{’}s predictions. In this work, we propose a novel framework for guiding model explanations by supervising them explicitly. To this end, our method, LEXplain, uses task-related lexicons to directly supervise model explanations. This approach consistently improves the model{‘}s explanations without sacrificing performance on the task, as we demonstrate on sentiment analysis and toxicity detection.

SelfExplain: A Self-Explaining Architecture for Neural Text Classifiers

We introduce SelfExplain, a novel self-explaining framework that explains a text classifier’s predictions using phrase-based concepts. SelfExplain augments existing neural classifiers by adding (1) a globally interpretable layer that identifies the most influential concepts in the training set for a given sample and (2) a locally interpretable layer that quantifies the contribution of each local input concept by computing a relevance score relative to the predicted label. Experiments across five text-classification datasets show that SelfExplain facilitates interpretability without sacrificing performance.