2.3 Taxonomy of Interpretability Methods

2.3.1 Intrinsic or Post-hoc

This criterion is based on whether the original model is designed to be interpretable (intrinsic) or the original model is a black-box model and extra interpretable methods are applied to explain that model.

2.3.2 Model-specific or Model-agnostic

This criterion is based on whether the interpretability method is designed for a specific model or a general model. The interpretation for intrinsic interpretable models is always model-specific. For instance, the interpretation of regression weights in a linear regression model is model-specific. The model-agnostic interpretability methods are usually post-hoc interpretability methods. For instance, the interpretation of a black-box model by LIME [7] is model-agnostic. The model-agnostic model is unaware of the model’s internal structure and only uses the inputs and the model’s prediction results.

2.3.3 Local or Global

This criterion is based on whether the interpretability method is to explain an instance or the whole model. - Global on a Holistic Level: The interpretability method is to explain the whole model. For instance, the interpretability of a linear regression model is to explain the whole model by the regression weights. - Global on a Modular Level: The interpretability method is to explain the whole model by looking at part of the model. - Local for a Single Instance: The interpretability method is to explain a single instance. - Local for a Group of Instances: The interpretability method is to explain a group of instances. ### Result-based This criterion is based on the result of the interpretability method.

  • Feature Importance: the importance of each feature in the model’s prediction.
  • Model internals: the internal structure of the model.
  • Data points: this criterion is based on the data points (existent or newly created) that are used to explain the model.
    • Instance-based: the data points are the existent instances in the training dataset (e.g., a similar instance with the same output as the instance of interest) that are used to explain the model.
    • Counterfactual-based: the data points are the counterfactual instances (change some of the features and the outcomes changes) that are used to explain the model.

References

[7]
M. T. Ribeiro, S. Singh, and C. Guestrin, “" why should i trust you?" Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.