As far as I know, this paper is the first to propose providing explanations for black-box models by learning a mask. Additionally, this paper presents an interesting perspective, which views explaining black-box models as a form of meta-learning. Specifically, an explanation is a rule that predicts the block-box model’s output for a given input.

A significant advantage of formulating explanations as meta learning is that the fidelity of the explanations can be measured as prediction accuracy.

They consider two approaches to summarize compactly the effect of deleting image regions in order to explain the behavior of the black box.

One method is the ‘deletion game’, which aims to find the smallest deletion mask $m$ that causes the scores $f_{c}(\Phi(x_0, m))\ll f_{c}(x_0)$ to drop significantly, where c is the target class.

$$ m^*=\underset{m \in[0,1]^{\Lambda}}{\operatorname{argmin}} \lambda\|\mathbf{1}-m\|_1+f_c\left(\Phi\left(x_0 ; m\right)\right) $$

The other method is the ‘preservation game’, where the goal is to find the smallest preservation mask $m$ that must be retain to preserve the score $f_{c}(\Phi(x_0;m))\ge f_{c}(x_0)$:

$$m^{*} = \argmin_{m}\lambda\|m\|_1 - f_{c}(\Phi(x_0;m))$$

To address the issue of artifacts, they propose two methods. First, effective explanations should generalize well. Therefore, they add a small amount of noise when applying the mask. Second, they believe that simple regular masks are less likely to cause artifacts, which they achieved by using the total-variation (TV) norm to regularize the mask.

$$ \mathbb{E}_\tau\left[f_c(\Phi(x_0, m)) + \tau\right] $$

Typically, the calculation of TV is as follows(speficically refer to Real Time Image Saliency for Black Box Classifiers):

$$ TV(m) = \sum_{i=1}^{N} \sum_{j=1}^{M}\left(|m_{i, j} - m_{i, j+1}| + |m_{i, j} - m_{i+1, j}|\right) $$

References

Fong, R. C., & Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE international conference on computer vision (pp. 3429-3437).

Dabkowski, P., & Gal, Y. (2017). Real time image saliency for black box classifiers. Advances in neural information processing systems, 30.