This paper provides theoretical guarantees for the robustness of model predictions to input and model perturbations in removal-based feature attribution methods. It then validates the theoretical results using synthetic and real-world datasets.
Notably, it describes removal-based feature attribution in two parts:(1) how feature information is removed from the model, and (2) how the algorithm summarizes each feature’s influence.
The main theory they apply is that L-Lipschitz continuous property: if a model $f$ is globally L-Lipschitz continuous, we have
The main theoretical framework they use is the $L$-Lipschitz continuity property, which states that if a model $f$ is globally $L$-Lipschitz continuous, then
$$ \left|f(x) - f(x')\right| \leq L \cdot \Vert x - x' \Vert_2, \forall x, x' \in \mathbb{R}^d. $$References
On the robustness of removal-based feature attributions NeurIPS 2023