This paper provides theoretical guarantees for the robustness of model predictions to input and model perturbations in removal-based feature attribution methods. It then validates the theoretical results using synthetic and real-world datasets.
Notably, it describes removal-based feature attribution in two parts:(1) how feature information is removed from the model, and (2) how the algorithm summarizes each feature’s influence.
The main theory they apply is that L-Lipschitz continuous property: if a model $f$ is globally L-Lipschitz continuous, we have
$$ \left|f(x) - f(x')\right| \leq L \cdot \Vert x - x' \Vert_2, \forall x, x' \in \mathbb{R}^d. $$References
Lin, Chris, Ian Covert, and Su-In Lee. “On the robustness of removal-based feature attributions.” Advances in Neural Information Processing Systems 36 (2024).