They believe that the meaning of the trade-off of this formulation is unclear. In particular, choosing different $\lambda$ and $\beta$ will result in different masks without a clear way of comparing them.
$$ m_{a} = \argmax_{m: \|m\|_{1} = \alpha |\Omega|, m \in \mathcal{M}} \Phi(m \otimes x) $$They think that the resulting mask is a function of the chosen area $a$ only.
$$ a^* = \min\{a: \Phi(m_a \otimes x) \ge \Phi_{0}\} $$The mask $a^*$ is the extremum because a smaller $a$ would result in the perturbed input failing to make the model output exceed the lower limit $\Phi_0$.
$$ R_{a}(m) = \|vecsort(m) - r_a\|^{2} $$$$ m_a = \argmax_{m\in\mathcal{M}} \Phi(m \otimes x) - \lambda R_{a}(m) $$