G. Montavon, S. Lapuschkin, A. Binder, W. Samek, K.-R. Müller
Explaining NonLinear Classification Decisions with Deep Taylor Decomposition
Pattern Recognition, 65:211–222, 2017
Deep Taylor decomposition is a method to explain individual neural network predictions in terms of input variables. It operates by running a backward pass on the network using a predefined set of rules. It produces as a result a decomposition of the neural network output on the input variables. The method can be used as a visualization tool, or as part of a more complex analysis pipeline. Deep Taylor decomposition is used, for example, to explain state-of-the-art neural networks for computer vision.
Before diving into the mathematical aspects of deep Taylor decomposition, we will first look at the problem of explanation conceptually and consider the simple example of an image predicted by a machine learning classifier to belong to the class "shark".
The machine learning classifier can be for example a convolutional neural network such as AlexNet or GoogleNet (readily trained networks are available at this page). When the trained model is given an image, it provides a prediction for that image, but no explanation associated to it.
A possible explanation can be obtained by determining which input variables (here, pixels) have contributed to the image classification, in particular, what pixels in the image are relevant for the prediction outcome. We call an assignment of scores onto pixels a heatmap. Such heatmap can be visualized as an image of same size as the image to classify:
In the heatmap, relevant portions of the image are highlighted in red. Here, the contour of the shark and its dorsal fin are considered relevant for classification. The machine learning model is therefore no longer a blackbox.
A simple technique for analyzing a neural network prediction is gradient-based sensitivity analysis. In its simplest form, it computes for a given data point the gradient of the neural network output function with respect to the input variables. In the example above, the gradient would tell us which pixels in the image would make the image more/less a shark if changing their color. However we make a distinction between the following two questions:
The second question is better answered by decomposition techniques. A decomposition seeks to redistribute the whole predicted output onto input pixels. Therefore, decomposition techniques seek to explain the prediction in its entirety, rather than only measuring a differential effect. The difference between sensitivity and decomposition can be more easily comprehended by looking at a simple two-dimensional example (a sum of two nonlinear functions, each operating on one variable of the input space).
The blue regions have high function value, and the white regions have function value zero. The vector fields represent the magnitude of each component of the analysis at various locations of the input space. We can observe that sensitivity and decomposition lead to qualitatively very different results:
To summarize, sensitivity analysis measures a local effect whereas decomposition measures a global effect. Both sensitivity analysis and decomposition methods have an additive structure. Sensitivity analysis satisfies the equation $$ {\textstyle \sum_p} (\partial f / \partial x_p )^2 = \|\nabla_{\!\boldsymbol{x}} f (\boldsymbol{x})\|^2 $$ where the left hand side sums over all input variables (i.e. the analysis redistribute the total local function variation to input variables). Similarly, decomposition methods must satisfy the equation $$ {\textstyle \sum_p} [f(\boldsymbol{x})]_p = f(\boldsymbol{x}) $$ (i.e. the whole function value needs to be redistributed on the input variables). In practice, we would like to enforce an additional positivity constraint on the decomposed terms in order to avoid decompositions of type \((-9999) + (10000) = 1\), where the magnitude of decomposed term can become arbitrarily high. Note that sensitivity analysis already structurally incorporates such constraint through the squaring function.
In the previous section, we have explained how the decomposition analysis differs conceptually from sensitivity analysis and why decompositions should be preferred for the problem explaining predictions. This section introduces deep Taylor decomposition, a recently proposed practical method for decomposing the predictions of large deep neural networks.
The method takes inspiration from the way error backpropagation operates: (1) dissociating the overall computation into a set of localized neuron computations, and (2) recombining these local computations in an appropriate manner (in the case of error backpropagation, using the chain rule for derivatives).
We consider a deep network with \((x_p)_p\) denoting the input neurons and \(x_f\) the output neuron. We would like to produce a decomposition \(([x_f]_p)_p\) of the output in terms of input variables where \([x_f]_p\) is the share of function value \(x_f\) assigned to a particular input variable \(x_p\). The decomposition should satisfy the conservation property $$ {\textstyle \sum}_p [x_f]_p = x_f, $$ and be positive $$ \forall_p: [x_f]_p \geq 0. $$ We adopt a propagation approach to this problem, where intermediate neurons are assigned a share of \(x_f\) and must redistribute this quantity onto their predecessor neurons. Thought globally, this redistribution process leads to a decomposition of the neural network output onto its input variables.
Suppose that we have computed all activations in a forward pass, and that we have redistributed the neural network output from the top-layer down to the layer associated to neuron \(x_j\). The scenario is illustrated in the following diagram:
We denote by \(x_j\) an arbitrary neuron in the network and \((x_i)_i\) the collection of neurons that it receives as input. We ask the question which share of function value \([x_f]_j\) associated this neuron should be redistributed to each neuron \((x_i)_i\) in the previous layer. We denote these shares \([[x_f]_j]_i\). This problem is difficult, because there a priori no known direct mapping \((x_i)_i \mapsto [x_f]_j\) from which we could derive a decomposition strategy. Instead, we will make the assumption that \([x_f]_j\) can always be expressed as a product of the neuron activation \(x_j\) and a positive locally constant term \(c_{j} \geq 0\): $$ [x_f]_j = c_j x_j $$ The variable \(c_j\) can be interpreted as the degree to which the activation \(x_j\) influences the neural network output locally. Note that the top-level neuron \(x_f\) readily have this product structure (\([x_f]_f = 1 \cdot x_f\)). The advantage of this product structure is that the problem of decomposing \([x_f]_j\) reduces in essence to decomposing the neuron function \((x_i)_i \mapsto x_j\), which we known, and that we will show in the next section to yield decompositions with the desired properties. Importantly, the redistribution resulting from decomposition of this neuron function should also produce decomposed terms with a product structure (\([x_f]_i = x_i c_i\)) so that by inductive reasoning the function value can be backpropagated through arbitrarily many layers.
In this section, we describe a possible propagation mechanism based on Taylor expansions for ReLU neurons. The propagation mechanism ensures the desired properties of a decomposition (the additive property, and the positivity of decomposed terms), and guarantees that the product-structure of propagated terms is approximately preserved from one layer to the the previous one (a necessary condition for backward propagation).
We consider for a given ReLU neuron \(x_j\) the mapping from its inputs \((x_i)_i\) to its redistributed function value \([x_f]_j\) through the nonlinear function \begin{align*} [x_f]_j &= c_j x_j\\ &= c_j \max\big(0,{\textstyle \sum_i} x_i w_{ij} + b_j\big) \end{align*} with \(b_j \leq 0\) and where \(c_j\) is assumed positive and constant on the relevant domain. We would like to decompose \([x_f]_j\) on the input neurons. The function is shown below for a two-dimensional input domain:
By looking at this plot, we remark that the function is linear on the set \(\{(x_i)_i : [x_f]_j > 0\}\). If choosing a point \((\widetilde x_i)_i\) called "root point" at the border of this set where the function is infinitesimally small, then the function can always be written as a first-order Taylor expansion $$ [x_f]_j = \sum_i c_j \frac{\partial x_j}{\partial x_i} \Big|_{(x_i)_i = (\widetilde x_i)_i} \cdot (x_i - \widetilde x_i). $$ without zero-order terms, nor second or higher-order terms. Identifying the elements of the sum to their respective input variables, we get the decomposition onto input variables: $$ [[x_f]_j]_i = c_j \frac{\partial x_j}{\partial x_i} \Big|_{(x_i)_i = (\widetilde x_i)_i} \cdot (x_i - \widetilde x_i) $$
In practice, it will not be necessary to define this root point explicitly. Instead, the original paper on deep Taylor decomposition showed that we can always express the root point of the neuron activation implicitly as the intersection of the plane equation \(\{(x_i)_i : x_j = 0\}\) expressing all possible root points, and a "search direction" \((v_{ij})_i\) starting at the actual neuron input \((x_i)_i\). The relation between the root point and the search direction is shown geometrically in the diagram below:
The decomposed terms are then given in closed form as $$ [[x_f]_j]_i = \frac{v_{ij} w_{ij}}{\sum_{i'} v_{i'j} w_{i'j}} [x_f]_j. $$ Finally, the neuron \(x_i\) receives messages \([[x_f]_j]_i\) from a number of neurons \(x_j\) to which it contributes. Thus, the total share of \(x_f\) received by \(x_i\) is given by: $$ [x_f]_i = \sum_j \frac{v_{ij} w_{ij}}{\sum_{i'} v_{i'j} w_{i'j}} [x_f]_j $$