Linear operators
Hessian
- class curvlinops.HessianLinearOperator(*args, **kwargs)
Hessian as SciPy linear operator.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. The Hessian matrix is\[\nabla^2_{\mathbf{\theta}} \mathcal{L} = c \sum_{n=1}^{N} \nabla_{\mathbf{\theta}}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\,.\]- __init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None)
Linear operator for DNN matrices.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If
None, there is no loss function and the represented matrix is independent of the loss function.params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader.progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_lastor data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True. This is a safeguard, only turn it off if you know what you are doing.shape – Shape of the represented matrix. If
Noneassumes(D, D)whereDis the total number of parameters
- Raises:
RuntimeError – If the check for deterministic behavior fails.
Generalized Gauss-Newton
- class curvlinops.GGNLinearOperator(*args, **kwargs)
GGN as SciPy linear operator.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. The GGN matrix is\[c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)^\top \left( \nabla_{f_\mathbf{\theta}(\mathbf{x}_n)}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)\,.\]- __init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None)
Linear operator for DNN matrices.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If
None, there is no loss function and the represented matrix is independent of the loss function.params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader.progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_lastor data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True. This is a safeguard, only turn it off if you know what you are doing.shape – Shape of the represented matrix. If
Noneassumes(D, D)whereDis the total number of parameters
- Raises:
RuntimeError – If the check for deterministic behavior fails.
Fisher (approximate)
- class curvlinops.FisherMCLinearOperator(*args, **kwargs)
Monte-Carlo approximation of the Fisher as SciPy linear operator.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. Let \(\ell(\mathbf{f}, \mathbf{y}) = - \log q(\mathbf{y} \mid \mathbf{f})\) be a negative log-likelihood loss. Denoting \(\mathbf{f}_n = f_{\mathbf{\theta}}(\mathbf{x}_n)\), the Fisher information matrix is\[\begin{split}c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \nabla_{\mathbf{f}_n}^2 \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ = c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ \approx c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \frac{1}{M} \sum_{m=1}^M \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)\end{split}\]with sampled targets \(\mathbf{\tilde{y}}_{n}^{(m)} \sim q( \cdot \mid \mathbf{f}_n)\). The expectation over the model’s likelihood is approximated via a Monte-Carlo estimator with \(M\) samples.
The linear operator represents a deterministic sample from this MC Fisher estimator. To generate different samples, you have to create instances with varying random seed argument.
- __init__(model_func: Callable[[Tensor], Tensor], loss_func: MSELoss | CrossEntropyLoss, params: List[Parameter], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, mc_samples: int = 1)
Linear operator for the MC approximation of the Fisher.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:
True. This is a safeguard, only turn it off if you know what you are doing.seed – Seed used to construct the internal random number generator used to draw samples at the beginning of each matrix-vector product. Default:
2147483647mc_samples – Number of samples to use. Default:
1.
- Raises:
NotImplementedError – If the loss function differs from
MSELossorCrossEntropyLoss.
- class curvlinops.KFACLinearOperator(*args, **kwargs)
Linear operator to multiply with the Fisher/GGN’s KFAC approximation.
KFAC approximates the per-layer Fisher/GGN with a Kronecker product: Consider a weight matrix \(\mathbf{W}\) and a bias vector \(\mathbf{b}\) in a single layer. The layer’s Fisher \(\mathbf{F}(\mathbf{\theta})\) for
\[\begin{split}\mathbf{\theta} = \begin{pmatrix} \mathrm{vec}(\mathbf{W}) \\ \mathbf{b} \end{pmatrix}\end{split}\]where \(\mathrm{vec}\) denotes column-stacking is approximated as
\[\mathbf{F}(\mathbf{\theta}) \approx \mathbf{A}_{(\text{KFAC})} \otimes \mathbf{B}_{(\text{KFAC})}\](see
curvlinops.FisherMCLinearOperatorfor the Fisher’s definition). Loosely speaking, the first Kronecker factor is the un-centered covariance of the inputs to a layer. The second Kronecker factor is the un-centered covariance of ‘would-be’ gradients w.r.t. the layer’s output. Those ‘would-be’ gradients result from sampling labels from the model’s distribution and computing their gradients.Kronecker-Factored Approximate Curvature (KFAC) was originally introduced for MLPs in
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored
approximate curvature. International Conference on Machine Learning (ICML),
extended to CNNs in
Grosse, R., & Martens, J. (2016). A kronecker-factored approximate Fisher matrix for
convolution layers. International Conference on Machine Learning (ICML),
and generalized to all linear layers with weight sharing in
Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., Hennig, P. (2023).
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures (NeurIPS).
- _SUPPORTED_LOSSES
Tuple of supported loss functions.
- _SUPPORTED_MODULES
Tuple of supported layers.
- __init__(model_func: Module, loss_func: MSELoss, params: List[Parameter], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None, seed: int = 2147483647, fisher_type: str = 'mc', mc_samples: int = 1, kfac_approx: str = 'expand', loss_average: None | str = 'batch', separate_weight_and_bias: bool = True)
Kronecker-factored approximate curvature (KFAC) proxy of the Fisher/GGN.
Warning
If the model’s parameters change, e.g. during training, you need to create a fresh instance of this object. This is because, for performance reasons, the Kronecker factors are computed once and cached during the first matrix-vector product. They will thus become outdated if the model changes.
Warning
- This is an early proto-type with limitations:
Only Linear and Conv2d modules are supported.
- Parameters:
model_func – The neural network. Must consist of modules.
loss_func – The loss function.
params – The parameters defining the Fisher/GGN that will be approximated through KFAC.
data – A data loader containing the data of the Fisher/GGN.
progressbar – Whether to show a progress bar when computing the Kronecker factors. Defaults to
False.check_deterministic – Whether to check that the linear operator is deterministic. Defaults to
True.shape – The shape of the linear operator. If
None, it will be inferred from the parameters. Defaults toNone.seed – The seed for the random number generator used to draw labels from the model’s predictive distribution. Defaults to
2147483647.fisher_type – The type of Fisher/GGN to approximate. If ‘type-2’, the exact Hessian of the loss w.r.t. the model outputs is used. This requires as many backward passes as the output dimension, i.e. the number of classes for classification. This is sometimes also called type-2 Fisher. If
'mc', the expectation is approximated by samplingmc_sampleslabels from the model’s predictive distribution. If'empirical', the empirical gradients are used which corresponds to the uncentered gradient covariance, or the empirical Fisher. Defaults to'mc'.mc_samples – The number of Monte-Carlo samples to use per data point. Has to be set to
1whenfisher_type != 'mc'. Defaults to1.kfac_approx – A string specifying the KFAC approximation that should be used for linear weight-sharing layers, e.g.
Conv2dmodules orLinearmodules that process matrix- or higher-dimensional features. Possible values are'expand'and'reduce'. See Eschenhagen et al., 2023 for an explanation of the two approximations.loss_average – Whether the loss function is a mean over per-sample losses and if yes, over which dimensions the mean is taken. If
"batch", the loss function is a mean over as many terms as the size of the mini-batch. If"batch+sequence", the loss function is a mean over as many terms as the size of the mini-batch times the sequence length, e.g. in the case of language modeling. IfNone, the loss function is a sum. This argument is used to ensure that the preconditioner is scaled consistently with the loss and the gradient. Default:"batch".separate_weight_and_bias – Whether to treat weights and biases separately. Defaults to
True.
- Raises:
ValueError – If the loss function is not supported.
ValueError – If the loss average is not supported.
ValueError – If the loss average is
Noneand the loss function’s reduction is not'sum'.ValueError – If the loss average is not
Noneand the loss function’s reduction is'sum'.ValueError – If
fisher_type != 'mc'andmc_samples != 1.NotImplementedError – If a parameter is in an unsupported layer.
Uncentered gradient covariance (empirical Fisher)
- class curvlinops.EFLinearOperator(*args, **kwargs)
Uncentered gradient covariance as SciPy linear operator.
The uncentered gradient covariance is often called ‘empirical Fisher’ (EF).
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. The uncentered gradient covariance matrix is\[c \sum_{n=1}^{N} \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right)^\top\,.\]Note
Multiplication with the empirical Fisher is currently implemented with an inefficient for-loop.
- __init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None)
Linear operator for DNN matrices.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If
None, there is no loss function and the represented matrix is independent of the loss function.params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader.progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_lastor data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True. This is a safeguard, only turn it off if you know what you are doing.shape – Shape of the represented matrix. If
Noneassumes(D, D)whereDis the total number of parameters
- Raises:
RuntimeError – If the check for deterministic behavior fails.
Jacobians
- class curvlinops.JacobianLinearOperator(*args, **kwargs)
Linear operator for the Jacobian.
Can be used with SciPy.
- __init__(model_func: Callable[[Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True)
Linear operator for the Jacobian as SciPy linear operator.
Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s Jacobian \(\mathbf{J}_\mathbf{\theta}\mathbf{f}\) is an \(NC \times D\) matrix with elements
\[\left[ \mathbf{J}_\mathbf{\theta}\mathbf{f} \right]_{(n,c), d} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]Note that the data must be supplied in deterministic order.
- Parameters:
model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
- class curvlinops.TransposedJacobianLinearOperator(*args, **kwargs)
Linear operator for the transpose Jacobian.
Can be used with SciPy.
- __init__(model_func: Callable[[Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True)
Linear operator for the transpose Jacobian as SciPy linear operator.
Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s transpose Jacobian \((\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top\) is an \(D \times NC\) matrix with elements
\[\left[ (\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top \right]_{d, (n,c)} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]Note that the data must be supplied in deterministic order.
- Parameters:
model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
Inverses
- class curvlinops.CGInverseLinearOperator(*args, **kwargs)
Class for inverse linear operators via conjugate gradients.
- __init__(A: LinearOperator)
Store the linear operator whose inverse should be represented.
- Parameters:
A – Linear operator whose inverse is formed. Must be symmetric and positive-definite
- set_cg_hyperparameters(x0=None, tol=1e-05, maxiter=None, M=None, callback=None, atol=None)
Store hyperparameters for CG.
They will be used to approximate the inverse matrix-vector products.
For more detail, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.cg.html
# noqa: DAR101
- class curvlinops.NeumannInverseLinearOperator(*args, **kwargs)
Class for inverse linear operators via truncated Neumann series.
# noqa: B950
Motivated by (referred to as lorraine2020optimizing in the following)
Lorraine, J., Vicol, P., & Duvenaud, D. (2020). Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics (AISTATS).
Warning
The Neumann series can be non-convergent. In this case, the iterations will become numerically unstable, leading to ``NaN``s.
Warning
The Neumann series can converge slowly. Use
curvlinops.CGInverLinearOperatorfor better accuracy.- __init__(A: LinearOperator, num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)
Store the linear operator whose inverse should be represented.
The Neumann series for an invertible linear operator \(\mathbf{A}\) is
\[\mathbf{A}^{-1} = \sum_{k=0}^{\infty} \left(\mathbf{I} - \mathbf{A} \right)^k\,.\]and is convergent if all eigenvalues satisfy \(0 < \lambda(\mathbf{A}) < 2\).
By re-rescaling the matrix by
scale(\(\alpha\)), we have:\[\mathbf{A}^{-1} = \alpha (\alpha \mathbf{A})^{-1} = \alpha \sum_{k=0}^{\infty} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,,\]which and is convergent if \(0 < \lambda(\mathbf{A}) < \frac{2}{\alpha}\).
Additionally, we truncate the series at
num_terms(\(K\)):\[\mathbf{A}^{-1} \approx \alpha \sum_{k=0}^{K} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,.\]- Parameters:
A – Linear operator whose inverse is formed.
num_terms – Number of terms in the truncated Neumann series. Default:
100.scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series (details above). Default:
1.0.check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default:
True.
- set_neumann_hyperparameters(num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)
Store hyperparameters for the truncated Neumann series.
- Parameters:
num_terms – Number of terms in the truncated Neumann series. Default:
100.scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series. Default:
1.0.check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default:
True.
Sub-matrices
- class curvlinops.SubmatrixLinearOperator(*args, **kwargs)
Class for sub-matrices of linear operators.
- __init__(A: LinearOperator, row_idxs: List[int], col_idxs: List[int])
Store the linear operator and indices of its sub-matrix.
Represents the sub-matrix
A[row_idxs, :][col_idxs, :].- Parameters:
A – A linear operator.
row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.
- set_submatrix(row_idxs: List[int], col_idxs: List[int])
Define the sub-matrix.
Internally sets the linear operator’s shape.
- Parameters:
row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.
- Raises:
ValueError – If the index lists contain duplicate values, non-integers, or out-of-bounds indices.
Spectral density approximation
- curvlinops.lanczos_approximate_spectrum(A: LinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 3.0, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01) Tuple[ndarray, ndarray]
Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.
Implements algorithm 2 (
LanczosApproxSpec) of Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).Internally rescales the operator spectrum to the interval [-1; 1] such that the width
kappaof the Gaussian bumps used to approximate the delta peaks need not be tweaked.- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1. Taken from papyan2020traces, Section D.2.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
3. Taken from papyan2020traces, Section D.2.boundaries – Estimates of the minimum and maximum eigenvalues of
A. If left unspecified, they will be estimated internally.margin – Relative margin added around the spectral boundary. Default:
0.05. Taken from papyan2020traces, Section D.2.boundaries_tol – (Only relevant if
boundariesare not specified). Relative accuracy used to estimate the spectral boundary.0implies machine precision. Default:1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.
- Returns:
Grid points λ and approximated spectral density p(λ) of A.
- curvlinops.lanczos_approximate_log_spectrum(A: LinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 1.04, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01, epsilon: float = 1e-05) Tuple[ndarray, ndarray]
Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of log(|A| + εI) ∈ Rᵈˣᵈ.
Follows the idea of Section C.7 in Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).
Here, log denotes the natural logarithm (i.e. base e).
Internally rescales the operator spectrum to the interval [-1; 1] such that the width
kappaof the Gaussian bumps used to approximate the delta peaks need not be tweaked.- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1. Taken from papyan2020traces, Section D.2.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
1.04. Obtained by tweaking while reproducing Fig. 15b from papyan2020traces (not specified by the paper).boundaries – Estimates of the minimum and maximum eigenvalues of
|A|. If left unspecified, they will be estimated internally.margin – Relative margin added around the spectral boundary. Default:
0.05. Taken from papyan2020traces, Section D.2.boundaries_tol – (Only relevant if
boundariesare not specified). Relative accuracy used to estimate the spectral boundary.0implies machine precision. Default:1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.epsilon – Shift to increase numerical stability. Default:
1e-5. Taken from papyan2020traces, Section D.2.
- Returns:
Grid points λ and approximated spectral density p(λ) of log(|A| + εI).
- class curvlinops.LanczosApproximateSpectrumCached(A: LinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01)
Class to approximate the spectral density of p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.
Caches Lanczos iterations to efficiently produce spectral density approximations with different hyperparameters.
- __init__(A: LinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01)
Initialize.
- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
boundaries – Estimates of the minimum and maximum eigenvalues of
A. If left unspecified, they will be estimated internally.boundaries_tol – (Only relevant if
boundariesare not specified). Relative accuracy used to estimate the spectral boundary.0implies machine precision. Default:1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.
- approximate_spectrum(num_repeats: int = 1, num_points: int = 1024, kappa: float = 3.0, margin: float = 0.05) Tuple[ndarray, ndarray]
Approximate the spectal density of A.
- Parameters:
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1. Taken from papyan2020traces, Section D.2.num_points – Resolution. Default:
1024.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
3. From papyan2020traces, Section D.2.margin – Relative margin added around the spectral boundary. Default:
0.05. Taken from papyan2020traces, Section D.2.
- Returns:
Grid points λ and approximated spectral density p(λ) of A.
Trace approximation
- class curvlinops.HutchinsonTraceEstimator(A: LinearOperator)
Class to perform trace estimation with Hutchinson’s method.
For details, see
Hutchinson, M. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communication in Statistics—Simulation and Computation.
Let \(\mathbf{A}\) be a square linear operator. We can approximate its trace \(\mathrm{Tr}(\mathbf{A})\) by drawing a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator
\[a := \mathbf{v}^\top \mathbf{A} \mathbf{v} \approx \mathrm{Tr}(\mathbf{A})\,.\]This estimator is unbiased,
\[\mathbb{E}[a] = \mathrm{Tr}(\mathbb{E}[\mathbf{v}^\top\mathbf{A} \mathbf{v}]) = \mathrm{Tr}(\mathbf{A} \mathbb{E}[\mathbf{v} \mathbf{v}^\top]) = \mathrm{Tr}(\mathbf{A} \mathbf{I}) = \mathrm{Tr}(\mathbf{A})\,.\]Example
>>> from numpy import trace, mean, round >>> from numpy.random import rand, seed >>> seed(0) # make deterministic >>> A = rand(10, 10) >>> tr_A = trace(A) # exact trace as reference >>> estimator = HutchinsonTraceEstimator(A) >>> # one- and multi-sample approximations >>> tr_A_low_precision = estimator.sample() >>> tr_A_high_precision = mean([estimator.sample() for _ in range(1_000)]) >>> assert abs(tr_A - tr_A_low_precision) > abs(tr_A - tr_A_high_precision) >>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4) (4.4575, 6.6796, 4.3886)
- __init__(A: LinearOperator)
Store the linear operator whose trace will be estimated.
- Parameters:
A – Linear square-shaped operator whose trace will be estimated.
- Raises:
ValueError – If the operator is not square.
- sample(distribution: str = 'rademacher') float
Draw a sample from the trace estimator.
Multiple samples can be combined into a more accurate trace estimation via averaging.
- Parameters:
distribution – Distribution of the vector along which the linear operator will be evaluated. Either
'rademacher'or'normal'. Default is'rademacher'.- Returns:
Sample from the trace estimator.
- class curvlinops.HutchPPTraceEstimator(A: LinearOperator, basis_dim: int | None = None, basis_distribution: str = 'rademacher')
Class to perform trace estimation with the Huch++ method.
In contrast to vanilla Hutchinson, Hutch++ has lower variance, but requires more memory.
For details, see
Meyer, R. A., Musco, C., Musco, C., & Woodruff, D. P. (2020). Hutch++: optimal stochastic trace estimation.
Let \(\mathbf{A}\) be a square linear operator whose trace we want to approximate. First, we compute an orthonormal basis \(\mathbf{Q}\) of a sub-space spanned by \(\mathbf{A} \mathbf{S}\) where \(\mathbf{S}\) is a tall random matrix with i.i.d. elements. Then, we compute the trace in the sub-space and apply Hutchinson’s estimator in the remaining space spanned by \(\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top\): We can draw a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator
\[a := \mathrm{Tr}(\mathbf{Q}^\top \mathbf{A} \mathbf{Q}) + \mathbf{v}^\top (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top)^\top \mathbf{A} (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top) \mathbf{v} \approx \mathrm{Tr}(\mathbf{A})\,.\]This estimator is unbiased, \(\mathbb{E}[a] = \mathrm{Tr}(\mathbf{A})\), as the first term is constant and the second part is Hutchinson’s estimator in a sub-space.
Example
>>> from numpy import trace, mean, round >>> from numpy.random import rand, seed >>> seed(0) # make deterministic >>> A = rand(10, 10) >>> tr_A = trace(A) # exact trace as reference >>> estimator = HutchPPTraceEstimator(A) >>> # one- and multi-sample approximations >>> tr_A_low_precision = estimator.sample() >>> tr_A_high_precision = mean([estimator.sample() for _ in range(998)]) >>> # assert abs(tr_A - tr_A_low_precision) > abs(tr_A - tr_A_high_precision) >>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4) (4.4575, 2.4085, 4.5791)
- __init__(A: LinearOperator, basis_dim: int | None = None, basis_distribution: str = 'rademacher')
Store the linear operator whose trace will be estimated.
- Parameters:
A – Linear square-shaped operator whose trace will be estimated.
basis_dim – Dimension of the subspace used for exact trace estimation. Can be at most the linear operator’s dimension. By default, its size will be 1% of the matrix’s dimension, but at most
10. This assumes that we are working with very large matrices and we can only afford storing a small number of columns at a time.basis_distribution – Distribution of the vectors used to construct the subspace. Either
'rademacher'` or ``'normal'. Default is'rademacher'.
- Raises:
ValueError – If the operator is not square or the basis dimension is too large.
Note
If you are planning to perform a fair (i.e. same computation budget) comparison with vanilla Hutchinson,
basis_dimshould bes / 3wheresis the number of samples used by vanilla Hutchinson. Ifs / 3requires storing a too large matrix, you can pickbasis_dim = s1and draws2samples from Hutch++ such that2 * s1 + s2 = s.
- sample(distribution: str = 'rademacher') float
Draw a sample from the trace estimator.
Multiple samples can be combined into a more accurate trace estimation via averaging.
Note
Calling this function for the first time will also compute the sub-space and its trace. Future calls will be faster as the latter are cached internally.
- Parameters:
distribution – Distribution of the vector along which the linear operator will be evaluated. Either
'rademacher'or'normal'. Default is'rademacher'.- Returns:
Sample from the trace estimator.
Diagonal approximation
- class curvlinops.HutchinsonDiagonalEstimator(A: LinearOperator)
Class to perform diagonal estimation with Hutchinson’s method.
For details, see
Martens, J., Sutskever, I., & Swersky, K. (2012). Estimating the hessian by back-propagating curvature. International Conference on Machine Learning (ICML).
Let \(\mathbf{A}\) be a square linear operator. We can approximate its diagonal \(\mathrm{diag}(\mathbf{A})\) by drawing a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator
\[\mathbf{a} := \mathbf{v} \odot \mathbf{A} \mathbf{v} \approx \mathrm{diag}(\mathbf{A})\,.\]This estimator is unbiased,
\[\mathbb{E}[a_i] = \sum_j \mathbb{E}[v_i A_{i,j} v_j] = \sum_j A_{i,j} \mathbb{E}[v_i v_j] = \sum_j A_{i,j} \delta_{i, j} = A_{i,i}\,.\]Example
>>> from numpy import diag, mean, round >>> from numpy.random import rand, seed >>> from numpy.linalg import norm >>> seed(0) # make deterministic >>> A = rand(10, 10) >>> diag_A = diag(A) # exact diagonal as reference >>> estimator = HutchinsonDiagonalEstimator(A) >>> # one- and multi-sample approximations >>> diag_A_low_precision = estimator.sample() >>> samples = [estimator.sample() for _ in range(1_000)] >>> diag_A_high_precision = mean(samples, axis=0) >>> # compute residual norms >>> error_low_precision = norm(diag_A - diag_A_low_precision) >>> error_high_precision = norm(diag_A - diag_A_high_precision) >>> assert error_low_precision > error_high_precision >>> round(error_low_precision, 4), round(error_high_precision, 4) (5.7268, 0.1525)
- __init__(A: LinearOperator)
Store the linear operator whose diagonal will be estimated.
- Parameters:
A – Linear square-shaped operator whose diagonal will be estimated.
- Raises:
ValueError – If the operator is not square.
- sample(distribution: str = 'rademacher') ndarray
Draw a sample from the diagonal estimator.
Multiple samples can be combined into a more accurate diagonal estimation via averaging.
- Parameters:
distribution – Distribution of the vector along which the linear operator will be evaluated. Either
'rademacher'or'normal'. Default is'rademacher'.- Returns:
A Sample from the diagonal estimator.
Experimental
The API of experimental features may be subject to changes, or they might become deprecated.
- class curvlinops.experimental.ActivationHessianLinearOperator(*args, **kwargs)
Hessian of the loss w.r.t. hidden features in a neural network.
Consider the empirical risk on a single mini-batch \(\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N\}\), \(\mathbf{y} = \{ \mathbf{y}_1, \dots, \mathbf{y}_N\}\):
\[\mathcal{L}(\mathbf{\theta}) = \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]Let \(\mathbf{Z} = \{ \mathbf{z}_1, \dots, \mathbf{z}_N \}\) denote a batch of some intermediate feature produced inside the neural network’s forward pass. The Hessian w.r.t. the flattened activations that is represented by this linear operator is
\[\nabla^2_{\mathbf{Z}} \mathcal{L} = \nabla^2_{\mathbf{Z}} \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]and has dimension \(\mathrm{dim}(\mathbf{Z}) = N \mathrm{dim}(\mathbf{z})\).
- __init__(model_func: Module, loss_func: Callable[[Tensor, Tensor], Tensor], activation: Tuple[str, str, int], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None)
Linear operator for the loss Hessian w.r.t. intermediate features.
- Parameters:
model_func – A neural network.
loss_func – A loss function.
activation – A tuple specifying w.r.t. what intermediate feature the Hessian shall be computed. Has three entries. The first entry is the layer’s name inside the model (any string from
model_func.named_modules). The second entry can either be"output"or"input"and specifies whether the intermediate feature is the layer’s output or the layer’s input. The third entry specifies what tensor should be taken from the tuple of input tensors or the tuple of output tensors. For most layers that process a single Tensor into another single Tensor, this value is0.data – An iterable of mini-batches. Must have a single batch, otherwise the activation Hessian is not defined.
progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_lastor data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True. This is a safeguard, only turn it off if you know what you are doing.shape – Shape of the represented matrix. If
None, this dimension will be inferred at the cost of one forward pass through the model.
- Raises:
ValueError – If
datacontains more than one batch.
Example
>>> from numpy import eye, allclose >>> from torch import manual_seed, rand >>> from torch.nn import Linear, MSELoss, Sequential, ReLU >>> >>> loss_func = MSELoss() >>> model = Sequential(Linear(4, 3), ReLU(), Linear(3, 2)) >>> [name for name, _ in model.named_modules()] # available layer names ['', '0', '1', '2'] >>> data = [(rand(10, 4), rand(10, 2))] >>> >>> hessian = ActivationHessianLinearOperator( # Hessian w.r.t. ReLU input ... model, loss_func, ("1", "input", 0), data ... ) >>> hessian.shape # batch size * feature dimension (10 * 3) (30, 30) >>> >>> # The ReLU's input is the first Linear's output, let's check that >>> hessian2 = ActivationHessianLinearOperator( # Hessian w.r.t. first output ... model, loss_func, ("0", "output", 0), data ... ) >>> I = eye(hessian.shape[1]) >>> allclose(hessian @ I, hessian2 @ I) True