Linear operators
Hessian
- class curvlinops.HessianLinearOperator(*args, **kwargs)[source]
Hessian as SciPy linear operator.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'
and \(c=1\) forreduction='sum'
. The Hessian matrix is\[\nabla^2_{\mathbf{\theta}} \mathcal{L} = c \sum_{n=1}^{N} \nabla_{\mathbf{\theta}}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\,.\]- SUPPORTS_BLOCKS
Whether the linear operator supports block operations. Default is
True
.- Type:
bool
- __init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)
Linear operator for DNN matrices.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If
None
, there is no loss function and the represented matrix is independent of the loss function.params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]
or a torchDataLoader
. Note thatX
could be adict
orUserDict
; this is useful for custom models. In this case, you must (i) specify thebatch_size_fn
argument, and (ii) take care of preprocessing likeX.to(device)
inside of yourmodel.forward()
function.progressbar – Show a progressbar during matrix-multiplication. Default:
False
.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_last
or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True
. This is a safeguard, only turn it off if you know what you are doing.shape – Shape of the represented matrix. If
None
assumes(D, D)
whereD
is the total number of parametersnum_data – Number of data points. If
None
, it is inferred from the data at the cost of one traversal through the data loader.block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of
nn.Parameter``s forming a block. Entries must sum to ``len(params)
. For instance[len(params)]
considers the full matrix, while[1, 1, ...]
corresponds to a block diagonal approximation where each parameter forms its own block.batch_size_fn – If the
X
’s indata
are nottorch.Tensor
, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdata
and return their batch size.
- Raises:
RuntimeError – If the check for deterministic behavior fails.
ValueError – If
block_sizes
is specified but the linear operator does not support blocks.ValueError – If the sum of blocks does not equal the number of parameters.
ValueError – If any block size is not positive.
ValueError – If
X
is not a tensor andbatch_size_fn
is not specified.
Generalized Gauss-Newton
- class curvlinops.GGNLinearOperator(*args, **kwargs)[source]
GGN as SciPy linear operator.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'
and \(c=1\) forreduction='sum'
. The GGN matrix is\[c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)^\top \left( \nabla_{f_\mathbf{\theta}(\mathbf{x}_n)}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)\,.\]- __init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)
Linear operator for DNN matrices.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If
None
, there is no loss function and the represented matrix is independent of the loss function.params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]
or a torchDataLoader
. Note thatX
could be adict
orUserDict
; this is useful for custom models. In this case, you must (i) specify thebatch_size_fn
argument, and (ii) take care of preprocessing likeX.to(device)
inside of yourmodel.forward()
function.progressbar – Show a progressbar during matrix-multiplication. Default:
False
.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_last
or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True
. This is a safeguard, only turn it off if you know what you are doing.shape – Shape of the represented matrix. If
None
assumes(D, D)
whereD
is the total number of parametersnum_data – Number of data points. If
None
, it is inferred from the data at the cost of one traversal through the data loader.block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of
nn.Parameter``s forming a block. Entries must sum to ``len(params)
. For instance[len(params)]
considers the full matrix, while[1, 1, ...]
corresponds to a block diagonal approximation where each parameter forms its own block.batch_size_fn – If the
X
’s indata
are nottorch.Tensor
, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdata
and return their batch size.
- Raises:
RuntimeError – If the check for deterministic behavior fails.
ValueError – If
block_sizes
is specified but the linear operator does not support blocks.ValueError – If the sum of blocks does not equal the number of parameters.
ValueError – If any block size is not positive.
ValueError – If
X
is not a tensor andbatch_size_fn
is not specified.
Fisher (approximate)
- class curvlinops.FisherMCLinearOperator(*args, **kwargs)[source]
Monte-Carlo approximation of the Fisher as SciPy linear operator.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'
and \(c=1\) forreduction='sum'
. Let \(\ell(\mathbf{f}, \mathbf{y}) = - \log q(\mathbf{y} \mid \mathbf{f})\) be a negative log-likelihood loss. Denoting \(\mathbf{f}_n = f_{\mathbf{\theta}}(\mathbf{x}_n)\), the Fisher information matrix is\[\begin{split}c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \nabla_{\mathbf{f}_n}^2 \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ = c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ \approx c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \frac{1}{M} \sum_{m=1}^M \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)\end{split}\]with sampled targets \(\mathbf{\tilde{y}}_{n}^{(m)} \sim q( \cdot \mid \mathbf{f}_n)\). The expectation over the model’s likelihood is approximated via a Monte-Carlo estimator with \(M\) samples.
The linear operator represents a deterministic sample from this MC Fisher estimator. To generate different samples, you have to create instances with varying random seed argument.
- __init__(model_func: Callable[[Tensor], Tensor], loss_func: MSELoss | CrossEntropyLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, mc_samples: int = 1, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]
Linear operator for the MC approximation of the Fisher.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]
or a torchDataLoader
. Note thatX
could be adict
orUserDict
; this is useful for custom models. In this case, you must (i) specify thebatch_size_fn
argument, and (ii) take care of preprocessing likeX.to(device)
inside of yourmodel.forward()
function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).progressbar – Show a progressbar during matrix-multiplication. Default:
False
.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:
True
. This is a safeguard, only turn it off if you know what you are doing.seed – Seed used to construct the internal random number generator used to draw samples at the beginning of each matrix-vector product. Default:
2147483647
mc_samples – Number of samples to use. Default:
1
.num_data – Number of data points. If
None
, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X
’s indata
are nottorch.Tensor
, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdata
and return their batch size.
- Raises:
NotImplementedError – If the loss function differs from
MSELoss
, BCEWithLogitsLoss, orCrossEntropyLoss
.
- class curvlinops.KFACLinearOperator(*args, **kwargs)[source]
Linear operator to multiply with the Fisher/GGN’s KFAC approximation.
KFAC approximates the per-layer Fisher/GGN with a Kronecker product: Consider a weight matrix \(\mathbf{W}\) and a bias vector \(\mathbf{b}\) in a single layer. The layer’s Fisher \(\mathbf{F}(\mathbf{\theta})\) for
\[\begin{split}\mathbf{\theta} = \begin{pmatrix} \mathrm{vec}(\mathbf{W}) \\ \mathbf{b} \end{pmatrix}\end{split}\]where \(\mathrm{vec}\) denotes column-stacking is approximated as
\[\mathbf{F}(\mathbf{\theta}) \approx \mathbf{A}_{(\text{KFAC})} \otimes \mathbf{B}_{(\text{KFAC})}\](see
curvlinops.FisherMCLinearOperator
for the Fisher’s definition). Loosely speaking, the first Kronecker factor is the un-centered covariance of the inputs to a layer. The second Kronecker factor is the un-centered covariance of ‘would-be’ gradients w.r.t. the layer’s output. Those ‘would-be’ gradients result from sampling labels from the model’s distribution and computing their gradients.Kronecker-Factored Approximate Curvature (KFAC) was originally introduced for MLPs in
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored
approximate curvature. International Conference on Machine Learning (ICML),
extended to CNNs in
Grosse, R., & Martens, J. (2016). A kronecker-factored approximate Fisher matrix for
convolution layers. International Conference on Machine Learning (ICML),
and generalized to all linear layers with weight sharing in
Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., Hennig, P. (2023).
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures (NeurIPS).
- _SUPPORTED_LOSSES
Tuple of supported loss functions.
- _SUPPORTED_MODULES
Tuple of supported layers.
- __init__(model_func: Module, loss_func: MSELoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None, seed: int = 2147483647, fisher_type: str = 'mc', mc_samples: int = 1, kfac_approx: str = 'expand', loss_average: None | str = 'batch', num_per_example_loss_terms: int | None = None, separate_weight_and_bias: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]
Kronecker-factored approximate curvature (KFAC) proxy of the Fisher/GGN.
Warning
If the model’s parameters change, e.g. during training, you need to create a fresh instance of this object. This is because, for performance reasons, the Kronecker factors are computed once and cached during the first matrix-vector product. They will thus become outdated if the model changes.
Warning
- This is an early proto-type with limitations:
Only Linear and Conv2d modules are supported.
If
deterministic_checks
is turned on (as is by default), this will compute the KFAC matrices on CPU, even if all passed arguments live on the GPU.
- Parameters:
model_func – The neural network. Must consist of modules.
loss_func – The loss function.
params – The parameters defining the Fisher/GGN that will be approximated through KFAC.
data – A data loader containing the data of the Fisher/GGN.
progressbar – Whether to show a progress bar when computing the Kronecker factors. Defaults to
False
.check_deterministic – Whether to check that the linear operator is deterministic. Defaults to
True
.shape – The shape of the linear operator. If
None
, it will be inferred from the parameters. Defaults toNone
.seed – The seed for the random number generator used to draw labels from the model’s predictive distribution. Defaults to
2147483647
.fisher_type – The type of Fisher/GGN to approximate. If ‘type-2’, the exact Hessian of the loss w.r.t. the model outputs is used. This requires as many backward passes as the output dimension, i.e. the number of classes for classification. This is sometimes also called type-2 Fisher. If
'mc'
, the expectation is approximated by samplingmc_samples
labels from the model’s predictive distribution. If'empirical'
, the empirical gradients are used which corresponds to the uncentered gradient covariance, or the empirical Fisher. If'forward-only'
, the gradient covariances will be identity matrices, see the FOOF method in Benzing, 2022 or ISAAC in Petersen et al., 2023. Defaults to'mc'
.mc_samples – The number of Monte-Carlo samples to use per data point. Has to be set to
1
whenfisher_type != 'mc'
. Defaults to1
.kfac_approx – A string specifying the KFAC approximation that should be used for linear weight-sharing layers, e.g.
Conv2d
modules orLinear
modules that process matrix- or higher-dimensional features. Possible values are'expand'
and'reduce'
. See Eschenhagen et al., 2023 for an explanation of the two approximations.loss_average – Whether the loss function is a mean over per-sample losses and if yes, over which dimensions the mean is taken. If
"batch"
, the loss function is a mean over as many terms as the size of the mini-batch. If"batch+sequence"
, the loss function is a mean over as many terms as the size of the mini-batch times the sequence length, e.g. in the case of language modeling. IfNone
, the loss function is a sum. This argument is used to ensure that the preconditioner is scaled consistently with the loss and the gradient. Default:"batch"
.num_per_example_loss_terms – Number of per-example loss terms, e.g., the number of tokens in a sequence. The model outputs will have
num_data * num_per_example_loss_terms * C
entries, whereC
is the dimension of the random variable we define the likelihood over – for theCrossEntropyLoss
it will be the number of classes, for theMSELoss
andBCEWithLogitsLoss
it will be the size of the last dimension of the the model outputs/targets (our convention here). IfNone
,num_per_example_loss_terms
is inferred from the data at the cost of one traversal through the data loader. It is expected to be the same for all examples. Defaults toNone
.separate_weight_and_bias – Whether to treat weights and biases separately. Defaults to
True
.num_data – Number of data points. If
None
, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X
’s indata
are nottorch.Tensor
, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdata
and return their batch size.
- Raises:
RuntimeError – If the check for deterministic behavior fails.
ValueError – If the loss function is not supported.
ValueError – If the loss average is not supported.
ValueError – If the loss average is
None
and the loss function’s reduction is not'sum'
.ValueError – If the loss average is not
None
and the loss function’s reduction is'sum'
.ValueError – If
fisher_type != 'mc'
andmc_samples != 1
.ValueError – If
X
is not a tensor andbatch_size_fn
is not specified.
- property det: Tensor
Determinant of the KFAC approximation.
Will call
_compute_kfac
if it has not been called before and will cache the determinant until_compute_kfac
is called again. Uses the property of the Kronecker product that \(\det(A \otimes B) = \det(A)^{m} \det(B)^{n}\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).- Returns:
Determinant of the KFAC approximation.
- property frobenius_norm: Tensor
Frobenius norm of the KFAC approximation.
Will call
_compute_kfac
if it has not been called before and will cache the Frobenius norm until_compute_kfac
is called again. Uses the property of the Kronecker product that \(\|A \otimes B\|_F = \|A\|_F \|B\|_F\).- Returns:
Frobenius norm of the KFAC approximation.
- property logdet: Tensor
Log determinant of the KFAC approximation.
More numerically stable than the
det
property. Will call_compute_kfac
if it has not been called before and will cache the log determinant until_compute_kfac
is called again. Uses the property of the Kronecker product that \(\log \det(A \otimes B) = m \log \det(A) + n \log \det(B)\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).- Returns:
Log determinant of the KFAC approximation.
- to_device(device: device)[source]
Load the linear operator to another device.
- Parameters:
device – The device to which the linear operator should be moved.
- torch_matmat(M_torch: Tensor | List[Tensor]) Tensor | List[Tensor] [source]
Apply KFAC to a matrix (multiple vectors) in PyTorch.
This allows for matrix-matrix products with the KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.
- Parameters:
M_torch – Matrix for multiplication. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of size
K
for the columns, i.e.[(K,) + p1.shape), (K,) + p2.shape, ...]
. If tensor, has shape[D, K]
with someK
.- Returns:
Matrix-multiplication result
KFAC @ M
. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of sizeK
for the columns, i.e.[(K,) + p1.shape, (K,) + p2.shape, ...]
. If tensor, has shape[D, K]
with someK
.
- torch_matvec(v_torch: Tensor | List[Tensor]) Tensor | List[Tensor] [source]
Apply KFAC to a vector in PyTorch.
This allows for matrix-vector products with the KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.
- Parameters:
v_torch – Vector for multiplication. If list of tensors, each entry has the same shape as a parameter, i.e.
[p1.shape, p2.shape, ...]
. If tensor, has shape[D]
.- Returns:
Matrix-multiplication result
KFAC @ v
. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter, i.e.[p1.shape, p2.shape, ...]
. If tensor, has shape[D]
.- Raises:
ValueError – If the input tensor has the wrong data type.
- property trace: Tensor
Trace of the KFAC approximation.
Will call
_compute_kfac
if it has not been called before and will cache the trace until_compute_kfac
is called again. Uses the property of the Kronecker product that \(\text{tr}(A \otimes B) = \text{tr}(A) \text{tr}(B)\).- Returns:
Trace of the KFAC approximation.
Uncentered gradient covariance (empirical Fisher)
- class curvlinops.EFLinearOperator(*args, **kwargs)[source]
Uncentered gradient covariance as SciPy linear operator.
The uncentered gradient covariance is often called ‘empirical Fisher’ (EF).
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'
and \(c=1\) forreduction='sum'
. The uncentered gradient covariance matrix is\[c \sum_{n=1}^{N} \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right)^\top\,.\]Note
Multiplication with the empirical Fisher is currently implemented with an inefficient for-loop.
- __init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]
Linear operator for the uncentered gradient covariance/empirical Fisher (EF).
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]
or a torchDataLoader
. Note thatX
could be adict
orUserDict
; this is useful for custom models. In this case, you must (i) specify thebatch_size_fn
argument, and (ii) take care of preprocessing likeX.to(device)
inside of yourmodel.forward()
function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).progressbar – Show a progressbar during matrix-multiplication. Default:
False
.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:
True
. This is a safeguard, only turn it off if you know what you are doing.num_data – Number of data points. If
None
, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X
’s indata
are nottorch.Tensor
, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdata
and return their batch size.
- Raises:
NotImplementedError – If the loss function differs from
MSELoss
, BCEWithLogitsLoss, orCrossEntropyLoss
.
Jacobians
- class curvlinops.JacobianLinearOperator(*args, **kwargs)[source]
Linear operator for the Jacobian.
Can be used with SciPy.
- __init__(model_func: Callable[[Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]
Linear operator for the Jacobian as SciPy linear operator.
Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s Jacobian \(\mathbf{J}_\mathbf{\theta}\mathbf{f}\) is an \(NC \times D\) matrix with elements
\[\left[ \mathbf{J}_\mathbf{\theta}\mathbf{f} \right]_{(n,c), d} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]Note that the data must be supplied in deterministic order.
- Parameters:
model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
num_data – Number of data points. If
None
, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X
’s indata
are nottorch.Tensor
, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdata
and return their batch size.
- class curvlinops.TransposedJacobianLinearOperator(*args, **kwargs)[source]
Linear operator for the transpose Jacobian.
Can be used with SciPy.
- __init__(model_func: Callable[[Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]
Linear operator for the transpose Jacobian as SciPy linear operator.
Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s transpose Jacobian \((\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top\) is an \(D \times NC\) matrix with elements
\[\left[ (\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top \right]_{d, (n,c)} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]Note that the data must be supplied in deterministic order.
- Parameters:
model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
num_data – Number of data points. If
None
, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X
’s indata
are nottorch.Tensor
, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdata
and return their batch size.
Inverses
- class curvlinops.CGInverseLinearOperator(*args, **kwargs)[source]
Class for inverse linear operators via conjugate gradients.
- __init__(A: LinearOperator)[source]
Store the linear operator whose inverse should be represented.
- Parameters:
A – Linear operator whose inverse is formed. Must be symmetric and positive-definite.
- set_cg_hyperparameters(x0: ndarray | None = None, maxiter: int | None = None, M: ndarray | LinearOperator | None = None, callback: Callable | None = None, atol: float | None = None, tol: float | None = 1e-05)[source]
Store hyperparameters for CG.
They will be used to approximate the inverse matrix-vector products.
For more detail, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.cg.html.
# noqa: DAR101
- class curvlinops.LSMRInverseLinearOperator(*args, **kwargs)[source]
Class for inverse linear operators via LSMR.
See https://arxiv.org/abs/1006.0758 for details on the LSMR algorithm.
- __init__(A: LinearOperator)[source]
Store the linear operator whose inverse should be represented.
- Parameters:
A – Linear operator whose inverse is formed.
- matvec_with_info(x: ndarray) Tuple[ndarray, int, int, float, float, float, float, float] [source]
Multiply x by the inverse of A and return additional information.
- Parameters:
x – Vector for multiplication.
- Returns:
Result of inverse matrix-vector multiplication,
A⁻¹ @ x
with additional information; see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html for details (same return values).
- set_lsmr_hyperparameters(damp: float = 0.0, atol: float | None = 1e-06, btol: float | None = 1e-06, conlim: float | None = 1e-08, maxiter: int | None = None, show: bool | None = False, x0: ndarray | None = None)[source]
Store hyperparameters for LSMR.
They will be used to approximate the inverse matrix-vector products.
For more detail, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html.
# noqa: DAR101
- class curvlinops.NeumannInverseLinearOperator(*args, **kwargs)[source]
Class for inverse linear operators via truncated Neumann series.
# noqa: B950
Motivated by (referred to as lorraine2020optimizing in the following)
Lorraine, J., Vicol, P., & Duvenaud, D. (2020). Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics (AISTATS).
Warning
The Neumann series can be non-convergent. In this case, the iterations will become numerically unstable, leading to ``NaN``s.
Warning
The Neumann series can converge slowly. Use
curvlinops.CGInverLinearOperator
for better accuracy.- __init__(A: LinearOperator, num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]
Store the linear operator whose inverse should be represented.
The Neumann series for an invertible linear operator \(\mathbf{A}\) is
\[\mathbf{A}^{-1} = \sum_{k=0}^{\infty} \left(\mathbf{I} - \mathbf{A} \right)^k\,.\]and is convergent if all eigenvalues satisfy \(0 < \lambda(\mathbf{A}) < 2\).
By re-rescaling the matrix by
scale
(\(\alpha\)), we have:\[\mathbf{A}^{-1} = \alpha (\alpha \mathbf{A})^{-1} = \alpha \sum_{k=0}^{\infty} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,,\]which and is convergent if \(0 < \lambda(\mathbf{A}) < \frac{2}{\alpha}\).
Additionally, we truncate the series at
num_terms
(\(K\)):\[\mathbf{A}^{-1} \approx \alpha \sum_{k=0}^{K} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,.\]- Parameters:
A – Linear operator whose inverse is formed.
num_terms – Number of terms in the truncated Neumann series. Default:
100
.scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series (details above). Default:
1.0
.check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default:
True
.
- set_neumann_hyperparameters(num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]
Store hyperparameters for the truncated Neumann series.
- Parameters:
num_terms – Number of terms in the truncated Neumann series. Default:
100
.scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series. Default:
1.0
.check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default:
True
.
- class curvlinops.KFACInverseLinearOperator(*args, **kwargs)[source]
Class to invert instances of the
KFACLinearOperator
.- __init__(A: KFACLinearOperator, damping: float | Tuple[float, float] = 0.0, use_heuristic_damping: bool = False, min_damping: float = 1e-08, use_exact_damping: bool = False, cache: bool = True, retry_double_precision: bool = True)[source]
Store the linear operator whose inverse should be represented.
- Parameters:
A –
KFACLinearOperator
whose inverse is formed.damping – Damping value(s) for all input and gradient covariances. If tuple, the first value is used for the input covariances and the second value for the gradient covariances. Note that if heuristic or exact damping is used the damping cannot be a tuple. Default:
0.
.use_heuristic_damping – Whether to use a heuristic damping strategy by Martens and Grosse, 2015 (Section 6.3). For input covariances \(A \in \mathbb{R}^{n \times n}\) and gradient covariances \(B \in \mathbb{R}^{m \times m}\), we define \(\pi := \sqrt{\frac{m\; \text{tr}(A)}{n\; \text{tr}(B)}}\) and set the damping for the input covariances \(A\) to \(\max(\pi\; \sqrt{\text{damping}}, \text{min_damping})\) and for the gradient covariances \(B\) to \(\max(\frac{1}{\pi}\; \sqrt{\text{damping}}, \text{min_damping})\). Default:
False
.min_damping – Minimum damping value. Only used if
use_heuristic_damping
isTrue
. Default:1e-8
.use_exact_damping – Whether to use exact damping, i.e. to invert \((A \otimes B) + \text{damping}\; \mathbf{I}\). This is implemented via eigendecompositions of the Kronecker factors, e.g. see equation (21) in Grosse et al., 2023. Note that the eigendecomposition synchronizes the device with the CPU. Default:
False
.cache – Whether to cache the inverses of the Kronecker factors. Default:
True
.retry_double_precision – Whether to retry Cholesky decomposition used for inversion in double precision. Default:
True
.
- Raises:
ValueError – If the linear operator is not a
KFACLinearOperator
.ValueError – If both heuristic and exact damping are selected.
ValueError – If heuristic or exact damping is used and the damping value is a tuple.
- torch_matmat(M_torch: Tensor | List[Tensor]) Tensor | List[Tensor] [source]
Apply the inverse of KFAC to a matrix (multiple vectors) in PyTorch.
This allows for matrix-matrix products with the inverse KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.
- Parameters:
M_torch – Matrix for multiplication. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of size
K
for the columns, i.e.[(K,) + p1.shape), (K,) + p2.shape, ...]
. If tensor, has shape[D, K]
with someK
.- Returns:
Matrix-multiplication result
KFAC⁻¹ @ M
. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of sizeK
for the columns, i.e.[(K,) + p1.shape, (K,) + p2.shape, ...]
. If tensor, has shape[D, K]
with someK
.
- torch_matvec(v_torch: Tensor | List[Tensor]) Tensor | List[Tensor] [source]
Apply the inverse of KFAC to a vector in PyTorch.
This allows for matrix-vector products with the inverse KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.
- Parameters:
v_torch – Vector for multiplication. If list of tensors, each entry has the same shape as a parameter, i.e.
[p1.shape, p2.shape, ...]
. If tensor, has shape[D]
.- Returns:
Matrix-multiplication result
KFAC⁻¹ @ v
. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter, i.e.[p1.shape, p2.shape, ...]
. If tensor, has shape[D]
.- Raises:
ValueError – If the input tensor has the wrong data type.
Sub-matrices
- class curvlinops.SubmatrixLinearOperator(*args, **kwargs)[source]
Class for sub-matrices of linear operators.
- __init__(A: LinearOperator, row_idxs: List[int], col_idxs: List[int])[source]
Store the linear operator and indices of its sub-matrix.
Represents the sub-matrix
A[row_idxs, :][col_idxs, :]
.- Parameters:
A – A linear operator.
row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.
- set_submatrix(row_idxs: List[int], col_idxs: List[int])[source]
Define the sub-matrix.
Internally sets the linear operator’s shape.
- Parameters:
row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.
- Raises:
ValueError – If the index lists contain duplicate values, non-integers, or out-of-bounds indices.
Spectral density approximation
- curvlinops.lanczos_approximate_spectrum(A: LinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 3.0, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01) Tuple[ndarray, ndarray] [source]
Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.
Implements algorithm 2 (
LanczosApproxSpec
) of Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).Internally rescales the operator spectrum to the interval [-1; 1] such that the width
kappa
of the Gaussian bumps used to approximate the delta peaks need not be tweaked.- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1
. Taken from papyan2020traces, Section D.2.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
3
. Taken from papyan2020traces, Section D.2.boundaries – Estimates of the minimum and maximum eigenvalues of
A
. If left unspecified, they will be estimated internally.margin – Relative margin added around the spectral boundary. Default:
0.05
. Taken from papyan2020traces, Section D.2.boundaries_tol – (Only relevant if
boundaries
are not specified). Relative accuracy used to estimate the spectral boundary.0
implies machine precision. Default:1e-2
, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.
- Returns:
Grid points λ and approximated spectral density p(λ) of A.
- curvlinops.lanczos_approximate_log_spectrum(A: LinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 1.04, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01, epsilon: float = 1e-05) Tuple[ndarray, ndarray] [source]
Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of log(|A| + εI) ∈ Rᵈˣᵈ.
Follows the idea of Section C.7 in Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).
Here, log denotes the natural logarithm (i.e. base e).
Internally rescales the operator spectrum to the interval [-1; 1] such that the width
kappa
of the Gaussian bumps used to approximate the delta peaks need not be tweaked.- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1
. Taken from papyan2020traces, Section D.2.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
1.04
. Obtained by tweaking while reproducing Fig. 15b from papyan2020traces (not specified by the paper).boundaries – Estimates of the minimum and maximum eigenvalues of
|A|
. If left unspecified, they will be estimated internally.margin – Relative margin added around the spectral boundary. Default:
0.05
. Taken from papyan2020traces, Section D.2.boundaries_tol – (Only relevant if
boundaries
are not specified). Relative accuracy used to estimate the spectral boundary.0
implies machine precision. Default:1e-2
, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.epsilon – Shift to increase numerical stability. Default:
1e-5
. Taken from papyan2020traces, Section D.2.
- Returns:
Grid points λ and approximated spectral density p(λ) of log(|A| + εI).
- class curvlinops.LanczosApproximateSpectrumCached(A: LinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01)[source]
Class to approximate the spectral density of p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.
Caches Lanczos iterations to efficiently produce spectral density approximations with different hyperparameters.
- __init__(A: LinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01)[source]
Initialize.
- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
boundaries – Estimates of the minimum and maximum eigenvalues of
A
. If left unspecified, they will be estimated internally.boundaries_tol – (Only relevant if
boundaries
are not specified). Relative accuracy used to estimate the spectral boundary.0
implies machine precision. Default:1e-2
, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.
- approximate_spectrum(num_repeats: int = 1, num_points: int = 1024, kappa: float = 3.0, margin: float = 0.05) Tuple[ndarray, ndarray] [source]
Approximate the spectal density of A.
- Parameters:
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1
. Taken from papyan2020traces, Section D.2.num_points – Resolution. Default:
1024
.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
3
. From papyan2020traces, Section D.2.margin – Relative margin added around the spectral boundary. Default:
0.05
. Taken from papyan2020traces, Section D.2.
- Returns:
Grid points λ and approximated spectral density p(λ) of A.
Trace approximation
- class curvlinops.HutchinsonTraceEstimator(A: LinearOperator)[source]
Class to perform trace estimation with Hutchinson’s method.
For details, see
Hutchinson, M. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communication in Statistics—Simulation and Computation.
Let \(\mathbf{A}\) be a square linear operator. We can approximate its trace \(\mathrm{Tr}(\mathbf{A})\) by drawing a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator
\[a := \mathbf{v}^\top \mathbf{A} \mathbf{v} \approx \mathrm{Tr}(\mathbf{A})\,.\]This estimator is unbiased,
\[\mathbb{E}[a] = \mathrm{Tr}(\mathbb{E}[\mathbf{v}^\top\mathbf{A} \mathbf{v}]) = \mathrm{Tr}(\mathbf{A} \mathbb{E}[\mathbf{v} \mathbf{v}^\top]) = \mathrm{Tr}(\mathbf{A} \mathbf{I}) = \mathrm{Tr}(\mathbf{A})\,.\]Example
>>> from numpy import trace, mean, round >>> from numpy.random import rand, seed >>> seed(0) # make deterministic >>> A = rand(10, 10) >>> tr_A = trace(A) # exact trace as reference >>> estimator = HutchinsonTraceEstimator(A) >>> # one- and multi-sample approximations >>> tr_A_low_precision = estimator.sample() >>> tr_A_high_precision = mean([estimator.sample() for _ in range(1_000)]) >>> assert abs(tr_A - tr_A_low_precision) > abs(tr_A - tr_A_high_precision) >>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4) (4.4575, 6.6796, 4.3886)
- __init__(A: LinearOperator)[source]
Store the linear operator whose trace will be estimated.
- Parameters:
A – Linear square-shaped operator whose trace will be estimated.
- Raises:
ValueError – If the operator is not square.
- sample(distribution: str = 'rademacher') float [source]
Draw a sample from the trace estimator.
Multiple samples can be combined into a more accurate trace estimation via averaging.
- Parameters:
distribution – Distribution of the vector along which the linear operator will be evaluated. Either
'rademacher'
or'normal'
. Default is'rademacher'
.- Returns:
Sample from the trace estimator.
- class curvlinops.HutchPPTraceEstimator(A: LinearOperator, basis_dim: int | None = None, basis_distribution: str = 'rademacher')[source]
Class to perform trace estimation with the Huch++ method.
In contrast to vanilla Hutchinson, Hutch++ has lower variance, but requires more memory.
For details, see
Meyer, R. A., Musco, C., Musco, C., & Woodruff, D. P. (2020). Hutch++: optimal stochastic trace estimation.
Let \(\mathbf{A}\) be a square linear operator whose trace we want to approximate. First, we compute an orthonormal basis \(\mathbf{Q}\) of a sub-space spanned by \(\mathbf{A} \mathbf{S}\) where \(\mathbf{S}\) is a tall random matrix with i.i.d. elements. Then, we compute the trace in the sub-space and apply Hutchinson’s estimator in the remaining space spanned by \(\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top\): We can draw a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator
\[a := \mathrm{Tr}(\mathbf{Q}^\top \mathbf{A} \mathbf{Q}) + \mathbf{v}^\top (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top)^\top \mathbf{A} (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top) \mathbf{v} \approx \mathrm{Tr}(\mathbf{A})\,.\]This estimator is unbiased, \(\mathbb{E}[a] = \mathrm{Tr}(\mathbf{A})\), as the first term is constant and the second part is Hutchinson’s estimator in a sub-space.
Example
>>> from numpy import trace, mean, round >>> from numpy.random import rand, seed >>> seed(0) # make deterministic >>> A = rand(10, 10) >>> tr_A = trace(A) # exact trace as reference >>> estimator = HutchPPTraceEstimator(A) >>> # one- and multi-sample approximations >>> tr_A_low_precision = estimator.sample() >>> tr_A_high_precision = mean([estimator.sample() for _ in range(998)]) >>> # assert abs(tr_A - tr_A_low_precision) > abs(tr_A - tr_A_high_precision) >>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4) (4.4575, 2.4085, 4.5791)
- __init__(A: LinearOperator, basis_dim: int | None = None, basis_distribution: str = 'rademacher')[source]
Store the linear operator whose trace will be estimated.
- Parameters:
A – Linear square-shaped operator whose trace will be estimated.
basis_dim – Dimension of the subspace used for exact trace estimation. Can be at most the linear operator’s dimension. By default, its size will be 1% of the matrix’s dimension, but at most
10
. This assumes that we are working with very large matrices and we can only afford storing a small number of columns at a time.basis_distribution – Distribution of the vectors used to construct the subspace. Either
'rademacher'` or ``'normal'
. Default is'rademacher'
.
- Raises:
ValueError – If the operator is not square or the basis dimension is too large.
Note
If you are planning to perform a fair (i.e. same computation budget) comparison with vanilla Hutchinson,
basis_dim
should bes / 3
wheres
is the number of samples used by vanilla Hutchinson. Ifs / 3
requires storing a too large matrix, you can pickbasis_dim = s1
and draws2
samples from Hutch++ such that2 * s1 + s2 = s
.
- sample(distribution: str = 'rademacher') float [source]
Draw a sample from the trace estimator.
Multiple samples can be combined into a more accurate trace estimation via averaging.
Note
Calling this function for the first time will also compute the sub-space and its trace. Future calls will be faster as the latter are cached internally.
- Parameters:
distribution – Distribution of the vector along which the linear operator will be evaluated. Either
'rademacher'
or'normal'
. Default is'rademacher'
.- Returns:
Sample from the trace estimator.
Diagonal approximation
- class curvlinops.HutchinsonDiagonalEstimator(A: LinearOperator)[source]
Class to perform diagonal estimation with Hutchinson’s method.
For details, see
Martens, J., Sutskever, I., & Swersky, K. (2012). Estimating the hessian by back-propagating curvature. International Conference on Machine Learning (ICML).
Let \(\mathbf{A}\) be a square linear operator. We can approximate its diagonal \(\mathrm{diag}(\mathbf{A})\) by drawing a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator
\[\mathbf{a} := \mathbf{v} \odot \mathbf{A} \mathbf{v} \approx \mathrm{diag}(\mathbf{A})\,.\]This estimator is unbiased,
\[\mathbb{E}[a_i] = \sum_j \mathbb{E}[v_i A_{i,j} v_j] = \sum_j A_{i,j} \mathbb{E}[v_i v_j] = \sum_j A_{i,j} \delta_{i, j} = A_{i,i}\,.\]Example
>>> from numpy import diag, mean, round >>> from numpy.random import rand, seed >>> from numpy.linalg import norm >>> seed(0) # make deterministic >>> A = rand(10, 10) >>> diag_A = diag(A) # exact diagonal as reference >>> estimator = HutchinsonDiagonalEstimator(A) >>> # one- and multi-sample approximations >>> diag_A_low_precision = estimator.sample() >>> samples = [estimator.sample() for _ in range(1_000)] >>> diag_A_high_precision = mean(samples, axis=0) >>> # compute residual norms >>> error_low_precision = norm(diag_A - diag_A_low_precision) >>> error_high_precision = norm(diag_A - diag_A_high_precision) >>> assert error_low_precision > error_high_precision >>> round(error_low_precision, 4), round(error_high_precision, 4) (5.7268, 0.1525)
- __init__(A: LinearOperator)[source]
Store the linear operator whose diagonal will be estimated.
- Parameters:
A – Linear square-shaped operator whose diagonal will be estimated.
- Raises:
ValueError – If the operator is not square.
- sample(distribution: str = 'rademacher') ndarray [source]
Draw a sample from the diagonal estimator.
Multiple samples can be combined into a more accurate diagonal estimation via averaging.
- Parameters:
distribution – Distribution of the vector along which the linear operator will be evaluated. Either
'rademacher'
or'normal'
. Default is'rademacher'
.- Returns:
A Sample from the diagonal estimator.
Frobenius norm approximation
- class curvlinops.HutchinsonSquaredFrobeniusNormEstimator(A: LinearOperator)[source]
Estimate the squared Frobenius norm of a matrix using Hutchinson’s method.
Let \(\mathbf{A} \in \mathbb{R}^{M \times N}\) be some matrix. It’s Frobenius norm \(\lVert\mathbf{A}\rVert_\text{F}\) is defined via:
\[\lVert\mathbf{A}\rVert_\text{F}^2 = \sum_{m=1}^M \sum_{n=1}^N \mathbf{A}_{n,m}^2 = \text{Tr}(\mathbf{A}^\top \mathbf{A}).\]Due to the last equality, we can use Hutchinson-style trace estimation to estimate the squared Frobenius norm.
Example
>>> from numpy import mean, round >>> from numpy.linalg import norm >>> from numpy.random import rand, seed >>> seed(0) # make deterministic >>> A = rand(5, 5) >>> fro2_A = norm(A, ord='fro')**2 # exact squared Frobenius norm as reference >>> estimator = HutchinsonSquaredFrobeniusNormEstimator(A) >>> # one- and multi-sample approximations >>> fro2_A_low_prec = estimator.sample() >>> fro2_A_high_prec = mean([estimator.sample() for _ in range(1_000)]) >>> assert abs(fro2_A - fro2_A_low_prec) > abs(fro2_A - fro2_A_high_prec) >>> round(fro2_A, 4), round(fro2_A_low_prec, 4), round(fro2_A_high_prec, 4) (10.7192, 8.3257, 10.6406)
- __init__(A: LinearOperator)[source]
Store the linear operator whose squared Frobenius norm will be estimated.
- Parameters:
A – Linear operator whose squared Frobenius norm will be estimated.
- sample(distribution: str = 'rademacher') float [source]
Draw a sample from the squared Frobenius norm estimator.
Multiple samples can be combined into a more accurate squared Frobenius norm estimation via averaging.
- Parameters:
distribution – Distribution of the vector along which the linear operator will be evaluated. Either
'rademacher'
or'normal'
. Default is'rademacher'
.- Returns:
Sample from the squared Frobenius norm estimator.
Experimental
The API of experimental features may be subject to changes, or they might become deprecated.
- class curvlinops.experimental.ActivationHessianLinearOperator(*args, **kwargs)[source]
Hessian of the loss w.r.t. hidden features in a neural network.
Consider the empirical risk on a single mini-batch \(\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N\}\), \(\mathbf{y} = \{ \mathbf{y}_1, \dots, \mathbf{y}_N\}\):
\[\mathcal{L}(\mathbf{\theta}) = \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]Let \(\mathbf{Z} = \{ \mathbf{z}_1, \dots, \mathbf{z}_N \}\) denote a batch of some intermediate feature produced inside the neural network’s forward pass. The Hessian w.r.t. the flattened activations that is represented by this linear operator is
\[\nabla^2_{\mathbf{Z}} \mathcal{L} = \nabla^2_{\mathbf{Z}} \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]and has dimension \(\mathrm{dim}(\mathbf{Z}) = N \mathrm{dim}(\mathbf{z})\).
- __init__(model_func: Module, loss_func: Callable[[Tensor, Tensor], Tensor], activation: Tuple[str, str, int], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None)[source]
Linear operator for the loss Hessian w.r.t. intermediate features.
- Parameters:
model_func – A neural network.
loss_func – A loss function.
activation – A tuple specifying w.r.t. what intermediate feature the Hessian shall be computed. Has three entries. The first entry is the layer’s name inside the model (any string from
model_func.named_modules
). The second entry can either be"output"
or"input"
and specifies whether the intermediate feature is the layer’s output or the layer’s input. The third entry specifies what tensor should be taken from the tuple of input tensors or the tuple of output tensors. For most layers that process a single Tensor into another single Tensor, this value is0
.data – An iterable of mini-batches. Must have a single batch, otherwise the activation Hessian is not defined.
progressbar – Show a progressbar during matrix-multiplication. Default:
False
.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_last
or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True
. This is a safeguard, only turn it off if you know what you are doing.shape – Shape of the represented matrix. If
None
, this dimension will be inferred at the cost of one forward pass through the model.
- Raises:
ValueError – If
data
contains more than one batch.
Example
>>> from numpy import eye, allclose >>> from torch import manual_seed, rand >>> from torch.nn import Linear, MSELoss, Sequential, ReLU >>> >>> loss_func = MSELoss() >>> model = Sequential(Linear(4, 3), ReLU(), Linear(3, 2)) >>> [name for name, _ in model.named_modules()] # available layer names ['', '0', '1', '2'] >>> data = [(rand(10, 4), rand(10, 2))] >>> >>> hessian = ActivationHessianLinearOperator( # Hessian w.r.t. ReLU input ... model, loss_func, ("1", "input", 0), data ... ) >>> hessian.shape # batch size * feature dimension (10 * 3) (30, 30) >>> >>> # The ReLU's input is the first Linear's output, let's check that >>> hessian2 = ActivationHessianLinearOperator( # Hessian w.r.t. first output ... model, loss_func, ("0", "output", 0), data ... ) >>> I = eye(hessian.shape[1]) >>> allclose(hessian @ I, hessian2 @ I) True