Linear operators

Hessian

class curvlinops.HessianLinearOperator(*args, **kwargs)[source]

Hessian as SciPy linear operator.

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. The Hessian matrix is

\[\nabla^2_{\mathbf{\theta}} \mathcal{L} = c \sum_{n=1}^{N} \nabla_{\mathbf{\theta}}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\,.\]
SUPPORTS_BLOCKS

Whether the linear operator supports block operations. Default is True.

Type:

bool

__init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)

Linear operator for DNN matrices.

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:
  • model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.

  • loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If None, there is no loss function and the represented matrix is independent of the loss function.

  • params – List of differentiable parameters used by the prediction function.

  • data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function.

  • progressbar – Show a progressbar during matrix-multiplication. Default: False.

  • check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.

  • shape – Shape of the represented matrix. If None assumes (D, D) where D is the total number of parameters

  • num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.

  • block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of nn.Parameter``s forming a block. Entries must sum to ``len(params). For instance [len(params)] considers the full matrix, while [1, 1, ...] corresponds to a block diagonal approximation where each parameter forms its own block.

  • batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:
  • RuntimeError – If the check for deterministic behavior fails.

  • ValueError – If block_sizes is specified but the linear operator does not support blocks.

  • ValueError – If the sum of blocks does not equal the number of parameters.

  • ValueError – If any block size is not positive.

  • ValueError – If X is not a tensor and batch_size_fn is not specified.

Generalized Gauss-Newton

class curvlinops.GGNLinearOperator(*args, **kwargs)[source]

GGN as SciPy linear operator.

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. The GGN matrix is

\[c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)^\top \left( \nabla_{f_\mathbf{\theta}(\mathbf{x}_n)}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)\,.\]
__init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)

Linear operator for DNN matrices.

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:
  • model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.

  • loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If None, there is no loss function and the represented matrix is independent of the loss function.

  • params – List of differentiable parameters used by the prediction function.

  • data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function.

  • progressbar – Show a progressbar during matrix-multiplication. Default: False.

  • check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.

  • shape – Shape of the represented matrix. If None assumes (D, D) where D is the total number of parameters

  • num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.

  • block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of nn.Parameter``s forming a block. Entries must sum to ``len(params). For instance [len(params)] considers the full matrix, while [1, 1, ...] corresponds to a block diagonal approximation where each parameter forms its own block.

  • batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:
  • RuntimeError – If the check for deterministic behavior fails.

  • ValueError – If block_sizes is specified but the linear operator does not support blocks.

  • ValueError – If the sum of blocks does not equal the number of parameters.

  • ValueError – If any block size is not positive.

  • ValueError – If X is not a tensor and batch_size_fn is not specified.

Fisher (approximate)

class curvlinops.FisherMCLinearOperator(*args, **kwargs)[source]

Monte-Carlo approximation of the Fisher as SciPy linear operator.

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. Let \(\ell(\mathbf{f}, \mathbf{y}) = - \log q(\mathbf{y} \mid \mathbf{f})\) be a negative log-likelihood loss. Denoting \(\mathbf{f}_n = f_{\mathbf{\theta}}(\mathbf{x}_n)\), the Fisher information matrix is

\[\begin{split}c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \nabla_{\mathbf{f}_n}^2 \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ = c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ \approx c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \frac{1}{M} \sum_{m=1}^M \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)\end{split}\]

with sampled targets \(\mathbf{\tilde{y}}_{n}^{(m)} \sim q( \cdot \mid \mathbf{f}_n)\). The expectation over the model’s likelihood is approximated via a Monte-Carlo estimator with \(M\) samples.

The linear operator represents a deterministic sample from this MC Fisher estimator. To generate different samples, you have to create instances with varying random seed argument.

__init__(model_func: Callable[[Tensor], Tensor], loss_func: MSELoss | CrossEntropyLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, mc_samples: int = 1, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]

Linear operator for the MC approximation of the Fisher.

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:
  • model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.

  • loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.

  • params – List of differentiable parameters used by the prediction function.

  • data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).

  • progressbar – Show a progressbar during matrix-multiplication. Default: False.

  • check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.

  • seed – Seed used to construct the internal random number generator used to draw samples at the beginning of each matrix-vector product. Default: 2147483647

  • mc_samples – Number of samples to use. Default: 1.

  • num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.

  • batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:

NotImplementedError – If the loss function differs from MSELoss, BCEWithLogitsLoss, or CrossEntropyLoss.

class curvlinops.KFACLinearOperator(*args, **kwargs)[source]

Linear operator to multiply with the Fisher/GGN’s KFAC approximation.

KFAC approximates the per-layer Fisher/GGN with a Kronecker product: Consider a weight matrix \(\mathbf{W}\) and a bias vector \(\mathbf{b}\) in a single layer. The layer’s Fisher \(\mathbf{F}(\mathbf{\theta})\) for

\[\begin{split}\mathbf{\theta} = \begin{pmatrix} \mathrm{vec}(\mathbf{W}) \\ \mathbf{b} \end{pmatrix}\end{split}\]

where \(\mathrm{vec}\) denotes column-stacking is approximated as

\[\mathbf{F}(\mathbf{\theta}) \approx \mathbf{A}_{(\text{KFAC})} \otimes \mathbf{B}_{(\text{KFAC})}\]

(see curvlinops.FisherMCLinearOperator for the Fisher’s definition). Loosely speaking, the first Kronecker factor is the un-centered covariance of the inputs to a layer. The second Kronecker factor is the un-centered covariance of ‘would-be’ gradients w.r.t. the layer’s output. Those ‘would-be’ gradients result from sampling labels from the model’s distribution and computing their gradients.

Kronecker-Factored Approximate Curvature (KFAC) was originally introduced for MLPs in

  • Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored

approximate curvature. International Conference on Machine Learning (ICML),

extended to CNNs in

  • Grosse, R., & Martens, J. (2016). A kronecker-factored approximate Fisher matrix for

convolution layers. International Conference on Machine Learning (ICML),

and generalized to all linear layers with weight sharing in

  • Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., Hennig, P. (2023).

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures (NeurIPS).

_SUPPORTED_LOSSES

Tuple of supported loss functions.

_SUPPORTED_MODULES

Tuple of supported layers.

__init__(model_func: Module, loss_func: MSELoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None, seed: int = 2147483647, fisher_type: str = 'mc', mc_samples: int = 1, kfac_approx: str = 'expand', loss_average: None | str = 'batch', num_per_example_loss_terms: int | None = None, separate_weight_and_bias: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]

Kronecker-factored approximate curvature (KFAC) proxy of the Fisher/GGN.

Warning

If the model’s parameters change, e.g. during training, you need to create a fresh instance of this object. This is because, for performance reasons, the Kronecker factors are computed once and cached during the first matrix-vector product. They will thus become outdated if the model changes.

Warning

This is an early proto-type with limitations:
  • Only Linear and Conv2d modules are supported.

  • If deterministic_checks is turned on (as is by default), this will compute the KFAC matrices on CPU, even if all passed arguments live on the GPU.

Parameters:
  • model_func – The neural network. Must consist of modules.

  • loss_func – The loss function.

  • params – The parameters defining the Fisher/GGN that will be approximated through KFAC.

  • data – A data loader containing the data of the Fisher/GGN.

  • progressbar – Whether to show a progress bar when computing the Kronecker factors. Defaults to False.

  • check_deterministic – Whether to check that the linear operator is deterministic. Defaults to True.

  • shape – The shape of the linear operator. If None, it will be inferred from the parameters. Defaults to None.

  • seed – The seed for the random number generator used to draw labels from the model’s predictive distribution. Defaults to 2147483647.

  • fisher_type – The type of Fisher/GGN to approximate. If ‘type-2’, the exact Hessian of the loss w.r.t. the model outputs is used. This requires as many backward passes as the output dimension, i.e. the number of classes for classification. This is sometimes also called type-2 Fisher. If 'mc', the expectation is approximated by sampling mc_samples labels from the model’s predictive distribution. If 'empirical', the empirical gradients are used which corresponds to the uncentered gradient covariance, or the empirical Fisher. If 'forward-only', the gradient covariances will be identity matrices, see the FOOF method in Benzing, 2022 or ISAAC in Petersen et al., 2023. Defaults to 'mc'.

  • mc_samples – The number of Monte-Carlo samples to use per data point. Has to be set to 1 when fisher_type != 'mc'. Defaults to 1.

  • kfac_approx – A string specifying the KFAC approximation that should be used for linear weight-sharing layers, e.g. Conv2d modules or Linear modules that process matrix- or higher-dimensional features. Possible values are 'expand' and 'reduce'. See Eschenhagen et al., 2023 for an explanation of the two approximations.

  • loss_average – Whether the loss function is a mean over per-sample losses and if yes, over which dimensions the mean is taken. If "batch", the loss function is a mean over as many terms as the size of the mini-batch. If "batch+sequence", the loss function is a mean over as many terms as the size of the mini-batch times the sequence length, e.g. in the case of language modeling. If None, the loss function is a sum. This argument is used to ensure that the preconditioner is scaled consistently with the loss and the gradient. Default: "batch".

  • num_per_example_loss_terms – Number of per-example loss terms, e.g., the number of tokens in a sequence. The model outputs will have num_data * num_per_example_loss_terms * C entries, where C is the dimension of the random variable we define the likelihood over – for the CrossEntropyLoss it will be the number of classes, for the MSELoss and BCEWithLogitsLoss it will be the size of the last dimension of the the model outputs/targets (our convention here). If None, num_per_example_loss_terms is inferred from the data at the cost of one traversal through the data loader. It is expected to be the same for all examples. Defaults to None.

  • separate_weight_and_bias – Whether to treat weights and biases separately. Defaults to True.

  • num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.

  • batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:
  • RuntimeError – If the check for deterministic behavior fails.

  • ValueError – If the loss function is not supported.

  • ValueError – If the loss average is not supported.

  • ValueError – If the loss average is None and the loss function’s reduction is not 'sum'.

  • ValueError – If the loss average is not None and the loss function’s reduction is 'sum'.

  • ValueError – If fisher_type != 'mc' and mc_samples != 1.

  • ValueError – If X is not a tensor and batch_size_fn is not specified.

property det: Tensor

Determinant of the KFAC approximation.

Will call _compute_kfac if it has not been called before and will cache the determinant until _compute_kfac is called again. Uses the property of the Kronecker product that \(\det(A \otimes B) = \det(A)^{m} \det(B)^{n}\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).

Returns:

Determinant of the KFAC approximation.

property frobenius_norm: Tensor

Frobenius norm of the KFAC approximation.

Will call _compute_kfac if it has not been called before and will cache the Frobenius norm until _compute_kfac is called again. Uses the property of the Kronecker product that \(\|A \otimes B\|_F = \|A\|_F \|B\|_F\).

Returns:

Frobenius norm of the KFAC approximation.

property logdet: Tensor

Log determinant of the KFAC approximation.

More numerically stable than the det property. Will call _compute_kfac if it has not been called before and will cache the log determinant until _compute_kfac is called again. Uses the property of the Kronecker product that \(\log \det(A \otimes B) = m \log \det(A) + n \log \det(B)\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).

Returns:

Log determinant of the KFAC approximation.

to_device(device: device)[source]

Load the linear operator to another device.

Parameters:

device – The device to which the linear operator should be moved.

torch_matmat(M_torch: Tensor | List[Tensor]) Tensor | List[Tensor][source]

Apply KFAC to a matrix (multiple vectors) in PyTorch.

This allows for matrix-matrix products with the KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.

Parameters:

M_torch – Matrix for multiplication. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of size K for the columns, i.e. [(K,) + p1.shape), (K,) + p2.shape, ...]. If tensor, has shape [D, K] with some K.

Returns:

Matrix-multiplication result KFAC @ M. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of size K for the columns, i.e. [(K,) + p1.shape, (K,) + p2.shape, ...]. If tensor, has shape [D, K] with some K.

torch_matvec(v_torch: Tensor | List[Tensor]) Tensor | List[Tensor][source]

Apply KFAC to a vector in PyTorch.

This allows for matrix-vector products with the KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.

Parameters:

v_torch – Vector for multiplication. If list of tensors, each entry has the same shape as a parameter, i.e. [p1.shape, p2.shape, ...]. If tensor, has shape [D].

Returns:

Matrix-multiplication result KFAC @ v. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter, i.e. [p1.shape, p2.shape, ...]. If tensor, has shape [D].

Raises:

ValueError – If the input tensor has the wrong data type.

property trace: Tensor

Trace of the KFAC approximation.

Will call _compute_kfac if it has not been called before and will cache the trace until _compute_kfac is called again. Uses the property of the Kronecker product that \(\text{tr}(A \otimes B) = \text{tr}(A) \text{tr}(B)\).

Returns:

Trace of the KFAC approximation.

Uncentered gradient covariance (empirical Fisher)

class curvlinops.EFLinearOperator(*args, **kwargs)[source]

Uncentered gradient covariance as SciPy linear operator.

The uncentered gradient covariance is often called ‘empirical Fisher’ (EF).

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. The uncentered gradient covariance matrix is

\[c \sum_{n=1}^{N} \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right)^\top\,.\]

Note

Multiplication with the empirical Fisher is currently implemented with an inefficient for-loop.

__init__(model_func: Callable[[Tensor], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]

Linear operator for the uncentered gradient covariance/empirical Fisher (EF).

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:
  • model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.

  • loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.

  • params – List of differentiable parameters used by the prediction function.

  • data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).

  • progressbar – Show a progressbar during matrix-multiplication. Default: False.

  • check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.

  • num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.

  • batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:

NotImplementedError – If the loss function differs from MSELoss, BCEWithLogitsLoss, or CrossEntropyLoss.

Jacobians

class curvlinops.JacobianLinearOperator(*args, **kwargs)[source]

Linear operator for the Jacobian.

Can be used with SciPy.

__init__(model_func: Callable[[Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]

Linear operator for the Jacobian as SciPy linear operator.

Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s Jacobian \(\mathbf{J}_\mathbf{\theta}\mathbf{f}\) is an \(NC \times D\) matrix with elements

\[\left[ \mathbf{J}_\mathbf{\theta}\mathbf{f} \right]_{(n,c), d} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]

Note that the data must be supplied in deterministic order.

Parameters:
  • model_func – Neural network function.

  • params – Neural network parameters.

  • data – Iterable of batched input-target pairs.

  • progressbar – Show progress bar.

  • check_deterministic – Check if model and data are deterministic.

  • num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.

  • batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

class curvlinops.TransposedJacobianLinearOperator(*args, **kwargs)[source]

Linear operator for the transpose Jacobian.

Can be used with SciPy.

__init__(model_func: Callable[[Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None)[source]

Linear operator for the transpose Jacobian as SciPy linear operator.

Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s transpose Jacobian \((\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top\) is an \(D \times NC\) matrix with elements

\[\left[ (\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top \right]_{d, (n,c)} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]

Note that the data must be supplied in deterministic order.

Parameters:
  • model_func – Neural network function.

  • params – Neural network parameters.

  • data – Iterable of batched input-target pairs.

  • progressbar – Show progress bar.

  • check_deterministic – Check if model and data are deterministic.

  • num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.

  • batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Inverses

class curvlinops.CGInverseLinearOperator(*args, **kwargs)[source]

Class for inverse linear operators via conjugate gradients.

__init__(A: LinearOperator)[source]

Store the linear operator whose inverse should be represented.

Parameters:

A – Linear operator whose inverse is formed. Must be symmetric and positive-definite.

set_cg_hyperparameters(x0: ndarray | None = None, maxiter: int | None = None, M: ndarray | LinearOperator | None = None, callback: Callable | None = None, atol: float | None = None, tol: float | None = 1e-05)[source]

Store hyperparameters for CG.

They will be used to approximate the inverse matrix-vector products.

For more detail, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.cg.html.

# noqa: DAR101

class curvlinops.LSMRInverseLinearOperator(*args, **kwargs)[source]

Class for inverse linear operators via LSMR.

See https://arxiv.org/abs/1006.0758 for details on the LSMR algorithm.

__init__(A: LinearOperator)[source]

Store the linear operator whose inverse should be represented.

Parameters:

A – Linear operator whose inverse is formed.

matvec_with_info(x: ndarray) Tuple[ndarray, int, int, float, float, float, float, float][source]

Multiply x by the inverse of A and return additional information.

Parameters:

x – Vector for multiplication.

Returns:

Result of inverse matrix-vector multiplication, A⁻¹ @ x with additional information; see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html for details (same return values).

set_lsmr_hyperparameters(damp: float = 0.0, atol: float | None = 1e-06, btol: float | None = 1e-06, conlim: float | None = 1e-08, maxiter: int | None = None, show: bool | None = False, x0: ndarray | None = None)[source]

Store hyperparameters for LSMR.

They will be used to approximate the inverse matrix-vector products.

For more detail, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html.

# noqa: DAR101

class curvlinops.NeumannInverseLinearOperator(*args, **kwargs)[source]

Class for inverse linear operators via truncated Neumann series.

# noqa: B950

See https://en.wikipedia.org/w/index.php?title=Neumann_series&oldid=1131424698#Approximate_matrix_inversion.

Motivated by (referred to as lorraine2020optimizing in the following)

  • Lorraine, J., Vicol, P., & Duvenaud, D. (2020). Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics (AISTATS).

Warning

The Neumann series can be non-convergent. In this case, the iterations will become numerically unstable, leading to ``NaN``s.

Warning

The Neumann series can converge slowly. Use curvlinops.CGInverLinearOperator for better accuracy.

__init__(A: LinearOperator, num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]

Store the linear operator whose inverse should be represented.

The Neumann series for an invertible linear operator \(\mathbf{A}\) is

\[\mathbf{A}^{-1} = \sum_{k=0}^{\infty} \left(\mathbf{I} - \mathbf{A} \right)^k\,.\]

and is convergent if all eigenvalues satisfy \(0 < \lambda(\mathbf{A}) < 2\).

By re-rescaling the matrix by scale (\(\alpha\)), we have:

\[\mathbf{A}^{-1} = \alpha (\alpha \mathbf{A})^{-1} = \alpha \sum_{k=0}^{\infty} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,,\]

which and is convergent if \(0 < \lambda(\mathbf{A}) < \frac{2}{\alpha}\).

Additionally, we truncate the series at num_terms (\(K\)):

\[\mathbf{A}^{-1} \approx \alpha \sum_{k=0}^{K} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,.\]
Parameters:
  • A – Linear operator whose inverse is formed.

  • num_terms – Number of terms in the truncated Neumann series. Default: 100.

  • scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series (details above). Default: 1.0.

  • check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default: True.

set_neumann_hyperparameters(num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]

Store hyperparameters for the truncated Neumann series.

Parameters:
  • num_terms – Number of terms in the truncated Neumann series. Default: 100.

  • scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series. Default: 1.0.

  • check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default: True.

class curvlinops.KFACInverseLinearOperator(*args, **kwargs)[source]

Class to invert instances of the KFACLinearOperator.

__init__(A: KFACLinearOperator, damping: float | Tuple[float, float] = 0.0, use_heuristic_damping: bool = False, min_damping: float = 1e-08, use_exact_damping: bool = False, cache: bool = True, retry_double_precision: bool = True)[source]

Store the linear operator whose inverse should be represented.

Parameters:
  • AKFACLinearOperator whose inverse is formed.

  • damping – Damping value(s) for all input and gradient covariances. If tuple, the first value is used for the input covariances and the second value for the gradient covariances. Note that if heuristic or exact damping is used the damping cannot be a tuple. Default: 0..

  • use_heuristic_damping – Whether to use a heuristic damping strategy by Martens and Grosse, 2015 (Section 6.3). For input covariances \(A \in \mathbb{R}^{n \times n}\) and gradient covariances \(B \in \mathbb{R}^{m \times m}\), we define \(\pi := \sqrt{\frac{m\; \text{tr}(A)}{n\; \text{tr}(B)}}\) and set the damping for the input covariances \(A\) to \(\max(\pi\; \sqrt{\text{damping}}, \text{min_damping})\) and for the gradient covariances \(B\) to \(\max(\frac{1}{\pi}\; \sqrt{\text{damping}}, \text{min_damping})\). Default: False.

  • min_damping – Minimum damping value. Only used if use_heuristic_damping is True. Default: 1e-8.

  • use_exact_damping – Whether to use exact damping, i.e. to invert \((A \otimes B) + \text{damping}\; \mathbf{I}\). This is implemented via eigendecompositions of the Kronecker factors, e.g. see equation (21) in Grosse et al., 2023. Note that the eigendecomposition synchronizes the device with the CPU. Default: False.

  • cache – Whether to cache the inverses of the Kronecker factors. Default: True.

  • retry_double_precision – Whether to retry Cholesky decomposition used for inversion in double precision. Default: True.

Raises:
  • ValueError – If the linear operator is not a KFACLinearOperator.

  • ValueError – If both heuristic and exact damping are selected.

  • ValueError – If heuristic or exact damping is used and the damping value is a tuple.

torch_matmat(M_torch: Tensor | List[Tensor]) Tensor | List[Tensor][source]

Apply the inverse of KFAC to a matrix (multiple vectors) in PyTorch.

This allows for matrix-matrix products with the inverse KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.

Parameters:

M_torch – Matrix for multiplication. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of size K for the columns, i.e. [(K,) + p1.shape), (K,) + p2.shape, ...]. If tensor, has shape [D, K] with some K.

Returns:

Matrix-multiplication result KFAC⁻¹ @ M. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter with an additional leading dimension of size K for the columns, i.e. [(K,) + p1.shape, (K,) + p2.shape, ...]. If tensor, has shape [D, K] with some K.

torch_matvec(v_torch: Tensor | List[Tensor]) Tensor | List[Tensor][source]

Apply the inverse of KFAC to a vector in PyTorch.

This allows for matrix-vector products with the inverse KFAC approximation in PyTorch without converting tensors to numpy arrays, which avoids unnecessary device transfers when working with GPUs and flattening/concatenating.

Parameters:

v_torch – Vector for multiplication. If list of tensors, each entry has the same shape as a parameter, i.e. [p1.shape, p2.shape, ...]. If tensor, has shape [D].

Returns:

Matrix-multiplication result KFAC⁻¹ @ v. Return type is the same as the type of the input. If list of tensors, each entry has the same shape as a parameter, i.e. [p1.shape, p2.shape, ...]. If tensor, has shape [D].

Raises:

ValueError – If the input tensor has the wrong data type.

Sub-matrices

class curvlinops.SubmatrixLinearOperator(*args, **kwargs)[source]

Class for sub-matrices of linear operators.

__init__(A: LinearOperator, row_idxs: List[int], col_idxs: List[int])[source]

Store the linear operator and indices of its sub-matrix.

Represents the sub-matrix A[row_idxs, :][col_idxs, :].

Parameters:
  • A – A linear operator.

  • row_idxs – The sub-matrix’s row indices.

  • col_idxs – The sub-matrix’s column indices.

set_submatrix(row_idxs: List[int], col_idxs: List[int])[source]

Define the sub-matrix.

Internally sets the linear operator’s shape.

Parameters:
  • row_idxs – The sub-matrix’s row indices.

  • col_idxs – The sub-matrix’s column indices.

Raises:

ValueError – If the index lists contain duplicate values, non-integers, or out-of-bounds indices.

Spectral density approximation

curvlinops.lanczos_approximate_spectrum(A: LinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 3.0, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01) Tuple[ndarray, ndarray][source]

Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.

Implements algorithm 2 (LanczosApproxSpec) of Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).

Internally rescales the operator spectrum to the interval [-1; 1] such that the width kappa of the Gaussian bumps used to approximate the delta peaks need not be tweaked.

Parameters:
  • A – Symmetric linear operator.

  • ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).

  • num_points – Resolution.

  • num_repeats – Number of Lanczos quadratures to average the density over. Default: 1. Taken from papyan2020traces, Section D.2.

  • kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default: 3. Taken from papyan2020traces, Section D.2.

  • boundaries – Estimates of the minimum and maximum eigenvalues of A. If left unspecified, they will be estimated internally.

  • margin – Relative margin added around the spectral boundary. Default: 0.05. Taken from papyan2020traces, Section D.2.

  • boundaries_tol – (Only relevant if boundaries are not specified). Relative accuracy used to estimate the spectral boundary. 0 implies machine precision. Default: 1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.

Returns:

Grid points λ and approximated spectral density p(λ) of A.

curvlinops.lanczos_approximate_log_spectrum(A: LinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 1.04, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01, epsilon: float = 1e-05) Tuple[ndarray, ndarray][source]

Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of log(|A| + εI) ∈ Rᵈˣᵈ.

Follows the idea of Section C.7 in Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).

Here, log denotes the natural logarithm (i.e. base e).

Internally rescales the operator spectrum to the interval [-1; 1] such that the width kappa of the Gaussian bumps used to approximate the delta peaks need not be tweaked.

Parameters:
  • A – Symmetric linear operator.

  • ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).

  • num_points – Resolution.

  • num_repeats – Number of Lanczos quadratures to average the density over. Default: 1. Taken from papyan2020traces, Section D.2.

  • kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default: 1.04. Obtained by tweaking while reproducing Fig. 15b from papyan2020traces (not specified by the paper).

  • boundaries – Estimates of the minimum and maximum eigenvalues of |A|. If left unspecified, they will be estimated internally.

  • margin – Relative margin added around the spectral boundary. Default: 0.05. Taken from papyan2020traces, Section D.2.

  • boundaries_tol – (Only relevant if boundaries are not specified). Relative accuracy used to estimate the spectral boundary. 0 implies machine precision. Default: 1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.

  • epsilon – Shift to increase numerical stability. Default: 1e-5. Taken from papyan2020traces, Section D.2.

Returns:

Grid points λ and approximated spectral density p(λ) of log(|A| + εI).

class curvlinops.LanczosApproximateSpectrumCached(A: LinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01)[source]

Class to approximate the spectral density of p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.

Caches Lanczos iterations to efficiently produce spectral density approximations with different hyperparameters.

__init__(A: LinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01)[source]

Initialize.

Parameters:
  • A – Symmetric linear operator.

  • ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).

  • boundaries – Estimates of the minimum and maximum eigenvalues of A. If left unspecified, they will be estimated internally.

  • boundaries_tol – (Only relevant if boundaries are not specified). Relative accuracy used to estimate the spectral boundary. 0 implies machine precision. Default: 1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.

approximate_spectrum(num_repeats: int = 1, num_points: int = 1024, kappa: float = 3.0, margin: float = 0.05) Tuple[ndarray, ndarray][source]

Approximate the spectal density of A.

Parameters:
  • num_repeats – Number of Lanczos quadratures to average the density over. Default: 1. Taken from papyan2020traces, Section D.2.

  • num_points – Resolution. Default: 1024.

  • kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default: 3. From papyan2020traces, Section D.2.

  • margin – Relative margin added around the spectral boundary. Default: 0.05. Taken from papyan2020traces, Section D.2.

Returns:

Grid points λ and approximated spectral density p(λ) of A.

Trace approximation

class curvlinops.HutchinsonTraceEstimator(A: LinearOperator)[source]

Class to perform trace estimation with Hutchinson’s method.

For details, see

  • Hutchinson, M. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communication in Statistics—Simulation and Computation.

Let \(\mathbf{A}\) be a square linear operator. We can approximate its trace \(\mathrm{Tr}(\mathbf{A})\) by drawing a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator

\[a := \mathbf{v}^\top \mathbf{A} \mathbf{v} \approx \mathrm{Tr}(\mathbf{A})\,.\]

This estimator is unbiased,

\[\mathbb{E}[a] = \mathrm{Tr}(\mathbb{E}[\mathbf{v}^\top\mathbf{A} \mathbf{v}]) = \mathrm{Tr}(\mathbf{A} \mathbb{E}[\mathbf{v} \mathbf{v}^\top]) = \mathrm{Tr}(\mathbf{A} \mathbf{I}) = \mathrm{Tr}(\mathbf{A})\,.\]

Example

>>> from numpy import trace, mean, round
>>> from numpy.random import rand, seed
>>> seed(0) # make deterministic
>>> A = rand(10, 10)
>>> tr_A = trace(A) # exact trace as reference
>>> estimator = HutchinsonTraceEstimator(A)
>>> # one- and multi-sample approximations
>>> tr_A_low_precision = estimator.sample()
>>> tr_A_high_precision = mean([estimator.sample() for _ in range(1_000)])
>>> assert abs(tr_A - tr_A_low_precision) > abs(tr_A - tr_A_high_precision)
>>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4)
(4.4575, 6.6796, 4.3886)
__init__(A: LinearOperator)[source]

Store the linear operator whose trace will be estimated.

Parameters:

A – Linear square-shaped operator whose trace will be estimated.

Raises:

ValueError – If the operator is not square.

sample(distribution: str = 'rademacher') float[source]

Draw a sample from the trace estimator.

Multiple samples can be combined into a more accurate trace estimation via averaging.

Parameters:

distribution – Distribution of the vector along which the linear operator will be evaluated. Either 'rademacher' or 'normal'. Default is 'rademacher'.

Returns:

Sample from the trace estimator.

class curvlinops.HutchPPTraceEstimator(A: LinearOperator, basis_dim: int | None = None, basis_distribution: str = 'rademacher')[source]

Class to perform trace estimation with the Huch++ method.

In contrast to vanilla Hutchinson, Hutch++ has lower variance, but requires more memory.

For details, see

  • Meyer, R. A., Musco, C., Musco, C., & Woodruff, D. P. (2020). Hutch++: optimal stochastic trace estimation.

Let \(\mathbf{A}\) be a square linear operator whose trace we want to approximate. First, we compute an orthonormal basis \(\mathbf{Q}\) of a sub-space spanned by \(\mathbf{A} \mathbf{S}\) where \(\mathbf{S}\) is a tall random matrix with i.i.d. elements. Then, we compute the trace in the sub-space and apply Hutchinson’s estimator in the remaining space spanned by \(\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top\): We can draw a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator

\[a := \mathrm{Tr}(\mathbf{Q}^\top \mathbf{A} \mathbf{Q}) + \mathbf{v}^\top (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top)^\top \mathbf{A} (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top) \mathbf{v} \approx \mathrm{Tr}(\mathbf{A})\,.\]

This estimator is unbiased, \(\mathbb{E}[a] = \mathrm{Tr}(\mathbf{A})\), as the first term is constant and the second part is Hutchinson’s estimator in a sub-space.

Example

>>> from numpy import trace, mean, round
>>> from numpy.random import rand, seed
>>> seed(0) # make deterministic
>>> A = rand(10, 10)
>>> tr_A = trace(A) # exact trace as reference
>>> estimator = HutchPPTraceEstimator(A)
>>> # one- and multi-sample approximations
>>> tr_A_low_precision = estimator.sample()
>>> tr_A_high_precision = mean([estimator.sample() for _ in range(998)])
>>> # assert abs(tr_A - tr_A_low_precision) > abs(tr_A - tr_A_high_precision)
>>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4)
(4.4575, 2.4085, 4.5791)
__init__(A: LinearOperator, basis_dim: int | None = None, basis_distribution: str = 'rademacher')[source]

Store the linear operator whose trace will be estimated.

Parameters:
  • A – Linear square-shaped operator whose trace will be estimated.

  • basis_dim – Dimension of the subspace used for exact trace estimation. Can be at most the linear operator’s dimension. By default, its size will be 1% of the matrix’s dimension, but at most 10. This assumes that we are working with very large matrices and we can only afford storing a small number of columns at a time.

  • basis_distribution – Distribution of the vectors used to construct the subspace. Either 'rademacher'` or ``'normal'. Default is 'rademacher'.

Raises:

ValueError – If the operator is not square or the basis dimension is too large.

Note

If you are planning to perform a fair (i.e. same computation budget) comparison with vanilla Hutchinson, basis_dim should be s / 3 where s is the number of samples used by vanilla Hutchinson. If s / 3 requires storing a too large matrix, you can pick basis_dim = s1 and draw s2 samples from Hutch++ such that 2 * s1 + s2 = s.

sample(distribution: str = 'rademacher') float[source]

Draw a sample from the trace estimator.

Multiple samples can be combined into a more accurate trace estimation via averaging.

Note

Calling this function for the first time will also compute the sub-space and its trace. Future calls will be faster as the latter are cached internally.

Parameters:

distribution – Distribution of the vector along which the linear operator will be evaluated. Either 'rademacher' or 'normal'. Default is 'rademacher'.

Returns:

Sample from the trace estimator.

Diagonal approximation

class curvlinops.HutchinsonDiagonalEstimator(A: LinearOperator)[source]

Class to perform diagonal estimation with Hutchinson’s method.

For details, see

  • Martens, J., Sutskever, I., & Swersky, K. (2012). Estimating the hessian by back-propagating curvature. International Conference on Machine Learning (ICML).

Let \(\mathbf{A}\) be a square linear operator. We can approximate its diagonal \(\mathrm{diag}(\mathbf{A})\) by drawing a random vector \(\mathbf{v}\) which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and sample from the estimator

\[\mathbf{a} := \mathbf{v} \odot \mathbf{A} \mathbf{v} \approx \mathrm{diag}(\mathbf{A})\,.\]

This estimator is unbiased,

\[\mathbb{E}[a_i] = \sum_j \mathbb{E}[v_i A_{i,j} v_j] = \sum_j A_{i,j} \mathbb{E}[v_i v_j] = \sum_j A_{i,j} \delta_{i, j} = A_{i,i}\,.\]

Example

>>> from numpy import diag, mean, round
>>> from numpy.random import rand, seed
>>> from numpy.linalg import norm
>>> seed(0) # make deterministic
>>> A = rand(10, 10)
>>> diag_A = diag(A) # exact diagonal as reference
>>> estimator = HutchinsonDiagonalEstimator(A)
>>> # one- and multi-sample approximations
>>> diag_A_low_precision = estimator.sample()
>>> samples = [estimator.sample() for _ in range(1_000)]
>>> diag_A_high_precision = mean(samples, axis=0)
>>> # compute residual norms
>>> error_low_precision = norm(diag_A - diag_A_low_precision)
>>> error_high_precision = norm(diag_A - diag_A_high_precision)
>>> assert error_low_precision > error_high_precision
>>> round(error_low_precision, 4), round(error_high_precision, 4)
(5.7268, 0.1525)
__init__(A: LinearOperator)[source]

Store the linear operator whose diagonal will be estimated.

Parameters:

A – Linear square-shaped operator whose diagonal will be estimated.

Raises:

ValueError – If the operator is not square.

sample(distribution: str = 'rademacher') ndarray[source]

Draw a sample from the diagonal estimator.

Multiple samples can be combined into a more accurate diagonal estimation via averaging.

Parameters:

distribution – Distribution of the vector along which the linear operator will be evaluated. Either 'rademacher' or 'normal'. Default is 'rademacher'.

Returns:

A Sample from the diagonal estimator.

Frobenius norm approximation

class curvlinops.HutchinsonSquaredFrobeniusNormEstimator(A: LinearOperator)[source]

Estimate the squared Frobenius norm of a matrix using Hutchinson’s method.

Let \(\mathbf{A} \in \mathbb{R}^{M \times N}\) be some matrix. It’s Frobenius norm \(\lVert\mathbf{A}\rVert_\text{F}\) is defined via:

\[\lVert\mathbf{A}\rVert_\text{F}^2 = \sum_{m=1}^M \sum_{n=1}^N \mathbf{A}_{n,m}^2 = \text{Tr}(\mathbf{A}^\top \mathbf{A}).\]

Due to the last equality, we can use Hutchinson-style trace estimation to estimate the squared Frobenius norm.

Example

>>> from numpy import mean, round
>>> from numpy.linalg import norm
>>> from numpy.random import rand, seed
>>> seed(0) # make deterministic
>>> A = rand(5, 5)
>>> fro2_A = norm(A, ord='fro')**2 # exact squared Frobenius norm as reference
>>> estimator = HutchinsonSquaredFrobeniusNormEstimator(A)
>>> # one- and multi-sample approximations
>>> fro2_A_low_prec = estimator.sample()
>>> fro2_A_high_prec = mean([estimator.sample() for _ in range(1_000)])
>>> assert abs(fro2_A - fro2_A_low_prec) > abs(fro2_A - fro2_A_high_prec)
>>> round(fro2_A, 4), round(fro2_A_low_prec, 4), round(fro2_A_high_prec, 4)
(10.7192, 8.3257, 10.6406)
__init__(A: LinearOperator)[source]

Store the linear operator whose squared Frobenius norm will be estimated.

Parameters:

A – Linear operator whose squared Frobenius norm will be estimated.

sample(distribution: str = 'rademacher') float[source]

Draw a sample from the squared Frobenius norm estimator.

Multiple samples can be combined into a more accurate squared Frobenius norm estimation via averaging.

Parameters:

distribution – Distribution of the vector along which the linear operator will be evaluated. Either 'rademacher' or 'normal'. Default is 'rademacher'.

Returns:

Sample from the squared Frobenius norm estimator.

Experimental

The API of experimental features may be subject to changes, or they might become deprecated.

class curvlinops.experimental.ActivationHessianLinearOperator(*args, **kwargs)[source]

Hessian of the loss w.r.t. hidden features in a neural network.

Consider the empirical risk on a single mini-batch \(\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N\}\), \(\mathbf{y} = \{ \mathbf{y}_1, \dots, \mathbf{y}_N\}\):

\[\mathcal{L}(\mathbf{\theta}) = \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]

Let \(\mathbf{Z} = \{ \mathbf{z}_1, \dots, \mathbf{z}_N \}\) denote a batch of some intermediate feature produced inside the neural network’s forward pass. The Hessian w.r.t. the flattened activations that is represented by this linear operator is

\[\nabla^2_{\mathbf{Z}} \mathcal{L} = \nabla^2_{\mathbf{Z}} \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]

and has dimension \(\mathrm{dim}(\mathbf{Z}) = N \mathrm{dim}(\mathbf{z})\).

__init__(model_func: Module, loss_func: Callable[[Tensor, Tensor], Tensor], activation: Tuple[str, str, int], data: Iterable[Tuple[Tensor, Tensor]], progressbar: bool = False, check_deterministic: bool = True, shape: Tuple[int, int] | None = None)[source]

Linear operator for the loss Hessian w.r.t. intermediate features.

Parameters:
  • model_func – A neural network.

  • loss_func – A loss function.

  • activation – A tuple specifying w.r.t. what intermediate feature the Hessian shall be computed. Has three entries. The first entry is the layer’s name inside the model (any string from model_func.named_modules). The second entry can either be "output" or "input" and specifies whether the intermediate feature is the layer’s output or the layer’s input. The third entry specifies what tensor should be taken from the tuple of input tensors or the tuple of output tensors. For most layers that process a single Tensor into another single Tensor, this value is 0.

  • data – An iterable of mini-batches. Must have a single batch, otherwise the activation Hessian is not defined.

  • progressbar – Show a progressbar during matrix-multiplication. Default: False.

  • check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.

  • shape – Shape of the represented matrix. If None, this dimension will be inferred at the cost of one forward pass through the model.

Raises:

ValueError – If data contains more than one batch.

Example

>>> from numpy import eye, allclose
>>> from torch import manual_seed, rand
>>> from torch.nn import Linear, MSELoss, Sequential, ReLU
>>>
>>> loss_func = MSELoss()
>>> model = Sequential(Linear(4, 3), ReLU(), Linear(3, 2))
>>> [name for name, _ in model.named_modules()] # available layer names
['', '0', '1', '2']
>>> data = [(rand(10, 4), rand(10, 2))]
>>>
>>> hessian = ActivationHessianLinearOperator( # Hessian w.r.t. ReLU input
...     model, loss_func, ("1", "input", 0), data
... )
>>> hessian.shape # batch size * feature dimension (10 * 3)
(30, 30)
>>>
>>> # The ReLU's input is the first Linear's output, let's check that
>>> hessian2 = ActivationHessianLinearOperator( # Hessian w.r.t. first output
...     model, loss_func, ("0", "output", 0), data
... )
>>> I = eye(hessian.shape[1])
>>> allclose(hessian @ I, hessian2 @ I)
True