Linear operators

Hessian

class curvlinops.HessianLinearOperator( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator for the Hessian of an empirical risk.

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. The Hessian matrix is

\[\nabla^2_{\mathbf{\theta}} \mathcal{L} = c \sum_{n=1}^{N} \nabla_{\mathbf{\theta}}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\,.\]

SUPPORTS_BLOCKS

Whether the linear operator supports block operations. Default is True.

Type:: bool

SELF_ADJOINT

Whether the linear operator is self-adjoint (True for Hessians).

Type:: bool

__init__( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )

Linear operator for curvature matrices of empirical risks.

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:

model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If None, there is no loss function and the represented matrix is independent of the loss function.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function.
progressbar – Show a progressbar during matrix-multiplication. Default: False.
check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of nn.Parameter``s forming a block. Entries must sum to ``len(params). For instance [len(params)] considers the full matrix, while [1, 1, ...] corresponds to a block diagonal approximation where each parameter forms its own block.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:

ValueError – If block_sizes is specified but the linear operator does not support blocks.
ValueError – If the sum of blocks does not equal the number of parameters.
ValueError – If any block size is not positive.
ValueError – If X is not a tensor and batch_size_fn is not specified.

Generalized Gauss-Newton

class curvlinops.GGNLinearOperator( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator for the generalized Gauss-Newton matrix of an empirical risk.

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. The GGN matrix is

\[c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)^\top \left( \nabla_{f_\mathbf{\theta}(\mathbf{x}_n)}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)\,.\]

SELF_ADJOINT

Whether the linear operator is self-adjoint. True for GGNs.

Type:: bool

__init__( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, block_sizes: List[int] | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )

Linear operator for curvature matrices of empirical risks.

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:

model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If None, there is no loss function and the represented matrix is independent of the loss function.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function.
progressbar – Show a progressbar during matrix-multiplication. Default: False.
check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of nn.Parameter``s forming a block. Entries must sum to ``len(params). For instance [len(params)] considers the full matrix, while [1, 1, ...] corresponds to a block diagonal approximation where each parameter forms its own block.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:

ValueError – If block_sizes is specified but the linear operator does not support blocks.
ValueError – If the sum of blocks does not equal the number of parameters.
ValueError – If any block size is not positive.
ValueError – If X is not a tensor and batch_size_fn is not specified.

Fisher (approximate)

class curvlinops.FisherMCLinearOperator( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: MSELoss | CrossEntropyLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, mc_samples: int = 1, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None, )[source]

Monte-Carlo approximation of the Fisher as SciPy linear operator.

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. Let \(\ell(\mathbf{f}, \mathbf{y}) = - \log q(\mathbf{y} \mid \mathbf{f})\) be a negative log-likelihood loss. Denoting \(\mathbf{f}_n = f_{\mathbf{\theta}}(\mathbf{x}_n)\), the Fisher information matrix is

\[\begin{split}c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \nabla_{\mathbf{f}_n}^2 \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ = c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ \approx c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \frac{1}{M} \sum_{m=1}^M \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)\end{split}\]

with sampled targets \(\mathbf{\tilde{y}}_{n}^{(m)} \sim q( \cdot \mid \mathbf{f}_n)\). The expectation over the model’s likelihood is approximated via a Monte-Carlo estimator with \(M\) samples.

The linear operator represents a deterministic sample from this MC Fisher estimator. To generate different samples, you have to create instances with varying random seed argument.

SELF_ADJOINT

Whether the operator is self-adjoint. True for the Fisher.

Type:: bool

supported_losses: Supported loss functions.

FIXED_DATA_ORDER

Whether the data order must be fix. True for MC-Fisher.

Type:: bool

__init__( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: MSELoss | CrossEntropyLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, mc_samples: int = 1, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None, )[source]

Linear operator for the Monte-Carlo approximation of the type-I Fisher.

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:

model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).
progressbar – Show a progressbar during matrix-multiplication. Default: False.
check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.
seed – Seed used to construct the internal random number generator used to draw samples at the beginning of each matrix-vector product. Default: 2147483647
mc_samples – Number of samples to use. Default: 1.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:

NotImplementedError – If the loss function differs from MSELoss, BCEWithLogitsLoss, or CrossEntropyLoss.

class curvlinops.KFACLinearOperator( model_func: Module, loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, fisher_type: str = FisherType.MC, mc_samples: int = 1, kfac_approx: str = KFACType.EXPAND, num_per_example_loss_terms: int | None = None, separate_weight_and_bias: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator to multiply with the Fisher/GGN’s KFAC approximation.

KFAC approximates the per-layer Fisher/GGN with a Kronecker product: Consider a weight matrix \(\mathbf{W}\) and a bias vector \(\mathbf{b}\) in a single layer. The layer’s Fisher \(\mathbf{F}(\mathbf{\theta})\) for

\[\begin{split}\mathbf{\theta} = \begin{pmatrix} \mathrm{vec}(\mathbf{W}) \\ \mathbf{b} \end{pmatrix}\end{split}\]

where \(\mathrm{vec}\) denotes column-stacking is approximated as

\[\mathbf{F}(\mathbf{\theta}) \approx \mathbf{A}_{(\text{KFAC})} \otimes \mathbf{B}_{(\text{KFAC})}\]

(see curvlinops.FisherMCLinearOperator for the Fisher’s definition). Loosely speaking, the first Kronecker factor is the un-centered covariance of the inputs to a layer. The second Kronecker factor is the un-centered covariance of ‘would-be’ gradients w.r.t. the layer’s output. Those ‘would-be’ gradients result from sampling labels from the model’s distribution and computing their gradients.

Kronecker-Factored Approximate Curvature (KFAC) was originally introduced for MLPs in

Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored

approximate curvature. International Conference on Machine Learning (ICML),

extended to CNNs in

Grosse, R., & Martens, J. (2016). A kronecker-factored approximate Fisher matrix for

convolution layers. International Conference on Machine Learning (ICML),

and generalized to all linear layers with weight sharing in

Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., Hennig, P. (2023).

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures (NeurIPS).

_SUPPORTED_LOSSES: Tuple of supported loss functions.

_SUPPORTED_MODULES: Tuple of supported layers.

_SUPPORTED_FISHER_TYPE

Enum of supported Fisher types.

Type:: curvlinops.kfac.FisherType

_SUPPORTED_KFAC_APPROX

Enum of supported KFAC approximation types.

Type:: curvlinops.kfac.KFACType

SELF_ADJOINT

Whether the operator is self-adjoint. True for KFAC.

Type:: bool

__init__( model_func: Module, loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, fisher_type: str = FisherType.MC, mc_samples: int = 1, kfac_approx: str = KFACType.EXPAND, num_per_example_loss_terms: int | None = None, separate_weight_and_bias: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Kronecker-factored approximate curvature (KFAC) proxy of the Fisher/GGN.

Warning

If the model’s parameters change, e.g. during training, you need to create a fresh instance of this object. This is because, for performance reasons, the Kronecker factors are computed once and cached during the first matrix-vector product. They will thus become outdated if the model changes.

Warning

This is an early proto-type with limitations:

Only Linear and Conv2d modules are supported.

Parameters:

model_func – The neural network. Must consist of modules.
loss_func – The loss function.
params – The parameters defining the Fisher/GGN that will be approximated through KFAC.
data – A data loader containing the data of the Fisher/GGN.
progressbar – Whether to show a progress bar when computing the Kronecker factors. Defaults to False.
check_deterministic – Whether to check that the linear operator is deterministic. Defaults to True.
seed – The seed for the random number generator used to draw labels from the model’s predictive distribution. Defaults to 2147483647.
fisher_type – The type of Fisher/GGN to approximate. If FisherType.TYPE2, the exact Hessian of the loss w.r.t. the model outputs is used. This requires as many backward passes as the output dimension, i.e. the number of classes for classification. This is sometimes also called type-2 Fisher. If FisherType.MC, the expectation is approximated by sampling mc_samples labels from the model’s predictive distribution. If FisherType.EMPIRICAL, the empirical gradients are used which corresponds to the uncentered gradient covariance, or the empirical Fisher. If FisherType.FORWARD_ONLY, the gradient covariances will be identity matrices, see the FOOF method in Benzing, 2022 or ISAAC in Petersen et al., 2023. Defaults to FisherType.MC.
mc_samples – The number of Monte-Carlo samples to use per data point. Has to be set to 1 when fisher_type != FisherType.MC. Defaults to 1.
kfac_approx – A string specifying the KFAC approximation that should be used for linear weight-sharing layers, e.g. Conv2d modules or Linear modules that process matrix- or higher-dimensional features. Possible values are KFACType.EXPAND and KFACType.REDUCE. See Eschenhagen et al., 2023 for an explanation of the two approximations. Defaults to KFACType.EXPAND.
num_per_example_loss_terms – Number of per-example loss terms, e.g., the number of tokens in a sequence. The model outputs will have num_data * num_per_example_loss_terms * C entries, where C is the dimension of the random variable we define the likelihood over – for the CrossEntropyLoss it will be the number of classes, for the MSELoss and BCEWithLogitsLoss it will be the size of the last dimension of the the model outputs/targets (our convention here). If None, num_per_example_loss_terms is inferred from the data at the cost of one traversal through the data loader. It is expected to be the same for all examples. Defaults to None.
separate_weight_and_bias – Whether to treat weights and biases separately. Defaults to True.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:

ValueError – If the loss function is not supported.
ValueError – If fisher_type != FisherType.MC and mc_samples != 1.
ValueError – If X is not a tensor and batch_size_fn is not specified.

property det: Tensor

Determinant of the KFAC approximation.

Will call compute_kronecker_factors if it has not been called before and will cache the determinant until compute_kronecker_factors is called again. Uses the property of the Kronecker product that \(\det(A \otimes B) = \det(A)^{m} \det(B)^{n}\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).

Returns:: Determinant of the KFAC approximation.

property frobenius_norm: Tensor

Frobenius norm of the KFAC approximation.

Will call compute_kronecker_factors if it has not been called before and will cache the Frobenius norm until compute_kronecker_factors is called again. Uses the property of the Kronecker product that \(\|A \otimes B\|_F = \|A\|_F \|B\|_F\).

Returns:: Frobenius norm of the KFAC approximation.

classmethod from_state_dict( state_dict: Dict[str, Any], model_func: Module, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], check_deterministic: bool = True, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, ) → KFACLinearOperator[source]

Load a KFAC linear operator from a state dictionary.

Parameters:

state_dict – State dictionary.
model_func – The model function.
params – The model’s parameters that KFAC is computed for.
data – A data loader containing the data of the Fisher/GGN.
check_deterministic – Whether to check that the linear operator is deterministic. Defaults to True.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Returns:

Linear operator of KFAC approximation.

load_state_dict(state_dict: Dict[str, Any])[source]

Load the state of the KFAC linear operator.

Warning

Loading a state dict will overwrite the parameters of the model underlying the linear operator!

Parameters:

state_dict – State dictionary.

Raises:

ValueError – If the loss function does not match the state dict.
ValueError – If the loss function reduction does not match the state dict.

property logdet: Tensor

Log determinant of the KFAC approximation.

More numerically stable than the det property. Will call compute_kronecker_factors if it has not been called before and will cache the log determinant until compute_kronecker_factors is called again. Uses the property of the Kronecker product that \(\log \det(A \otimes B) = m \log \det(A) + n \log \det(B)\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).

Returns:: Log determinant of the KFAC approximation.

state_dict() → Dict[str, Any][source]

Return the state of the KFAC linear operator.

Returns:: State dictionary.

property trace: Tensor

Trace of the KFAC approximation.

Will call compute_kronecker_factors if it has not been called before and will cache the trace until compute_kronecker_factors is called again. Uses the property of the Kronecker product that \(\text{tr}(A \otimes B) = \text{tr}(A) \text{tr}(B)\).

Returns:: Trace of the KFAC approximation.

class curvlinops.EKFACLinearOperator( model_func: Module, loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, fisher_type: str = FisherType.MC, mc_samples: int = 1, kfac_approx: str = KFACType.EXPAND, num_per_example_loss_terms: int | None = None, separate_weight_and_bias: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None, )[source]

Linear operator to multiply with the Fisher/GGN’s EKFAC approximation.

Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EKFAC) was originally introduced in

George, T., Laurent, C., Bouthillier, X., Ballas, N., Vincent, P. (2018).

Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis (NeurIPS)

and concurrently in the context of continual learning in

Liu, X., Masana, M., Herranz, L., Van de Weijer, J., Lopez, A., Bagdanov, A. (2018). Rotate your networks: Better weight consolidation and less catastrophic forgetting (ICPR).

_SUPPORTED_FISHER_TYPE

Tuple with supported Fisher types.

Type:: Tuple[curvlinops.kfac.FisherType]

__init__( model_func: Module, loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, seed: int = 2147483647, fisher_type: str = FisherType.MC, mc_samples: int = 1, kfac_approx: str = KFACType.EXPAND, num_per_example_loss_terms: int | None = None, separate_weight_and_bias: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping], int] | None = None, )[source]

Eigenvalue-corrected KFAC (EKFAC) proxy of the Fisher/GGN.

Warning

If the model’s parameters change, e.g. during training, you need to create a fresh instance of this object. This is because, for performance reasons, the Kronecker factors are computed once and cached during the first matrix-vector product. They will thus become outdated if the model changes.

Warning

This is an early proto-type with limitations:

Only Linear and Conv2d modules are supported.
Only models with 2d output are supported.

Parameters:

model_func – The neural network. Must consist of modules.
loss_func – The loss function.
params – The parameters defining the Fisher/GGN that will be approximated through EKFAC.
data – A data loader containing the data of the Fisher/GGN.
progressbar – Whether to show a progress bar when computing the Kronecker factors. Defaults to False.
check_deterministic – Whether to check that the linear operator is deterministic. Defaults to True.
seed – The seed for the random number generator used to draw labels from the model’s predictive distribution. Defaults to 2147483647.
fisher_type – The type of Fisher/GGN to approximate. If FisherType.TYPE2, the exact Hessian of the loss w.r.t. the model outputs is used. This requires as many backward passes as the output dimension, i.e. the number of classes for classification. This is sometimes also called type-2 Fisher. If FisherType.MC, the expectation is approximated by sampling mc_samples labels from the model’s predictive distribution. If FisherType.EMPIRICAL, the empirical gradients are used which corresponds to the uncentered gradient covariance/empirical Fisher. Defaults to FisherType.MC.
mc_samples – The number of Monte-Carlo samples to use per data point. Has to be set to 1 when fisher_type != FisherType.MC. Defaults to 1.
kfac_approx –
A string specifying the KFAC approximation that should be used for linear weight-sharing layers, e.g. Conv2d modules or Linear modules that process matrix- or higher-dimensional features. Possible values are KFACType.EXPAND and KFACType.REDUCE. See Eschenhagen et al., 2023 for an explanation of the two approximations. Defaults to KFACType.EXPAND.
num_per_example_loss_terms – Number of per-example loss terms, e.g., the number of tokens in a sequence. The model outputs will have num_data * num_per_example_loss_terms * C entries, where C is the dimension of the random variable we define the likelihood over – for the CrossEntropyLoss it will be the number of classes, for the MSELoss and BCEWithLogitsLoss it will be the size of the last dimension of the the model outputs/targets (our convention here). If None, num_per_example_loss_terms is inferred from the data at the cost of one traversal through the data loader. It is expected to be the same for all examples. Defaults to None.
separate_weight_and_bias – Whether to treat weights and biases separately. Defaults to True.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

property det: Tensor

Determinant of the EKFAC approximation.

Will call compute_kronecker_factors and compute_eigenvalue_correction if either of them has not been called before and will cache the determinant until one of them is called again.

Returns:: Determinant of the EKFAC approximation.

property frobenius_norm: Tensor

Frobenius norm of the EKFAC approximation.

Will call compute_kronecker_factors and compute_eigenvalue_correction if either of them has not been called before and will cache the Frobenius norm until one of them is called again.

Returns:: Frobenius norm of the EKFAC approximation.

classmethod from_state_dict( state_dict: Dict[str, Any], model_func: Module, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], check_deterministic: bool = True, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, ) → KFACLinearOperator

Load a KFAC linear operator from a state dictionary.

Parameters:

state_dict – State dictionary.
model_func – The model function.
params – The model’s parameters that KFAC is computed for.
data – A data loader containing the data of the Fisher/GGN.
check_deterministic – Whether to check that the linear operator is deterministic. Defaults to True.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Returns:

Linear operator of KFAC approximation.

load_state_dict(state_dict: Dict[str, Any])[source]

Load the state of the EKFAC linear operator.

Parameters:: state_dict – State dictionary.

property logdet: Tensor

Log determinant of the EKFAC approximation.

More numerically stable than the det property. Will call compute_kronecker_factors and compute_eigenvalue_correction if either of them has not been called before and will cache the logdet until one of them is called again.

Returns:: Log determinant of the EKFAC approximation.

state_dict() → Dict[str, Any][source]

Return the state of the EKFAC linear operator.

Returns:: State dictionary.

property trace: Tensor

Trace of the EKFAC approximation.

Will call compute_kronecker_factors and compute_eigenvalue_correction if either of them has not been called before and will cache the trace until one of them is called again.

Returns:: Trace of the EKFAC approximation.

Uncentered gradient covariance (empirical Fisher)

class curvlinops.EFLinearOperator( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Uncentered gradient covariance as PyTorch linear operator.

The uncentered gradient covariance is often called ‘empirical Fisher’ (EF).

Consider the empirical risk

\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]

with \(c = \frac{1}{N}\) for reduction='mean' and \(c=1\) for reduction='sum'. The uncentered gradient covariance matrix is

\[c \sum_{n=1}^{N} \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right)^\top\,.\]

SELF_ADJOINT

Whether the linear operator is self-adjoint. True for empirical Fisher.

Type:: bool

__init__( model_func: Callable[[Tensor | MutableMapping], Tensor], loss_func: Callable[[Tensor, Tensor], Tensor] | None, params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator for the uncentered gradient covariance/empirical Fisher (EF).

Note

f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.

Parameters:

model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches [(X, y), ...] or a torch DataLoader. Note that X could be a dict or UserDict; this is useful for custom models. In this case, you must (i) specify the batch_size_fn argument, and (ii) take care of preprocessing like X.to(device) inside of your model.forward() function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).
progressbar – Show a progressbar during matrix-multiplication. Default: False.
check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Raises:

NotImplementedError – If the loss function differs from MSELoss, BCEWithLogitsLoss, or CrossEntropyLoss.

Jacobians

class curvlinops.JacobianLinearOperator( model_func: Callable[[MutableMapping | Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator of the Jacobian.

FIXED_DATA_ORDER

Whether the data order must be fix. True for Jacobians.

Type:: bool

__init__( model_func: Callable[[MutableMapping | Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator for the Jacobian as PyTorch linear operator.

Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s Jacobian \(\mathbf{J}_\mathbf{\theta}\mathbf{f}\) is an \(NC \times D\) matrix with elements

\[\left[ \mathbf{J}_\mathbf{\theta}\mathbf{f} \right]_{(n,c), d} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]

Note that the data must be supplied in deterministic order.

Parameters:

model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

class curvlinops.TransposedJacobianLinearOperator( model_func: Callable[[MutableMapping | Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator for the transpose Jacobian.

FIXED_DATA_ORDER

Whether the data order must be fix. True for Jacobians.

Type:: bool

__init__( model_func: Callable[[MutableMapping | Tensor], Tensor], params: List[Parameter], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, num_data: int | None = None, batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None, )[source]

Linear operator for the transpose Jacobian as PyTorch linear operator.

Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s transpose Jacobian \((\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top\) is an \(D \times NC\) matrix with elements

\[\left[ (\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top \right]_{d, (n,c)} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]

Note that the data must be supplied in deterministic order.

Parameters:

model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
num_data – Number of data points. If None, it is inferred from the data at the cost of one traversal through the data loader.
batch_size_fn – If the X’s in data are not torch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates from data and return their batch size.

Inverses

class curvlinops.CGInverseLinearOperator(A: PyTorchLinearOperator, **cg_hyperparameters)[source]

Class for inverse linear operators via conjugate gradients.

Note

Internally, this operator uses GPyTorch’s implementation of CG.

__init__(A: PyTorchLinearOperator, **cg_hyperparameters)[source]

Store the linear operator whose inverse should be represented.

Parameters:

A – PyTorch linear operator whose inverse is formed. Must represent a symmetric and positive-definite matrix.
cg_hyperparameters – Keyword arguments for GPyTorch’s CG implementation. For details, see the documentation of the linear_cg function in https://github.com/cornellius-gp/linear_operator/blob/main/linear_operator/utils/linear_cg.py.

class curvlinops.LSMRInverseLinearOperator(A: PyTorchLinearOperator, **lsmr_hyperparameters)[source]

Class for inverse PyTorch linear operators via LSMR.

See https://arxiv.org/abs/1006.0758 for details on the LSMR algorithm.

Note

Internally, this operator uses SciPy’s CPU implementation of LSMR as PyTorch currently does not offer an LSMR interface that purely relies on matrix-vector products.

__init__(A: PyTorchLinearOperator, **lsmr_hyperparameters)[source]

Store the linear operator whose inverse should be represented.

Parameters:

A – Linear operator whose inverse is formed.
lsmr_hyperparameters – The hyper-parameters that will be passed to the LSMR implementation in SciPy. For more detail, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html.

class curvlinops.NeumannInverseLinearOperator(A: PyTorchLinearOperator, num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]

Class for inverse linear operators via truncated Neumann series.

# noqa: B950

See https://en.wikipedia.org/w/index.php?title=Neumann_series&oldid=1131424698#Approximate_matrix_inversion.

Motivated by

Lorraine, J., Vicol, P., & Duvenaud, D. (2020). Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics (AISTATS).

Warning

The Neumann series can be non-convergent. In this case, the iterations will become numerically unstable, leading to ``NaN``s.

Warning

The Neumann series can converge slowly. Use curvlinops.CGInverLinearOperator for better accuracy.

__init__(A: PyTorchLinearOperator, num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]

Store the linear operator whose inverse should be represented.

The Neumann series for an invertible linear operator \(\mathbf{A}\) is

\[\mathbf{A}^{-1} = \sum_{k=0}^{\infty} \left(\mathbf{I} - \mathbf{A} \right)^k\,.\]

and is convergent if all eigenvalues satisfy \(0 < \lambda(\mathbf{A}) < 2\).

By re-rescaling the matrix by scale (\(\alpha\)), we have:

\[\mathbf{A}^{-1} = \alpha (\alpha \mathbf{A})^{-1} = \alpha \sum_{k=0}^{\infty} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,,\]

which and is convergent if \(0 < \lambda(\mathbf{A}) < \frac{2}{\alpha}\).

Additionally, we truncate the series at num_terms (\(K\)):

\[\mathbf{A}^{-1} \approx \alpha \sum_{k=0}^{K} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,.\]

Parameters:

A – Linear operator whose inverse is formed.
num_terms – Number of terms in the truncated Neumann series. Default: 100.
scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series (details above). Default: 1.0.
check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default: True.

class curvlinops.KFACInverseLinearOperator( A: KFACLinearOperator, damping: float | Tuple[float, float] = 0.0, use_heuristic_damping: bool = False, min_damping: float = 1e-08, use_exact_damping: bool = False, cache: bool = True, retry_double_precision: bool = True, )[source]

Class to invert instances of the KFACLinearOperator.

SELF_ADJOINT

Whether the operator is self-adjoint. True for KFAC-inverse.

Type:: bool

__init__( A: KFACLinearOperator, damping: float | Tuple[float, float] = 0.0, use_heuristic_damping: bool = False, min_damping: float = 1e-08, use_exact_damping: bool = False, cache: bool = True, retry_double_precision: bool = True, )[source]

Store the linear operator whose inverse should be represented.

Parameters:

A – KFACLinearOperator whose inverse is formed.
damping – Damping value(s) for all input and gradient covariances. If tuple, the first value is used for the input covariances and the second value for the gradient covariances. Note that if heuristic or exact damping is used the damping cannot be a tuple. Default: 0..
use_heuristic_damping – Whether to use a heuristic damping strategy by Martens and Grosse, 2015 (Section 6.3). For input covariances \(A \in \mathbb{R}^{n \times n}\) and gradient covariances \(B \in \mathbb{R}^{m \times m}\), we define \(\pi := \sqrt{\frac{m\; \text{tr}(A)}{n\; \text{tr}(B)}}\) and set the damping for the input covariances \(A\) to \(\max(\pi\; \sqrt{\text{damping}}, \text{min_damping})\) and for the gradient covariances \(B\) to \(\max(\frac{1}{\pi}\; \sqrt{\text{damping}}, \text{min_damping})\). Default: False.
min_damping – Minimum damping value. Only used if use_heuristic_damping is True. Default: 1e-8.
use_exact_damping – Whether to use exact damping, i.e. to invert \((A \otimes B) + \text{damping}\; \mathbf{I}\). This is implemented via eigendecompositions of the Kronecker factors, e.g. see equation (21) in Grosse et al., 2023. Note that the eigendecomposition synchronizes the device with the CPU. Default: False.
cache – Whether to cache the inverses of the Kronecker factors. Default: True.
retry_double_precision – Whether to retry Cholesky decomposition used for inversion in double precision. Default: True.

Raises:

ValueError – If the linear operator is not a KFACLinearOperator.
ValueError – If both heuristic and exact damping are selected.
ValueError – If heuristic or exact damping is used and the damping value is a tuple.

Sub-matrices

class curvlinops.SubmatrixLinearOperator(A: PyTorchLinearOperator, row_idxs: List[int], col_idxs: List[int])[source]

Class for sub-matrices of linear operators.

__init__(A: PyTorchLinearOperator, row_idxs: List[int], col_idxs: List[int])[source]

Store the linear operator and indices of its sub-matrix.

Represents the sub-matrix A[row_idxs, :][col_idxs, :].

Parameters:

A – A linear operator.
row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.

set_submatrix(row_idxs: List[int], col_idxs: List[int])[source]

Define the sub-matrix.

Internally sets the linear operator’s shape.

Parameters:

row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.

Raises:

ValueError – If the index lists contain duplicate values, non-integers, or out-of-bounds indices.

Spectral density approximation

Note

This functionality currently expects SciPy ``LinearOperator``s.

curvlinops.lanczos_approximate_spectrum( A: PyTorchLinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 3.0, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01, ) → Tuple[Tensor, Tensor][source]

Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.

Implements algorithm 2 (LanczosApproxSpec) of Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).

Internally rescales the operator spectrum to the interval [-1; 1] such that the width kappa of the Gaussian bumps used to approximate the delta peaks need not be tweaked.

Parameters:

A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default: 1. Taken from papyan2020traces, Section D.2.
kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default: 3. Taken from papyan2020traces, Section D.2.
boundaries – Estimates of the minimum and maximum eigenvalues of A. If left unspecified, they will be estimated internally.
margin – Relative margin added around the spectral boundary. Default: 0.05. Taken from papyan2020traces, Section D.2.
boundaries_tol – (Only relevant if boundaries are not specified). Relative accuracy used to estimate the spectral boundary. 0 implies machine precision. Default: 1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.

Returns:

Grid points λ and approximated spectral density p(λ) of A.

curvlinops.lanczos_approximate_log_spectrum( A: PyTorchLinearOperator, ncv: int, num_points: int = 1024, num_repeats: int = 1, kappa: float = 1.04, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, margin: float = 0.05, boundaries_tol: float = 0.01, epsilon: float = 1e-05, ) → Tuple[Tensor, Tensor][source]

Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of log(|A| + εI) ∈ Rᵈˣᵈ.

Follows the idea of Section C.7 in Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).

Here, log denotes the natural logarithm (i.e. base e).

Internally rescales the operator spectrum to the interval [-1; 1] such that the width kappa of the Gaussian bumps used to approximate the delta peaks need not be tweaked.

Parameters:

A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default: 1. Taken from papyan2020traces, Section D.2.
kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default: 1.04. Obtained by tweaking while reproducing Fig. 15b from papyan2020traces (not specified by the paper).
boundaries – Estimates of the minimum and maximum eigenvalues of |A|. If left unspecified, they will be estimated internally.
margin – Relative margin added around the spectral boundary. Default: 0.05. Taken from papyan2020traces, Section D.2.
boundaries_tol – (Only relevant if boundaries are not specified). Relative accuracy used to estimate the spectral boundary. 0 implies machine precision. Default: 1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.
epsilon – Shift to increase numerical stability. Default: 1e-5. Taken from papyan2020traces, Section D.2.

Returns:

Grid points λ and approximated spectral density p(λ) of log(|A| + εI).

class curvlinops.LanczosApproximateSpectrumCached( A: PyTorchLinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01, )[source]

Class to approximate the spectral density of p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.

Caches Lanczos iterations to efficiently produce spectral density approximations with different hyperparameters.

__init__( A: PyTorchLinearOperator, ncv: int, boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None, boundaries_tol: float = 0.01, )[source]

Initialize.

Parameters:

A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
boundaries – Estimates of the minimum and maximum eigenvalues of A. If left unspecified, they will be estimated internally.
boundaries_tol – (Only relevant if boundaries are not specified). Relative accuracy used to estimate the spectral boundary. 0 implies machine precision. Default: 1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.

approximate_spectrum(num_repeats: int = 1, num_points: int = 1024, kappa: float = 3.0, margin: float = 0.05) → Tuple[Tensor, Tensor][source]

Approximate the spectal density of A.

Parameters:

num_repeats – Number of Lanczos quadratures to average the density over. Default: 1. Taken from papyan2020traces, Section D.2.
num_points – Resolution. Default: 1024.
kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default: 3. From papyan2020traces, Section D.2.
margin – Relative margin added around the spectral boundary. Default: 0.05. Taken from papyan2020traces, Section D.2.

Returns:

Grid points λ and approximated spectral density p(λ) of A.

Trace approximation

curvlinops.hutchinson_trace(A: Tensor | PyTorchLinearOperator, num_matvecs: int, distribution: str = 'rademacher') → Tensor[source]

Estimate a linear operator’s trace using the Girard-Hutchinson method.

For details, see

Girard, D. A. (1989). A fast ‘monte-carlo cross-validation’ procedure for large least squares problems with noisy data. Numerische Mathematik.
Hutchinson, M. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communication in Statistics—Simulation and Computation.

Let \(\mathbf{A}\) be a square linear operator. We can approximate its trace \(\mathrm{Tr}(\mathbf{A})\) by drawing \(N\) random vectors \(\mathbf{v}_n \sim \mathbf{v}\) from a distribution that satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and compute

\[a := \frac{1}{N} \sum_{n=1}^N \mathbf{v}_n^\top \mathbf{A} \mathbf{v}_n \approx \mathrm{Tr}(\mathbf{A})\,.\]

This estimator is unbiased,

\[\mathbb{E}[a] = \mathrm{Tr}(\mathbb{E}[\mathbf{v}^\top\mathbf{A} \mathbf{v}]) = \mathrm{Tr}(\mathbf{A} \mathbb{E}[\mathbf{v} \mathbf{v}^\top]) = \mathrm{Tr}(\mathbf{A} \mathbf{I}) = \mathrm{Tr}(\mathbf{A})\,.\]

Parameters:

A – A square linear operator whose trace is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the dimension of the linear operator (because otherwise one can evaluate the true trace directly at the same cost).
distribution – Distribution of the random vectors used for the trace estimation. Can be either 'rademacher' or 'normal'. Default: 'rademacher'.

Returns:

The estimated trace of the linear operator.

Example

>>> from torch import manual_seed, rand
>>> _ = manual_seed(0) # make deterministic
>>> A = rand(50, 50)
>>> tr_A = A.trace().item() # exact trace as reference
>>> # one- and multi-sample approximations
>>> tr_A_low_precision = hutchinson_trace(A, num_matvecs=1).item()
>>> tr_A_high_precision = hutchinson_trace(A, num_matvecs=40).item()
>>> # compute the relative errors
>>> rel_error_low_precision = abs(tr_A - tr_A_low_precision) / abs(tr_A)
>>> rel_error_high_precision = abs(tr_A - tr_A_high_precision) / abs(tr_A)
>>> assert rel_error_low_precision > rel_error_high_precision
>>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4)
(23.7836, -10.0279, 20.8427)

curvlinops.hutchpp_trace(A: PyTorchLinearOperator | Tensor, num_matvecs: int, distribution: str = 'rademacher') → Tensor[source]

Estimate a linear operator’s trace using the Hutch++ method.

In contrast to vanilla Hutchinson, Hutch++ has lower variance, but requires more memory. The method is presented in

Meyer, R. A., Musco, C., Musco, C., & Woodruff, D. P. (2020). Hutch++: optimal stochastic trace estimation.

Let \(\mathbf{A}\) be a square linear operator whose trace we want to approximate. First, using one third of the available matrix-vector products, we compute an orthonormal basis \(\mathbf{Q}\) of a sub-space spanned by \(\mathbf{A} \mathbf{S}\) where \(\mathbf{S}\) is a tall random matrix with i.i.d. elements. Then, using one third of the available matrix-vector products, we compute the trace in the sub-space. Finally, we apply Hutchinson’s estimator in the remaining space spanned by \(\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top\). Let \(3N\) denote the total number of matrix-vector products. We can draw \(2N\) random vectors \(\mathbf{v}_n \sim \mathbf{v}\) from a distribution which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\), compute \(\mathbf{Q}\) from the first \(N\) vectors, and use the remaining to compute the estimator

\[a := \mathrm{Tr}(\mathbf{Q}^\top \mathbf{A} \mathbf{Q}) + \frac{1}{N} \sum_{n = N+1}^{2N} \mathbf{v}_n^\top (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top)^\top \mathbf{A} (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top) \mathbf{v}_n \approx \mathrm{Tr}(\mathbf{A})\,.\]

This estimator is unbiased, \(\mathbb{E}[a] = \mathrm{Tr}(\mathbf{A})\), as the first term is the exact trace in the space spanned by \(\mathbf{Q}\), and the second part is Hutchinson’s unbiased estimator in the complementary space.

Parameters:

A – A square linear operator whose trace is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the dimension of the linear operator (because otherwise one can evaluate the true trace directly at the same cost), and divisible by 3.
distribution – Distribution of the random vectors used for the trace estimation. Can be either 'rademacher' or 'normal'. Default: 'rademacher'.

Returns:

The estimated trace of the linear operator.

Example

>>> from torch import manual_seed, rand
>>> _ = manual_seed(0) # make deterministic
>>> A = rand(50, 50)
>>> tr_A = A.trace().item() # exact trace as reference
>>> # one- and multi-sample approximations
>>> tr_A_low_precision = hutchpp_trace(A, num_matvecs=3).item()
>>> tr_A_high_precision = hutchpp_trace(A, num_matvecs=30).item()
>>> # compute the relative errors
>>> rel_error_low_precision = abs(tr_A - tr_A_low_precision) / abs(tr_A)
>>> rel_error_high_precision = abs(tr_A - tr_A_high_precision) / abs(tr_A)
>>> assert rel_error_low_precision > rel_error_high_precision
>>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4)
(23.7836, 15.7879, 19.6381)

curvlinops.xtrace(A: PyTorchLinearOperator | Tensor, num_matvecs: int, distribution: str = 'rademacher') → Tensor[source]

Estimate a linear operator’s trace using the XTrace algorithm.

The method is presented in this paper:

Epperly, E. N., Tropp, J. A., & Webber, R. J. (2024). Xtrace: making the most of every sample in stochastic trace estimation. SIAM Journal on Matrix Analysis and Applications (SIMAX).

It combines the variance reduction from Hutch++ with the exchangeability principle.

Parameters:

A – A square linear operator.
num_matvecs – Total number of matrix-vector products to use. Must be even and less than the dimension of the linear operator (because otherwise one can evaluate the true trace directly at the same cost).
distribution – Distribution of the random vectors used for the trace estimation. Can be either 'rademacher' or 'normal'. Default: 'rademacher'.

Returns:

The estimated trace of the linear operator.

Diagonal approximation

curvlinops.hutchinson_diag(A: PyTorchLinearOperator | Tensor, num_matvecs: int, distribution: str = 'rademacher') → Tensor[source]

Estimate a linear operator’s diagonal using Hutchinson’s method.

For details, see

Bekas, C., Kokiopoulou, E., & Saad, Y. (2007). An estimator for the diagonal of a matrix. Applied Numerical Mathematics.

Let \(\mathbf{A}\) be a square linear operator. We can approximate its diagonal \(\mathrm{diag}(\mathbf{A})\) by drawing random vectors \(N\) \(\mathbf{v}_n \sim \mathbf{v}\) from a distribution \(\mathbf{v}\) that satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\), and compute the estimator

\[\mathbf{a} := \frac{1}{N} \sum_{n=1}^N \mathbf{v}_n \odot \mathbf{A} \mathbf{v}_n \approx \mathrm{diag}(\mathbf{A})\,.\]

This estimator is unbiased,

\[\mathbb{E}[a_i] = \sum_j \mathbb{E}[v_i A_{i,j} v_j] = \sum_j A_{i,j} \mathbb{E}[v_i v_j] = \sum_j A_{i,j} \delta_{i, j} = A_{i,i}\,.\]

Parameters:

A – A square linear operator whose diagonal is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the dimension of the linear operator (because otherwise one can evaluate the true diagonal directly at the same cost).
distribution – Distribution of the random vectors used for the diagonal estimation. Can be either 'rademacher' or 'normal'. Default: 'rademacher'.

Returns:

The estimated diagonal of the linear operator.

Example

>>> from torch import manual_seed, rand
>>> from torch.linalg import vector_norm
>>> _ = manual_seed(0) # make deterministic
>>> A = rand(40, 40)
>>> diag_A = A.diag() # exact diagonal as reference
>>> # one- and multi-sample approximations
>>> diag_A_low_precision = hutchinson_diag(A, num_matvecs=1)
>>> diag_A_high_precision = hutchinson_diag(A, num_matvecs=30)
>>> # compute residual norms
>>> error_low_precision = (vector_norm(diag_A - diag_A_low_precision) / vector_norm(diag_A)).item()
>>> error_high_precision = (vector_norm(diag_A - diag_A_high_precision) / vector_norm(diag_A)).item()
>>> assert error_low_precision > error_high_precision
>>> round(error_low_precision, 4), round(error_high_precision, 4)
(3.2648, 0.9253)

curvlinops.xdiag(A: PyTorchLinearOperator | Tensor, num_matvecs: int) → Tensor[source]

Estimate a linear operator’s diagonal using the XDiag algorithm.

The method is presented in this paper:

Epperly, E. N., Tropp, J. A., & Webber, R. J. (2024). Xtrace: making the most of every sample in stochastic trace estimation. SIAM Journal on Matrix Analysis and Applications (SIMAX).

It combines the variance reduction from Diag++ with the exchangeability principle.

Parameters:

A – A square linear operator.
num_matvecs – Total number of matrix-vector products to use. Must be even and less than the dimension of the linear operator (because otherwise one can evaluate the true diagonal directly at the same cost).

Returns:

The estimated diagonal of the linear operator.

Frobenius norm approximation

class curvlinops.hutchinson_squared_fro(A: Tensor | PyTorchLinearOperator, num_matvecs: int, distribution: str = 'rademacher')[source]

Estimate the squared Frobenius norm of a matrix using Hutchinson’s method.

Let \(\mathbf{A} \in \mathbb{R}^{M \times N}\) be some matrix. It’s Frobenius norm \(\lVert\mathbf{A}\rVert_\text{F}\) is defined via:

\[\lVert\mathbf{A}\rVert_\text{F}^2 = \sum_{m=1}^M \sum_{n=1}^N \mathbf{A}_{n,m}^2 = \text{Tr}(\mathbf{A}^\top \mathbf{A}).\]

Due to the last equality, we can use Hutchinson-style trace estimation to estimate the squared Frobenius norm.

Parameters:

A – A matrix whose squared Frobenius norm is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the minimum dimension of the matrix.
distribution – Distribution of the random vectors used for the trace estimation. Can be either 'rademacher' or 'normal'. Default: 'rademacher'.

Returns:

The estimated squared Frobenius norm of the matrix.

Raises:

ValueError – If the matrix is not two-dimensional or if the number of matrix- vector products is greater than the minimum dimension of the matrix (because then you can evaluate the true squared Frobenius norm directly atthe same cost).

Example

>>> from torch.linalg import matrix_norm
>>> from torch import rand, manual_seed
>>> _ = manual_seed(0) # make deterministic
>>> A = rand(40, 40)
>>> fro2_A = matrix_norm(A).item()**2 # reference: exact squared Frobenius norm
>>> # one- and multi-sample approximations
>>> fro2_A_low_prec = hutchinson_squared_fro(A, num_matvecs=1).item()
>>> fro2_A_high_prec = hutchinson_squared_fro(A, num_matvecs=30).item()
>>> assert abs(fro2_A - fro2_A_low_prec) > abs(fro2_A - fro2_A_high_prec)
>>> round(fro2_A, 1), round(fro2_A_low_prec, 1), round(fro2_A_high_prec, 1)
(530.9, 156.7, 628.9)

Experimental

Experimental features may be subject to changes or become deprecated.

class curvlinops.experimental.ActivationHessianLinearOperator( model_func: Module, loss_func: Callable[[MutableMapping | Tensor, Tensor], Tensor], activation: Tuple[str, str, int], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, )[source]

Hessian of the loss w.r.t. hidden features in a neural network.

Consider the empirical risk on a single mini-batch \(\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N\}\), \(\mathbf{y} = \{ \mathbf{y}_1, \dots, \mathbf{y}_N\}\):

\[\mathcal{L}(\mathbf{\theta}) = \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]

Let \(\mathbf{Z} = \{ \mathbf{z}_1, \dots, \mathbf{z}_N \}\) denote a batch of some intermediate feature produced inside the neural network’s forward pass. The Hessian w.r.t. the flattened activations that is represented by this linear operator is

\[\nabla^2_{\mathbf{Z}} \mathcal{L} = \nabla^2_{\mathbf{Z}} \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]

and has dimension \(\mathrm{dim}(\mathbf{Z}) = N \mathrm{dim}(\mathbf{z})\).

SELF_ADJOINT

Whether the linear operator is self-adjoint. True for the activation Hessian.

Type:: bool

__init__( model_func: Module, loss_func: Callable[[MutableMapping | Tensor, Tensor], Tensor], activation: Tuple[str, str, int], data: Iterable[Tuple[Tensor | MutableMapping, Tensor]], progressbar: bool = False, check_deterministic: bool = True, )[source]

Linear operator for the loss Hessian w.r.t. intermediate features.

Parameters:

model_func – A neural network.
loss_func – A loss function.
activation – A tuple specifying w.r.t. what intermediate feature the Hessian shall be computed. Has three entries. The first entry is the layer’s name inside the model (any string from model_func.named_modules). The second entry can either be "output" or "input" and specifies whether the intermediate feature is the layer’s output or the layer’s input. The third entry specifies what tensor should be taken from the tuple of input tensors or the tuple of output tensors. For most layers that process a single Tensor into another single Tensor, this value is 0.
data – An iterable of mini-batches. Must have a single batch, otherwise the activation Hessian is not defined.
progressbar – Show a progressbar during matrix-multiplication. Default: False.
check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default: True. This is a safeguard, only turn it off if you know what you are doing.

Raises:

ValueError – If data contains more than one batch.

Example

>>> from torch import manual_seed, rand, eye, allclose
>>> from torch.nn import Linear, MSELoss, Sequential, ReLU
>>>
>>> loss_func = MSELoss()
>>> model = Sequential(Linear(4, 3), ReLU(), Linear(3, 2))
>>> [name for name, _ in model.named_modules()] # available layer names
['', '0', '1', '2']
>>> data = [(rand(10, 4), rand(10, 2))]
>>>
>>> hessian = ActivationHessianLinearOperator( # Hessian w.r.t. ReLU input
...     model, loss_func, ("1", "input", 0), data
... )
>>> hessian.shape # batch size * feature dimension (10 * 3)
(30, 30)
>>>
>>> # The ReLU's input is the first Linear's output, let's check that
>>> hessian2 = ActivationHessianLinearOperator( # Hessian w.r.t. first output
...     model, loss_func, ("0", "output", 0), data
... )
>>> I = eye(hessian.shape[1])
>>> allclose(hessian @ I, hessian2 @ I)
True