Linear operators
Hessian
- class curvlinops.HessianLinearOperator(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: Callable[[Tensor, Tensor], Tensor] | None,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- block_sizes: List[int] | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for the Hessian of an empirical risk.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. The Hessian matrix is\[\nabla^2_{\mathbf{\theta}} \mathcal{L} = c \sum_{n=1}^{N} \nabla_{\mathbf{\theta}}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\,.\]- SUPPORTS_BLOCKS
Whether the linear operator supports block operations. Default is
True.- Type:
bool
- SELF_ADJOINT
Whether the linear operator is self-adjoint (
Truefor Hessians).- Type:
bool
- __init__(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: Callable[[Tensor, Tensor], Tensor] | None,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- block_sizes: List[int] | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for curvature matrices of empirical risks.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If
None, there is no loss function and the represented matrix is independent of the loss function.params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader. Note thatXcould be adictorUserDict; this is useful for custom models. In this case, you must (i) specify thebatch_size_fnargument, and (ii) take care of preprocessing likeX.to(device)inside of yourmodel.forward()function.progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_lastor data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True. This is a safeguard, only turn it off if you know what you are doing.num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of
nn.Parameter``s forming a block. Entries must sum to ``len(params). For instance[len(params)]considers the full matrix, while[1, 1, ...]corresponds to a block diagonal approximation where each parameter forms its own block.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- Raises:
ValueError – If
block_sizesis specified but the linear operator does not support blocks.ValueError – If the sum of blocks does not equal the number of parameters.
ValueError – If any block size is not positive.
ValueError – If
Xis not a tensor andbatch_size_fnis not specified.
Generalized Gauss-Newton
- class curvlinops.GGNLinearOperator(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: Callable[[Tensor, Tensor], Tensor] | None,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- block_sizes: List[int] | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for the generalized Gauss-Newton matrix of an empirical risk.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. The GGN matrix is\[c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)^\top \left( \nabla_{f_\mathbf{\theta}(\mathbf{x}_n)}^2 \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \mathbf{J}_{\mathbf{\theta}} f_{\mathbf{\theta}}(\mathbf{x}_n) \right)\,.\]- SELF_ADJOINT
Whether the linear operator is self-adjoint.
Truefor GGNs.- Type:
bool
- __init__(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: Callable[[Tensor, Tensor], Tensor] | None,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- block_sizes: List[int] | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for curvature matrices of empirical risks.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value. If
None, there is no loss function and the represented matrix is independent of the loss function.params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader. Note thatXcould be adictorUserDict; this is useful for custom models. In this case, you must (i) specify thebatch_size_fnargument, and (ii) take care of preprocessing likeX.to(device)inside of yourmodel.forward()function.progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_lastor data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True. This is a safeguard, only turn it off if you know what you are doing.num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.block_sizes – This argument will be ignored if the linear operator does not support blocks. List of integers indicating the number of
nn.Parameter``s forming a block. Entries must sum to ``len(params). For instance[len(params)]considers the full matrix, while[1, 1, ...]corresponds to a block diagonal approximation where each parameter forms its own block.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- Raises:
ValueError – If
block_sizesis specified but the linear operator does not support blocks.ValueError – If the sum of blocks does not equal the number of parameters.
ValueError – If any block size is not positive.
ValueError – If
Xis not a tensor andbatch_size_fnis not specified.
Fisher (approximate)
- class curvlinops.FisherMCLinearOperator(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: MSELoss | CrossEntropyLoss,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- seed: int = 2147483647,
- mc_samples: int = 1,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping], int] | None = None,
Monte-Carlo approximation of the Fisher as SciPy linear operator.
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. Let \(\ell(\mathbf{f}, \mathbf{y}) = - \log q(\mathbf{y} \mid \mathbf{f})\) be a negative log-likelihood loss. Denoting \(\mathbf{f}_n = f_{\mathbf{\theta}}(\mathbf{x}_n)\), the Fisher information matrix is\[\begin{split}c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \nabla_{\mathbf{f}_n}^2 \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ = c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \mathbb{E}_{\mathbf{\tilde{y}}_n \sim q( \cdot \mid \mathbf{f}_n)} \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_n) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right) \\ \approx c \sum_{n=1}^{N} \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)^\top \frac{1}{M} \sum_{m=1}^M \left[ \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right) \left( \nabla_{\mathbf{f}_n} \ell(\mathbf{f}_n, \mathbf{\tilde{y}}_{n}^{(m)}) \right)^{\top} \right] \left( \mathbf{J}_{\mathbf{\theta}} \mathbf{f}_n \right)\end{split}\]with sampled targets \(\mathbf{\tilde{y}}_{n}^{(m)} \sim q( \cdot \mid \mathbf{f}_n)\). The expectation over the model’s likelihood is approximated via a Monte-Carlo estimator with \(M\) samples.
The linear operator represents a deterministic sample from this MC Fisher estimator. To generate different samples, you have to create instances with varying random seed argument.
- SELF_ADJOINT
Whether the operator is self-adjoint.
Truefor the Fisher.- Type:
bool
- supported_losses
Supported loss functions.
- FIXED_DATA_ORDER
Whether the data order must be fix.
Truefor MC-Fisher.- Type:
bool
- __init__(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: MSELoss | CrossEntropyLoss,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- seed: int = 2147483647,
- mc_samples: int = 1,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping], int] | None = None,
Linear operator for the Monte-Carlo approximation of the type-I Fisher.
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader. Note thatXcould be adictorUserDict; this is useful for custom models. In this case, you must (i) specify thebatch_size_fnargument, and (ii) take care of preprocessing likeX.to(device)inside of yourmodel.forward()function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:
True. This is a safeguard, only turn it off if you know what you are doing.seed – Seed used to construct the internal random number generator used to draw samples at the beginning of each matrix-vector product. Default:
2147483647mc_samples – Number of samples to use. Default:
1.num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- Raises:
NotImplementedError – If the loss function differs from
MSELoss, BCEWithLogitsLoss, orCrossEntropyLoss.
- class curvlinops.KFACLinearOperator(
- model_func: Module,
- loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- seed: int = 2147483647,
- fisher_type: str = FisherType.MC,
- mc_samples: int = 1,
- kfac_approx: str = KFACType.EXPAND,
- num_per_example_loss_terms: int | None = None,
- separate_weight_and_bias: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator to multiply with the Fisher/GGN’s KFAC approximation.
KFAC approximates the per-layer Fisher/GGN with a Kronecker product: Consider a weight matrix \(\mathbf{W}\) and a bias vector \(\mathbf{b}\) in a single layer. The layer’s Fisher \(\mathbf{F}(\mathbf{\theta})\) for
\[\begin{split}\mathbf{\theta} = \begin{pmatrix} \mathrm{vec}(\mathbf{W}) \\ \mathbf{b} \end{pmatrix}\end{split}\]where \(\mathrm{vec}\) denotes column-stacking is approximated as
\[\mathbf{F}(\mathbf{\theta}) \approx \mathbf{A}_{(\text{KFAC})} \otimes \mathbf{B}_{(\text{KFAC})}\](see
curvlinops.FisherMCLinearOperatorfor the Fisher’s definition). Loosely speaking, the first Kronecker factor is the un-centered covariance of the inputs to a layer. The second Kronecker factor is the un-centered covariance of ‘would-be’ gradients w.r.t. the layer’s output. Those ‘would-be’ gradients result from sampling labels from the model’s distribution and computing their gradients.Kronecker-Factored Approximate Curvature (KFAC) was originally introduced for MLPs in
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored
approximate curvature. International Conference on Machine Learning (ICML),
extended to CNNs in
Grosse, R., & Martens, J. (2016). A kronecker-factored approximate Fisher matrix for
convolution layers. International Conference on Machine Learning (ICML),
and generalized to all linear layers with weight sharing in
Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., Hennig, P. (2023).
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures (NeurIPS).
- _SUPPORTED_LOSSES
Tuple of supported loss functions.
- _SUPPORTED_MODULES
Tuple of supported layers.
- _SUPPORTED_FISHER_TYPE
Enum of supported Fisher types.
- Type:
curvlinops.kfac.FisherType
- _SUPPORTED_KFAC_APPROX
Enum of supported KFAC approximation types.
- Type:
curvlinops.kfac.KFACType
- SELF_ADJOINT
Whether the operator is self-adjoint.
Truefor KFAC.- Type:
bool
- __init__(
- model_func: Module,
- loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- seed: int = 2147483647,
- fisher_type: str = FisherType.MC,
- mc_samples: int = 1,
- kfac_approx: str = KFACType.EXPAND,
- num_per_example_loss_terms: int | None = None,
- separate_weight_and_bias: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Kronecker-factored approximate curvature (KFAC) proxy of the Fisher/GGN.
Warning
If the model’s parameters change, e.g. during training, you need to create a fresh instance of this object. This is because, for performance reasons, the Kronecker factors are computed once and cached during the first matrix-vector product. They will thus become outdated if the model changes.
Warning
- This is an early proto-type with limitations:
Only Linear and Conv2d modules are supported.
- Parameters:
model_func – The neural network. Must consist of modules.
loss_func – The loss function.
params – The parameters defining the Fisher/GGN that will be approximated through KFAC.
data – A data loader containing the data of the Fisher/GGN.
progressbar – Whether to show a progress bar when computing the Kronecker factors. Defaults to
False.check_deterministic – Whether to check that the linear operator is deterministic. Defaults to
True.seed – The seed for the random number generator used to draw labels from the model’s predictive distribution. Defaults to
2147483647.fisher_type – The type of Fisher/GGN to approximate. If
FisherType.TYPE2, the exact Hessian of the loss w.r.t. the model outputs is used. This requires as many backward passes as the output dimension, i.e. the number of classes for classification. This is sometimes also called type-2 Fisher. IfFisherType.MC, the expectation is approximated by samplingmc_sampleslabels from the model’s predictive distribution. IfFisherType.EMPIRICAL, the empirical gradients are used which corresponds to the uncentered gradient covariance, or the empirical Fisher. IfFisherType.FORWARD_ONLY, the gradient covariances will be identity matrices, see the FOOF method in Benzing, 2022 or ISAAC in Petersen et al., 2023. Defaults toFisherType.MC.mc_samples – The number of Monte-Carlo samples to use per data point. Has to be set to
1whenfisher_type != FisherType.MC. Defaults to1.kfac_approx – A string specifying the KFAC approximation that should be used for linear weight-sharing layers, e.g.
Conv2dmodules orLinearmodules that process matrix- or higher-dimensional features. Possible values areKFACType.EXPANDandKFACType.REDUCE. See Eschenhagen et al., 2023 for an explanation of the two approximations. Defaults toKFACType.EXPAND.num_per_example_loss_terms – Number of per-example loss terms, e.g., the number of tokens in a sequence. The model outputs will have
num_data * num_per_example_loss_terms * Centries, whereCis the dimension of the random variable we define the likelihood over – for theCrossEntropyLossit will be the number of classes, for theMSELossandBCEWithLogitsLossit will be the size of the last dimension of the the model outputs/targets (our convention here). IfNone,num_per_example_loss_termsis inferred from the data at the cost of one traversal through the data loader. It is expected to be the same for all examples. Defaults toNone.separate_weight_and_bias – Whether to treat weights and biases separately. Defaults to
True.num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- Raises:
ValueError – If the loss function is not supported.
ValueError – If
fisher_type != FisherType.MCandmc_samples != 1.ValueError – If
Xis not a tensor andbatch_size_fnis not specified.
- property det: Tensor
Determinant of the KFAC approximation.
Will call
compute_kronecker_factorsif it has not been called before and will cache the determinant untilcompute_kronecker_factorsis called again. Uses the property of the Kronecker product that \(\det(A \otimes B) = \det(A)^{m} \det(B)^{n}\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).- Returns:
Determinant of the KFAC approximation.
- property frobenius_norm: Tensor
Frobenius norm of the KFAC approximation.
Will call
compute_kronecker_factorsif it has not been called before and will cache the Frobenius norm untilcompute_kronecker_factorsis called again. Uses the property of the Kronecker product that \(\|A \otimes B\|_F = \|A\|_F \|B\|_F\).- Returns:
Frobenius norm of the KFAC approximation.
- classmethod from_state_dict(
- state_dict: Dict[str, Any],
- model_func: Module,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- check_deterministic: bool = True,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Load a KFAC linear operator from a state dictionary.
- Parameters:
state_dict – State dictionary.
model_func – The model function.
params – The model’s parameters that KFAC is computed for.
data – A data loader containing the data of the Fisher/GGN.
check_deterministic – Whether to check that the linear operator is deterministic. Defaults to
True.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- Returns:
Linear operator of KFAC approximation.
- load_state_dict(state_dict: Dict[str, Any])[source]
Load the state of the KFAC linear operator.
Warning
Loading a state dict will overwrite the parameters of the model underlying the linear operator!
- Parameters:
state_dict – State dictionary.
- Raises:
ValueError – If the loss function does not match the state dict.
ValueError – If the loss function reduction does not match the state dict.
- property logdet: Tensor
Log determinant of the KFAC approximation.
More numerically stable than the
detproperty. Will callcompute_kronecker_factorsif it has not been called before and will cache the log determinant untilcompute_kronecker_factorsis called again. Uses the property of the Kronecker product that \(\log \det(A \otimes B) = m \log \det(A) + n \log \det(B)\), where \(A \in \mathbb{R}^{n \times n}\) and \(B \in \mathbb{R}^{m \times m}\).- Returns:
Log determinant of the KFAC approximation.
- state_dict() Dict[str, Any][source]
Return the state of the KFAC linear operator.
- Returns:
State dictionary.
- property trace: Tensor
Trace of the KFAC approximation.
Will call
compute_kronecker_factorsif it has not been called before and will cache the trace untilcompute_kronecker_factorsis called again. Uses the property of the Kronecker product that \(\text{tr}(A \otimes B) = \text{tr}(A) \text{tr}(B)\).- Returns:
Trace of the KFAC approximation.
- class curvlinops.EKFACLinearOperator(
- model_func: Module,
- loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- seed: int = 2147483647,
- fisher_type: str = FisherType.MC,
- mc_samples: int = 1,
- kfac_approx: str = KFACType.EXPAND,
- num_per_example_loss_terms: int | None = None,
- separate_weight_and_bias: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping], int] | None = None,
Linear operator to multiply with the Fisher/GGN’s EKFAC approximation.
Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EKFAC) was originally introduced in
George, T., Laurent, C., Bouthillier, X., Ballas, N., Vincent, P. (2018).
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis (NeurIPS)
and concurrently in the context of continual learning in
Liu, X., Masana, M., Herranz, L., Van de Weijer, J., Lopez, A., Bagdanov, A. (2018). Rotate your networks: Better weight consolidation and less catastrophic forgetting (ICPR).
- _SUPPORTED_FISHER_TYPE
Tuple with supported Fisher types.
- Type:
Tuple[curvlinops.kfac.FisherType]
- __init__(
- model_func: Module,
- loss_func: MSELoss | CrossEntropyLoss | BCEWithLogitsLoss,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- seed: int = 2147483647,
- fisher_type: str = FisherType.MC,
- mc_samples: int = 1,
- kfac_approx: str = KFACType.EXPAND,
- num_per_example_loss_terms: int | None = None,
- separate_weight_and_bias: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping], int] | None = None,
Eigenvalue-corrected KFAC (EKFAC) proxy of the Fisher/GGN.
Warning
If the model’s parameters change, e.g. during training, you need to create a fresh instance of this object. This is because, for performance reasons, the Kronecker factors are computed once and cached during the first matrix-vector product. They will thus become outdated if the model changes.
Warning
- This is an early proto-type with limitations:
Only Linear and Conv2d modules are supported.
Only models with 2d output are supported.
- Parameters:
model_func – The neural network. Must consist of modules.
loss_func – The loss function.
params – The parameters defining the Fisher/GGN that will be approximated through EKFAC.
data – A data loader containing the data of the Fisher/GGN.
progressbar – Whether to show a progress bar when computing the Kronecker factors. Defaults to
False.check_deterministic – Whether to check that the linear operator is deterministic. Defaults to
True.seed – The seed for the random number generator used to draw labels from the model’s predictive distribution. Defaults to
2147483647.fisher_type – The type of Fisher/GGN to approximate. If
FisherType.TYPE2, the exact Hessian of the loss w.r.t. the model outputs is used. This requires as many backward passes as the output dimension, i.e. the number of classes for classification. This is sometimes also called type-2 Fisher. IfFisherType.MC, the expectation is approximated by samplingmc_sampleslabels from the model’s predictive distribution. IfFisherType.EMPIRICAL, the empirical gradients are used which corresponds to the uncentered gradient covariance/empirical Fisher. Defaults toFisherType.MC.mc_samples – The number of Monte-Carlo samples to use per data point. Has to be set to
1whenfisher_type != FisherType.MC. Defaults to1.kfac_approx –
A string specifying the KFAC approximation that should be used for linear weight-sharing layers, e.g.
Conv2dmodules orLinearmodules that process matrix- or higher-dimensional features. Possible values areKFACType.EXPANDandKFACType.REDUCE. See Eschenhagen et al., 2023 for an explanation of the two approximations. Defaults toKFACType.EXPAND.num_per_example_loss_terms – Number of per-example loss terms, e.g., the number of tokens in a sequence. The model outputs will have
num_data * num_per_example_loss_terms * Centries, whereCis the dimension of the random variable we define the likelihood over – for theCrossEntropyLossit will be the number of classes, for theMSELossandBCEWithLogitsLossit will be the size of the last dimension of the the model outputs/targets (our convention here). IfNone,num_per_example_loss_termsis inferred from the data at the cost of one traversal through the data loader. It is expected to be the same for all examples. Defaults toNone.separate_weight_and_bias – Whether to treat weights and biases separately. Defaults to
True.num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- property det: Tensor
Determinant of the EKFAC approximation.
Will call
compute_kronecker_factorsandcompute_eigenvalue_correctionif either of them has not been called before and will cache the determinant until one of them is called again.- Returns:
Determinant of the EKFAC approximation.
- property frobenius_norm: Tensor
Frobenius norm of the EKFAC approximation.
Will call
compute_kronecker_factorsandcompute_eigenvalue_correctionif either of them has not been called before and will cache the Frobenius norm until one of them is called again.- Returns:
Frobenius norm of the EKFAC approximation.
- classmethod from_state_dict(
- state_dict: Dict[str, Any],
- model_func: Module,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- check_deterministic: bool = True,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Load a KFAC linear operator from a state dictionary.
- Parameters:
state_dict – State dictionary.
model_func – The model function.
params – The model’s parameters that KFAC is computed for.
data – A data loader containing the data of the Fisher/GGN.
check_deterministic – Whether to check that the linear operator is deterministic. Defaults to
True.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- Returns:
Linear operator of KFAC approximation.
- load_state_dict(state_dict: Dict[str, Any])[source]
Load the state of the EKFAC linear operator.
- Parameters:
state_dict – State dictionary.
- property logdet: Tensor
Log determinant of the EKFAC approximation.
More numerically stable than the
detproperty. Will callcompute_kronecker_factorsandcompute_eigenvalue_correctionif either of them has not been called before and will cache the logdet until one of them is called again.- Returns:
Log determinant of the EKFAC approximation.
Uncentered gradient covariance (empirical Fisher)
- class curvlinops.EFLinearOperator(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: Callable[[Tensor, Tensor], Tensor] | None,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Uncentered gradient covariance as PyTorch linear operator.
The uncentered gradient covariance is often called ‘empirical Fisher’ (EF).
Consider the empirical risk
\[\mathcal{L}(\mathbf{\theta}) = c \sum_{n=1}^{N} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n)\]with \(c = \frac{1}{N}\) for
reduction='mean'and \(c=1\) forreduction='sum'. The uncentered gradient covariance matrix is\[c \sum_{n=1}^{N} \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right) \left( \nabla_{\mathbf{\theta}} \ell(f_{\mathbf{\theta}}(\mathbf{x}_n), \mathbf{y}_n) \right)^\top\,.\]- SELF_ADJOINT
Whether the linear operator is self-adjoint.
Truefor empirical Fisher.- Type:
bool
- __init__(
- model_func: Callable[[Tensor | MutableMapping], Tensor],
- loss_func: Callable[[Tensor, Tensor], Tensor] | None,
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for the uncentered gradient covariance/empirical Fisher (EF).
Note
f(X; θ) denotes a neural network, parameterized by θ, that maps a mini-batch input X to predictions p. ℓ(p, y) maps the prediction to a loss, using the mini-batch labels y.
- Parameters:
model_func – A function that maps the mini-batch input X to predictions. Could be a PyTorch module representing a neural network.
loss_func – Loss function criterion. Maps predictions and mini-batch labels to a scalar value.
params – List of differentiable parameters used by the prediction function.
data – Source from which mini-batches can be drawn, for instance a list of mini-batches
[(X, y), ...]or a torchDataLoader. Note thatXcould be adictorUserDict; this is useful for custom models. In this case, you must (i) specify thebatch_size_fnargument, and (ii) take care of preprocessing likeX.to(device)inside of yourmodel.forward()function. Due to the sequential internal Monte-Carlo sampling, batches must be presented in the same deterministic order (no shuffling!).progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use drop_last or data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:
True. This is a safeguard, only turn it off if you know what you are doing.num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- Raises:
NotImplementedError – If the loss function differs from
MSELoss,BCEWithLogitsLoss, orCrossEntropyLoss.
Jacobians
- class curvlinops.JacobianLinearOperator(
- model_func: Callable[[MutableMapping | Tensor], Tensor],
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator of the Jacobian.
- FIXED_DATA_ORDER
Whether the data order must be fix.
Truefor Jacobians.- Type:
bool
- __init__(
- model_func: Callable[[MutableMapping | Tensor], Tensor],
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for the Jacobian as PyTorch linear operator.
Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s Jacobian \(\mathbf{J}_\mathbf{\theta}\mathbf{f}\) is an \(NC \times D\) matrix with elements
\[\left[ \mathbf{J}_\mathbf{\theta}\mathbf{f} \right]_{(n,c), d} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]Note that the data must be supplied in deterministic order.
- Parameters:
model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
- class curvlinops.TransposedJacobianLinearOperator(
- model_func: Callable[[MutableMapping | Tensor], Tensor],
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for the transpose Jacobian.
- FIXED_DATA_ORDER
Whether the data order must be fix.
Truefor Jacobians.- Type:
bool
- __init__(
- model_func: Callable[[MutableMapping | Tensor], Tensor],
- params: List[Parameter],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
- num_data: int | None = None,
- batch_size_fn: Callable[[MutableMapping | Tensor], int] | None = None,
Linear operator for the transpose Jacobian as PyTorch linear operator.
Consider a model \(f(\mathbf{x}, \mathbf{\theta}): \mathbb{R}^M \times \mathbb{R}^D \to \mathbb{R}^C\) with parameters \(\mathbf{\theta}\) and input \(\mathbf{x}\). Assume we are given a data set \(\mathcal{D} = \{ (\mathbf{x}_n, \mathbf{y}_n) \}_{n=1}^N\) of input-target pairs via batches. The model’s transpose Jacobian \((\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top\) is an \(D \times NC\) matrix with elements
\[\left[ (\mathbf{J}_\mathbf{\theta}\mathbf{f})^\top \right]_{d, (n,c)} = \frac{\partial [f(\mathbf{x}_n, \mathbf{\theta})]_c}{\partial \theta_d}\,.\]Note that the data must be supplied in deterministic order.
- Parameters:
model_func – Neural network function.
params – Neural network parameters.
data – Iterable of batched input-target pairs.
progressbar – Show progress bar.
check_deterministic – Check if model and data are deterministic.
num_data – Number of data points. If
None, it is inferred from the data at the cost of one traversal through the data loader.batch_size_fn – If the
X’s indataare nottorch.Tensor, this needs to be specified. The intended behavior is to consume the first entry of the iterates fromdataand return their batch size.
Inverses
- class curvlinops.CGInverseLinearOperator(A: PyTorchLinearOperator, **cg_hyperparameters)[source]
Class for inverse linear operators via conjugate gradients.
Note
Internally, this operator uses GPyTorch’s implementation of CG.
- __init__(A: PyTorchLinearOperator, **cg_hyperparameters)[source]
Store the linear operator whose inverse should be represented.
- Parameters:
A – PyTorch linear operator whose inverse is formed. Must represent a symmetric and positive-definite matrix.
cg_hyperparameters – Keyword arguments for GPyTorch’s CG implementation. For details, see the documentation of the
linear_cgfunction in https://github.com/cornellius-gp/linear_operator/blob/main/linear_operator/utils/linear_cg.py.
- class curvlinops.LSMRInverseLinearOperator(A: PyTorchLinearOperator, **lsmr_hyperparameters)[source]
Class for inverse PyTorch linear operators via LSMR.
See https://arxiv.org/abs/1006.0758 for details on the LSMR algorithm.
Note
Internally, this operator uses SciPy’s CPU implementation of LSMR as PyTorch currently does not offer an LSMR interface that purely relies on matrix-vector products.
- __init__(A: PyTorchLinearOperator, **lsmr_hyperparameters)[source]
Store the linear operator whose inverse should be represented.
- Parameters:
A – Linear operator whose inverse is formed.
lsmr_hyperparameters – The hyper-parameters that will be passed to the LSMR implementation in SciPy. For more detail, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html.
- class curvlinops.NeumannInverseLinearOperator(A: PyTorchLinearOperator, num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]
Class for inverse linear operators via truncated Neumann series.
# noqa: B950
Motivated by
Lorraine, J., Vicol, P., & Duvenaud, D. (2020). Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics (AISTATS).
Warning
The Neumann series can be non-convergent. In this case, the iterations will become numerically unstable, leading to ``NaN``s.
Warning
The Neumann series can converge slowly. Use
curvlinops.CGInverLinearOperatorfor better accuracy.- __init__(A: PyTorchLinearOperator, num_terms: int = 100, scale: float = 1.0, check_nan: bool = True)[source]
Store the linear operator whose inverse should be represented.
The Neumann series for an invertible linear operator \(\mathbf{A}\) is
\[\mathbf{A}^{-1} = \sum_{k=0}^{\infty} \left(\mathbf{I} - \mathbf{A} \right)^k\,.\]and is convergent if all eigenvalues satisfy \(0 < \lambda(\mathbf{A}) < 2\).
By re-rescaling the matrix by
scale(\(\alpha\)), we have:\[\mathbf{A}^{-1} = \alpha (\alpha \mathbf{A})^{-1} = \alpha \sum_{k=0}^{\infty} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,,\]which and is convergent if \(0 < \lambda(\mathbf{A}) < \frac{2}{\alpha}\).
Additionally, we truncate the series at
num_terms(\(K\)):\[\mathbf{A}^{-1} \approx \alpha \sum_{k=0}^{K} \left(\mathbf{I} - \alpha \mathbf{A} \right)^k\,.\]- Parameters:
A – Linear operator whose inverse is formed.
num_terms – Number of terms in the truncated Neumann series. Default:
100.scale – Scale applied to the matrix in the Neumann iteration. Crucial for convergence of Neumann series (details above). Default:
1.0.check_nan – Whether to check for NaNs while applying the truncated Neumann series. Default:
True.
- class curvlinops.KFACInverseLinearOperator(
- A: KFACLinearOperator,
- damping: float | Tuple[float, float] = 0.0,
- use_heuristic_damping: bool = False,
- min_damping: float = 1e-08,
- use_exact_damping: bool = False,
- cache: bool = True,
- retry_double_precision: bool = True,
Class to invert instances of the
KFACLinearOperator.- SELF_ADJOINT
Whether the operator is self-adjoint.
Truefor KFAC-inverse.- Type:
bool
- __init__(
- A: KFACLinearOperator,
- damping: float | Tuple[float, float] = 0.0,
- use_heuristic_damping: bool = False,
- min_damping: float = 1e-08,
- use_exact_damping: bool = False,
- cache: bool = True,
- retry_double_precision: bool = True,
Store the linear operator whose inverse should be represented.
- Parameters:
A –
KFACLinearOperatorwhose inverse is formed.damping – Damping value(s) for all input and gradient covariances. If tuple, the first value is used for the input covariances and the second value for the gradient covariances. Note that if heuristic or exact damping is used the damping cannot be a tuple. Default:
0..use_heuristic_damping – Whether to use a heuristic damping strategy by Martens and Grosse, 2015 (Section 6.3). For input covariances \(A \in \mathbb{R}^{n \times n}\) and gradient covariances \(B \in \mathbb{R}^{m \times m}\), we define \(\pi := \sqrt{\frac{m\; \text{tr}(A)}{n\; \text{tr}(B)}}\) and set the damping for the input covariances \(A\) to \(\max(\pi\; \sqrt{\text{damping}}, \text{min_damping})\) and for the gradient covariances \(B\) to \(\max(\frac{1}{\pi}\; \sqrt{\text{damping}}, \text{min_damping})\). Default:
False.min_damping – Minimum damping value. Only used if
use_heuristic_dampingisTrue. Default:1e-8.use_exact_damping – Whether to use exact damping, i.e. to invert \((A \otimes B) + \text{damping}\; \mathbf{I}\). This is implemented via eigendecompositions of the Kronecker factors, e.g. see equation (21) in Grosse et al., 2023. Note that the eigendecomposition synchronizes the device with the CPU. Default:
False.cache – Whether to cache the inverses of the Kronecker factors. Default:
True.retry_double_precision – Whether to retry Cholesky decomposition used for inversion in double precision. Default:
True.
- Raises:
ValueError – If the linear operator is not a
KFACLinearOperator.ValueError – If both heuristic and exact damping are selected.
ValueError – If heuristic or exact damping is used and the damping value is a tuple.
Sub-matrices
- class curvlinops.SubmatrixLinearOperator(A: PyTorchLinearOperator, row_idxs: List[int], col_idxs: List[int])[source]
Class for sub-matrices of linear operators.
- __init__(A: PyTorchLinearOperator, row_idxs: List[int], col_idxs: List[int])[source]
Store the linear operator and indices of its sub-matrix.
Represents the sub-matrix
A[row_idxs, :][col_idxs, :].- Parameters:
A – A linear operator.
row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.
- set_submatrix(row_idxs: List[int], col_idxs: List[int])[source]
Define the sub-matrix.
Internally sets the linear operator’s shape.
- Parameters:
row_idxs – The sub-matrix’s row indices.
col_idxs – The sub-matrix’s column indices.
- Raises:
ValueError – If the index lists contain duplicate values, non-integers, or out-of-bounds indices.
Spectral density approximation
Note
This functionality currently expects SciPy ``LinearOperator``s.
- curvlinops.lanczos_approximate_spectrum(
- A: PyTorchLinearOperator,
- ncv: int,
- num_points: int = 1024,
- num_repeats: int = 1,
- kappa: float = 3.0,
- boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None,
- margin: float = 0.05,
- boundaries_tol: float = 0.01,
Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.
Implements algorithm 2 (
LanczosApproxSpec) of Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).Internally rescales the operator spectrum to the interval [-1; 1] such that the width
kappaof the Gaussian bumps used to approximate the delta peaks need not be tweaked.- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1. Taken from papyan2020traces, Section D.2.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
3. Taken from papyan2020traces, Section D.2.boundaries – Estimates of the minimum and maximum eigenvalues of
A. If left unspecified, they will be estimated internally.margin – Relative margin added around the spectral boundary. Default:
0.05. Taken from papyan2020traces, Section D.2.boundaries_tol – (Only relevant if
boundariesare not specified). Relative accuracy used to estimate the spectral boundary.0implies machine precision. Default:1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.
- Returns:
Grid points λ and approximated spectral density p(λ) of A.
- curvlinops.lanczos_approximate_log_spectrum(
- A: PyTorchLinearOperator,
- ncv: int,
- num_points: int = 1024,
- num_repeats: int = 1,
- kappa: float = 1.04,
- boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None,
- margin: float = 0.05,
- boundaries_tol: float = 0.01,
- epsilon: float = 1e-05,
Approximate the spectral density p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of log(|A| + εI) ∈ Rᵈˣᵈ.
Follows the idea of Section C.7 in Papyan, 2020 (https://jmlr.org/papers/v21/20-933.html).
Here, log denotes the natural logarithm (i.e. base e).
Internally rescales the operator spectrum to the interval [-1; 1] such that the width
kappaof the Gaussian bumps used to approximate the delta peaks need not be tweaked.- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
num_points – Resolution.
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1. Taken from papyan2020traces, Section D.2.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
1.04. Obtained by tweaking while reproducing Fig. 15b from papyan2020traces (not specified by the paper).boundaries – Estimates of the minimum and maximum eigenvalues of
|A|. If left unspecified, they will be estimated internally.margin – Relative margin added around the spectral boundary. Default:
0.05. Taken from papyan2020traces, Section D.2.boundaries_tol – (Only relevant if
boundariesare not specified). Relative accuracy used to estimate the spectral boundary.0implies machine precision. Default:1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.epsilon – Shift to increase numerical stability. Default:
1e-5. Taken from papyan2020traces, Section D.2.
- Returns:
Grid points λ and approximated spectral density p(λ) of log(|A| + εI).
- class curvlinops.LanczosApproximateSpectrumCached(
- A: PyTorchLinearOperator,
- ncv: int,
- boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None,
- boundaries_tol: float = 0.01,
Class to approximate the spectral density of p(λ) = 1/d ∑ᵢ δ(λ - λᵢ) of A ∈ Rᵈˣᵈ.
Caches Lanczos iterations to efficiently produce spectral density approximations with different hyperparameters.
- __init__(
- A: PyTorchLinearOperator,
- ncv: int,
- boundaries: Tuple[float, float] | Tuple[float, None] | Tuple[None, float] | None = None,
- boundaries_tol: float = 0.01,
Initialize.
- Parameters:
A – Symmetric linear operator.
ncv – Number of Lanczos vectors (number of nodes/weights for the quadrature).
boundaries – Estimates of the minimum and maximum eigenvalues of
A. If left unspecified, they will be estimated internally.boundaries_tol – (Only relevant if
boundariesare not specified). Relative accuracy used to estimate the spectral boundary.0implies machine precision. Default:1e-2, from https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html#examples.
- approximate_spectrum(num_repeats: int = 1, num_points: int = 1024, kappa: float = 3.0, margin: float = 0.05) Tuple[Tensor, Tensor][source]
Approximate the spectal density of A.
- Parameters:
num_repeats – Number of Lanczos quadratures to average the density over. Default:
1. Taken from papyan2020traces, Section D.2.num_points – Resolution. Default:
1024.kappa – Width of the Gaussian used to approximate delta peaks in [-1; 1]. Must be greater than 1. Default:
3. From papyan2020traces, Section D.2.margin – Relative margin added around the spectral boundary. Default:
0.05. Taken from papyan2020traces, Section D.2.
- Returns:
Grid points λ and approximated spectral density p(λ) of A.
Trace approximation
- curvlinops.hutchinson_trace(A: Tensor | PyTorchLinearOperator, num_matvecs: int, distribution: str = 'rademacher') Tensor[source]
Estimate a linear operator’s trace using the Girard-Hutchinson method.
For details, see
Girard, D. A. (1989). A fast ‘monte-carlo cross-validation’ procedure for large least squares problems with noisy data. Numerische Mathematik.
Hutchinson, M. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communication in Statistics—Simulation and Computation.
Let \(\mathbf{A}\) be a square linear operator. We can approximate its trace \(\mathrm{Tr}(\mathbf{A})\) by drawing \(N\) random vectors \(\mathbf{v}_n \sim \mathbf{v}\) from a distribution that satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\) and compute
\[a := \frac{1}{N} \sum_{n=1}^N \mathbf{v}_n^\top \mathbf{A} \mathbf{v}_n \approx \mathrm{Tr}(\mathbf{A})\,.\]This estimator is unbiased,
\[\mathbb{E}[a] = \mathrm{Tr}(\mathbb{E}[\mathbf{v}^\top\mathbf{A} \mathbf{v}]) = \mathrm{Tr}(\mathbf{A} \mathbb{E}[\mathbf{v} \mathbf{v}^\top]) = \mathrm{Tr}(\mathbf{A} \mathbf{I}) = \mathrm{Tr}(\mathbf{A})\,.\]- Parameters:
A – A square linear operator whose trace is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the dimension of the linear operator (because otherwise one can evaluate the true trace directly at the same cost).
distribution – Distribution of the random vectors used for the trace estimation. Can be either
'rademacher'or'normal'. Default:'rademacher'.
- Returns:
The estimated trace of the linear operator.
Example
>>> from torch import manual_seed, rand >>> _ = manual_seed(0) # make deterministic >>> A = rand(50, 50) >>> tr_A = A.trace().item() # exact trace as reference >>> # one- and multi-sample approximations >>> tr_A_low_precision = hutchinson_trace(A, num_matvecs=1).item() >>> tr_A_high_precision = hutchinson_trace(A, num_matvecs=40).item() >>> # compute the relative errors >>> rel_error_low_precision = abs(tr_A - tr_A_low_precision) / abs(tr_A) >>> rel_error_high_precision = abs(tr_A - tr_A_high_precision) / abs(tr_A) >>> assert rel_error_low_precision > rel_error_high_precision >>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4) (23.7836, -10.0279, 20.8427)
- curvlinops.hutchpp_trace(A: PyTorchLinearOperator | Tensor, num_matvecs: int, distribution: str = 'rademacher') Tensor[source]
Estimate a linear operator’s trace using the Hutch++ method.
In contrast to vanilla Hutchinson, Hutch++ has lower variance, but requires more memory. The method is presented in
Meyer, R. A., Musco, C., Musco, C., & Woodruff, D. P. (2020). Hutch++: optimal stochastic trace estimation.
Let \(\mathbf{A}\) be a square linear operator whose trace we want to approximate. First, using one third of the available matrix-vector products, we compute an orthonormal basis \(\mathbf{Q}\) of a sub-space spanned by \(\mathbf{A} \mathbf{S}\) where \(\mathbf{S}\) is a tall random matrix with i.i.d. elements. Then, using one third of the available matrix-vector products, we compute the trace in the sub-space. Finally, we apply Hutchinson’s estimator in the remaining space spanned by \(\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top\). Let \(3N\) denote the total number of matrix-vector products. We can draw \(2N\) random vectors \(\mathbf{v}_n \sim \mathbf{v}\) from a distribution which satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\), compute \(\mathbf{Q}\) from the first \(N\) vectors, and use the remaining to compute the estimator
\[a := \mathrm{Tr}(\mathbf{Q}^\top \mathbf{A} \mathbf{Q}) + \frac{1}{N} \sum_{n = N+1}^{2N} \mathbf{v}_n^\top (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top)^\top \mathbf{A} (\mathbf{I} - \mathbf{Q} \mathbf{Q}^\top) \mathbf{v}_n \approx \mathrm{Tr}(\mathbf{A})\,.\]This estimator is unbiased, \(\mathbb{E}[a] = \mathrm{Tr}(\mathbf{A})\), as the first term is the exact trace in the space spanned by \(\mathbf{Q}\), and the second part is Hutchinson’s unbiased estimator in the complementary space.
- Parameters:
A – A square linear operator whose trace is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the dimension of the linear operator (because otherwise one can evaluate the true trace directly at the same cost), and divisible by 3.
distribution – Distribution of the random vectors used for the trace estimation. Can be either
'rademacher'or'normal'. Default:'rademacher'.
- Returns:
The estimated trace of the linear operator.
Example
>>> from torch import manual_seed, rand >>> _ = manual_seed(0) # make deterministic >>> A = rand(50, 50) >>> tr_A = A.trace().item() # exact trace as reference >>> # one- and multi-sample approximations >>> tr_A_low_precision = hutchpp_trace(A, num_matvecs=3).item() >>> tr_A_high_precision = hutchpp_trace(A, num_matvecs=30).item() >>> # compute the relative errors >>> rel_error_low_precision = abs(tr_A - tr_A_low_precision) / abs(tr_A) >>> rel_error_high_precision = abs(tr_A - tr_A_high_precision) / abs(tr_A) >>> assert rel_error_low_precision > rel_error_high_precision >>> round(tr_A, 4), round(tr_A_low_precision, 4), round(tr_A_high_precision, 4) (23.7836, 15.7879, 19.6381)
- curvlinops.xtrace(A: PyTorchLinearOperator | Tensor, num_matvecs: int, distribution: str = 'rademacher') Tensor[source]
Estimate a linear operator’s trace using the XTrace algorithm.
The method is presented in this paper:
Epperly, E. N., Tropp, J. A., & Webber, R. J. (2024). Xtrace: making the most of every sample in stochastic trace estimation. SIAM Journal on Matrix Analysis and Applications (SIMAX).
It combines the variance reduction from Hutch++ with the exchangeability principle.
- Parameters:
A – A square linear operator.
num_matvecs – Total number of matrix-vector products to use. Must be even and less than the dimension of the linear operator (because otherwise one can evaluate the true trace directly at the same cost).
distribution – Distribution of the random vectors used for the trace estimation. Can be either
'rademacher'or'normal'. Default:'rademacher'.
- Returns:
The estimated trace of the linear operator.
Diagonal approximation
- curvlinops.hutchinson_diag(A: PyTorchLinearOperator | Tensor, num_matvecs: int, distribution: str = 'rademacher') Tensor[source]
Estimate a linear operator’s diagonal using Hutchinson’s method.
For details, see
Bekas, C., Kokiopoulou, E., & Saad, Y. (2007). An estimator for the diagonal of a matrix. Applied Numerical Mathematics.
Let \(\mathbf{A}\) be a square linear operator. We can approximate its diagonal \(\mathrm{diag}(\mathbf{A})\) by drawing random vectors \(N\) \(\mathbf{v}_n \sim \mathbf{v}\) from a distribution \(\mathbf{v}\) that satisfies \(\mathbb{E}[\mathbf{v} \mathbf{v}^\top] = \mathbf{I}\), and compute the estimator
\[\mathbf{a} := \frac{1}{N} \sum_{n=1}^N \mathbf{v}_n \odot \mathbf{A} \mathbf{v}_n \approx \mathrm{diag}(\mathbf{A})\,.\]This estimator is unbiased,
\[\mathbb{E}[a_i] = \sum_j \mathbb{E}[v_i A_{i,j} v_j] = \sum_j A_{i,j} \mathbb{E}[v_i v_j] = \sum_j A_{i,j} \delta_{i, j} = A_{i,i}\,.\]- Parameters:
A – A square linear operator whose diagonal is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the dimension of the linear operator (because otherwise one can evaluate the true diagonal directly at the same cost).
distribution – Distribution of the random vectors used for the diagonal estimation. Can be either
'rademacher'or'normal'. Default:'rademacher'.
- Returns:
The estimated diagonal of the linear operator.
Example
>>> from torch import manual_seed, rand >>> from torch.linalg import vector_norm >>> _ = manual_seed(0) # make deterministic >>> A = rand(40, 40) >>> diag_A = A.diag() # exact diagonal as reference >>> # one- and multi-sample approximations >>> diag_A_low_precision = hutchinson_diag(A, num_matvecs=1) >>> diag_A_high_precision = hutchinson_diag(A, num_matvecs=30) >>> # compute residual norms >>> error_low_precision = (vector_norm(diag_A - diag_A_low_precision) / vector_norm(diag_A)).item() >>> error_high_precision = (vector_norm(diag_A - diag_A_high_precision) / vector_norm(diag_A)).item() >>> assert error_low_precision > error_high_precision >>> round(error_low_precision, 4), round(error_high_precision, 4) (3.2648, 0.9253)
- curvlinops.xdiag(A: PyTorchLinearOperator | Tensor, num_matvecs: int) Tensor[source]
Estimate a linear operator’s diagonal using the XDiag algorithm.
The method is presented in this paper:
Epperly, E. N., Tropp, J. A., & Webber, R. J. (2024). Xtrace: making the most of every sample in stochastic trace estimation. SIAM Journal on Matrix Analysis and Applications (SIMAX).
It combines the variance reduction from Diag++ with the exchangeability principle.
- Parameters:
A – A square linear operator.
num_matvecs – Total number of matrix-vector products to use. Must be even and less than the dimension of the linear operator (because otherwise one can evaluate the true diagonal directly at the same cost).
- Returns:
The estimated diagonal of the linear operator.
Frobenius norm approximation
- class curvlinops.hutchinson_squared_fro(A: Tensor | PyTorchLinearOperator, num_matvecs: int, distribution: str = 'rademacher')[source]
Estimate the squared Frobenius norm of a matrix using Hutchinson’s method.
Let \(\mathbf{A} \in \mathbb{R}^{M \times N}\) be some matrix. It’s Frobenius norm \(\lVert\mathbf{A}\rVert_\text{F}\) is defined via:
\[\lVert\mathbf{A}\rVert_\text{F}^2 = \sum_{m=1}^M \sum_{n=1}^N \mathbf{A}_{n,m}^2 = \text{Tr}(\mathbf{A}^\top \mathbf{A}).\]Due to the last equality, we can use Hutchinson-style trace estimation to estimate the squared Frobenius norm.
- Parameters:
A – A matrix whose squared Frobenius norm is estimated.
num_matvecs – Total number of matrix-vector products to use. Must be smaller than the minimum dimension of the matrix.
distribution – Distribution of the random vectors used for the trace estimation. Can be either
'rademacher'or'normal'. Default:'rademacher'.
- Returns:
The estimated squared Frobenius norm of the matrix.
- Raises:
ValueError – If the matrix is not two-dimensional or if the number of matrix- vector products is greater than the minimum dimension of the matrix (because then you can evaluate the true squared Frobenius norm directly atthe same cost).
Example
>>> from torch.linalg import matrix_norm >>> from torch import rand, manual_seed >>> _ = manual_seed(0) # make deterministic >>> A = rand(40, 40) >>> fro2_A = matrix_norm(A).item()**2 # reference: exact squared Frobenius norm >>> # one- and multi-sample approximations >>> fro2_A_low_prec = hutchinson_squared_fro(A, num_matvecs=1).item() >>> fro2_A_high_prec = hutchinson_squared_fro(A, num_matvecs=30).item() >>> assert abs(fro2_A - fro2_A_low_prec) > abs(fro2_A - fro2_A_high_prec) >>> round(fro2_A, 1), round(fro2_A_low_prec, 1), round(fro2_A_high_prec, 1) (530.9, 156.7, 628.9)
Experimental
Experimental features may be subject to changes or become deprecated.
- class curvlinops.experimental.ActivationHessianLinearOperator(
- model_func: Module,
- loss_func: Callable[[MutableMapping | Tensor, Tensor], Tensor],
- activation: Tuple[str, str, int],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
Hessian of the loss w.r.t. hidden features in a neural network.
Consider the empirical risk on a single mini-batch \(\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N\}\), \(\mathbf{y} = \{ \mathbf{y}_1, \dots, \mathbf{y}_N\}\):
\[\mathcal{L}(\mathbf{\theta}) = \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]Let \(\mathbf{Z} = \{ \mathbf{z}_1, \dots, \mathbf{z}_N \}\) denote a batch of some intermediate feature produced inside the neural network’s forward pass. The Hessian w.r.t. the flattened activations that is represented by this linear operator is
\[\nabla^2_{\mathbf{Z}} \mathcal{L} = \nabla^2_{\mathbf{Z}} \ell(f_{\mathbf{\theta}}(\mathbf{X}), \mathbf{y})\]and has dimension \(\mathrm{dim}(\mathbf{Z}) = N \mathrm{dim}(\mathbf{z})\).
- SELF_ADJOINT
Whether the linear operator is self-adjoint.
Truefor the activation Hessian.- Type:
bool
- __init__(
- model_func: Module,
- loss_func: Callable[[MutableMapping | Tensor, Tensor], Tensor],
- activation: Tuple[str, str, int],
- data: Iterable[Tuple[Tensor | MutableMapping, Tensor]],
- progressbar: bool = False,
- check_deterministic: bool = True,
Linear operator for the loss Hessian w.r.t. intermediate features.
- Parameters:
model_func – A neural network.
loss_func – A loss function.
activation – A tuple specifying w.r.t. what intermediate feature the Hessian shall be computed. Has three entries. The first entry is the layer’s name inside the model (any string from
model_func.named_modules). The second entry can either be"output"or"input"and specifies whether the intermediate feature is the layer’s output or the layer’s input. The third entry specifies what tensor should be taken from the tuple of input tensors or the tuple of output tensors. For most layers that process a single Tensor into another single Tensor, this value is0.data – An iterable of mini-batches. Must have a single batch, otherwise the activation Hessian is not defined.
progressbar – Show a progressbar during matrix-multiplication. Default:
False.check_deterministic – Probe that model and data are deterministic, i.e. that the data does not use
drop_lastor data augmentation. Also, the model’s forward pass could depend on the order in which mini-batches are presented (BatchNorm, Dropout). Default:True. This is a safeguard, only turn it off if you know what you are doing.
- Raises:
ValueError – If
datacontains more than one batch.
Example
>>> from torch import manual_seed, rand, eye, allclose >>> from torch.nn import Linear, MSELoss, Sequential, ReLU >>> >>> loss_func = MSELoss() >>> model = Sequential(Linear(4, 3), ReLU(), Linear(3, 2)) >>> [name for name, _ in model.named_modules()] # available layer names ['', '0', '1', '2'] >>> data = [(rand(10, 4), rand(10, 2))] >>> >>> hessian = ActivationHessianLinearOperator( # Hessian w.r.t. ReLU input ... model, loss_func, ("1", "input", 0), data ... ) >>> hessian.shape # batch size * feature dimension (10 * 3) (30, 30) >>> >>> # The ReLU's input is the first Linear's output, let's check that >>> hessian2 = ActivationHessianLinearOperator( # Hessian w.r.t. first output ... model, loss_func, ("0", "output", 0), data ... ) >>> I = eye(hessian.shape[1]) >>> allclose(hessian @ I, hessian2 @ I) True