.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "basic_usage/example_huggingface.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_basic_usage_example_huggingface.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_basic_usage_example_huggingface.py:

Usage with Huggingface LLMs
===============================

This example demonstrates how to work with Huggingface (HF) language models.

As always, let's first import the required functionality.
Remember to run :code:`pip install -U transformers datasets`

.. GENERATED FROM PYTHON SOURCE LINES 9-30

.. code-block:: Python


    from collections import UserDict
    from collections.abc import MutableMapping

    import torch.utils.data as data_utils
    from datasets import Dataset
    from torch import Tensor, bfloat16, eye, manual_seed, no_grad
    from torch.nn import CrossEntropyLoss, Module
    from transformers import (
        DataCollatorWithPadding,
        GPT2Config,
        GPT2ForSequenceClassification,
        GPT2Tokenizer,
        PreTrainedTokenizer,
    )

    from curvlinops import GGNLinearOperator

    # make deterministic
    manual_seed(0)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <torch._C.Generator object at 0x71fa6dd06810>


.. GENERATED FROM PYTHON SOURCE LINES 31-36

Data
----

We will use synthetic data for simplicity. But obviously this can
be replaced with any HF dataloader.

.. GENERATED FROM PYTHON SOURCE LINES 37-65

.. code-block:: Python


    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token_id = tokenizer.eos_token_id

    data = [
        {"text": "Today is hot, but I will manage!!!!", "label": 1},
        {"text": "Tomorrow is cold", "label": 0},
        {"text": "Carpe diem", "label": 1},
        {"text": "Tempus fugit", "label": 1},
    ]
    dataset = Dataset.from_list(data)


    def tokenize(row):
        """Tokenize a dataset row for GPT-2.

        Returns:
            Tokenized dictionary for the given row.
        """
        return tokenizer(row["text"])


    dataset = dataset.map(tokenize, remove_columns=["text"])
    dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    dataloader = data_utils.DataLoader(
        dataset, batch_size=100, collate_fn=DataCollatorWithPadding(tokenizer)
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
    Map:   0%|          | 0/4 [00:00<?, ? examples/s]    Map: 100%|██████████| 4/4 [00:00<00:00, 166.24 examples/s]


.. GENERATED FROM PYTHON SOURCE LINES 66-69

Let's check the batch emitted by HF. We will see that it is a :code:`UserDict`,
containing the input and label tensors. Note that :code:`UserDict` is
:code:`MutableMapping`, so it is compatible with :code:`curvlinops`.

.. GENERATED FROM PYTHON SOURCE LINES 70-77

.. code-block:: Python


    data = next(iter(dataloader))
    print(f"Is the data a UserDict? {isinstance(data, UserDict)}")
    for k, v in data.items():
        print(k, v.shape)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Is the data a UserDict? True
    input_ids torch.Size([4, 9])
    attention_mask torch.Size([4, 9])
    labels torch.Size([4])


.. GENERATED FROM PYTHON SOURCE LINES 78-86

Model
-----

Curvlinops supports general :code:`UserDict` inputs. However, everything must
be handled inside the :code:`forward` function of the model. This gives
the users the most flexibility, without much overhead.

Let's wrap the HF model to conform this requirement then.

.. GENERATED FROM PYTHON SOURCE LINES 87-138

.. code-block:: Python


    class MyGPT2(Module):
        """Huggingface LLM wrapper.

        Args:
            tokenizer: The tokenizer used for preprocessing the text data. Needed
                since the model needs to know the padding token id.
        """

        def __init__(self, tokenizer: PreTrainedTokenizer) -> None:
            """Initialize the wrapper with a tokenizer."""
            super().__init__()
            config = GPT2Config.from_pretrained("gpt2")
            config.pad_token_id = tokenizer.pad_token_id
            config.num_labels = 2
            self.hf_model = GPT2ForSequenceClassification.from_pretrained(
                "gpt2", config=config
            )

            # For simplicity, only enable grad for the last layer
            for p in self.hf_model.parameters():
                p.requires_grad = False

            for p in self.hf_model.score.parameters():
                p.requires_grad = True

        def forward(self, data: MutableMapping) -> Tensor:
            """Run the model forward pass and move inputs to the correct device.

            Args:
                data: A dict-like data structure with `input_ids` inside.
                    This is the default data structure assumed by Huggingface
                    dataloaders.

            Returns:
                logits: An `(batch_size, n_classes)`-sized tensor of logits.
            """
            device = next(self.parameters()).device
            input_ids = data["input_ids"].to(device)
            output_dict = self.hf_model(input_ids)
            return output_dict.logits


    model = MyGPT2(tokenizer).to(bfloat16)

    with no_grad():
        logits = model(data)
        print(f"Logits shape: {logits.shape}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]    Loading weights: 100%|██████████| 148/148 [00:00<00:00, 4464.82it/s]
    [transformers] GPT2ForSequenceClassification LOAD REPORT from: gpt2
    Key          | Status  | 
    -------------+---------+-
    score.weight | MISSING | 

    Notes:
    - MISSING:      those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
    [transformers] We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
    You may ignore this warning if your `pad_token_id` (50256) is identical to the `bos_token_id` (50256), `eos_token_id` (50256), or the `sep_token_id` (None), and your input is not padded.
    Logits shape: torch.Size([4, 2])


.. GENERATED FROM PYTHON SOURCE LINES 139-146

Curvlinops
----------

We are now ready to compute the curvature of this HF model using Curvlinops.
For this, we need to define a function to tell Curvlinops how to get the
batch size of the :code:`UserDict` input batch. Everything else is unchanged
from the standard usage of Curvlinops!

.. GENERATED FROM PYTHON SOURCE LINES 147-174

.. code-block:: Python


    def batch_size_fn(x: MutableMapping):
        """Return the batch size for a dict-like batch.

        Returns:
            Batch size for the input batch.
        """
        return x["input_ids"].shape[0]


    params = {n: p for n, p in model.named_parameters() if p.requires_grad}

    ggn = GGNLinearOperator(
        model,
        CrossEntropyLoss(),
        params,
        [(data, data["labels"])],  # We still need to input a list of "(X, y)" pairs!
        check_deterministic=False,
        batch_size_fn=batch_size_fn,  # Remember to specify this!
    )

    G = ggn @ eye(ggn.shape[0], device=next(iter(params.values())).device)

    print(f"GGN shape: {G.shape}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    GGN shape: torch.Size([1536, 1536])


.. GENERATED FROM PYTHON SOURCE LINES 175-181

Conclusion
----------

This :code:`UserDict` (or any other dict-like data structure) specification
is very flexible. This doesn't stop at HF models. You can leverage this
for any custom models!


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 29.323 seconds)


.. _sphx_glr_download_basic_usage_example_huggingface.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: example_huggingface.ipynb <example_huggingface.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: example_huggingface.py <example_huggingface.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: example_huggingface.zip <example_huggingface.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_