Data Format#

This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information.

Hugging Face Chat Datasets#

Hugging Face chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a messages key. The messages should be a list of dictionaries, each with a role and content key. The role typically has one of the following values: system, user, and assistant. For example:

{
    "messages": [
        {
            "role": "system",
            "content": "This is a helpful system message."
        },
        {
            "role": "user",
            "content": "This is a user's question"
        },
        {
            "role": "assistant",
            "content": "This is the assistant's response."
        }
    ]
}

Chat Templates#

Formatting the data in this way allows us to take advantage of the Hugging Face tokenizers’ apply_chat_template functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the HuggingFace apply_chat_template documentation for details.

By default, apply_chat_template attempts to apply the chat_template associated with the tokenizer. However, in some cases, users might want to specify their own chat template. Also, note that many tokenizers do not have associated chat_templates, in which case an explicit chat template is required. Users can specify an explicit chat template string using Jinja format and can pass that string to apply_chat_template. The following is an example using a simple template which prepends a role header to each turn:

from transformers import AutoTokenizer

example_template = "{% for message in messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{{ content }}{% endfor %}"

example_input = [
    {
        'role': 'user',
        'content': 'Hello!'
    },
    {
        'role': 'assistant',
        'content': 'Hi there!'
    }
]
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
output = tokenizer.apply_chat_template(example_input, chat_template=example_template, tokenize=False)

## this is the output string we expect
expected_output = '<|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there!<|eot_id|>'
assert output == expected_output

For more details on creating chat templates, refer to the Hugging Face documentation.