Using Local Tokenizers Without HuggingFace
AIPerf can be configured to use local tokenizers without requiring a connection to HuggingFace. This is particularly useful in environments where direct access to HuggingFace is blocked or restricted.
This guide shows you how to run AIPerf using locally stored tokenizer files instead of downloading them from HuggingFace.
Prerequisites
Before you begin, ensure you have:
- AIPerf installed
- Tokenizer files available locally (e.g.,
tokenizer.json,vocab.txt,config.json) - A directory containing your tokenizer files in HuggingFace-compatible format
Prepare Your Local Tokenizer
1. Organize Tokenizer Files
Make sure your tokenizer files are stored in a local directory. A typical tokenizer directory structure looks like this:
2. Verify File Format
Ensure your tokenizer files match the HuggingFace/tokenizers format. The files should be compatible with the transformers library’s tokenizer loading mechanism.
Run AIPerf with Local Tokenizer
Use the --tokenizer parameter to specify the path to your local tokenizer directory or file:
Example: Using a Local Llama Tokenizer
Sample Output (Successful Run):
Example: Using a Local Qwen Tokenizer
Using Custom Tokenizers
If you are using a custom tokenizer (one that is not a standard pretrained model from HuggingFace), you can still use it with AIPerf as long as it adheres to the rules below.
Requirement: HuggingFace Format
Crucial: Your custom tokenizer MUST be saved in the HuggingFace transformers format. AIPerf relies on the transformers library to load tokenizers, so standard compatibility is required.
Important Notes
No HuggingFace Connection Required
- When you specify a local path with
--tokenizer, AIPerf loads the tokenizer directly from your local files - No files will be downloaded from HuggingFace when using a local tokenizer path
- No internet connection to HuggingFace servers is required
Fully Air-Gapped Environments
For strictly air-gapped environments where you want to explicitly forbid any connection attempts, you can set the following environment variables:
This ensures that the underlying transformers library operates in offline mode.
File Format Compatibility
- Your local tokenizer directory structure and files must match the HuggingFace/tokenizers format
- The tokenizer files should include standard files like
tokenizer.json,vocab.txt, orvocab.json - AIPerf uses the same tokenizer loading mechanism as the
transformerslibrary
No Extra Flags Needed
- You do not need to set any additional flags unless your tokenizer requires custom code execution
- The
--tokenizerparameter accepts both directory paths and direct file paths
Troubleshooting
Tokenizer Not Found
If you encounter errors about missing tokenizer files:
- Verify the path you provided is correct
- Check that the directory contains the required tokenizer files
- Ensure file permissions allow AIPerf to read the files
Incompatible Tokenizer Format
If the tokenizer fails to load:
- Verify your tokenizer files are in HuggingFace-compatible format
- Check that all required files are present (
tokenizer.json,config.json, etc.)