Curation Troubleshooting#
Symptom |
Likely Cause |
Fix |
|---|---|---|
|
Path does not exist in the current host, container, or shared mount |
Use an absolute path or verify the mount. For local tiny runs, override the packaged |
Missing FastText model |
|
Set |
|
Only one word-count bound was set |
Set both |
Output is empty or much smaller than expected |
Filters are too strict or applied before corpus shape is understood |
Re-run with |
Ray worker starts a new |
Local |
Export |
Large local file causes memory pressure |
Input shard is too large for the available Ray worker memory |
Split large JSONL files into smaller shards before running Curator. |
Domain classifier downloads repeatedly |
Hugging Face cache path is not persistent |
Set |
Debug Checklist#
Run the local tiny command with filters disabled.
Confirm
uv sync --extra curatecompleted in the same repository clone.Confirm the input path exists where the command runs.
Confirm
output_diris writable.Add language, word-count, and domain filters one at a time.