Input and Output Format#
Expectations for input_path layouts, shard layout under output_dir, output_mode, chat reconstruction, and FAITH columns.
Inputs#
Supported layouts include JSON Lines (JSONL), with one JSON object per line, or Apache Parquet columnar files.
JSONL records contain arbitrary JSON objects per line. The
text_fieldsetting selects which stringsTranslationStagevisits. Wildcards expand across arrays such asmessagesitems.Parquet inputs share the same logical schema. The Curator
ParquetReaderpartitions row groups according tofiles_per_partitionandblocksizeoverrides.
Avoid pointing input_path at directories that mix .jsonl and .parquet shards when input_format=auto. The reader raises an error instead of guessing.
Outputs#
The JsonlWriter and ParquetWriter emit multiple shards under output_dir because Curator pipelines partition work for parallelism. Downstream steps should treat output_dir as the artifact, not individual filenames.
output_mode#
Mode |
Producer behavior |
|---|---|
|
Source strings at |
|
Original strings remain. Translations and metadata appear in auxiliary columns defined by |
|
Emits replaced views and retains intermediate structures. This is the starter default for auditing. |
Chat Reconstruction#
When reconstruct_messages=true, expect parallel arrays such as translated_messages mirroring the messages layout but with translated content entries. This simplifies quality assurance review without sacrificing structured tool-call payloads.
FAITH Annotations#
When faith_eval.enabled=true and merge_scores=true, each record carries score blobs aligned with segment boundaries. When filter_enabled=true, low-trust rows disappear entirely from output_dir shards.
Sampling Outputs#
Use find, head, and python3 -m json.tool on any emitted shard to inspect one translated row:
find ./output_dir -name '*.jsonl' | head -n 1 | xargs head -n 1 | python3 -m json.tool --no-ensure-ascii