Machine‑Learning Operations (ais ml)

View as Markdown

Introduced in v3.30, the ml namespace is intended for commands that target ML‑centric data workflows — bulk extraction of training samples, cross‑bucket collation of model artifacts, and manifest‑driven slicing of large corpora.

In this document:

  • ais ml get-batch – one‑shot consolidation of objects (and archived sub‑objects) into a single TAR/TGZ/ZIP/TAR.LZ4.

  • ais ml lhotse-get-batch – higher‑level driver that reads Lhotse cut manifests and spawns one or many get-batch jobs on your behalf.

Jump straight to the section you need:

Table of Contents


Subcommands

1$ ais ml <TAB><TAB>
2get-batch # Consolidate objects / archived files into one archive
3lhotse-get-batch # Drive multiple get‑batch runs from Lhotse manifests

--help is available at every level:

1$ ais ml get-batch --help
2$ ais ml lhotse-get-batch --help

ais ml get-batch

Fetch one set of inputs—objects, archived files, or byte ranges—possibly spanning multiple buckets and providers, and package them into a single output archive. Default format is TAR; pass the destination name with .tgz, .zip, .tar.lz4, etc. to switch.

Typical uses:

  • Build a reproducible training snapshot from disparate buckets.
  • Extract a handful of files from a huge shard without rewriting the shard.
  • Stream an archive directly into a training job (--streaming).

Usage

1ais ml get-batch [SRC ...] DST_ARCHIVE [flags]
2
3# SRC can be:
4# - bucket or virtual directory (ais://speech/)
5# - single object (s3://models/chkpt‑001.pth)
6# - archived path inside an object (ais://data/shard.tar/file.wav)

Examples

GoalOne‑liner
Package two objects as training.tarais ml get-batch ais://dataset --list "audio1.wav,audio2.wav" training.tar
Build compressed range archiveais ml get-batch ais://models --template "checkpoint-{001..100}.pth" model-v2.tgz
Extract a single file from a shardais ml get-batch ais://data/shard.tar/file.wav dataset.zip
Drive via JSON spec on diskais ml get-batch training.tar --spec batch.json
Inline YAML spec, stream to diskais ml get-batch /tmp/out.tar --spec '{ in: [...], streaming_get: true }' --streaming

Key Options

FlagPurpose
--spec, -f PATHJSON or YAML request; overrides CLI flags.
--list OBJ1,OBJ2,...Explicit comma‑separated object list.
--template STRBrace‑expansion template (supports ranges, steps).
--prefix DIR/Only objects whose names start with the prefix.
--omit-src-bckStrip bucket name inside the resulting archive.
--streamingStream archive as it is built (lower latency, constant memory).
--cont-on-errKeep going on per‑object errors (best‑effort).
--yes, -yAssume “yes” to all prompts (non‑interactive).
--non-verbose, -nvMinimal progress output.

Full proto definition: api/apc/ml.go.


ais ml lhotse-get-batch

Consumes a Lhotse cuts.jsonl[.gz | .lz4] manifest and spawns one or many get-batch transactions. Ideal for speech/ASR pipelines where a manifest describes thousands of time‑offsets across many recordings.

Key distinction:

CommandOutput archivesDriving data
get-batchexactly oneCLI flags or JSON/YAML spec
lhotse-get-batchone or manyLhotse cut manifest

Lhotse Manifest

In a Lhotse manifest, each cut JSON line lists one or more recording sources (URI + byte/time offsets).

Manifests may be:

  • Plain text – cuts.jsonl
  • Gzip‑compressed – cuts.jsonl.gz or cuts.jsonl.gzip
  • LZ4‑compressed – cuts.jsonl.lz4

Each line is an independent cut. See the Lhotse docs for the full schema.

Usage

1ais ml lhotse-get-batch --cuts manifest.jsonl[.gz | .lz4] [DST] [flags]
2
3# With --output-template you may omit DST; the template expands per batch.
1$ ais ml lhotse-get-batch --help
2NAME:
3 ais ml lhotse-get-batch - Get multiple objects from Lhotse manifests and package into consolidated archive(s).
4 Returns TAR by default; supported formats include: .tar, .tgz or .tar.gz, .zip, .tar.lz4.
5 Supports chunking, filtering, and multi-output generation from Lhotse cut manifests.
6 Lhotse manifest format: each line contains a single cut JSON object with recording sources;
7 Lhotse manifest may be plain (`.jsonl`), gzip‑compressed (`.jsonl.gz` / `.gzip`), or LZ4‑compressed (`.jsonl.lz4`).
8 Examples:
9 - 'ais ml lhotse-get-batch --cuts manifest.jsonl.gz output.tar' - entire manifest as single TAR;
10 - 'ais ml lhotse-get-batch --cuts cuts.jsonl --sample-rate 16000 output.tar' - with sample rate conversion;
11 - 'ais ml lhotse-get-batch --cuts m.jsonl.lz4 --batch-size 1000 --output-template "a-{001..999}.tar"' - generate 999 'a-*.tar' batches (1000 cuts each)
12 See [Lhotse docs](https://lhotse.readthedocs.io/) for manifest format details.
13
14USAGE:
15 ais ml lhotse-get-batch [BUCKET[/NAME_or_TEMPLATE] ...] [DST_ARCHIVE] --spec [JSON_SPECIFICATION|YAML_SPECIFICATION] [command options]
16
17OPTIONS:
18 batch-size number of cuts per output file
19 cont-on-err Keep running archiving xaction (job) in presence of errors in any given multi-object transaction
20 cuts path to Lhotse cuts.jsonl or cuts.jsonl.gz or cuts.jsonl.lz4
21 list Comma-separated list of object or file names, e.g.:
22 --list 'o1,o2,o3'
23 --list "abc/1.tar, abc/1.cls, abc/1.jpeg"
24 or, when listing files and/or directories:
25 --list "/home/docs, /home/abc/1.tar, /home/abc/1.jpeg"
26 non-verbose,nv Non-verbose (quiet) output, minimized reporting, fewer warnings
27 omit-src-bck When set, strip source bucket names from paths inside the archive (ie., use object names only)
28 output-template template for multiple output files (e.g. 'batch-{001..999}.tar')
29 prefix Select virtual directories or objects with names starting with the specified prefix, e.g.:
30 '--prefix a/b/c' - matches names 'a/b/c/d', 'a/b/cdef', and similar;
31 '--prefix a/b/c/' - only matches objects from the virtual directory a/b/c/
32 sample-rate audio sample-rate (Hz); used to convert sample offsets (in seconds) to byte offsets
33 spec,f Path to JSON or YAML request specification
34 streaming stream the resulting archive prior to finalizing it in memory
35 template Template to match object or file names; may contain prefix (that could be empty) with zero or more ranges
36 (with optional steps and gaps), e.g.:
37 --template "" # (an empty or '*' template matches everything)
38 --template 'dir/subdir/'
39 --template 'shard-{1000..9999}.tar'
40 --template "prefix-{0010..0013..2}-gap-{1..2}-suffix"
41 and similarly, when specifying files and directories:
42 --template '/home/dir/subdir/'
43 --template "/abc/prefix-{0010..9999..2}-suffix"
44 yes,y Assume 'yes' to all questions
45 help, h Show help

Examples

GoalCommand
Single TAR from entire manifestais ml lhotse-get-batch --cuts manifest.jsonl.gz output.tar
Resample offsets for 16 kHzais ml lhotse-get-batch --cuts cuts.jsonl --sample-rate 16000 output.tar
Chunk into 1,000‑cut archivesais ml lhotse-get-batch --cuts manifest.jsonl.lz4 --batch-size 1000 --output-template "batch-{001..999}.tar"

Key Options

FlagPurpose
--cuts PATHRequired. Path to cuts.jsonl or .jsonl.gz or .jsonl.lz4.
--batch-size NNumber of cuts per output archive (default: all).
--output-template STRBrace‑expansion template for multi‑file output.
--sample-rate HZConvert time offsets (sec) → byte ranges for this rate.
--spec, -f PATHOptional JSON/YAML spec (applies to every batch).

Remaining flags (--list, --template, --prefix, --omit-src-bck, --streaming, --cont-on-err, --yes, --nv) behave exactly like in get-batch.


Tips & Best Practices

  • Streaming (--streaming) vs multipart (buffered) The flag only toggles how AIStore emits the response — nothing changes in the CLI itself.

    ModeAIS behavior on the wireWhen it helps
    --streamingAIStore delivers the next set of requested samples as soon as possible; the resulting (received) archive grows incrementally; there is no leading apc.MossResp JSON message.• You require early‑data/low‑latency playback
    • AIStore shows low-memory alert (via ais show cluster or Grafana)
    Multipart (flag omitted)AIS first assembles the entire resulting archive in memory. The HTTP reply then has two parts:
    1. apc.MossResp JSON (ordered one‑to‑one with the request)
    2. TAR/TGZ/.tar.lz4/ZIP payload.
    • Plenty of memory on the server side
    • You want the header (sizes, order) upfront
    • Maximum bulk throughput on a fast LAN
  • Leaving bucketprovider blank in a spec entry

    • Pass a default bucket when you call the API — e.g. ais ml get-batch ais://my‑bck --spec …. Any entry with an empty bucket field inherits my‑bck.
    • If provider is omitted the cluster assumes ais://. For S3, GCS, Azure, OCI, etc., always spell out "provider":"s3", "gcp", … per entry.
    • Mixed‑provider requests are fine; just be explicit for every non‑AIS object.
  • Manifests at scale. lhotse-get-batch --batch-size divides the manifest on the client side—no need to pre‑split the file.

  • Check specs into Git. Treat JSON/YAML batch specs as immutable artifacts for reproducibility.


See Also