Machine‑Learning Operations (ais ml)
Machine‑Learning Operations (ais ml)
Introduced in v3.30, the ml namespace is intended for commands that target ML‑centric data
workflows — bulk extraction of training samples, cross‑bucket collation of model
artifacts, and manifest‑driven slicing of large corpora.
In this document:
-
ais ml get-batch– one‑shot consolidation of objects (and archived sub‑objects) into a single TAR/TGZ/ZIP/TAR.LZ4. -
ais ml lhotse-get-batch– higher‑level driver that reads Lhotse cut manifests and spawns one or manyget-batchjobs on your behalf.
Jump straight to the section you need:
Table of Contents
Subcommands
--help is available at every level:
ais ml get-batch
Fetch one set of inputs—objects, archived files, or byte ranges—possibly
spanning multiple buckets and providers, and package them into a single
output archive.
Default format is TAR; pass the destination name with .tgz, .zip,
.tar.lz4, etc. to switch.
Typical uses:
- Build a reproducible training snapshot from disparate buckets.
- Extract a handful of files from a huge shard without rewriting the shard.
- Stream an archive directly into a training job (
--streaming).
Usage
Examples
Key Options
Full proto definition: api/apc/ml.go.
ais ml lhotse-get-batch
Consumes a Lhotse cuts.jsonl[.gz | .lz4] manifest and spawns one or many
get-batch transactions. Ideal for speech/ASR pipelines where a manifest
describes thousands of time‑offsets across many recordings.
Key distinction:
Lhotse Manifest
In a Lhotse manifest, each cut JSON line lists one or more recording sources (URI + byte/time offsets).
Manifests may be:
- Plain text –
cuts.jsonl - Gzip‑compressed –
cuts.jsonl.gzorcuts.jsonl.gzip - LZ4‑compressed –
cuts.jsonl.lz4
Each line is an independent cut. See the Lhotse docs for the full schema.
Usage
Examples
Key Options
Remaining flags (--list, --template, --prefix, --omit-src-bck, --streaming, --cont-on-err, --yes, --nv) behave exactly like in get-batch.
Tips & Best Practices
-
Streaming (
--streaming) vs multipart (buffered) The flag only toggles how AIStore emits the response — nothing changes in the CLI itself. -
Leaving
bucket/providerblank in a spec entry- Pass a default bucket when you call the API — e.g.
ais ml get-batch ais://my‑bck --spec …. Any entry with an emptybucketfield inheritsmy‑bck. - If
provideris omitted the cluster assumesais://. For S3, GCS, Azure, OCI, etc., always spell out"provider":"s3","gcp", … per entry. - Mixed‑provider requests are fine; just be explicit for every non‑AIS object.
- Pass a default bucket when you call the API — e.g.
-
Manifests at scale.
lhotse-get-batch --batch-sizedivides the manifest on the client side—no need to pre‑split the file. -
Check specs into Git. Treat JSON/YAML batch specs as immutable artifacts for reproducibility.
See Also
- API reference:
apc/ml.go - Lhotse docs: https://lhotse.readthedocs.io/
- Tutorial: Building a speech‑dataset snapshot with
ais ml lhotse-get-batch(WIP) - General CLI search:
ais search