Start, Stop, and monitor distributed parallel sorting (dSort)
Start, Stop, and monitor distributed parallel sorting (dSort)
Start, Stop, and monitor distributed parallel sorting (dSort)
For background and in-depth presentation, please see this document.
ais dsort [SRC_BUCKET] [DST_BUCKET] —spec [JSON_SPECIFICATION|YAML_SPECIFICATION|-] [command options]
This example simply runs ais/test/scripts/dsort-ex1-spec.json specification. The source and destination buckets - ais://src and ais://dst, respectively - must exist.
Further, the source buckets must have at least 10 shards with names that match input_format (see below).
Notice the -v (--verbose) switch as well.
ais archive gen-shards "BUCKET/TEMPLATE.EXT"
Put randomly generated shards into a bucket. The main use case for this command is dSort testing. Further reference for this command can be found here.
ais start dsort --spec JOB_SPEC or ais start dsort -f <PATH_TO_JOB_SPEC>
Start new dSort job with the provided specification.
Specification should be provided by either argument or -f flag - providing both argument and flag will result in error.
Upon creation, JOB_ID of the job is returned - it can then be used to abort it or retrieve metrics.
The following table describes JSON/YAML keys which can be used in the specification.
There’s also the possibility to override some of the values from global distributed_sort config via job specification.
All values are optional - if empty, the value from global distributed_sort config will be used.
For more information refer to configuration.
Command defined below starts (alphanumeric) sorting job with extended metrics for input shards with names shard-0.tar, shard-1.tar, …, shard-9.tar.
Each of the output shards will have at least 10240 bytes (10KB) and will be named new-shard-0000.tar, new-shard-0001.tar, …
Assuming that dsort_spec.json contains:
You can start dSort job with:
Command defined below starts basic shuffle job for input shards with names shard-0.tar, shard-1.tar, …, shard-9.tar.
Each of the output shards will have at least 10240 bytes (10KB) and will be named new-shard-0000.tar, new-shard-0001.tar, …
One of the key features of the dSort is that user can specify the exact mapping from the record key to the output shard.
To use this feature output_format should be empty and ekm_file, as well as ekm_file_sep, must be set.
The output shards will be created with provided template format.
Assuming that ekm_file (URL: http://website.web/static/ekm_file.txt) has content:
or if ekm_file (URL: http://website.web/static/ekm_file.json, notice .json extension) and has content:
or, you can also use regex as the record identifier. The ekm_file can contain regex patterns as keys to match multiple records that fit the regex pattern to provided format.
and content of the input shards looks more or less like this:
You can run:
After the run, the output shards will look more or less like this (the number of records in given shard depends on provided output_shard_size):
EKM also supports template syntax to express output shard names.
For example, if ekm_file has content:
After running dsort, the output would be look like this:
ais show job dsort [JOB_ID]
Retrieve the status of the dSort with provided JOB_ID which is returned upon creation.
Lists all dSort jobs if the JOB_ID argument is omitted.
Shows all dSort jobs with descriptions starting with sort prefix.
Save newly fetched metrics of the dSort job with ID 5JjIuGemR to /tmp/dsort_run.txt file every 500 milliseconds
Show running status of meta sorting phase for all targets.
Show created shards in each target along with the target ids.
ais stop dsort JOB_ID
Stop the dSort job with given JOB_ID.
ais job rm dsort JOB_ID
Remove the finished dSort job with given JOB_ID from the job list.
ais wait dsort JOB_ID
or, same:
ais wait JOB_ID
Wait for the dSort job with given JOB_ID to finish.