Start, Stop, and monitor distributed parallel sorting (dSort)
Start, Stop, and monitor distributed parallel sorting (dSort)
For background and in-depth presentation, please see this document.
- Usage
- Example
- Generate Shards
- Start dSort job
- Show dSort jobs and job status
- Stop dSort job
- Remove dSort job
- Wait for dSort job
Usage
ais dsort [SRC_BUCKET] [DST_BUCKET] —spec [JSON_SPECIFICATION|YAML_SPECIFICATION|-] [command options]
Example
This example simply runs ais/test/scripts/dsort-ex1-spec.json specification. The source and destination buckets - ais://src and ais://dst, respectively - must exist.
Further, the source buckets must have at least 10 shards with names that match input_format (see below).
Notice the -v (--verbose) switch as well.
Generate Shards
ais archive gen-shards "BUCKET/TEMPLATE.EXT"
Put randomly generated shards into a bucket. The main use case for this command is dSort testing. Further reference for this command can be found here.
Start dSort job
ais start dsort --spec JOB_SPEC or ais start dsort -f <PATH_TO_JOB_SPEC>
Start new dSort job with the provided specification.
Specification should be provided by either argument or -f flag - providing both argument and flag will result in error.
Upon creation, JOB_ID of the job is returned - it can then be used to abort it or retrieve metrics.
The following table describes JSON/YAML keys which can be used in the specification.
There’s also the possibility to override some of the values from global distributed_sort config via job specification.
All values are optional - if empty, the value from global distributed_sort config will be used.
For more information refer to configuration.
Examples
Sort records inside the shards
Command defined below starts (alphanumeric) sorting job with extended metrics for input shards with names shard-0.tar, shard-1.tar, …, shard-9.tar.
Each of the output shards will have at least 10240 bytes (10KB) and will be named new-shard-0000.tar, new-shard-0001.tar, …
Assuming that dsort_spec.json contains:
You can start dSort job with:
Shuffle records
Command defined below starts basic shuffle job for input shards with names shard-0.tar, shard-1.tar, …, shard-9.tar.
Each of the output shards will have at least 10240 bytes (10KB) and will be named new-shard-0000.tar, new-shard-0001.tar, …
Pack records into shards with different categories - EKM (External Key Map)
One of the key features of the dSort is that user can specify the exact mapping from the record key to the output shard.
To use this feature output_format should be empty and ekm_file, as well as ekm_file_sep, must be set.
The output shards will be created with provided template format.
Assuming that ekm_file (URL: http://website.web/static/ekm_file.txt) has content:
or if ekm_file (URL: http://website.web/static/ekm_file.json, notice .json extension) and has content:
or, you can also use regex as the record identifier. The ekm_file can contain regex patterns as keys to match multiple records that fit the regex pattern to provided format.
and content of the input shards looks more or less like this:
You can run:
After the run, the output shards will look more or less like this (the number of records in given shard depends on provided output_shard_size):
EKM also supports template syntax to express output shard names.
For example, if ekm_file has content:
After running dsort, the output would be look like this:
Show dSort jobs and job status
ais show job dsort [JOB_ID]
Retrieve the status of the dSort with provided JOB_ID which is returned upon creation.
Lists all dSort jobs if the JOB_ID argument is omitted.
Options
Examples
Show dSort jobs with description matching provided regex
Shows all dSort jobs with descriptions starting with sort prefix.
Save metrics to log file
Save newly fetched metrics of the dSort job with ID 5JjIuGemR to /tmp/dsort_run.txt file every 500 milliseconds
Show only json metrics
Show only json metrics filtered by daemon id
Using jq to filter out the json formatted metric output
Show running status of meta sorting phase for all targets.
Show created shards in each target along with the target ids.
Stop dSort job
ais stop dsort JOB_ID
Stop the dSort job with given JOB_ID.
Remove dSort job
ais job rm dsort JOB_ID
Remove the finished dSort job with given JOB_ID from the job list.
Wait for dSort job
ais wait dsort JOB_ID
or, same:
ais wait JOB_ID
Wait for the dSort job with given JOB_ID to finish.