Run and Monitor a Job (Run:ai CLI)
This section shows how to submit and check the status of a job using the Run:ai CLI.
Login into runai using the
login
option.Run the following command and Enter the Username and Password.
runai login
Example Result:
1researcher1@basepod-head1:~/runai_dir$ runai login 2Username: researcher1@nvidia.com 3Password: 4INFO[0022] Logged in successfully
Submit a distributed training job.
runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1
Eaxmple Result:
1researcher1@basepod-head1:~/runai_dir$ runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1 2Job dist-job1 submitted successfully.
Check the job status using
runai describe
.runai describe job dist-job1 -p researcher1
Example Result:
1researcher1@basepod-head1:~/runai_dir$ runai describe job dist-job1 -p researcher1 2Name: dist-job1 3Namespace: runai-researcher1 4Type: Train 5Status: Running 6Duration: 5m0 7GPUs: 0 8Total Requested GPUs: 0 9Allocated GPUs: 2 10Allocated GPUs memory: 0 11Running PODs: 3 12Pending PODs: 0 13Parallelism: 1 14Completions: 1 15Succeeded PODs: 0 16Failed PODs: 0 17Is Distributed Workload: true 18Service URLs: 19Command Line: runai submit-mpi dist-job1 --processes=2 -g 1 -i gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -e RUNAI_SLEEP_SECS=60