Run and Monitor a Job (Run:ai CLI)

This section shows how to submit and check the status of a job using the Run:ai CLI.

  1. Login into runai using the login option.

    Run the following command and Enter the Username and Password.

    runai login
    

    Example Result:

    1researcher1@basepod-head1:~/runai_dir$ runai login
    2Username: researcher1@nvidia.com
    3Password:
    4INFO[0022] Logged in successfully
    
  2. Submit a distributed training job.

    runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1
    

    Eaxmple Result:

    1researcher1@basepod-head1:~/runai_dir$ runai submit-mpi dist-job1 -i \n gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -g 4 -p researcher1
    2Job dist-job1 submitted successfully.
    
  3. Check the job status using runai describe.

    runai describe job dist-job1 -p researcher1
    

    Example Result:

     1researcher1@basepod-head1:~/runai_dir$ runai describe job dist-job1 -p researcher1
     2Name: dist-job1
     3Namespace: runai-researcher1
     4Type: Train
     5Status: Running
     6Duration: 5m0
     7GPUs: 0
     8Total Requested GPUs: 0
     9Allocated GPUs: 2
    10Allocated GPUs memory: 0
    11Running PODs: 3
    12Pending PODs: 0
    13Parallelism: 1
    14Completions: 1
    15Succeeded PODs: 0
    16Failed PODs: 0
    17Is Distributed Workload: true
    18Service URLs:
    19Command Line: runai submit-mpi dist-job1 --processes=2 -g 1 -i gcr.io/run-ai-demo/quickstart-distributed:tf-2.1.0 -e RUNAI_SLEEP_SECS=60