Performance
===========

Below are measured performance for the Jarvis ASR, NLP, and TTS services on NVIDIA T4, V100 SXM2 16GB, and NVIDIA A100 SXM4 40GB GPUs. CPU specifications for each system can be found here: 

- :download:`lscpu_t4 <./perf_data/lscpu_t4.txt>` 
- :download:`lscpu_v100 <./perf_data/lscpu_v100.txt>`
- :download:`lscpu_a100 <./perf_data/lscpu_a100.txt>`

ASR
----

The latency numbers below were measured using the streaming recognition mode, with the BERT-based punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Jarvis streaming client ``jarvis_streaming_asr_client``, provided in the Jarvis client image was used with the ``--simulate_realtime`` flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav). The command used was:

.. prompt:: bash
    :substitutions:

    jarvis_streaming_asr_client  --chunk_duration_ms=<chunk_duration> --simulate_realtime=true --automatic_punctuation=true --num_parallel_requests=<num_streams>   --word_time_offsets=true --print_transcripts=false  --interim_results=false --num_iterations=<5*num_streams> --audio_file=1272-135031-0000.wav --output_filename=/tmp/output.json;


NVIDIA A100 GPU
^^^^^^^^^^^^^^^

100ms chunk
"""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_a100_100ms.csv


800ms chunk
"""""""""""

.. csv-table::
   :header-rows: 2
   :file: ./perf_data/perf_a100_800ms.csv


3200ms chunk
""""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_a100_3200ms.csv


NVIDIA V100 GPU
^^^^^^^^^^^^^^^

100ms chunk
"""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_v100_100ms.csv

800ms chunk
"""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_v100_800ms.csv

3200ms chunk
""""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_v100_3200ms.csv


NVIDIA T4
^^^^^^^^^

100ms chunk
"""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_t4_100ms.csv

800ms chunk
"""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_t4_800ms.csv

3200ms chunk
""""""""""""

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_t4_3200ms.csv

NLP
---

Performance of the Jarvis named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Jarvis question answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Jarvis. Batch size 1 latency and maximum throughput were measured.

NVIDIA A100 GPU
^^^^^^^^^^^^^^^

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_a100_nlp.csv

NVIDIA V100 GPU
^^^^^^^^^^^^^^^

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_v100_nlp.csv

NVIDIA T4
^^^^^^^^^

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_t4_nlp.csv


TTS
---

Performance of the Jarvis text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured.

NVIDIA A100 GPU
^^^^^^^^^^^^^^^

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_a100_tts.csv

NVIDIA V100 GPU
^^^^^^^^^^^^^^^

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_v100_tts.csv

NVIDIA T4
^^^^^^^^^

.. csv-table:: 
   :header-rows: 2
   :file: ./perf_data/perf_t4_tts.csv


Performance considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~

When the server is under high load, requests might time out, as the server will not
start inference for a new request until a previous request is completely generated
so that inference slot can be freed. This is done to maximize throughput for the TTS
service and allow for real-time interaction. NVIDIA does not recommend making more than 8-10
simultaneous requests with the models provided in Jarvis 1.0.0 beta.