{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "# ==============================================================================" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Python API Examples\n", "\n", "This notebook walks through the basics of the Jarvis Speech and Language AI Services.\n", "\n", "## Overview\n", "\n", "NVIDIA Jarvis is a platform for building and deploying AI applications that fuse vision, speech and other sensors. It offers a complete workflow to build, train and deploy AI systems that can use visual cues such as gestures and gaze along with speech in context. With the Jarvis platform, you can:\n", "\n", "- Build speech and visual AI applications using pretrained NVIDIA Neural Modules ([NeMo](https://github.com/NVIDIA/NeMo)) available at NVIDIA GPU Cloud ([NGC](https://ngc.nvidia.com/catalog/models?orderBy=modifiedDESC&query=%20label%3A%22NeMo%2FPyTorch%22&quickFilter=models&filters=)).\n", "\n", "- Transfer learning: re-train your model on domain-specific data, with NVIDIA [NeMo](https://github.com/NVIDIA/NeMo). NeMo is a toolkit and platform that enables researchers to define and build new state-of-the-art speech and natural language processing models.\n", "\n", "- Optimize neural network performance and latency using NVIDIA TensorRT \n", "\n", "- Deploy AI applications with TensorRT Inference Server:\n", " - Support multiple network formats: ONNX, TensorRT plans, PyTorch TorchScript models.\n", " - Deployement on multiple platforms: from datacenter to edge servers, via Helm to K8s cluster, on NVIDIA Volta/Turing GPUs or Jetson Xavier platforms.\n", "\n", "See the below video for a demo of Jarvis capabilities." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import IFrame\n", "\n", "# Jarvis Youtube demo video\n", "IFrame(\"https://www.youtube.com/embed/r264lBi1nMU?rel=0&controls=0&showinfo=0\", width=\"560\", height=\"315\", frameborder=\"0\", allowfullscreen=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more detailed information on Jarvis, please refer to the [Jarvis developer documentation](https://developer.nvidia.com/).\n", "\n", "## Introduction the Jarvis Speech and Natural Languages services\n", "\n", "Jarvis offers a rich set of speech and natural language understanding services such as:\n", "\n", "- Automated speech recognition (ASR)\n", "- Text-to-Speech synthesis (TTS)\n", "- A collection of natural language understanding services such as named entity recognition (NER), punctuation, intent classification." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning objectives\n", "\n", "- Understand how interact with Jarvis Speech and Natural Languages APIs, services and use cases\n", "\n", "## Requirements and setup\n", "\n", "To execute this notebook, please follow the setup steps in [README](./README.md).\n", "\n", "We first generate some required libraries." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import io\n", "import librosa\n", "from time import time\n", "import numpy as np\n", "import IPython.display as ipd\n", "import grpc\n", "import requests\n", "\n", "# NLP proto\n", "import jarvis_api.jarvis_nlp_core_pb2 as jcnlp\n", "import jarvis_api.jarvis_nlp_core_pb2_grpc as jcnlp_srv\n", "import jarvis_api.jarvis_nlp_pb2 as jnlp\n", "import jarvis_api.jarvis_nlp_pb2_grpc as jnlp_srv\n", "\n", "# ASR proto\n", "import jarvis_api.jarvis_asr_pb2 as jasr\n", "import jarvis_api.jarvis_asr_pb2_grpc as jasr_srv\n", "\n", "# TTS proto\n", "import jarvis_api.jarvis_tts_pb2 as jtts\n", "import jarvis_api.jarvis_tts_pb2_grpc as jtts_srv\n", "import jarvis_api.audio_pb2 as ja" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Jarvis clients and connect to Jarvis Speech API server\n", "\n", "The below URI assumes a local deployment of the Jarvis Speech API server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, the user should use an appropriate URI." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "channel = grpc.insecure_channel('localhost:50051')\n", "\n", "jarvis_asr = jasr_srv.JarvisASRStub(channel)\n", "jarvis_nlp = jnlp_srv.JarvisNLPStub(channel)\n", "jarvis_cnlp = jcnlp_srv.JarvisCoreNLPStub(channel)\n", "jarvis_tts = jtts_srv.JarvisTTSStub(channel)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Content\n", "1. [Offline ASR Example](#1)\n", "1. [Core NLP Service Examples](#2)\n", "1. [TTS Service Example](#3)\n", "1. [Jarvis NLP Service Examples](#4)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 1. Offline ASR Example\n", "\n", "Jarvis Speech API supports `.wav` files in PCM format, `.alaw`, `.mulaw` and `.flac` formats with single channel in this release. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This example uses a .wav file with LINEAR_PCM encoding.\n", "# read in an audio file from local disk\n", "path = \"/work/wav/sample.wav\"\n", "audio, sr = librosa.core.load(path, sr=None)\n", "with io.open(path, 'rb') as fh:\n", " content = fh.read()\n", "ipd.Audio(path)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ASR Transcript: What is natural language processing? \n", "\n", "\n", "Full Response Message:\n", "results {\n", " alternatives {\n", " transcript: \"What is natural language processing? \"\n", " confidence: -8.908161163330078\n", " }\n", " channel_tag: 1\n", " audio_processed: 6.400000095367432\n", "}\n", "\n" ] } ], "source": [ "# Set up an offline/batch recognition request\n", "req = jasr.RecognizeRequest()\n", "req.audio = content # raw bytes\n", "req.config.encoding = ja.AudioEncoding.LINEAR_PCM # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings\n", "req.config.sample_rate_hertz = sr # Audio will be resampled if necessary\n", "req.config.language_code = \"en-US\" # Ignored, will route to correct model in future release\n", "req.config.max_alternatives = 1 # How many top-N hypotheses to return\n", "req.config.enable_automatic_punctuation = True # Add punctuation when end of VAD detected\n", "req.config.audio_channel_count = 1 # Mono channel\n", "\n", "response = jarvis_asr.Recognize(req)\n", "asr_best_transcript = response.results[0].alternatives[0].transcript\n", "print(\"ASR Transcript:\", asr_best_transcript)\n", "\n", "print(\"\\n\\nFull Response Message:\")\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 2. Core NLP Service Examples\n", "\n", "All of the Core NLP Services support batched requests. The maximum batch size,\n", "if any, of the underlying models is hidden from the end user and automatically\n", "batched by the Jarvis and TRTIS servers.\n", "\n", "The Core NLP API provides three methods currently:\n", "\n", " 1. TransformText - map an input string to an output string\n", " \n", " 2. ClassifyText - return a single label for the input string\n", " \n", " 3. ClassifyTokens - return a label per input token" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TransformText Output:\n", " Add punctuation to this sentence.\n", " Do you have any red Nvidia shirts?\n", " I need one cpu, four gpus and lots of memory for my new computer. It's going to be very cool.\n" ] } ], "source": [ "# Use the TextTransform API to run the punctuation model\n", "req = jcnlp.TextTransformRequest()\n", "req.model.model_name = \"jarvis_punctuation\"\n", "req.text.append(\"add punctuation to this sentence\")\n", "req.text.append(\"do you have any red nvidia shirts\")\n", "req.text.append(\"i need one cpu four gpus and lots of memory \"\n", " \"for my new computer it's going to be very cool\")\n", "\n", "nlp_resp = jarvis_cnlp.TransformText(req)\n", "print(\"TransformText Output:\")\n", "print(\"\\n\".join([f\" {x}\" for x in nlp_resp.text]))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Named Entities:\n", " jensen huang (PER)\n", " nvidia corporation (ORG)\n", " santa clara (LOC)\n", " california (LOC)\n" ] } ], "source": [ "# Use the TokenClassification API to run a Named Entity Recognition (NER) model\n", "# Note: the model configuration of the NER model indicates that the labels are\n", "# in IOB format. Jarvis, subsequently, knows to:\n", "# a) ignore 'O' labels\n", "# b) Remove B- and I- prefixes from labels\n", "# c) Collapse sequences of B- I- ... I- tokens into a single token\n", "\n", "req = jcnlp.TokenClassRequest()\n", "req.model.model_name = \"jarvis_ner\" # If you have deployed a custom model with the domain_name \n", " # parameter in ServiceMaker's `jarvis-build` command then you should use \n", " # \"jarvis_ner_\" where \n", " # is the name you provided to the domain_name parameter.\n", "\n", "req.text.append(\"Jensen Huang is the CEO of NVIDIA Corporation, \"\n", " \"located in Santa Clara, California\")\n", "resp = jarvis_cnlp.ClassifyTokens(req)\n", "\n", "print(\"Named Entities:\")\n", "for result in resp.results[0].results:\n", " print(f\" {result.token} ({result.label[0].class_name})\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "results {\n", " labels {\n", " class_name: \"weather\"\n", " score: 0.9975590109825134\n", " }\n", "}\n", "results {\n", " labels {\n", " class_name: \"meteorology\"\n", " score: 0.984375\n", " }\n", "}\n", "results {\n", " labels {\n", " class_name: \"personality\"\n", " score: 0.984375\n", " }\n", "}\n", "\n" ] } ], "source": [ "# Submit a TextClassRequest for text classification.\n", "# Jarvis NLP comes with a default text_classification domain called \"domain_misty\" which consists of \n", "# 4 classes: meteorology, personality, weather and nomatch\n", "\n", "request = jcnlp.TextClassRequest()\n", "request.model.model_name = \"jarvis_text_classification_domain\" # If you have deployed a custom model \n", " # with the `--domain_name` parameter in ServiceMaker's `jarvis-build` command \n", " # then you should use \"jarvis_text_classification_\"\n", " # where is the name you provided to the \n", " # domain_name parameter. In this case the domain_name is \"domain\"\n", "request.text.append(\"Is it going to snow in Burlington, Vermont tomorrow night?\")\n", "request.text.append(\"What causes rain?\")\n", "request.text.append(\"What is your favorite season?\")\n", "ct_response = jarvis_cnlp.ClassifyText(request)\n", "print(ct_response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 3. TTS Service Example\n", "\n", "Subsequent releases will include added features, including model registration to support multiple languages/voices with the same API. Support for resampling to alternative sampling rates will also be added." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req = jtts.SynthesizeSpeechRequest()\n", "req.text = \"Is it recognize speech or wreck a nice beach?\"\n", "req.language_code = \"en-US\" # currently required to be \"en-US\"\n", "req.encoding = ja.AudioEncoding.LINEAR_PCM # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings\n", "req.sample_rate_hz = 22050 # ignored, audio returned will be 22.05KHz\n", "req.voice_name = \"ljspeech\" # ignored\n", "\n", "resp = jarvis_tts.Synthesize(req)\n", "audio_samples = np.frombuffer(resp.audio, dtype=np.float32)\n", "ipd.Audio(audio_samples, rate=22050)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 4. Jarvis NLP Service Examples\n", "\n", "The NLP Service contains higher-level/more application-specific NLP APIs. This\n", "guide demonstrates how the AnalyzeIntent API can be used for queries across\n", "both known and unknown domains." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "intent {\n", " class_name: \"weather.humidity\"\n", " score: 0.983601987361908\n", "}\n", "slots {\n", " token: \"san francisco\"\n", " label {\n", " class_name: \"weatherplace\"\n", " score: 0.9822959899902344\n", " }\n", "}\n", "slots {\n", " token: \"?\"\n", " label {\n", " class_name: \"weatherplace\"\n", " score: 0.6474800109863281\n", " }\n", "}\n", "domain_str: \"weather\"\n", "domain {\n", " class_name: \"weather\"\n", " score: 1.0\n", "}\n", "\n" ] } ], "source": [ "# The AnalyzeIntent API can be used to query a Intent Slot classifier. The API can leverage a\n", "# text classification model to classify the domain of the input query and then route to the \n", "# appropriate intent slot model.\n", "\n", "# Lets first see an example where the domain is known. This skips execution of the domain classifier\n", "# and proceeds directly to the intent/slot model for the requested domain.\n", "\n", "req = jnlp.AnalyzeIntentRequest()\n", "req.query = \"How is the humidity in San Francisco?\"\n", "req.options.domain = \"weather\" # The is appended to \"jarvis_intent_\" to look for a \n", " # model \"jarvis_intent_\". So in this e.g., the model \"jarvis_intent_weather\"\n", " # needs to be preloaded in jarvis server. If you would like to deploy your \n", " # custom Joint Intent and Slot model use the `--domain_name` parameter in \n", " # ServiceMaker's `jarvis-build intent_slot` command.\n", "\n", "resp = jarvis_nlp.AnalyzeIntent(req)\n", "print(resp)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "intent {\n", " class_name: \"weather.rainfall\"\n", " score: 0.9661880135536194\n", "}\n", "slots {\n", " token: \"tomorrow\"\n", " label {\n", " class_name: \"weatherforecastdaily\"\n", " score: 0.5325539708137512\n", " }\n", "}\n", "slots {\n", " token: \"?\"\n", " label {\n", " class_name: \"weatherplace\"\n", " score: 0.6895459890365601\n", " }\n", "}\n", "domain_str: \"weather\"\n", "domain {\n", " class_name: \"weather\"\n", " score: 0.9975590109825134\n", "}\n", "\n" ] } ], "source": [ "# Below is an example where the input domain is not provided.\n", "\n", "req = jnlp.AnalyzeIntentRequest()\n", "req.query = \"Is it going to rain tomorrow?\"\n", "\n", " # The input query is first routed to the a text classification model called \"jarvis_text_classification_domain\"\n", " # The output class label of \"jarvis_text_classification_domain\" is appended to \"jarvis_intent_\"\n", " # to get the appropriate Intent Slot model to execute for the input query.\n", " # Note: The model \"jarvis_text_classification_domain\" needs to be loaded into Jarvis server and have the appropriate\n", " # class labels that would invoke the corresponding intent slot model.\n", "\n", "resp = jarvis_nlp.AnalyzeIntent(req)\n", "print(resp)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[weather.cloudy]\tIs it currently cloudy in Tokyo?\n", "[weather.rainfall]\tWhat is the annual rainfall in Pune?\n", "[weather.humidity]\tWhat is the humidity going to be tomorrow?\n" ] } ], "source": [ "# Some weather Intent queries\n", "queries = [\n", " \"Is it currently cloudy in Tokyo?\",\n", " \"What is the annual rainfall in Pune?\",\n", " \"What is the humidity going to be tomorrow?\"\n", "]\n", "for q in queries:\n", " req = jnlp.AnalyzeIntentRequest()\n", " req.query = q\n", " start = time()\n", " resp = jarvis_nlp.AnalyzeIntent(req)\n", "\n", " print(f\"[{resp.intent.class_name}]\\t{req.query}\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time to complete 10 synchronous requests: 0.05957150459289551\n", "Time to complete 10 asynchronous requests: 0.020952463150024414\n", "\n" ] } ], "source": [ "# Demonstrate latency by calling repeatedly.\n", "# NOTE: this is a synchronous API call, so request #N will not be sent until\n", "# response #N-1 is returned. This means latency and throughput will be negatively\n", "# impacted by long-distance & VPN connections\n", "\n", "req = jcnlp.TextTransformRequest()\n", "req.text.append(\"i need one cpu four gpus and lots of memory for my new computer it's going to be very cool\")\n", "\n", "iterations = 10\n", "# Demonstrate synchronous performance\n", "start_time = time()\n", "for _ in range(iterations):\n", " nlp_resp = jarvis_nlp.PunctuateText(req)\n", "end_time = time()\n", "print(f\"Time to complete {iterations} synchronous requests: {end_time-start_time}\")\n", "\n", "# Demonstrate async performance\n", "start_time = time()\n", "futures = []\n", "for _ in range(iterations):\n", " futures.append(jarvis_nlp.PunctuateText.future(req))\n", "for f in futures:\n", " f.result()\n", "end_time = time()\n", "print(f\"Time to complete {iterations} asynchronous requests: {end_time-start_time}\\n\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 5. Go deeper into Jarvis capabilities\n", "\n", "Now that you have a basic introduction to the Jarvis APIs, you may like to try out:\n", "\n", "### 1. Sample apps:\n", "\n", "Jarvis comes with various sample apps as a demonstration for how to use the APIs to build interesting applications such as a [chatbot](https://dl.gitlab-master-pages.nvidia.com/sw-ai-app-design/jarvis-api/samples/weather.html), a domain specific speech recognition or [keyword (entity) recognition system](http://docs.jarvis-ai.nvidia.com/release-1-0/samples/callcenter.html), or simply how Jarvis allows scaling out for handling massive amount of requests at the same time. ([SpeechSquad)](https://dl.gitlab-master-pages.nvidia.com/sw-ai-app-design/jarvis-api/samples/speechsquad.html) \n", "Have a look at the Sample Application section in the [Jarvis developer documentation](https://developer.nvidia.com/) for all the sample apps.\n", "\n", "\n", "### 3. Finetune your own domain specific Speech or NLP model and deploy into Jarvis.\n", "\n", "Train the latest state-of-the-art speech and natural language processing models on your own data using [NeMo](https://github.com/NVIDIA/NeMo) or [Transfer Learning ToolKit](https://developer.nvidia.com/transfer-learning-toolkit) and deploy them on Jarvis using the [Jarvis ServiceMaker tool](http://docs.jarvis-ai.nvidia.com/release-1-0/model-servicemaker.html).\n", "\n", "\n", "### 3. Further resources:\n", "\n", "Explore the details of each of the APIs and their functionalities in the [docs](https://dl.gitlab-master-pages.nvidia.com/sw-ai-app-design/jarvis-api/protobuf-api/protobuf-api-root.html)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }