Deployment Troubleshooting#

Pre-deployment Setup Issues #

NGC API Keys: Legacy vs Personal Keys#

Warning

The legacy API keys have been deprecated and are not encouraged to be used anymore. Use personal API keys instead.

Legacy API Key: This was the initial type of API key provided by NVIDIA NGC. It grants access to various NGC services but lacks advanced management features.

Personal API Key: Introduced to enhance security and management, Personal API Keys are tied to individual users within an NGC organization. They offer several advantages over the original API keys:

Role-Based Access Control (RBAC): Personal API Keys support RBAC, allowing administrators to define specific permissions for each user, thereby controlling access to different NGC services and resources.

Lifecycle Management: These keys are linked to the user’s lifecycle within the NGC organization. If a user leaves the organization or their role changes, their Personal API Key can be revoked or adjusted accordingly.

Key Management Features: Users can set expiration dates, revoke, delete, or rotate their Personal API Keys as needed, providing greater control and security.

Application Instance Host Name and K8S Node Labels#

Observed in Bare-metal deployments. When deploying the Tokkio chart on Bare-metal, K8S node labels may fail to apply correctly if the node name contains invalid characters. This commonly occurs when the application instance’s hostname contains uppercase letters, which violates the Kubernetes node naming requirements specified in RFC 1123.

For example, the following error may occur during deployment:

TASK [apply node labels] *******************************************************
failed: [app-master] (item={'key': 'my-host-2xA40', 'value': {}}) => {"ansible_loop_var": "item", "changed": false, "item": {"key": "my-host-2xA40", "value": {}}, "msg": "Failed to create object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Node \\\\\"my-host-2xA40\\\\\" is invalid: metadata.name: Invalid value: \\\\\"my-host-2xA40\\\\\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, \\'-\\' or \\'.\\', and must start and end with an alphanumeric character (e.g. \\'example.com\\', regex used for validation is \\'[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\\\\\\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*\\')\",\"reason\":\"Invalid\",\"details\":{\"name\":\"my-host-2xA40\",\"kind\":\"Node\",\"causes\":[{\"reason\":\"FieldValueInvalid\",\"message\":\"Invalid value: \\\\\"my-host-2xA40\\\\\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, \\'-\\' or \\'.\\', and must start and end with an alphanumeric character (e.g. \\'example.com\\', regex used for validation is \\'[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\\\\\\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*\\')\",\"field\":\"metadata.name\"}]},\"code\":422}\\n'", "reason": "Unprocessable Entity"}

PLAY RECAP *********************************************************************
app-master                 : ok=5    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

To resolve this issue, ensure your hostname follows the RFC 1123 requirements: - Use only lowercase alphanumeric characters, hyphens (-), or periods (.) - Start and end with an alphanumeric character - Avoid uppercase letters

Deployment Issues #

Using default K8s Namespace#

While bringing up the Tokkio Workflow, if you had set spec.app.configs.app_settings.k8s_namespace = default in the config file, the Application comes up fine. However, during the uninstallation of the app only component (--component app), the uninstallation script throws the error below.

fatal: [app-master]: FAILED! => {"changed": false, "error": 403, "msg": "Namespace default: Failed to delete object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"namespaces \\\\\"default\\\\\" is forbidden: this namespace may not be deleted\",\"reason\":\"Forbidden\",\"details\":{\"name\":\"default\",\"kind\":\"namespaces\"},\"code\":403}\\n'", "reason": "Forbidden", "status": 403}

As a recommendation, avoid using the default Kubernetes namespace for spec.app.configs.app_settings.k8s_namespace.

Deployment Fails with Exhausted attempts waiting for inventory to be ready Message#

Observed in Bare-metal deployments. When running the envbuild.sh script, it may exit with the following message:

.
.
[INFO]  Extracting inventory
[INFO]  Checking inventory
[INFO]  Waiting for inventory to be ready
[INFO]  Waiting for inventory to be ready
[INFO]  Waiting for inventory to be ready
[INFO]  Waiting for inventory to be ready
[INFO]  Waiting for inventory to be ready
.
.
.
[INFO]  Waiting for inventory to be ready
[INFO]  Waiting for inventory to be ready
[INFO]  Exhausted attempts waiting for inventory to be ready

This issue indicates that the Controller instance is not able to connect to the application instance. The majority of the time, this is due to the missing prerequisites on the Controller instance. Ensure that you have established Passwordless SSH between the Controller and Application instances.

You can verify this by the command below on the Controller instance:

source <path-to-env-file/my-env-file.env>
ssh -i <path-to-ssh-private-key-used-in-config-template> ${APP_HOST_SSH_USER}@${APP_HOST_IPV4_ADDR}

This should connect you to the Application instance without any password prompt. If you see a password prompt, then you need to ensure that you have established Passwordless SSH between the Controller and Application instances. Follow the instructions at Common Setup Procedures and resolve the prerequisites.

Deployment Fails with Cannot extract inventory when there are IaC changes Message#

Observed in Cloud deployments. When running the envbuild.sh script, it may exit with the following message:

[INFO]  Preparing artifacts
[INFO]  Extracting inventory
[INFO]  Cannot extract inventory when there are IaC changes

=================================================================================================================================================================================================================================================================================
[INFO]  Logs for the execution are available at: /home/my-user-name/<path-to-logs>/logs/envbuild-sh-1745021862.log
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This error occurs when there is a mismatch identified between the IaC (Infrastructure as Code) files and the current state of the environment, at the same time you are trying to update the application. To verify the IaC changes reported by the script, run the following --dry-run command:

./envbuild.sh install --component infra --dry-run  --tf-binary terraform  --config-file <path-to-config-file>

You will should see the changes that IaC was finding out of sync with the current state of the environment.

Terraform will perform the following actions:

.
.
.
# module.star_cloudfront_certificate.aws_acm_certificate_validation.this must be replaced
-/+ resource "aws_acm_certificate_validation" "this" {
    ~ id                      = "2025-04-07 20:13:38.348 +0000 UTC" -> (known after apply)
    ~ validation_record_fqdns = [ # forces replacement
        - "_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.example.com",
        ] -> (known after apply) # forces replacement
        # (1 unchanged attribute hidden)
    }

# module.star_cloudfront_certificate.aws_route53_record.this["*.example.com"] will be created
+ resource "aws_route53_record" "this" {
    + allow_overwrite = true
    + fqdn            = (known after apply)
    + id              = (known after apply)
    + name            = "_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.example.com"
    + records         = [
        + "_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxx.acm-validations.aws.",
        ]
    + ttl             = 60
    + type            = "CNAME"
    + zone_id         = "ZXXXXXXXXXXXXX"
    }

Plan: 4 to add, 2 to change, 2 to destroy.

As you see, there are changes to IaC state that needs to be synced first. To resolve this, you can run the following command to sync the environment with IaC state and then continue with the application update.

./envbuild.sh install --component infra --tf-binary terraform  --config-file <path-to-config-file>

Infrastructure Management #

Azure CDN Cache Invalidation#

Observed in Azure deployments. At times, the reinstallation of the UI with config changes or the reinstallation of the new UI on the existing UI does not reflect when browsed through the UI endpoint. This happens because Azure CDN caches UI content, causing old UI content to still be visible while browsing. If you want to forcefully clear the cache of Azure CDN, you need to invalidate the cache of Azure CDN using the commands below.

source my-env-file.env

az cdn endpoint purge --resource-group '<replace-with-actual-rg-name>' --profile-name '<replace-with-cdn-profile-name>' --name '<replace-with-actual-cdn-endpoint-hostname>' --content-paths '/*'

Once the above command runs successfully, you can try accessing the UI endpoint and it should reflect the latest UI changes.

AWS ACM Certificate Changes#

Observed in AWS deployments. You might notice AWS ACM certificate changes appearing in your Terraform plan even when you haven’t made any changes to your infrastructure. This typically happens due to the following reasons:

Certificate Renewal: AWS ACM certificates have a validity period, and AWS automatically renews them before expiration. When this happens, Terraform detects the change in the certificate’s metadata and validation records.
Validation Record Updates: When certificates are renewed or revalidated, AWS generates new validation records. These changes are reflected in the Terraform plan as new Route53 records.

These changes are normal and don’t indicate any problems with your infrastructure. You can safely apply these changes as they are part of AWS’s certificate management process.

To handle these changes:

Run the infrastructure update to apply the certificate changes

After applying the changes, proceed with your application deployment

These changes won’t affect your application’s availability

./envbuild.sh install --component infra --tf-binary terraform  --config-file <path-to-config-file>

Application Customization #

Customizations Are Not Reflecting in the Deployed Application#

If the performed customizations are not reflected in the deployed experience, there are a couple of checks you can do to debug:

Check if the performed customizations were overrides versus app specs. When deploying the Tokkio chart, the values present in the overrides values files are used to update the app specs. Ensure that the intended customization is referenced in the correct file and not overridden by an entry in the override values file.
Check if the intended customization needs a chart rebuild. Please refer to the Customization Options for a list of the common customizations and indicators regarding the necessity of chart rebuild.

Miscellaneous #

Updating secrets will not show the modified data and will not restart the pods; the user has to manually restart the pods.
Upgrading the app doesn’t restart all the UE pods; the user has to restart all the UE pods to provide streams on all three clients.
An upgrade in VMS cm will not restart the pod; the user has to manually restart the pod.