NMX Controller (NMX-C) Release Notes v0.9.0

Known Issues

RM #

Issue

4295690

Description: On rare occasions, after a system upgrade, NMX-C may fail to start due to corruption of the AOF (Redis persistence file). This issue can be diagnosed using the NVOS CLI command 'nv show cluster app'.

If the NMX-C status is displayed as 'not ok request timeout' for an extended period (longer than 2-3 minutes), and the command 'nv show system log | grep ERR' returns output like the following:

| grep ERR' produces output like below:

Feb 10 14:00:06.315951 juliet-ariel ERR clusterd[21793]: Timeout on healthcheck to cluster app nmx-controller

Feb 10 14:00:09.470491 juliet-ariel ERR clusterd[21793]: gRPC failure, reason: StatusCode.UNAVAILABLE, details: failed to connect to all addresses this indicates this kind of problem

Workaround:

  1. Stop NMX-C: nv action stop cluster apps nmx-controller

  2. Manually delete the AOF files (as root):

    #rm -rf /var/log/nmx/nmx-c/redis_aof

  3. Start NMX-C: nv action start cluster apps nmx-controller

Keywords: Upgrade, AOF

Discovered in Version: 0.9.0

4210527

Description: Partition delete requests intermittently take a long time (~15sec). Nonetheless, the deletion process is completed successfully.

Workaround: Increase the timeouts to avoid calls' failures due to exceeding gRPC waiting deadline.

Keywords: Partition deletion, gRPC

Discovered in Version: 0.8.0

© Copyright 2025, NVIDIA. Last updated on Mar 5, 2025.