Enterprise Lifecycle Integration#

How to Use This Information#

This section provides information on the anticipated uses and limitations of this information.

Intended Audience#

This information is written for the following audience:

  • Enterprise IT administrators and endpoint operations teams

  • Platform engineering or SRE teams operating fleets at scale

  • Security operations teams that require evidence-driven workflows

  • Support and escalation teams that need standardized diagnostics artifacts

Document Scope#

The following are within the scope of this document.

  • Lifecycle-driven operational playbooks. For example, the following workflow:

Procure → Provision → Monitor → Maintain → Respond → Retire

  • Agentless remote execution model (SSH) with a strict stdout JSON contract

  • Evidence minimization and artifact handling patterns

  • Wrapper patterns for enterprise platforms (for example, job systems, orchestration, and configuration management)

The following are not within the scope of this document.

  • Prescriptive identity architecture (for example, IAM design, key management, and PKI)

  • Organization-specific network segmentation and firewall policy

  • Business or legal policy decisions (such as retention durations and legal hold workflows)

  • Detailed application-layer configuration beyond manageability requirements

  • Detailed ISO repack, USB OEMDATA layout, and Cloud-init seed procedures (see Custom Installation with Cloud-Init)

Document Conventions#

This document uses the following conventions:

  • Collectors are read-only tools safe to run frequently and at high concurrency.

  • Controllers change state and must be gated by changing windows, rings or waves, and validation.

  • Tools return exactly one bounded JSON document on stdout per invocation.

  • Large outputs and deep evidence are stored as artifacts; stdout returns pointers only.

When you see inline code, it represents literal strings, such as file names, flags, fields, or paths.

Operational Guardrails#

Note: Read this before running controllers.

The following are guardrails to consider.

  • Run controllers only in approved change windows with staged rollouts (for example pilot → waves → broad).

  • Keep stdout JSON bounded to avoid platform truncation.

  • Treat artifact creation as on-demand; store and retain artifacts per policy.

  • Separate transport or auth failures (platform layer) from tool failures and endpoint health failures.

Support Boundaries and Operational Responsibilities#

This section clarifies what is provided and supported as part of the DGX Spark manageability baseline versus what remains the responsibility of the enterprise IT environment. It is intended to reduce ambiguity during deployment, escalation, and long-term operations.

Supported Baseline Posture#

Supported Baseline (high-level)#

DGX Spark enterprise manageability is designed and validated against the supported baseline OS image (DGX OS).

Baseline Topic

Guidance

OS baseline

DGX OS is the supported baseline for the guide’s operational contract.

Driver stack

Driver behavior, versions, and compatible combinations are validated relative to the baseline.

Firmware alignment

Firmware inventory and update posture are interpreted relative to the baseline and platform capabilities.

Tooling

Production tools installed into DGX_spark_management/bin/ are intended for fleet automation.

Reference scripts

Landscape scripts are reference implementations and may require adaptation for production governance.

Management Platform Support Posture#

Canonical Landscape is the primary recommended management platform for DGX Spark, and its license is included as part of the Ubuntu Pro entitlement. Landscape provides user management, policy deployment, and OS, firmware, and driver updates from a local repository. For enrollment steps, see Ubuntu Pro and Landscape client enrollment.

In addition, ecosystem tools such as Ansible, Puppet, Tanium, and others may offer ARM64‑capable agents that can be used for broader fleet management, including approval workflows and rollback strategies. These alternatives are supported solely by their respective vendors. Any integration examples referenced in this document do not imply NVIDIA support or validation.

Reimaging Posture#

Reimaging may be required in certain customer workflows, but it has consequences for predictability and supportability.

Reimaging Posture#

Reimaging away from the supported baseline OS can change the behavior across drivers and firmware.

Scenario

Operational Expectation

Baseline DGX OS

Predictable behavior: Spark Manageability Guide playbooks apply directly.

Baseline and approved updates

Predictable within validated update channels; confirm before and after evidence.

Non-baseline OS image

Higher variance; validate collectors and controllers; adjust wrappers and evidence paths as needed.

Mixed fleet baselines

Increase drift and support complexity; maintain baseline identity records per ring or group.

Division of Responsibility (NVIDIA vs Enterprise IT)#

NVIDIA Provides (Typical)

Enterprise IT Owns (Typical)

DGX OS baseline image and guidance

Identity strategy, authentication, RBAC, and secrets management

Reference tools for inventory, diagnostics and update control

Orchestration platform selection, job packaging, scheduling, and rings and waves

Reference scripts demonstrating platform patterns (Landscape)

Change management, approvals, and maintenance windows

Evidence paths and recommended minimization model

Artifact storage, retention policies, legal hold, and ticket linkage

Escalation guidance and support interfaces

Network segmentation, firewall rules, and access pathways

Controller Class

Minimum Governance Expectation

Update control (spark_updatectl.py)

Change window with ring rollout and precheck and postcheck evidence. Rollback awareness is retained.

Diagnostics bundles (spark_diagctl.py bundle modes)

On-demand only. Artifacts are handled through evidence store. Ticket linkage is recommended.

DGX Spark Overview, Enterprise Challenges, Lifecycle Backbone, and the JSON/Agentless SSH#

DGX Spark is being introduced into environments that already have mature endpoint operations. This section establishes a lifecycle-following framework and a simple operational contract so enterprise tooling can manage DGX Spark consistently at fleet scale.

Key Takeaways

  • Treat DGX Spark as a enterprise endpoint with an appliance mindset.

  • Standardize on SSH for remote execution and JSON on stdout as the integration contract.

  • Use artifacts only when deep evidence is required; keep routine results small and bounded.

Capabilities Summary#

This matrix summarizes the current DGX Spark manageability capabilities exposed by this repository. It is intended to help an Enterprise IT administrator quickly understand the following:

  • What exists today and how it is delivered (installed tool vs reference script)

  • What evidence is produced (stdout JSON vs artifacts).

How to Read This Table

  • Production (installed): Fleet-ready tooling installed into DGX_spark_management/bin/.

  • Reference (in-place): Example scripts designed for Canonical Landscape constraints; not installed by default.

  • Evidence model: stdout JSON is the primary contract; artifacts are generated only when deep evidence is required.

DGX Spark manageability capabilities (summary)#

Capability area

Integration overview

Delivery

Primary commands

Evidence produced

Identity and asset acceptance

Capture stable identifiers and an “as-received” acceptance snapshot.

Production

device_identity.py, os_build_identity.py

stdout JSON; optional on-device JSON record

Hardware configuration inventory

Enumerate key hardware configuration (CPU/GPU/SSD/NIC/memory).

Production

hardware_config.py

stdout JSON; optional on-device JSON record

Firmware inventory

Report firmware versions (UEFI/BIOS, NIC, SSD, GPU as available).

Production

firmware_reporter.py

stdout JSON; optional on-device JSON record

Driver inventory

Report key driver versions (GPU/NIC/storage/USB as implemented).

Production

driver_inventory_reporter.py

stdout JSON; optional on-device JSON record

Software inventory

Enumerate installed software inventories for drift and compliance contexts.

Production

software_inventory_reporter.py

stdout JSON; optional on-device JSON record

UEFI-backed tags (optional)

Read/write UEFI-backed metadata tags for fleet workflows (if enabled).

Production

NVAIAread, NVAIAwrite

stdout JSON (read/write result); optional on-device record

Health posture and diagnostics (L1)

Return bounded health posture signals for monitoring and triage.

Production

spark_diagctl.py

stdout JSON summary

Deep evidence bundles (L2)

Generate targeted or full diagnostics bundles when escalation requires evidence.

Production

spark_diagctl.py

stdout JSON + artifact pointer(s)

Reset / reboot context

Explain reboot/reset reasons for stability correlation and incident triage.

Production

reset_reason_reporter.py

stdout JSON; optional on-device JSON record

Controlled SW/FW updates

Expose update posture and controlled update operations within maintenance windows.

Production

spark_updatectl.py

stdout JSON; optional update evidence artifacts depending on mode

Landscape reference execution

Run platform-friendly checks and store per-run evidence under a run directory.

Reference (landscape_* scripts)

landscape_* scripts

Short stdout + on-device evidence under <runid.log>

DGX Spark Overview#

DGX Spark is deployed as an enterprise-managed, locally accelerated AI companion device, often alongside standard PC fleets, to provide on-prem or edge AI compute where latency, privacy, data gravity, or disconnected operations make cloud-only workflows impractical.

In enterprise terms, DGX Spark behaves less like a general-purpose end-user PC and more like a managed endpoint appliance:

  • It runs NVIDIA’s base OS (DGX OS).

  • It is typically administered remotely, at scale, with minimal local interaction.

  • It should integrate into existing IT operations: inventory, monitoring, patching, incident response, and retirement.

Documentation Principle: Treat DGX Spark as a first-class enterprise endpoint but manage it with an appliance mindset: baseline control, automation, and evidence-driven operations.

image0

Figure 1: DGX Spark context (PC fleet + AI workloads + enterprise tooling).

Enterprise Management Challenges#

Enterprise IT teams managing PCs globally typically rely on:

  • Rich device identity and hardware inventory

  • Standardized patching and reboot orchestration

  • Predictable driver and firmware servicing channels

  • Mature remote diagnostics and support bundles

  • CMDB integration, drift detection, and reporting. Linux-based devices often break those assumptions:

  • Management approaches differ across distros and package managers

  • Endpoint agents vary by platform and may not be desirable in regulated environments

  • Ad-hoc scripting without consistent outputs leads to fragile automations

  • Troubleshooting evidence collection can be inconsistent, large, and noisy. DGX Spark’s manageability approach addresses these challenges by providing:

    • Bounded, machine-ingestible outputs (JSON)

    • An agentless control-plane model (SSH)

    • Optional artifact bundles when deep evidence is required

    • Lifecycle-aligned playbooks that map to standard IT operations

PC-Centric Expectation

DGX Spark Operational Equivalent

Identity and inventory are always queryable

Standard SSH collectors emit bounded JSON snapshots for ingestion.

Patching and reboots are orchestrated

Maintenance-window playbooks: precheck → update and reboot → postcheck, all returning JSON.

Driver and firmware servicing is predictable

Standardize on a supported baseline (DGX OS) and validate drift through inventory JSON.

Support bundles are standardized

Generate evidence artifacts only when needed and reference them from the JSON result.

Drift detection and reporting are continuous

Record baseline OS build, firmware, drivers, and critical packages. Compare them over time.

image1

Figure 2: Manageability assumptions vs Linux endpoint realities.

Lifecycle Backbone (How Enterprise IT Actually Runs Fleets)#

The information in this section is organized around a lifecycle backbone that enterprise IT already understands. It deliberately starts before the device is powered on, because procurement decisions (SKUs, support, warranty, spares, standards alignment, and so on) drive long-term operational cost.

Procurement and Receiving

  • Define standard SKUs and configurations, support entitlements, and regional logistics.

  • Establish asset identity expectations (serial and UUID conventions, labeling, CMDB records).

  • Plan sparing, RMA flows, and replacement pools aligned to global sites.

Initial Provisioning

  • Establish device identity, hardware, firmware, software inventory, and baseline state.

  • Apply initial configuration required for remote management (SSH reachability, time sync, and so on).

  • Record enrollment metadata and ownership tags (as defined by your IT processes).

  • Apply automated first-boot configuration with Cloud-init and repacked BaseOS media where required. See Cloud-init for DGX Spark and Custom Installation with Cloud-Init.

Ongoing Monitoring

  • Run health checks and alerting signals.

  • Detect drift from known good baselines (such as OS build, firmware set, driver set, critical packages).

  • Analyze reset reasons and stability indicators to catch systemic issues early.

Maintenance Windows

  • Coordinate controlled updates and reboots within change windows.

  • Validate update outcomes and preserve rollback safety.

  • Enforce staged rollouts (rings and waves) to reduce fleet risk.

Incident Response

  • Collect targeted evidence or full diagnostics bundles when needed.

  • Package artifacts for escalation (such as, engineering, NVIDIA support, or OEM workflows).

  • Execute remediation consistently and record outcomes for postmortems.

End-of-Life

  • Execute factory reset aligned to retirement policy.

  • Produce retirement evidence (such as method, timestamps, success and failure, and artifact references).

  • Handle redeploy, transfer, and disposal with chain-of-custody documentation.

image2

Figure 3: Lifecycle backbone (Procure → Provision → Monitor → Maintain → Respond → Retire).

JSON-First, Agentless SSH Execution Model#

The information in this section standardizes on one universal remote control plane:

  • Remote execution: SSH (works from Windows and from enterprise management platforms)

  • Output contract: JSON on stdout (small, bounded; or machine-ingestible)

  • Deep evidence: Artifacts (tarballs and log bundles) referenced by JSON and pulled only when needed. This model intentionally avoids requiring a resident management agent on the DGX Spark endpoint.

Existing platforms orchestrate SSH execution and ingest JSON results.

image3

Figure 4: Orchestrator → SSH → Tool → stdout JSON → parse or ingest (plus optional artifact pull).

image4

Figure 5: JSON result envelope (fields) and artifact pointer model.

Simple Operations#

IT teams can run the entire control plane from endpoints and servers using built-in OpenSSH capabilities:

  • Direct SSH command execution

  • Scripting and scheduling

  • JSON capture and forwarding into CMDB, SIEM, and monitoring pipelines

image5

Figure 6: Admin workstation or server → SSH → DGX Spark → JSON ingestion targets.

image6

Figure 7: Admin host SSH invocation example flow

Artifact Strategy#

Use two tiers of operational evidence:

  • stdout JSON: Bounded, predictable, and easy to store and index

  • artifacts: Collected selectively (incident response and escalation), stored with retention controls; and CMDB stores pointers only

image7

Figure 8: Decision split—small JSON vs large artifacts (pointers + retention).

Artifact Type

Typical Retention

Notes

Routine health snapshots

30–90 days

Prefer JSON-only; avoid artifacts unless needed.

Incident diagnostics bundles

90–180 days

Retain for root cause analysis and escalation cycles.

Security audit evidence

180–365+ days

Align with compliance and legal hold policy.

Retirement proofs

Per policy

Retention period set by chain-of-custody or enterprise governance.

Operational UX Details#

This subsection describes the practical operator experience, such as what to run, what you get back, and how to handle evidence in a way that scales in enterprise job systems.

Operator Workflow in One Screen#

  • Pick the lifecycle intent (such as inventory, drift check, health posture, update window, incident evidence, and retire).

  • Run the corresponding installed command over SSH.

  • Capture stdout JSON as the authoritative record.

  • Retrieve artifacts only when JSON reports they exist or escalation requires deeper evidence.

Quick start: The four most common actions#

Intent

Command to Run

What to Store

Acceptance snapshot (as-received)

device_identity.py os_build_identity.py

stdout JSON for CMDB; optional on-device JSON record

Baseline inventories (provisioning)

hardware_config.py firmware_reporter.py driver_i nventory_reporter.py software_i nventory_reporter.py

stdout JSON for drift anchors; retain build + driver + firmware fields

Monitoring (routine posture)

spark_diagctl.py rese t_reason_reporter.py

stdout JSON for health/stability signals

Incident escalation (deep evidence)

spark_diagctl.py (bundle modes)

stdout JSON + artifact pointer(s); retrieve bundle only when needed

Stdout JSON is the API (contract)#

Stdout JSON is the integration contract. Recommended capture: stdout (verbatim), rc, stderr, ts, host. See JSON result capture (stdout as the API) for the standard envelope fields.

Evidence Handling: Summary First, Artifacts on Demand#

Apply the rule in Artifact Strategy. If the result is small, keep it in stdout JSON. If the result is large (such as logs or bundles), generate or retrieve it as an artifact. Store artifacts in your normal evidence store and link them back to the run record or ticket.

Scenario

Operational Handling

Routine monitoring

stdout JSON only; run frequently; bounded outputs

Drift investigation

stdout JSON summaries; artifact only if a deep diff is required

Incident response L1

stdout JSON + minimal targeted log excerpts

Incident response L2

stdout JSON + artifact bundle; retrieve and attach to ticket

Retirement evidence

stdout JSON “certificate” + optional artifact evidence per policy

Common Failure Classes#

When automation fails at scale, the fastest recovery comes from classifying failures consistently:

Failure Class

What it Usually Means

First Action

Transport/auth

SSH could not connect or authenticate; device unreachable; key or policy issue.

Fix connectivity or credentials in the orchestration layer; rerun a collector.

Privilege

Command requires sudo or access to privileged system interfaces.

Grant least-privilege sudo for the tool; re-run; validate no interactive prompts.

Tool execution

Missing dependency, unexpected OS state, or tool internal error.

Check stderr + tool log directory; compare against supported baseline.

Endpoint degraded

Tool ran successfully but returned warning or failure status due to device health signals.

Collect targeted diagnostics; escalate to L2 bundle if needed.

Integration Patterns: SSH Execution from Management Platforms#

This section describes how enterprise platforms execute remote actions over SSH, capture results, and feed them into downstream systems. The goal is portability. The same operational pattern should work whether you trigger jobs from Windows, a jump host, or a central orchestration tier.

Key takeaways: See JSON-first, agentless SSH execution model and Artifact Strategy for the execution and evidence model. Design for fleet realities: output limits, timeouts, concurrency, and predictable privilege boundaries.

The Universal Remote Execution Model#

At scale, enterprise tooling converges on a consistent execution loop:

  1. Select targets (static groups, rings, or dynamic inventory queries).

  2. Execute a remote command over SSH (often through a job, package, or policy).

  3. Capture stdout, stderr, and exit code.

  4. Parse stdout JSON into normalized fields.

  5. Ingest results into CMDB, monitoring, and ITSM workflows.

  6. (Optional) Retrieve artifacts when JSON indicates they exist.

image8

Figure 9: Universal model — platform → SSH → device tool → stdout JSON → ingest; optional artifact pull.

SSH Execution#

Many enterprise IT teams prefer a simple operational model.

  • Admin devices includes an OpenSSH client for interactive use and automation.

  • PowerShell provides a natural surface for fan-out, scheduling, and JSON capture.

  • Output should be stored verbatim and parsed off-box.

image9

Figure 10: SSH invocation capturing stdout JSON to file.

ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new nvidia@DGX_HOST “sudo /usr/ local/bin/dgx-mgmt/spark_diagctl.py” | Out-File -Encoding utf8 .\ spark_diagctl.py $doc = Get-Content .\ spark_diagctl.py -Raw | ConvertFrom-Json

$doc.status

SSH Access Patterns for Enterprise Operations (Non-Prescriptive)#

The information in this section is intentionally non-prescriptive about identity and IAM. In practice, most enterprises converge on one of these connectivity patterns:

  • Direct SSH per device (common in labs and smaller networks).

  • Bastion or jump-host mediated SSH (typical for segmented networks).

  • Brokered SSH execution (platform runs jobs from a central execution tier).

Operational considerations that matter at fleet scale:

  • Stable addressing or an authoritative inventory source

  • Non-interactive authentication and execution

  • Predictable sudo permissions for tools

  • Separation between read-only collectors and state-changing controllers

Pattern

Strengths

Tradeoffs

Direct per-device SSH

Simple and transparent; minimal infrastructure

Harder at scale; network exposure; credential sprawl risk

Bastion or jump host

Fits segmentation; centralizes access control and logging

Capacity planning; potential bottleneck

Brokered execution

Central scheduling, targeting, and reporting; easier fleet automation

Depends on platform constraints; adds abstraction

image10

Figure 11: SSH connectivity patterns — direct vs bastion vs brokered execution.

JSON Result Capture (stdout as the API)#

(Model described in JSON-first, agentless SSH execution model and Artifact Strategy) To keep integrations uniform across platforms, it is assumed that:

  • Orchestration wrappers store stdout verbatim (or as close as possible)

  • Parsing happens off-box (in the platform, not on the device)

Operational requirements:

  • stdout JSON is the authoritative contract

  • stderr is retained for debugging, but is not the primary data plane

  • If a tool fails before emitting JSON, the wrapper should emit a standard failure envelope

Field

Purpose

tool

Stable identifier of the command or tool invoked

ts

UTC timestamp for correlation

host

Target identity used by the orchestrator (hostname or asset ID)

status

ok

rc

Exit code from the tool execution

duration_ms

Runtime to support SLOs and debugging

summary

Small bounded summary suitable for indexing

warnings

Optional list of non-fatal issues

artifacts

Optional list of artifact pointers (ok, data, errors, meta)

{

“tool”: “spark_diagctl.py “,

“ts”: “2026-01-12T21:17:00Z”,

“host”: “DGX_HOST”,

“status”: “ok”,

“rc”: 0,

“duration_ms”: 842,

“summary”: {

“disk”: “ok”,

“network”: “ok”, “drivers”: “ok”

},

“warnings”: [], “artifacts”: []

}

Artifact Retrieval Patterns#

Artifacts are used when:

  • A health signal indicates anomaly

  • Incident response requires evidence

  • Escalation requires a standardized support bundle

  • Audits require retained evidence beyond JSON summaries

Common patterns:

  • Pull model: Orchestrator retrieves artifact through scp/sftp after job completion.

  • Store-and-link model: Orchestrator uploads artifact to internal object storage and stores a link in the job record or ticket.

  • Retention model: Artifacts are retained per policy and then deleted automatically.

Artifact Type

Typical Trigger

Notes

Diagnostics bundle

Incident or escalation

Prefer store-and-link; include hashes; keep sizes predictable

Focused logs

Single failing signal

Smaller than full bundles; faster retrieval and triage

Configuration snapshot

Drift investigation

Pair with a baseline record for comparisons

Update evidence

Maintenance window validation

Attach to change ticket when required

image11

Figure 12: Artifact lifecycle — generate → retrieve → store → link → expire.

Implementation#

This section is meant to help an Enterprise IT admin (or integrator) quickly answer:

  • What tooling does NVIDIA provide?

  • Where is the source code and documentation for each tool?

  • What gets installed on-device and what runs in-place as reference code?

  • Where do outputs and logs land on the DGX Spark device?

  • Which tools or scripts are relevant to each lifecycle stage

For top-level directory layout and command-to-source-path mapping, see Repository layout reference.

image12

Repository Content#

This repository contains two distinct implementation types:

Production Tools (11)#

These are production-ready, stdlib-only Python tools designed for fleet automation with:

  • JSON-first output

  • Configuration defaults with CLI override support

  • Safety guardrails for state-changing operations

  • Comprehensive documentation and test coverage

The following is the installed command location on the device:

  • DGX_spark_management/bin/

The following are the production tools functional areas and source roots:

  • clear_asset_information/ (7 tools)

  • controlled_sw_fw_updates/ (1 tool)

  • remote_ops_remediation/ (2 tools)

  • data_protection_privacy/ (1 tool)

Each production tool follows a consistent structure:

  • {functional_area}/{tool_name}/src/

  • {functional_area}/{tool_name}/config/

  • {functional_area}/{tool_name}/install.sh

  • Installed entry point(s) land in DGX_spark_management/bin/

Canonical Landscape Reference Scripts (8)#

These are reference implementations designed for Canonical Landscape “remote script execution” constraints. The device must be enrolled in Landscape first; see Ubuntu Pro and Landscape client enrollment.

  • Minimal error handling (examples, not production frameworks)

  • stdout output intended to be short and platform-friendly

  • Detailed evidence stored locally on the device per run

Reference Scripts Run In-Place

  • They are not installed into bin/ by default.

  • They live under their functional area folders with landscape_ prefix naming.

Reference Script Functional Areas

  • attestable_conformance_regulatory/ (1 script)

  • resilience_recovery_rollback/ (4 scripts)

  • network_enterprise_connectivity/ (2 scripts)

  • security_posture_vuln_response/ (1 script)

Reference script structure:

  • {functional_area}/landscape_{script_name}/README.md

  • {functional_area}/landscape_{script_name}/{script_name}.sh

Script Path (Repo)

Primary Script

Purpose

Stdout + Evidence

attestable_conformance_regulatory/ landscape_signing_verification/

signing_verification.sh

APT signing verification

Short stdout; detailed evidence stored per run on device

resilience_recovery_rollback/ landscape_verified_boot_integrity/

verified_boot_integrity.sh

Verified boot integrity reference

Short stdout; per-run evidence under run directory

resilience_recovery_rollback/ landscape_recovery_backup_levels/

recovery_backup_levels.sh

Recovery/backup level reference

Short stdout; per-run evidence under run directory

resilience_recovery_rollback/ landscape_factory_reset_reprovision/

factory_reset_reprovision.sh

Factory reset + reprovision

Short stdout; per-run evidence under run directory

resilience_recovery_rollback/ landscape_health_watchdogs/

health_watchdogs.sh

Health watchdog reference

Short stdout; per-run evidence under run directory

network_enterprise_connectivity/ landscape_collect_package/

collect_package.sh

Support bundle collection reference

Short stdout; evidence bundle stored on device

network_enterprise_connectivity/ landscape_retrieve_logs_stdout/

retrieve_logs_stdout.sh

Bounded log retrieval reference

Short stdout; bounded excerpts; detailed evidence on device

security_posture_vuln_response/ landscape_encryption_at_rest/

encryption_at_rest.sh

Encryption-at-rest reference

Short stdout; evidence stored per run on device

When to Use Reference Scripts#

Use reference scripts when management platform constraints are the dominant design factor (for example, strict stdout limits or “run script remotely” paradigms). For broad enterprise automation, use production tools installed in bin/.

Primary Entry Points#

In-repo documentation provides the authoritative directory tree, reading order, and per-tool references. Start here:

  • docs/00_INDEX.md — Documentation index and reading order

  • docs/01_PROJECT_README.md — Overview, quick start, naming conventions

  • docs/02_PROJECT_STRUCTURE.md — Directory tree and runtime layout

  • docs/03_LANDSCAPE_REFERENCE_SCRIPTS_SETUP.md — Landscape reference script framework

I want to…

Read this first

Get a quick start view

docs/01_PROJECT_README.md

Understand folder structure and runtime layout

docs/02_PROJECT_STRUCTURE.md

Integrate Landscape reference scripts

docs/03_LANDSCAPE_REFERENCE_SCRIPTS_SETUP.md

Tool Deep-Dive Documentation#

Each production tool has the following:

  • A tool folder README at {functional_area}/{tool_name}/README.md

  • Centralized “production tools” documentation referenced by docs/00_INDEX.md. Tool locations and reference code map points to these documents per tool and summarizes the following:

  • Command synopsis and options

  • JSON output structure and key fields

  • Troubleshooting and known limitations

  • Integration notes

Installed Commands on the DGX Spark#

This subsection provides information about the installed commands on the DGX Spark.

Installed Command Directory#

When deployed, production tools are installed here:

  • DGX_spark_management/bin/ Expected installed commands:

  • bin/device_identity.py

  • bin/hardware_config.py

  • bin/firmware_reporter.py

  • bin/os_build_identity.py

  • bin/driver_inventory_reporter.py

  • bin/software_inventory_reporter.py

  • bin/NVAIAwrite

  • bin/NVAIAread

  • bin/spark_updatectl.py

  • bin/spark_diagctl.py

  • bin/reset_reason_reporter.py

image13

Figure 15: Example listing of DGX_spark_management/bin/ on a device.

Runtime Outputs and Logs#

Production tools and reference scripts write runtime output to:

  • /var/lib/dgx_spark_management/{functional_area}/{tool_or_script_name}/ Examples:

  • /var/lib/dgx_spark_management/clear_asset_information/hardware_inventory_collector/ device_identity.json

  • /var/lib/dgx_spark_management/controlled_sw_fw_updates/update_control_plane/ status.json

  • /var/lib/dgx_spark_management/remote_ops_remediation/ diagnostic_collector/diagnostics_full.json

Production logs are found under:

  • /var/log/dgx_spark/{functional_area}/{tool_name}/

Examples:

  • /var/log/dgx_spark/clear_asset_information/ hardware_inventory_collector/device_identity.log

  • /var/log/dgx_spark/controlled_sw_fw_updates/update_control_plane/spark_updatectl.log

  • /var/log/dgx_spark/remote_ops_remediation/diagnostic_collector/spark_diagctl.log

Landscape reference scripts store per-run evidence under:

  • /var/lib/dgx_spark_management/{functional_area}/{landscape_script}/run_<UTC>/

Category

Location Pattern

Production runtime outputs

/var/lib/dgx_spark_management/ {functional_area}/{tool_name}/

Production logs

/var/log/dgx_spark/ {functional_area}/{tool_name}/

image14

Figure 16: Evidence flow — stdout JSON vs on-device evidence directory vs platform attachment.

Repository Layout Reference#

Use this subsection to locate folders and source files in the DGX_spark_management/ repository. For the full directory tree and runtime layout, see docs/02_PROJECT_STRUCTURE.md.

Top-Level Repository Layout#

Path

Purpose

bin/

Installed production tool entrypoints (what operators run on endpoints)

docs/

Architecture, structure, deployment, and tool references

clear_asset_information/

Production tools: identity, hardware, firmware, OS build, drivers, software, asset tags

controlled_sw_fw_updates/

Production tool: update control plane (reboot coordination, rollback reporting)

remote_ops_remediation/

Production tools: diagnostics collector, reset reason reporter

data_protection_privacy/

Production tool: data protection and privacy

attestable_conformance_regulatory/

Landscape script: APT signing verification

resilience_recovery_rollback/

Landscape scripts: boot integrity, backup levels, factory reset, watchdogs

network_enterprise_connectivity/

Landscape scripts: support bundle collection, log retrieval

security_posture_vuln_response/

Landscape script: encryption-at-rest reporting

enrollment_identity_access/

Planned functional area placeholder

integration_automation/

Planned functional area placeholder

lifecycle_business_ops/

Planned functional area placeholder

packaging/

Packaging artifacts (for example, systemd units)

tools/

Development tooling helpers

Where operators spend time

  • Operators and integrators typically use bin/ (installed commands) and docs/ (usage and integration guidance).

  • Developers work primarily under functional area folders.

Production Tools: Command-to-Source Mapping#

Each production tool folder includes README.md, src/, config/, and install.sh (installs into bin/).

Clear Asset Information (Seven Tools)#

Installed command

Source root

Notes

device_identity.py

clear_asset_information/hardware_inventory_collector/src/device_identity.py

Stable identifier through SMBIOS/DMI logic

hardware_config.py

clear_asset_information/hardware_inventory_collector/src/hardware_config.py

CPU/GPU/SSD/NIC/memory enumeration

firmware_reporter.py

clear_asset_information/firmware_version_reporter/src/firmware_reporter.py

BIOS/UEFI/NIC/SSD/GPU firmware enumeration

os_build_identity.py

clear_asset_information/os_build_identity_reporter/src/os_build_identity.py

OS build + DGX identity

driver_inventory_reporter.py

clear_asset_information/driver_inventory_reporter/src/driver_inventory_reporter.py

GPU/NIC/storage/USB drivers

software_inventory_reporter.py

clear_asset_information/software_inventory_reporter/src/software_inventory_reporter.py

dpkg/snap/pip/docker enumeration

NVAIAwrite / NVAIAread

clear_asset_information/asset_tag_manager/src/

UEFI-backed metadata store (read/write)

Controlled Software/Firmware Updates (One Tool)#

Installed command

Source root

spark_updatectl.py

controlled_sw_fw_updates/update_control_plane/src/spark_updatectl.py

Remote Ops and Remediation (Two Tools)#

Installed command

Source root

spark_diagctl.py

remote_ops_remediation/diagnostic_collector/src/spark_diagctl.py

reset_reason_reporter.py

remote_ops_remediation/reset_reason_reporter/src/reset_reason_reporter.py

Lifecycle Mapping to the Actual Code in This Repository#

This section maps the lifecycle found in Lifecycle backbone (how enterprise IT actually runs fleets) to the tools and scripts in this repository.

Procurement and Receiving#

Relevant production tools:

  • bin/device_identity.py

  • bin/os_build_identity.py

  • bin/hardware_config.py

  • bin/firmware_reporter.py

Stage Objective

Recommended Tools

Outputs to Capture

Acceptance snapshot

Identity + build + HW + FW

stdout JSON + stored on-device JSON paths

Initial Provisioning#

Relevant production tools:

  • Identity, hardware, firmware, OS build, driver inventory, and software inventory

  • optional: bin/NVAIAwrite and bin/NVAIAread if UEFI-backed tags are used

Baseline Set

Primary Tools

Drift Anchor Fields

Identity + build + inventories

device_identity.py os_build_identity.py hardware_config.py firmware_reporter.py driver_i nventory_reporter.py software_i nventory_reporter.py

OS build identity; firmware set; driver set; and critical package inventory hashes and counts

Optional first-boot automation uses Cloud-init with repacked BaseOS media; see Cloud-init for DGX Spark.

Cloud-init for DGX Spark#

This section explains Cloud-init basics and how it supports automated first-boot provisioning of DGX Spark in enterprise deployments.

What is Cloud-init#

Cloud-init is a widely used initialization framework that runs during the early boot process of Linux systems. It allows system configuration and provisioning tasks to be automated at first boot using simple configuration files. Originally developed for cloud environments, Cloud-init is equally effective for provisioning physical devices such as DGX Spark systems.

For DGX Spark deployments, Cloud-init enables administrators to apply configuration policies automatically when a device is first powered on, eliminating the need for manual setup.

Typical uses include:

  • Setting the system hostname and identity

  • Creating administrator accounts and installing SSH keys

  • Configuring networking or Wi-Fi profiles

  • Installing packages or applying update policies

  • Installing corporate certificates or proxy settings

  • Registering the system with enterprise management platforms

Cloud-init executes during the early stages of the boot process and normally runs once per instance, making it well suited for automated provisioning workflows.

DGX Spark Procedures#

On DGX Spark, Cloud-init implements first-boot provisioning during the Initial Provisioning lifecycle stage. Site-specific configuration (users, packages, out-of-box experience behavior, and optional local package or firmware mirrors) is delivered through repacked BaseOS media, OEM Cloud-init seeds, and optional OEMDATA USB layout.

For step-by-step procedures, USB partitioning, repack scripts, example user-data files, and verification workflows, see Custom Installation with Cloud-Init.

For general Cloud-init module reference and behavior, see https://docs.cloud-init.io/en/latest/.

Ongoing Monitoring#

The following are relevant production tools:

  • bin/spark_diagctl.py (health posture and targeted collection modes)

  • bin/reset_reason_reporter.py

  • (Optional) Periodic drift re-checks using os_build_identity, driver_inventory, and firmware_reporter

Signal

Source Tool

Artifacts (if any)

Health posture

spark_diagctl.py

Optional bundle modes

Reset context

reset_reason_reporter.py

None

Maintenance Windows#

The following are relevant production tools:

  • bin/spark_updatectl.py

  • Validation tools: os_build_identity, spark_diagctl, optional driver or firmware tools

Phase

Tools

Evidence

Pre-check

Build identity + health

stdout JSON

Update control

spark_updatectl.py

stdout JSON + optional artifact

Post-check

Build identity + health

stdout JSON

Incident Response#

Relevant production tools:

  • Level 1: Targeted triage: spark_diagctl health, reset_reason_reporter, plus identity, build, and drivers as needed

  • Level 2 Deep evidence: spark_diagctl bundle collection modes. Relevant reference scripts (Landscape environments) include log retrieval and support bundle collection scripts under network_enterprise_connectivity/

Level

Tools and Scripts

Output Handling Expectation

Level 1

Bounded collectors

stdout JSON only

Level 2

Bundle modes / reference scripts

stdout JSON + artifacts/evidence

Cascade and Redeployment#

Cascade and redeployment is a re-provisioning motion:

  • The device changes user, function, or environment

  • The lifecycle loop returns to provisioning steps while preserving linkage to prior history, if desired.

Relevant production tools:

  • Provisioning baseline set (identity, build, hardware, firmware, drivers, and software)

  • Optional tag tools, if used to mark new function or owner. Relevant reference scripts (Landscape environments) include factory reset and re-provisioning examples under resilience_recovery_rollback/

Motion

Recommended Steps

Evidence to Capture

Re-provision

Baseline re-capture or optional reset

New baseline JSON plus optional reset evidence

Tool Locations and Reference Code Map#

This section explains the tools and example implementations inside the project repository. The goal is to make it easy to locate the correct scripts, understand what they do, and embed them into enterprise platform wrappers without copying large files.

Repository Layout Conventions Used in This Section#

It is assumed that the repository organizes content into:

  • Tools: Scripts or binaries invoked remotely (collectors and controllers).

  • Reference code: Examples used to demonstrate platform integration patterns.

  • Platform wrappers: Example job scripts for specific management platforms.

  • Artifacts: Optional bundles (diagnostics or logs) created on demand.

To avoid repeating paths everywhere, define a single canonical base path and reuse it.

Note: In the provided examples, the tool directory is referenced as a single variable (for example, dgx_tool_dir). Replace it with your repo’s actual path on the device.

image15

Figure 17: Repo layout — tools, reference code, wrappers, and artifacts.

Canonical Tool Directory and Naming Scheme#

For readability and consistency, it is assumed that:

  • Collectors are named *.py and emit one JSON envelope on stdout

  • Controllers are also *.json but are gated and validated

  • Artifact generators indicate artifact creation and return pointers in artifacts

Category

Convention

Collector

Name ends in .json; read-only; safe to run frequently; bounded output

Controller

Name ends in .json; changes state; gated; includes precheck= and postcheck guidance

Artifact generator

Name indicates bundle creation; stdout JSON returns artifact pointer list

image16

Figure 18: Naming scheme — collectors vs controllers vs artifact generators.

Reference Code Map#

See Canonical Landscape reference scripts (8) for the full list of Landscape reference scripts and their purposes. Enrollment is required before running them; see Ubuntu Pro and Landscape client enrollment.

How to Use These References#

Treat these directories as examples of platform wrapper integration and evidence minimization. For production automation, use the guide’s standardized collector and controller tools and keep wrapper scripts thin.

Tool Invocation Patterns#

Across all platforms, it is assumed that the same invocation pattern is used:

  • SSH executes a tool from the canonical tool directory.

  • The tool emits one JSON envelope on stdout.

  • The platform stores stdout JSON verbatim and parses it off-box.

  • Artifacts are created only on demand; stdout returns pointers.

image17

Figure 19: Invocation contract — SSH executes tool → stdout JSON → ingest; optional artifacts by pointer.

Embedding Paths in Examples#

To keep the document readable, examples should not hardcode long paths repeatedly. Use one variable name consistently, and document where the variable is set per platform.

Platform Wrapper

How to Represent the Tool Path

Windows scripts

Set a single variable and call tools via that variable

Ansible

Use a vars: entry such as dgx_tool_dir and reference it in tasks

Landscape jobs

Use the job definition to call a script in-place, or call a canonical tool directory

BigFix/Tanium

Store the SSH wrapper centrally and pass the tool path as a parameter

Recommendation

In later sections, whenever you see dgx_tool_dir, replace it with the actual installed location on the endpoint. Keep the replacement consistent across all playbooks and wrappers.

Evidence Minimization (stdout vs artifacts)#

This repository includes examples that write evidence to disk (for later retrieval). That pattern is correct for large evidence and diagnostics, but stdout must remain bounded.

Use this decision rule:

  • If the result is small and can be summarized, return it in stdout JSON.

  • If the result is large, generate an artifact and return a pointer in stdout JSON.

Case

Preferred Handling

Routine inventory and health signals

stdout JSON summary only

Long logs or large inventories

artifact bundle + pointer in stdout JSON

Incident response deep evidence

artifact bundle + hashes + retrieval hint

Audit or retirement proofs

stdout JSON certificate + artifact evidence as required

For additional information on evidence minimization, see Artifact Strategy and Evidence Handling: Summary First, Artifacts on Demand.

image18

Figure 20: Evidence minimization — bounded stdout JSON vs artifact offload.

Platform Wrapper Patterns and Packaging Examples#

This section shows examples of how to package the guide’s SSH + stdout-JSON tooling model inside common enterprise platforms.

Key takeaways: Consider the same model as found in JSON-first, agentless SSH execution model and Artifact Strategy (execute → capture stdout JSON → optionally pull artifacts). Use collectors (safe, frequent) vs controllers (gated, validated).

Common Packaging Principles (Applies to All Platforms)#

Output Sizing#

Consider the same principle as in Artifact Strategy: keep stdout bounded and use artifacts for large evidence. Most platforms impose stdout or result-size limits. Packaging should treat:

  • stdout JSON as the authoritative small record

  • Large bundles (diagnostics tarball, large logs, or full inventories) as attachments or artifacts

Output Type

Packaging Guidance

stdout JSON

Bounded summary, stable envelope, safe to index and ingest

stderr

Keep for debugging. Do not rely on it for structured data.

Artifacts

Use for large evidence. Attach or store and return pointers in JSON.

Recommendation

If a platform truncates outputs, treat the truncation as a correctness failure. Reduce stdout size or switch the payload to an artifact generator that returns only a pointer on stdout.

Ansible Playbook Examples (Agentless SSH-Native)#

Ansible aligns naturally with this information because it is SSH-native and can fetch files. These examples focus on:

  • Running the installed tools on each host

  • Capturing JSON results per host

  • Optionally fetching artifacts

Inventory Assumption

Conceptual Guidance

Host grouping

Use groups for rings or waves (such as pilot, wave1, wave2, or broad)

Connection

SSH keys and non-interactive execution; bastion if required

Privilege

Use become for controllers; keep collectors unprivileged when possible

Outputs

Write stdout JSON per-host; collect centrally; parse off-box

Example: Provisioning Baseline (Conceptual)#

— name: Provisioning baseline (collectors) hosts: dgx_spark gather_facts: no vars:

dgx_tool_dir: /path/to/tools # set per Part 4 tasks: - name: Collect identity/build

shell: “sudo {{ dgx_tool_dir }}/identity.json” register: identity_out changed_when: false

  • name: Collect hardware/firmware/driver/software inventories shell: “sudo {{ dgx_tool_dir }}/inventory.json” register: inv_out changed_when: false

  • name: Write results (stdout JSON) per-host copy:

content: “{{ identity_out.stdout }}n” dest: “./out/{{ inventory_hostname }}_identity.json” - name: Write inventory JSON per-host copy:

content: “{{ inv_out.stdout }}n”

dest: “./out/{{ inventory_hostname }}_inventory.json”

Example: Monitoring Snapshot#

— name: Monitoring snapshot (collectors) hosts: dgx_spark gather_facts: no vars:

dgx_tool_dir: /path/to/tools # set per Part 4 tasks: - name: Health check

shell: “sudo {{ dgx_tool_dir }}/spark_diagctl.py “ register: health_out changed_when: false

  • name: Reset reason

shell: “sudo {{ dgx_tool_dir }}/reset_reason.json” register: reset_out changed_when: false

  • name: Save JSON outputs copy:

content: “{{ health_out.stdout }}n” dest: “./out/{{ inventory_hostname }}_health.json”

  • name: Save reset reason JSON copy:

content: “{{ reset_out.stdout }}n”

dest: “./out/{{ inventory_hostname }}_reset_reason.json”

Example: Incident Response L2#

— name: Incident response L2 (artifact bundle) hosts: dgx_spark gather_facts: no vars:

dgx_tool_dir: /path/to/tools # set per Part 4 tasks:

  • name: Trigger diagnostics bundle

shell: “sudo {{ dgx_tool_dir }}/diag_collect.json” register: diag_out

  • name: Save diagnostics JSON copy:

content: “{{ diag_out.stdout }}n”

dest: “./out/{{ inventory_hostname }}_diag_collect.json”

  • name: Fetch artifact (conceptual) # Replace with a real fetch that reads the artifact pointer from JSON fetch:

src: “/path/on/device/to/artifact.tar.gz”

dest: “./artifacts/{{ inventory_hostname }}_artifact.tar.gz” flat: yes

Ubuntu Pro and Landscape Client Enrollment#

Complete these steps before using the Landscape reference scripts in Canonical Landscape reference scripts (8) or the job examples in Canonical Landscape script examples.

For the most up-to-date information, see Landscape installation and set-up - Landscape documentation.

Scope: These steps target Landscape SaaS enrollment through Ubuntu Pro (sudo pro enable landscape with Self-hosted server? set to No). Self-hosted Landscape deployments can differ; follow Canonical’s documentation for those environments.

Landscape SaaS enrollment requires Ubuntu Pro. The following steps enable Ubuntu Pro services and enroll a DGX Spark as a Landscape client.

  1. Register for Ubuntu One.

    Go to Ubuntu Pro and register for an Ubuntu One account.

  2. Create a Landscape account.

    1. Open a browser.

    2. Navigate to https://landscape.canonical.com.

    3. Sign in using Ubuntu One.

    4. Create a new organization or join an existing one.

    5. Record your account name (organization name). You need this value during client enrollment.

  3. Attach Ubuntu Pro on the DGX Spark.

    Landscape SaaS enrollment requires Ubuntu Pro. On the DGX Spark, check status:

    sudo pro status
    

    If the system is not attached, run:

    sudo pro attach <YOUR_UBUNTU_PRO_TOKEN>
    

    Obtain the token from https://ubuntu.com/pro, then verify attachment:

    sudo pro status
    

    You should see Attached: yes.

  4. Enable Landscape (interactive mode).

    Run:

    sudo pro enable landscape
    

    When prompted, enter:

    • Self-hosted server?: No

    • Computer title: for example, dgx-spark-01

    • Account name: your Landscape organization name

    This process installs, configures, and starts the Landscape client.

  5. Approve the machine in the Landscape portal.

    1. Return to the Landscape web portal.

    2. Navigate to Pending Machines.

    3. Select the DGX Spark system.

    4. Click Accept.

    The system appears in your managed machines list.

  6. Verify client operation.

    sudo systemctl status landscape-client
    

    You should see Active: active (running).

    If needed, check the logs:

    sudo tail -n 50 /var/log/landscape/client.log
    

After enrollment, you can use Landscape to:

  • Apply tags (for example, DGX, Spark, AI-node)

  • Execute remote scripts

  • Schedule updates

  • Monitor package and security status

  • Create access groups

  • Define update and compliance policies

Canonical Landscape Script Examples#

Complete Ubuntu Pro and Landscape client enrollment before running these jobs.

Canonical Landscape is included because the repository provides reference scripts that illustrate platform job integration. They are not meant to replace production tools. They demonstrate:

  • Integrating health and security checks into Canonical Landscape jobs

  • Producing bounded stdout results

  • Writing evidence to disk for later retrieval

Reference script locations (from Tool locations and reference code map):

  • attestable_conformance_regulatory/landscape_signing_verification/

  • resilience_recovery_rollback/landscape_verified_boot_integrity/

  • resilience_recovery_rollback/landscape_recovery_backup_levels/

  • resilience_recovery_rollback/landscape_factory_reset_reprovision/

  • resilience_recovery_rollback/landscape_health_watchdogs/

  • network_enterprise_connectivity/landscape_collect_package/

  • network_enterprise_connectivity/landscape_retrieve_logs_stdout/

  • security_posture_vuln_response/landscape_encryption_at_rest/

For example, the following is an APT signing verification (conceptual)

# Landscape job (conceptual)

# - run signing verification

# - emit bounded PASS/FAIL/UNKNOWN on stdout # - write evidence to a directory for retrieval

run_signing_verification emit_stdout_summary write_evidence_dir

Example: Verified Boot Integrity#

# Landscape job (conceptual)

# - check verified boot signals

# - emit bounded summary

# - store evidence for later retrieval

check_verified_boot emit_stdout_summary write_evidence_dir

Example: Factory Reset with Reprovision#

# Landscape job (conceptual)

# - customer-gated reset workflow # - emit

retirement/reset certificate JSON on stdout # -

store evidence and logs

validate_gate_conditions perform_reset emit_stdout_certificate write_evidence_dir

image19

Figure 21: Landscape constraints — stdout summary vs evidence-on-disk retrieval pattern.

Tanium Package Patterns#

Tanium often implements operations through:

  • Packages or modules that run scripts

  • Results captured in question outputs or package logs

  • Attachments handled through platform mechanisms

These patterns show:

  • Packaging lifecycle SSH payloads as Tanium packages

  • Storing stdout JSON as package output (small)

  • Storing large bundles as attachments

Tanium Concept

Documentation Mapping

Package

Wrapper payload that runs SSH tools and captures stdout JSON

Question and result

Ingested JSON summary for reporting and filtering

Package logs

stderr and troubleshooting context

Attachments

Artifacts (bundles) referenced by JSON pointers

Pattern: Provisioning Baseline Package (Conceptual)#

# Tanium package (conceptual)

# - run baseline collectors over SSH

# - store stdout JSON per-host

run_ssh_identity run_ssh_inventory store_json_outputs

Pattern: Monitoring Snapshot Package (Conceptual)#

# Tanium package (conceptual)

# - run health/reset reason

# - store minimal JSON

run_ssh_health run_ssh_reset_reason store_json_outputs

Notes for Puppet and Chef (Drift and Scheduled Execution)#

Puppet and Chef are well-suited for:

  • Periodic execution of collectors (inventory or monitoring)

  • Enforcing baseline presence of tooling

  • Scheduled drift detection against known-good baselines

They are not typically used for interactive incident response bundles but can trigger them if desired.

Use Case

Usage

Tooling presence

Ensure the tool directory and dependencies exist on endpoints

Scheduled collectors

Run bounded collectors on a cadence. Forward JSON off-box.

Drift detection

Compare baseline summaries to current summaries and flag divergence

Controllers

Use sparingly. Prefer change-window orchestration platforms for risky actions.

Incident bundles

Possible, but generally better triggered by incident orchestration flows