User Guide (24.04.01)
User Guide (24.04.01)

spark-rapids/user-guide/24.04.01/partials/tools-setup-db-aws.html

The tool currently only supports event logs stored on S3 (no DBFS paths). The remote output storage is also expected to be S3.

  • Install Databricks CLI

    • Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.

    • Set the configuration settings and credentials of the Databricks CLI:

    • Set up authentication by following these instructions

    • Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.

    • If the configuration is not set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. See the description of the variables in environment variables docs.

  • Setup the environment to access S3

    • Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started

    • Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.

    • If the AWS CLI configuration is not set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as: AWS_PROFILE, AWS_DEFAULT_REGION, AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE. See the full list of variables in aws-cli-configure-envvars

    • Note that it is important to configure with the correct region for the bucket being used on S3. If region is not set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY emvironment variables if the credentials could not be pulled from the credential files.

    Note

    In order to be able to run tools that require SSH on the EMR nodes (i.e., bootstrap), then:

    • make sure that you have SSH access to the cluster nodes; and

    • create a key pair using Amazon EC2 through the AWS CLI command aws ec2 create-key-pair as instructed in aws-cli-create-key-pairs.

© Copyright 2024, NVIDIA. Last updated on Jun 12, 2024.