spark-rapids/user-guide/latest/partials/tools-setup-db-aws.html
The tool currently only supports event logs stored on S3 (no DBFS paths).
The remote output storage is also expected to be S3.
In order to get complete eventlogs for a given run-id :
`
databricks clusters list | grep <run-id>
databricks fs cp -r <databricks log location/<cluster id got from the above command> <destination_location>
`
are a couple of commands that can be used to download all the logs associated with a given run. Please refer to
the latest Databricks documentation on up-to-date information.
Due to some platform limitations, it is likely that the logs may be incomplete. Thq qualification tool attempts to
process them as best as possible. If the results come back empty, the rapids_4_spark_qualification_output_status.csv
file can call out the failed run due to
incomplete logs.
Install Databricks CLI
Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.
Set the configuration settings and credentials of the Databricks CLI:
Set up authentication by following these instructions
Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.
If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. Refer to the description of the variables in environment variables docs.
Setup the environment to access S3
Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started
Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.
If the AWS CLI configuration isn’t set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as:
AWS_PROFILE
,AWS_DEFAULT_REGION
,AWS_CONFIG_FILE
,AWS_SHARED_CREDENTIALS_FILE
. Refer to the full list of variables in aws-cli-configure-envvarsIt’s important to configure with the correct region for the bucket being used on S3. If region isn’t set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
emvironment variables if the credentials couldn’t be pulled from the credential files.
Note
In order to be able to run tools that require SSH on the EMR nodes (that is, bootstrap), then:
make sure that you have SSH access to the cluster nodes; and
create a key pair using Amazon EC2 through the AWS CLI command
aws ec2 create-key-pair
as instructed in aws-cli-create-key-pairs.
Note
For users who have multiple Databricks profiles, user can switch profiles by setting environment variable RAPIDS_USER_TOOLS_DATABRICKS_PROFILE
.