Deploy Text Curation on Slurm#
This workflow covers the full text curation pipeline on Slurm, including model download, text cleaning, deduplication, classification, and PII redaction.
See also
For details on text container environments and Slurm environment variables, see Container Environments.
Prerequisites#
Create required directories for AWS credentials, NeMo Curator configuration, and local workspace:
mkdir $HOME/.aws mkdir -p $HOME/.config/nemo_curator mkdir $HOME/nemo_curator_local_workspace
Prepare configuration files:
$HOME/.aws/credentials
(for S3 access)[default] aws_access_key_id = aws_secret_access_key = region =
$HOME/.config/nemo_curator/config.yaml
(for HuggingFace API key)huggingface: api_key: "<api_token>"
$HOME/nemo_curator_local_workspace/text_config.yaml
(for advanced semantic deduplication)# Semantic deduplication configuration (for programmatic API usage) # Note: CLI scripts use command-line parameters, not this configuration file cache_dir: "/tmp/semdedup_cache" # Embedding settings embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2" embedding_batch_size: 128 embeddings_save_loc: "embeddings" # Clustering settings max_iter: 100 n_clusters: 1000 clustering_save_loc: "clustering_results" # Deduplication settings eps_to_extract: 0.01 sim_metric: "cosine" which_to_keep: "hard" batched_cosine_similarity: 1024
Model Download#
Copy the following script for downloading all required text processing models into the Slurm cluster.
Note
This script downloads models required for the quality classifier and other components. You may need to adjust the model list based on your specific workflow requirements.
#!/bin/bash #SBATCH --job-name=download_text_models #SBATCH -p defq #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --exclusive # Update Me! #SBATCH --output=/home/<username>/logs/%x_%j.log USER_DIR="/home/${USER}" CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh" # LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace" LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config" NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml" CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT}" export NEMO_CURATOR_RAY_SLURM_JOB=1 export NEMO_CURATOR_LOCAL_DOCKER_JOB=1 # Download Text Processing Models srun \ --mpi=none \ --container-writable \ --no-container-remap-root \ --export=NEMO_CURATOR_RAY_SLURM_JOB,NEMO_CURATOR_LOCAL_DOCKER_JOB \ --container-image "${CONTAINER_IMAGE}" \ --container-mounts "${CONTAINER_MOUNTS}" \ -- python3 -c " from nemo_curator.utils.download_utils import download_models models = [ 'sentence-transformers/all-MiniLM-L6-v2', 'huggingface/fasttext-language-identification', 'nvidia/quality-classifier-deberta', 'nvidia/domain-classifier-fineweb' ] download_models(models, '/config/models') "
Update the
SBATCH
parameters and paths to match your username and environment.Run the script.
sbatch 1_curator_download_text_models.sh
Text Processing Pipeline#
The workflow consists of four main Slurm scripts, to be run in order:
curator_text_clean.sh
: Cleans and preprocesses raw text data.curator_text_classify.sh
: Applies quality and domain classification.curator_text_dedup.sh
: Performs exact and fuzzy deduplication (semantic deduplication requires separate setup).curator_text_redact.sh
: Identifies and redacts personally identifiable information.
Note
Semantic Deduplication: The current scripts implement exact and fuzzy deduplication. For semantic deduplication, you’ll need to use the programmatic API or the specialized scripts in nemo_curator.scripts.semdedup
. See the Semantic Deduplication documentation for complete setup instructions.
curator_text_clean.sh
- Cleans and preprocesses raw text data.
#!/bin/bash
#SBATCH --job-name=text-clean
#SBATCH -p defq
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --time=04:00:00
# Update Me!
#SBATCH --output=/home/<username>/logs/%x_%j.log
#SBATCH --error=/home/<username>/logs/%x_%j.log
USER_DIR="/home/${USER}"
CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh"
INPUT_DATA_PATH="s3://your-bucket/raw-text-data/"
OUTPUT_DATA_PATH="s3://your-bucket/cleaned-text-data/"
#
LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace"
LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config"
NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml"
AWS_MOUNT="${HOME}/.aws:/root/.aws"
CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT},${AWS_MOUNT}"
export NEMO_CURATOR_RAY_SLURM_JOB=1
srun \
--mpi=none \
--container-writable \
--no-container-remap-root \
--export=NEMO_CURATOR_RAY_SLURM_JOB \
--container-image "${CONTAINER_IMAGE}" \
--container-mounts "${CONTAINER_MOUNTS}" \
-- python3 -m nemo_curator.scripts.text_cleaning \
--input-data-dir "${INPUT_DATA_PATH}" \
--output-clean-dir "${OUTPUT_DATA_PATH}" \
--scheduler-address "127.0.0.1" \
--scheduler-port 8786 \
--device gpu \
--normalize-newlines \
--remove-urls
curator_text_classify.sh
- Applies quality and domain classification.
#!/bin/bash
#SBATCH --job-name=text-classify
#SBATCH -p defq
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --time=06:00:00
# Update Me!
#SBATCH --output=/home/<username>/logs/%x_%j.log
#SBATCH --error=/home/<username>/logs/%x_%j.log
USER_DIR="/home/${USER}"
CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh"
INPUT_DATA_PATH="s3://your-bucket/cleaned-text-data/"
OUTPUT_DATA_PATH="s3://your-bucket/classified-text-data/"
#
LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace"
LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config"
NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml"
AWS_MOUNT="${HOME}/.aws:/root/.aws"
CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT},${AWS_MOUNT}"
export NEMO_CURATOR_RAY_SLURM_JOB=1
srun \
--mpi=none \
--container-writable \
--no-container-remap-root \
--export=NEMO_CURATOR_RAY_SLURM_JOB \
--container-image "${CONTAINER_IMAGE}" \
--container-mounts "${CONTAINER_MOUNTS}" \
-- python3 -m nemo_curator.scripts.classifiers.quality_classifier_inference \
--input-data-dir "${INPUT_DATA_PATH}" \
--output-data-dir "${OUTPUT_DATA_PATH}" \
--scheduler-address "127.0.0.1" \
--scheduler-port 8786 \
--device gpu \
--input-text-field "text" \
--batch-size 256
curator_text_dedup.sh
- Performs exact and fuzzy deduplication (semantic deduplication requires separate setup).
#!/bin/bash
#SBATCH --job-name=text-dedup
#SBATCH -p defq
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --time=12:00:00
# Update Me!
#SBATCH --output=/home/<username>/logs/%x_%j.log
#SBATCH --error=/home/<username>/logs/%x_%j.log
USER_DIR="/home/${USER}"
CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh"
INPUT_DATA_PATH="s3://your-bucket/classified-text-data/"
OUTPUT_DATA_PATH="s3://your-bucket/deduplicated-text-data/"
CACHE_DIR="s3://your-bucket/dedup-cache/"
#
LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace"
LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config"
NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml"
AWS_MOUNT="${HOME}/.aws:/root/.aws"
CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT},${AWS_MOUNT}"
export NEMO_CURATOR_RAY_SLURM_JOB=1
# Step 1: Exact Deduplication
srun \
--mpi=none \
--container-writable \
--no-container-remap-root \
--export=NEMO_CURATOR_RAY_SLURM_JOB \
--container-image "${CONTAINER_IMAGE}" \
--container-mounts "${CONTAINER_MOUNTS}" \
-- python3 -m nemo_curator.scripts.find_exact_duplicates \
--input-data-dirs "${INPUT_DATA_PATH}" \
--output-dir "${CACHE_DIR}/exact_dedup" \
--input-json-id-field "id" \
--input-json-text-field "text" \
--scheduler-address "127.0.0.1" \
--scheduler-port 8786 \
--device gpu
# Step 2: Fuzzy Deduplication - Compute MinHashes
srun \
--mpi=none \
--container-writable \
--no-container-remap-root \
--export=NEMO_CURATOR_RAY_SLURM_JOB \
--container-image "${CONTAINER_IMAGE}" \
--container-mounts "${CONTAINER_MOUNTS}" \
-- python3 -m nemo_curator.scripts.fuzzy_deduplication.compute_minhashes \
--input-data-dirs "${CACHE_DIR}/exact_dedup" \
--output-minhash-dir "${CACHE_DIR}/fuzzy_dedup/minhashes" \
--input-json-id-field "id" \
--input-json-text-field "text" \
--scheduler-address "127.0.0.1" \
--scheduler-port 8786 \
--device gpu \
--char-ngram 24 \
--minhash-length 260
# Step 3: Fuzzy Deduplication - LSH
srun \
--mpi=none \
--container-writable \
--no-container-remap-root \
--export=NEMO_CURATOR_RAY_SLURM_JOB \
--container-image "${CONTAINER_IMAGE}" \
--container-mounts "${CONTAINER_MOUNTS}" \
-- python3 -m nemo_curator.scripts.fuzzy_deduplication.minhash_lsh \
--input-data-dirs "${CACHE_DIR}/fuzzy_dedup/minhashes" \
--output-bucket-dir "${CACHE_DIR}/fuzzy_dedup/buckets" \
--input-json-id-field "id" \
--scheduler-address "127.0.0.1" \
--scheduler-port 8786 \
--device gpu \
--num-bands 20 \
--buckets-per-shuffle 5
# Note: Semantic deduplication requires a separate configuration file and multi-step process
# See the semantic deduplication documentation for the complete workflow
curator_text_redact.sh
- Identifies and redacts personally identifiable information.
#!/bin/bash
#SBATCH --job-name=text-redact
#SBATCH -p defq
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --time=08:00:00
# Update Me!
#SBATCH --output=/home/<username>/logs/%x_%j.log
#SBATCH --error=/home/<username>/logs/%x_%j.log
USER_DIR="/home/${USER}"
CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh"
INPUT_DATA_PATH="s3://your-bucket/deduplicated-text-data/"
OUTPUT_DATA_PATH="s3://your-bucket/final-text-data/"
#
LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace"
LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config"
NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml"
AWS_MOUNT="${HOME}/.aws:/root/.aws"
CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT},${AWS_MOUNT}"
export NEMO_CURATOR_RAY_SLURM_JOB=1
srun \
--mpi=none \
--container-writable \
--no-container-remap-root \
--export=NEMO_CURATOR_RAY_SLURM_JOB \
--container-image "${CONTAINER_IMAGE}" \
--container-mounts "${CONTAINER_MOUNTS}" \
-- python3 -m nemo_curator.scripts.find_pii_and_deidentify \
--input-data-dir "${INPUT_DATA_PATH}" \
--output-data-dir "${OUTPUT_DATA_PATH}" \
--scheduler-address "127.0.0.1" \
--scheduler-port 8786 \
--device gpu \
--supported-entities "PERSON,EMAIL_ADDRESS,PHONE_NUMBER" \
--anonymize-action replace \
--batch-size 1000 \
--text-field "text"
Update all
# Update Me!
sections in the scripts for your environment (paths, usernames, S3 buckets, etc).Submit each job with
sbatch
:
sbatch curator_text_clean.sh
sbatch curator_text_classify.sh
sbatch curator_text_dedup.sh
sbatch curator_text_redact.sh
Monitoring and Logs#
Check job status:
squeue
View logs:
tail -f /path/to/logs/<jobname>-<jobid>.log
Performance Considerations#
Memory Requirements: Text deduplication can be memory-intensive. Consider increasing node memory or reducing batch sizes for large datasets.
Storage: Intermediate deduplication files can be large. Ensure sufficient storage in your cache directory.
Scaling: Semantic deduplication benefits from more GPU nodes. Consider 8+ nodes for datasets >1TB.