Skip to main content
Ctrl+K
NeMo-Curator - Home NeMo-Curator - Home

NeMo-Curator

  • GitHub
NeMo-Curator - Home NeMo-Curator - Home

NeMo-Curator

  • GitHub

Table of Contents

  • Home

About NeMo Curator

  • Overview of NeMo Curator
  • Key Features
  • Concepts
    • Text Concepts
      • Curation Pipeline
      • Data Loading
      • Data Acquisition
      • Data Processing
    • Image Concepts
      • Data Loading
      • Data Processing
      • Data Export
    • Video Concepts
      • Architecture
      • Key Abstractions
      • Data Flow
    • Audio Concepts
      • Audio Curation Pipeline (Overview)
      • ASR Pipeline
      • Quality Metrics
      • AudioBatch Structure
      • Dataset Manifests and Ingest
      • Text Integration
    • Deduplication Concepts
  • NeMo Curator Release Notes: 25.09

Get Started

  • About Getting Started
  • Text Curation Quickstart
  • Image Curation Quickstart
  • Video Curation Quickstart
  • Audio Curation Quickstart

Curate Text

  • About Text Curation
  • Tutorials
  • Load Data
    • Read Existing Data
    • ArXiv
    • Common Crawl
    • Wikipedia
    • Custom Data
  • Process Data
    • Quality Assessment & Filtering
      • Heuristic Filters
      • Classifier Filters
      • Distributed Classification
    • Deduplication
      • Exact Duplicate Removal
      • Fuzzy Duplicate Removal
      • Semantic Deduplication
    • Content Processing & Cleaning
      • Document IDs
      • Text Cleaning
    • Language Management
      • Language Identification
      • Stop Words
    • Specialized Processing
      • Code Processing

Curate Images

  • About Image Curation
  • Tutorials
    • Beginner Tutorial
    • Image Duplicate Removal Workflow
  • Load Data
    • Tar Archives
  • Process Data
    • Filters
      • Aesthetic Filter
      • NSFW Filter
    • Embeddings
      • CLIP ImageEmbeddingStage
  • Save and Export

Curate Video

  • About Video Curation
  • Tutorials
    • Beginner Tutorial
    • Split and Deduplicate Videos
    • Pipeline Customization
      • Add Custom Environment
      • Adding Custom Code
      • Adding Custom Models
      • Adding Custom Stages
  • Load Data
  • Process Data
    • Clip Videos
    • Encode Clips
    • Filter Clips and Frames
    • Extract Frames
    • Create Embeddings
    • Create Captions & Preview
    • Remove Duplicate Embeddings
  • Save & Export

Curate Audio

  • About Audio Curation
  • Tutorials
    • Beginner Tutorial
  • Load Data
    • FLEURS Dataset
    • Custom Manifests
    • Local Files
  • Process Data
    • ASR Inference
      • NeMo ASR Models
    • Quality Assessment
      • WER Filtering
      • Duration Filtering
    • Audio Analysis
    • Text Integration
  • Save & Export

Setup & Deployment

  • About Setup & Deployment
  • Install Curator

Reference

  • About References
  • Infrastructure
    • Memory Management Guide
    • GPU Processing Guide
    • Resumable Processing
    • Container Environments
    • Pipeline Execution Backends
  • API Reference
    • backends
      • backends.experimental
        • backends.experimental.ray_actor_pool
        • backends.experimental.ray_data
        • backends.experimental.utils
      • backends.internal
        • backends.internal.raft
      • backends.xenna
        • backends.xenna.adapter
        • backends.xenna.executor
      • backends.base
      • backends.utils
    • pipeline
      • pipeline.pipeline
    • stages
      • stages.audio
        • stages.audio.datasets
        • stages.audio.inference
        • stages.audio.io
        • stages.audio.metrics
        • stages.audio.common
      • stages.deduplication
        • stages.deduplication.exact
        • stages.deduplication.fuzzy
        • stages.deduplication.semantic
        • stages.deduplication.shuffle_utils
        • stages.deduplication.gpu_utils
        • stages.deduplication.id_generator
        • stages.deduplication.io_utils
      • stages.image
        • stages.image.deduplication
        • stages.image.embedders
        • stages.image.filters
        • stages.image.io
      • stages.text
        • stages.text.classifiers
        • stages.text.deduplication
        • stages.text.download
        • stages.text.embedders
        • stages.text.filters
        • stages.text.io
        • stages.text.models
        • stages.text.modifiers
        • stages.text.modules
        • stages.text.utils
      • stages.video
        • stages.video.caption
        • stages.video.clipping
        • stages.video.embedding
        • stages.video.filtering
        • stages.video.io
        • stages.video.preview
      • stages.base
      • stages.client_partitioning
      • stages.file_partitioning
      • stages.function_decorators
      • stages.resources
    • tasks
      • tasks.audio_batch
      • tasks.document
      • tasks.file_group
      • tasks.image
      • tasks.tasks
      • tasks.utils
      • tasks.video
    • utils
      • utils.client_utils
      • utils.column_utils
      • utils.decoder_utils
      • utils.file_utils
      • utils.grouping
      • utils.hf_download_utils
      • utils.nvcodec_utils
      • utils.operation_utils
      • utils.performance_utils
      • utils.storage_utils
      • utils.windowing_utils
      • utils.writer_utils
  • Tools
  • API Reference
  • stages
  • stages.text
  • stages.text.download
  • stages.text.download.wikipedia

stages.text.download.wikipedia#

Submodules#

  • stages.text.download.wikipedia.download
  • stages.text.download.wikipedia.extract
  • stages.text.download.wikipedia.iterator
  • stages.text.download.wikipedia.stage
  • stages.text.download.wikipedia.url_generation

previous

stages.text.download.html_extractors.trafilatura

next

stages.text.download.wikipedia.download

On this page
  • Submodules
NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025 NVIDIA Corporation.