For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
            • Math
            • Resources
            • Synthetic
            • Text
              • Classifiers
              • Deduplication
              • Download
                • Arxiv
                • Base
                • Common Crawl
                  • Download
                  • Extract
                  • Stage
                  • Url Generation
                  • Warc Iterator
                • Html Extractors
                • Utils
                • Wikipedia
              • Embedders
              • Experimental
              • Filters
              • Io
              • Models
              • Modifiers
              • Modules
              • Utils
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesTextDownloadCommon Crawl

nemo_curator.stages.text.download.common_crawl.extract

||View as Markdown|
Previous

nemo_curator.stages.text.download.common_crawl.download

Next

nemo_curator.stages.text.download.common_crawl.stage

Module Contents

Classes

NameDescription
CommonCrawlHTMLExtractor-

API

class nemo_curator.stages.text.download.common_crawl.extract.CommonCrawlHTMLExtractor(
algorithm: nemo_curator.stages.text.download.html_extractors.HTMLExtractorAlgorithm | str | None = None,
algorithm_kwargs: dict | None = None,
stop_lists: dict[str, frozenset[str]] | None = None
)

Bases: DocumentExtractor

nemo_curator.stages.text.download.common_crawl.extract.CommonCrawlHTMLExtractor.extract(
record: dict[str, typing.Any]
) -> dict[str, typing.Any] | None

Extract text from HTML content in the record.

Takes a record dict containing “content” field with HTML and returns a new dict with only the output columns: url, warc_id, source_id, language, text.

nemo_curator.stages.text.download.common_crawl.extract.CommonCrawlHTMLExtractor.input_columns() -> list[str]
nemo_curator.stages.text.download.common_crawl.extract.CommonCrawlHTMLExtractor.output_columns() -> list[str]