nemo_curator.stages.text.download.common_crawl.extract
nemo_curator.stages.text.download.common_crawl.extract
Module Contents
Classes
API
Bases: DocumentExtractor
Extract text from HTML content in the record.
Takes a record dict containing “content” field with HTML and returns a new dict with only the output columns: url, warc_id, source_id, language, text.