nemo_curator.stages.text.download.utils

View as Markdown

Module Contents

Functions

NameDescription
check_s5cmd_installedCheck if s5cmd is installed.
decode_html-
detect_languageDetect language using cld2.
lang_detectDetect language from text.
remove_control_charactersRemove control characters from text.
try_decode_with_detected_encoding-

API

nemo_curator.stages.text.download.utils.check_s5cmd_installed() -> bool

Check if s5cmd is installed.

s5cmd is a command-line tool for interacting with S3-compatible storage. This function checks if it’s available in the system PATH.

Returns: bool

True if s5cmd is installed and accessible, False otherwise.

nemo_curator.stages.text.download.utils.decode_html(
html_bytes: bytes
) -> str | None
nemo_curator.stages.text.download.utils.detect_language(
text: str
) -> tuple[bool, int, list[tuple[str, str, float, int]]]

Detect language using cld2.

Returns: bool

tuple[bool, int, list[tuple[str, str, float, int]]]:

nemo_curator.stages.text.download.utils.lang_detect(
text: str
) -> str

Detect language from text.

Parameters:

text
str

Text to detect language from.

Returns: str

The most likely language code.

nemo_curator.stages.text.download.utils.remove_control_characters(
text: str
) -> str

Remove control characters from text. Control characters are non-printable characters in the Unicode standard that control how text is displayed or processed.

nemo_curator.stages.text.download.utils.try_decode_with_detected_encoding(
html_bytes: bytes
) -> str | None