nemo_curator.stages.text.download.utils
nemo_curator.stages.text.download.utils
Module Contents
Functions
API
Check if s5cmd is installed.
s5cmd is a command-line tool for interacting with S3-compatible storage. This function checks if it’s available in the system PATH.
Returns: bool
True if s5cmd is installed and accessible, False otherwise.
Detect language using cld2.
Returns: bool
tuple[bool, int, list[tuple[str, str, float, int]]]:
Detect language from text.
Parameters:
text
Text to detect language from.
Returns: str
The most likely language code.
Remove control characters from text. Control characters are non-printable characters in the Unicode standard that control how text is displayed or processed.