stages.text.download.utils
#
Module Contents#
Functions#
Detect language using cld2. |
|
Detect language from text. |
|
Remove control characters from text. Control characters are non-printable characters in the Unicode standard that control how text is displayed or processed. |
|
API#
- stages.text.download.utils.decode_html(html_bytes: bytes) str | None #
- stages.text.download.utils.detect_language(
- text: str,
Detect language using cld2.
Returns: tuple[bool, int, list[tuple[str, str, float, int]]]: is_reliable: bool True if the detection is high confidence. textBytesFound: int The number of bytes of text found. details: list[tuple[str, str, float, int]] A list of tuples upto three detected languages containing the language name (str) language code (str) percent (float) what percentage of the text is in this language score (int) how confident the detection is.
- stages.text.download.utils.lang_detect(text: str) str #
Detect language from text.
Args: text (str): Text to detect language from.
Returns: str: The most likely language code.
- stages.text.download.utils.remove_control_characters(text: str) str #
Remove control characters from text. Control characters are non-printable characters in the Unicode standard that control how text is displayed or processed.
- stages.text.download.utils.try_decode_with_detected_encoding(html_bytes: bytes) str | None #