nemo_curator.stages.text.utils.text_utils

View as Markdown

Module Contents

Functions

NameDescription
get_commentsReturns a string including all coments
get_comments_and_docstring-
get_docstringsParse Python source code from file or string and print docstrings.
get_ngrams-
get_paragraphs-
get_sentences-
get_word_splitterFor Chinese and Japanese text, we use external libraries to split the text
get_words-
is_paragraph_indices_in_top_or_bottom_only-
parse_docstringsParse Python source code and yield a tuple of ast node instance, name,
remove_punctuation-

Data

NODE_TYPES

API

nemo_curator.stages.text.utils.text_utils.get_comments(
s: str,
clean: bool = False
) -> str

Returns a string including all coments

nemo_curator.stages.text.utils.text_utils.get_comments_and_docstring(
source: str,
comments: bool = True,
clean_comments: bool = False
) -> tuple[str, str]
nemo_curator.stages.text.utils.text_utils.get_docstrings(
source: str,
module: str = '<string>'
) -> list[str]

Parse Python source code from file or string and print docstrings.

nemo_curator.stages.text.utils.text_utils.get_ngrams(
input_list: list[str],
n: int
) -> list[tuple[str, ...]]
nemo_curator.stages.text.utils.text_utils.get_paragraphs(
document: str
) -> list[str]
nemo_curator.stages.text.utils.text_utils.get_sentences(
document: str
) -> list[str]
nemo_curator.stages.text.utils.text_utils.get_word_splitter(
language: str
) -> collections.abc.Callable[[str], list[str]]

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

Returns: A function which can be used to parse the words of a string into a list.

Parameters:

language
str

An ISO 639-1 language code. For example, “en” for English, “zh” for Chinese, and “ja” for Japanese.

nemo_curator.stages.text.utils.text_utils.get_words(
text: str
) -> tuple[list[str], list[int]]
nemo_curator.stages.text.utils.text_utils.is_paragraph_indices_in_top_or_bottom_only(
boilerplate_paragraph_indices: list[int],
num_paragraphs: int
) -> bool
nemo_curator.stages.text.utils.text_utils.parse_docstrings(
source: str
) -> list[tuple[ast.AST, str | None, str]]

Parse Python source code and yield a tuple of ast node instance, name, and docstring for each function/method, class and module.

nemo_curator.stages.text.utils.text_utils.remove_punctuation(
str_in: str
) -> str
nemo_curator.stages.text.utils.text_utils.NODE_TYPES = {ast.ClassDef: 'Class', ast.FunctionDef: 'Function/Method', ast.Module: 'Module'...