`utils.text_utils`#

Module Contents#

Functions#

`get_comments`	Returns a string including all coments
`get_comments_and_docstring`	Extract all natural text in source: comments + doctsrings the extraction fails in case of syntax errors in the file Args: source: the code to parse comments: if True extract comments two clean_comment: if True remove # from extracted comments Returns: a string with concatenated docstrings and comments
`get_docstrings`	Parse Python source code from file or string and print docstrings.
`get_ngrams`
`get_paragraphs`
`get_sentences`
`get_word_splitter`	For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
`get_words`
`is_paragraph_indices_in_top_or_bottom_only`
`parse_docstrings`	Parse Python source code and yield a tuple of ast node instance, name, and docstring for each function/method, class and module.
`remove_punctuation`

Data#

NODE_TYPES

API#

utils.text_utils.NODE_TYPES#: None

utils.text_utils.get_comments(s: str, clean: bool = False) → str#: Returns a string including all coments

utils.text_utils.get_comments_and_docstring( source: str, comments: bool = True, clean_comments: bool = False, ) → tuple[str, str]#: Extract all natural text in source: comments + doctsrings the extraction fails in case of syntax errors in the file Args: source: the code to parse comments: if True extract comments two clean_comment: if True remove # from extracted comments Returns: a string with concatenated docstrings and comments

utils.text_utils.get_docstrings(source: str, module: str = '<string>') → list[str]#: Parse Python source code from file or string and print docstrings.

utils.text_utils.get_ngrams(input_list: list[str], n: int) → list[tuple[str, ...]]#

utils.text_utils.get_paragraphs(document: str) → list[str]#

utils.text_utils.get_sentences(document: str) → list[str]#

utils.text_utils.get_word_splitter( language: str, ) → collections.abc.Callable[[str], list[str]]#

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

Args: language (str): An ISO 639-1 language code. For example, “en” for English, “zh” for Chinese, and “ja” for Japanese. Returns: A function which can be used to parse the words of a string into a list.

utils.text_utils.get_words(text: str) → tuple[list[str], list[int]]#

utils.text_utils.is_paragraph_indices_in_top_or_bottom_only( boilerplate_paragraph_indices: list[int], num_paragraphs: int, ) → bool#

utils.text_utils.parse_docstrings(source: str) → list[tuple[ast.AST, str | None, str]]#: Parse Python source code and yield a tuple of ast node instance, name, and docstring for each function/method, class and module.

utils.text_utils.remove_punctuation(str_in: str) → str#

utils.text_utils#

Module Contents#

Functions#

Data#

API#

`utils.text_utils`#