utils.text_utils#

Module Contents#

Functions#

get_comments

Returns a string including all coments

get_comments_and_docstring

Extract all natural text in source: comments + doctsrings the extraction fails in case of syntax errors in the file Args: source: the code to parse comments: if True extract comments two clean_comment: if True remove # from extracted comments Returns: a string with concatenated docstrings and comments

get_docstrings

Parse Python source code from file or string and print docstrings.

get_ngrams

get_paragraphs

get_sentences

get_word_splitter

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

get_words

is_paragraph_indices_in_top_or_bottom_only

parse_docstrings

Parse Python source code and yield a tuple of ast node instance, name, and docstring for each function/method, class and module.

remove_punctuation

Data#

API#

utils.text_utils.NODE_TYPES#

None

utils.text_utils.get_comments(s: str, clean: bool = False) str#

Returns a string including all coments

utils.text_utils.get_comments_and_docstring(
source: str,
comments: bool = True,
clean_comments: bool = False,
) tuple[str, str]#

Extract all natural text in source: comments + doctsrings the extraction fails in case of syntax errors in the file Args: source: the code to parse comments: if True extract comments two clean_comment: if True remove # from extracted comments Returns: a string with concatenated docstrings and comments

utils.text_utils.get_docstrings(source: str, module: str = '<string>') list[str]#

Parse Python source code from file or string and print docstrings.

utils.text_utils.get_ngrams(input_list: list[str], n: int) list[tuple[str, ...]]#
utils.text_utils.get_paragraphs(document: str) list[str]#
utils.text_utils.get_sentences(document: str) list[str]#
utils.text_utils.get_word_splitter(
language: str,
) collections.abc.Callable[[str], list[str]]#

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

Args: language (str): An ISO 639-1 language code. For example, “en” for English, “zh” for Chinese, and “ja” for Japanese. Returns: A function which can be used to parse the words of a string into a list.

utils.text_utils.get_words(text: str) tuple[list[str], list[int]]#
utils.text_utils.is_paragraph_indices_in_top_or_bottom_only(
boilerplate_paragraph_indices: list[int],
num_paragraphs: int,
) bool#
utils.text_utils.parse_docstrings(source: str) list[tuple[ast.AST, str | None, str]]#

Parse Python source code and yield a tuple of ast node instance, name, and docstring for each function/method, class and module.

utils.text_utils.remove_punctuation(str_in: str) str#