morpheus.utils.cudf_subword_helper

Wrapper around cudf’s subword tokenizer

Functions

`create_tokenizer`(vocab_hash_file, do_lower_case)	_summary_
`create_vocab_table`(vocabpath)	Create Vocabulary tables from the vocab.txt file
`get_cached_tokenizer`(vocab_hash_file, ...)	Get cached subword tokenizer.
`tokenize_text_series`(vocab_hash_file, ...)	This function tokenizes a text series using the bert subword_tokenizer and vocab-hash

Classes

Feature(input_ids, input_mask, segment_ids)

create_tokenizer(vocab_hash_file, do_lower_case)[source]

_summary_

Parameters

vocab_hash_filestr: Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function.
do_lower_casebool: If set to true, original text will be lowercased before encoding.

Returns

create_vocab_table(vocabpath)[source]

Create Vocabulary tables from the vocab.txt file

Parameters

Returns

get_cached_tokenizer(vocab_hash_file, do_lower_case)[source]

Get cached subword tokenizer. Creates tokenizer and caches it if it does not already exist.

Parameters

vocab_hash_filestr: Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function.
do_lower_casebool: If set to true, original text will be lowercased before encoding.

Returns

tokenize_text_series(vocab_hash_file, do_lower_case, text_ser, seq_len, stride, truncation, add_special_tokens)[source]

This function tokenizes a text series using the bert subword_tokenizer and vocab-hash

Parameters

vocab_hash_filestr: vocab_hash_file to use (Created using perfect_hash.py with compact flag)
do_lower_casebool: If set to true, original text will be lowercased before encoding.
text_sercudf.Series: Text Series to tokenize
seq_lenint: Sequence Length to use (We add to special tokens for ner classification job)
strideint: Stride for the tokenizer
truncationbool: If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.
add_special_tokensbool: Whether or not to encode the sequences with the special tokens of the BERT classification model.

Returns

collections.namedtuple: A named tuple with these keys {‘input_ids’:,’input_mask’:,’segment_ids’:}