morpheus.utils.cudf_subword_helper#
Wrapper around cudf’s subword tokenizer
Functions
|
_summary_ |
|
Create Vocabulary tables from the vocab.txt file |
|
Get cached subword tokenizer. |
|
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash |
Classes
|
- create_tokenizer(vocab_hash_file, do_lower_case)[source]#
_summary_
- Parameters:
- vocab_hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocabfunction.- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- Returns:
- cudf.core.subword_tokenizer.SubwordTokenizer
Subword tokenizer
- create_vocab_table(vocabpath)[source]#
Create Vocabulary tables from the vocab.txt file
- Parameters:
- vocabpathstr
Path of vocablary file
- Returns:
- np.array
id2vocab: np.array, dtype=<U5
- get_cached_tokenizer(vocab_hash_file, do_lower_case)[source]#
Get cached subword tokenizer. Creates tokenizer and caches it if it does not already exist.
- Parameters:
- vocab_hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocabfunction.- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- Returns:
- cudf.core.subword_tokenizer.SubwordTokenizer
Cached subword tokenizer
- tokenize_text_series(
- vocab_hash_file,
- do_lower_case,
- text_ser,
- seq_len,
- stride,
- truncation,
- add_special_tokens,
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash
- Parameters:
- vocab_hash_filestr
vocab_hash_file to use (Created using
perfect_hash.pywith compact flag)- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- text_sercudf.Series
Text Series to tokenize
- seq_lenint
Sequence Length to use (We add to special tokens for ner classification job)
- strideint
Stride for the tokenizer
- truncationbool
If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.
- add_special_tokensbool
Whether or not to encode the sequences with the special tokens of the BERT classification model.
- Returns:
- collections.namedtuple
A named tuple with these keys {‘input_ids’:,’input_mask’:,’segment_ids’:}