morpheus.utils.cudf_subword_helper
Wrapper around cudf’s subword tokenizer
Functions
<a href="#morpheus.utils.cudf_subword_helper.create_tokenizer">create_tokenizer</a> (vocab_hash_file, do_lower_case) |
_summary_ |
<a href="#morpheus.utils.cudf_subword_helper.create_vocab_table">create_vocab_table</a> (vocabpath) |
Create Vocabulary tables from the vocab.txt file |
<a href="#morpheus.utils.cudf_subword_helper.get_cached_tokenizer">get_cached_tokenizer</a> (vocab_hash_file, ...) |
Get cached subword tokenizer. |
<a href="#morpheus.utils.cudf_subword_helper.tokenize_text_series">tokenize_text_series</a> (vocab_hash_file, ...) |
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash |
Classes
<a href="morpheus.utils.cudf_subword_helper.Feature.html#morpheus.utils.cudf_subword_helper.Feature">Feature</a> (input_ids, input_mask, segment_ids) |
|
- create_tokenizer(vocab_hash_file, do_lower_case)[source]
_summary_
- Parameters
- vocab_hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- Returns
- cudf.core.subword_tokenizer.SubwordTokenizer
Subword tokenizer
- create_vocab_table(vocabpath)[source]
Create Vocabulary tables from the vocab.txt file
- Parameters
- vocabpathstr
Path of vocablary file
- Returns
- np.array
id2vocab: np.array, dtype=<U5
- get_cached_tokenizer(vocab_hash_file, do_lower_case)[source]
Get cached subword tokenizer. Creates tokenizer and caches it if it does not already exist.
- Parameters
- vocab_hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- Returns
- cudf.core.subword_tokenizer.SubwordTokenizer
Cached subword tokenizer
- tokenize_text_series(vocab_hash_file, do_lower_case, text_ser, seq_len, stride, truncation, add_special_tokens)[source]
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash
- Parameters
- vocab_hash_filestr
vocab_hash_file to use (Created using
perfect_hash.py
with compact flag)- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- text_sercudf.Series
Text Series to tokenize
- seq_lenint
Sequence Length to use (We add to special tokens for ner classification job)
- strideint
Stride for the tokenizer
- truncationbool
If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.
- add_special_tokensbool
Whether or not to encode the sequences with the special tokens of the BERT classification model.
- Returns
- collections.namedtuple
A named tuple with these keys {‘input_ids’:,’input_mask’:,’segment_ids’:}