morpheus.utils.cudf_subword_helper
Wrapper around cudf’s subword tokenizer
Functions
create_tokenizer (vocab_hash_file, do_lower_case) |
_summary_ |
create_vocab_table (vocabpath) |
Create Vocabulary tables from the vocab.txt file |
get_cached_tokenizer (vocab_hash_file, ...) |
Get cached subword tokenizer. |
tokenize_text_series (vocab_hash_file, ...) |
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash |
Classes
Feature (input_ids, input_mask, segment_ids) |
|
- create_tokenizer(vocab_hash_file, do_lower_case)[source]
_summary_
- Parameters
- vocab_hash_file
- do_lower_case
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.If set to true, original text will be lowercased before encoding.
- Returns
- cudf.core.subword_tokenizer.SubwordTokenizer
Subword tokenizer
- create_vocab_table(vocabpath)[source]
Create Vocabulary tables from the vocab.txt file
- Parameters
- vocabpath
Path of vocablary file
- Returns
- np.array
id2vocab: np.array, dtype=<U5
- get_cached_tokenizer(vocab_hash_file, do_lower_case)[source]
Get cached subword tokenizer. Creates tokenizer and caches it if it does not already exist.
- Parameters
- vocab_hash_file
- do_lower_case
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.If set to true, original text will be lowercased before encoding.
- Returns
- cudf.core.subword_tokenizer.SubwordTokenizer
Cached subword tokenizer
- tokenize_text_series(vocab_hash_file, do_lower_case, text_ser, seq_len, stride, truncation, add_special_tokens)[source]
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash
- Parameters
- vocab_hash_file
- do_lower_case
- text_ser
- seq_len
- stride
- truncation
- add_special_tokens
vocab_hash_file to use (Created using
perfect_hash.py
with compact flag)If set to true, original text will be lowercased before encoding.
Text Series to tokenize
Sequence Length to use (We add to special tokens for ner classification job)
Stride for the tokenizer
If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.
Whether or not to encode the sequences with the special tokens of the BERT classification model.
- Returns
- collections.namedtuple
A named tuple with these keys {‘input_ids’:,’input_mask’:,’segment_ids’:}