Functions
|
_summary_ |
|
Create Vocabulary tables from the vocab.txt file |
|
Get cached subword tokenizer. |
|
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash |
Classes
|
|
- create_tokenizer(vocab_hash_file, do_lower_case)[source]
_summary_
- Parameters
- vocab_hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- Returns
- cudf.core.subword_tokenizer.SubwordTokenizer
Subword tokenizer
- create_vocab_table(vocabpath)[source]
Create Vocabulary tables from the vocab.txt file
- Parameters
- vocabpathstr
Path of vocablary file
- Returns
- np.array
id2vocab: np.array, dtype=<U5
- get_cached_tokenizer(vocab_hash_file, do_lower_case)[source]
Get cached subword tokenizer. Creates tokenizer and caches it if it does not already exist.
- Parameters
- vocab_hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- Returns
- cudf.core.subword_tokenizer.SubwordTokenizer
Cached subword tokenizer
- tokenize_text_series(vocab_hash_file, do_lower_case, text_ser, seq_len, stride, truncation, add_special_tokens)[source]
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash
- Parameters
- vocab_hash_filestr
vocab_hash_file to use (Created using
perfect_hash.py
with compact flag)- do_lower_casebool
If set to true, original text will be lowercased before encoding.
- text_sercudf.Series
Text Series to tokenize
- seq_lenint
Sequence Length to use (We add to special tokens for ner classification job)
- strideint
Stride for the tokenizer
- truncationbool
If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.
- add_special_tokensbool
Whether or not to encode the sequences with the special tokens of the BERT classification model.
- Returns
- collections.namedtuple
A named tuple with these keys {‘input_ids’:,’input_mask’:,’segment_ids’:}