morpheus.utils.cudf_subword_helper

(Latest Version)

Wrapper around cudf’s subword tokenizer

Functions

create_tokenizer(vocab_hash_file, do_lower_case) _summary_
create_vocab_table(vocabpath) Create Vocabulary tables from the vocab.txt file
get_cached_tokenizer(vocab_hash_file, ...) Get cached subword tokenizer.
tokenize_text_series(vocab_hash_file, ...) This function tokenizes a text series using the bert subword_tokenizer and vocab-hash

Classes

Feature(input_ids, input_mask, segment_ids)

Attributes

create_tokenizer(vocab_hash_file, do_lower_case)[source]

_summary_

Parameters
vocab_hash_file

Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function.

do_lower_case

If set to true, original text will be lowercased before encoding.

Returns
cudf.core.subword_tokenizer.SubwordTokenizer

Subword tokenizer

create_vocab_table(vocabpath)[source]

Create Vocabulary tables from the vocab.txt file

Parameters
vocabpath

Path of vocablary file

Returns
np.array

id2vocab: np.array, dtype=<U5

get_cached_tokenizer(vocab_hash_file, do_lower_case)[source]

Get cached subword tokenizer. Creates tokenizer and caches it if it does not already exist.

Parameters
vocab_hash_file

Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function.

do_lower_case

If set to true, original text will be lowercased before encoding.

Returns
cudf.core.subword_tokenizer.SubwordTokenizer

Cached subword tokenizer

tokenize_text_series(vocab_hash_file, do_lower_case, text_ser, seq_len, stride, truncation, add_special_tokens)[source]

This function tokenizes a text series using the bert subword_tokenizer and vocab-hash

Parameters
vocab_hash_file

vocab_hash_file to use (Created using perfect_hash.py with compact flag)

do_lower_case

If set to true, original text will be lowercased before encoding.

text_ser

Text Series to tokenize

seq_len

Sequence Length to use (We add to special tokens for ner classification job)

stride

Stride for the tokenizer

truncation

If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.

add_special_tokens

Whether or not to encode the sequences with the special tokens of the BERT classification model.

Returns
collections.namedtuple

A named tuple with these keys {‘input_ids’:,’input_mask’:,’segment_ids’:}

Previous morpheus.utils.control_message_utils.CMDefaultFailureContextManager
Next morpheus.utils.cudf_subword_helper.Feature
© Copyright 2024, NVIDIA. Last updated on Jul 8, 2024.