morpheus.utils.cudf_subword_helper

(Latest Version)

Functions

create_tokenizer(vocab_hash_file, do_lower_case)

_summary_

create_vocab_table(vocabpath)

Create Vocabulary tables from the vocab.txt file

get_cached_tokenizer(vocab_hash_file, ...)

Get cached subword tokenizer.

tokenize_text_series(vocab_hash_file, ...)

This function tokenizes a text series using the bert subword_tokenizer and vocab-hash

Classes

Feature(input_ids, input_mask, segment_ids)

Attributes

create_tokenizer(vocab_hash_file, do_lower_case)[source]

_summary_

Parameters
vocab_hash_filestr

Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function.

do_lower_casebool

If set to true, original text will be lowercased before encoding.

Returns
cudf.core.subword_tokenizer.SubwordTokenizer

Subword tokenizer

create_vocab_table(vocabpath)[source]

Create Vocabulary tables from the vocab.txt file

Parameters
vocabpathstr

Path of vocablary file

Returns
np.array

id2vocab: np.array, dtype=<U5

get_cached_tokenizer(vocab_hash_file, do_lower_case)[source]

Get cached subword tokenizer. Creates tokenizer and caches it if it does not already exist.

Parameters
vocab_hash_filestr

Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function.

do_lower_casebool

If set to true, original text will be lowercased before encoding.

Returns
cudf.core.subword_tokenizer.SubwordTokenizer

Cached subword tokenizer

tokenize_text_series(vocab_hash_file, do_lower_case, text_ser, seq_len, stride, truncation, add_special_tokens)[source]

This function tokenizes a text series using the bert subword_tokenizer and vocab-hash

Parameters
vocab_hash_filestr

vocab_hash_file to use (Created using perfect_hash.py with compact flag)

do_lower_casebool

If set to true, original text will be lowercased before encoding.

text_sercudf.Series

Text Series to tokenize

seq_lenint

Sequence Length to use (We add to special tokens for ner classification job)

strideint

Stride for the tokenizer

truncationbool

If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.

add_special_tokensbool

Whether or not to encode the sequences with the special tokens of the BERT classification model.

Returns
collections.namedtuple

A named tuple with these keys {‘input_ids’:,’input_mask’:,’segment_ids’:}

© Copyright 2023, NVIDIA. Last updated on Apr 11, 2023.