Struct PreprocessNLPStageInterfaceProxy
Defined in File preprocess_nlp.hpp
-
struct PreprocessNLPStageInterfaceProxy
Interface proxy, used to insulate python bindings.
Public Static Functions
-
static std::shared_ptr<mrc::segment::Object<PreprocessNLPStage<MultiMessage, MultiInferenceMessage>>> init_multi(mrc::segment::Builder &builder, const std::string &name, std::string vocab_hash_file, uint32_t sequence_length, bool truncation, bool do_lower_case, bool add_special_token, int stride = -1, std::string column = "data")
Create and initialize a ProcessNLPStage that receives MultiMessage and emits MultiInferenceMessage, and return the result.
- Parameters
builder – : Pipeline context object reference
name – : Name of a stage reference
vocab_hash_file – : Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.sequence_length – : Sequence Length to use (We add to special tokens for NER classification job).
truncation – : If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.
do_lower_case – : If set to true, original text will be lowercased before encoding.
add_special_token – : Whether or not to encode the sequences with the special tokens of the BERT classification model.
stride – : If
truncation
== False and the tokenized string is larger than max_length, the sequences containing the overflowing token-ids can contain duplicated token-ids from the main sequence. If max_length is equal to stride there are no duplicated-id tokens. If stride is 80% of max_length, 20% of the first sequence will be repeated on the second sequence and so on until the entire sentence is encoded.column – : Name of the string column to operate on, defaults to “data”.
- Returns
std::shared_ptr<mrc::segment::Object<PreprocessNLPStage<MultiMessage, MultiInferenceMessage>>>
-
static std::shared_ptr<mrc::segment::Object<PreprocessNLPStage<ControlMessage, ControlMessage>>> init_cm(mrc::segment::Builder &builder, const std::string &name, std::string vocab_hash_file, uint32_t sequence_length, bool truncation, bool do_lower_case, bool add_special_token, int stride = -1, std::string column = "data")
Create and initialize a ProcessNLPStage that receives ControlMessage and emits ControlMessage, and return the result.
- Parameters
builder – : Pipeline context object reference
name – : Name of a stage reference
vocab_hash_file – : Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function.sequence_length – : Sequence Length to use (We add to special tokens for NER classification job).
truncation – : If set to true, strings will be truncated and padded to max_length. Each input string will result in exactly one output sequence. If set to false, there may be multiple output sequences when the max_length is smaller than generated tokens.
do_lower_case – : If set to true, original text will be lowercased before encoding.
add_special_token – : Whether or not to encode the sequences with the special tokens of the BERT classification model.
stride – : If
truncation
== False and the tokenized string is larger than max_length, the sequences containing the overflowing token-ids can contain duplicated token-ids from the main sequence. If max_length is equal to stride there are no duplicated-id tokens. If stride is 80% of max_length, 20% of the first sequence will be repeated on the second sequence and so on until the entire sentence is encoded.column – : Name of the string column to operate on, defaults to “data”.
- Returns
std::shared_ptr<mrc::segment::Object<PreprocessNLPStage<ControlMessage, ControlMessage>>>
-
static std::shared_ptr<mrc::segment::Object<PreprocessNLPStage<MultiMessage, MultiInferenceMessage>>> init_multi(mrc::segment::Builder &builder, const std::string &name, std::string vocab_hash_file, uint32_t sequence_length, bool truncation, bool do_lower_case, bool add_special_token, int stride = -1, std::string column = "data")