ez_transfer.preprocessors¶
Base Preprocessor¶
-
easytransfer.preprocessors.preprocessor.
truncate_seq_pair
(tokens_a, tokens_b, max_length)[source]¶ Truncates a sequence pair in place to the maximum length.
Classifcation/Regression Preprocessor¶
-
class
easytransfer.preprocessors.classification_regression_preprocessor.
ClassificationRegressionPreprocessorConfig
(**kwargs)[source]¶
-
class
easytransfer.preprocessors.classification_regression_preprocessor.
ClassificationRegressionPreprocessor
(config, **kwargs)[source]¶ Preprocessor for classification/regression task
-
config_class
¶
-
Pre-train Preprocessor¶
-
class
easytransfer.preprocessors.pretrain_preprocessor.
MaskedLmInstance
(index, label)¶ -
index
¶ Alias for field number 0
-
label
¶ Alias for field number 1
-
-
easytransfer.preprocessors.pretrain_preprocessor.
create_masked_lm_predictions
(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, do_whole_word_mask, rng)[source]¶ Creates the predictions for the masked LM objective.
-
class
easytransfer.preprocessors.pretrain_preprocessor.
TrainingInstance
(tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next)[source]¶ A single training instance (sentence pair).
Sequence Labeling Preprocessor¶
-
class
easytransfer.preprocessors.labeling_preprocessor.
SequenceLabelingPreprocessorConfig
(**kwargs)[source]¶
-
class
easytransfer.preprocessors.labeling_preprocessor.
SequenceLabelingPreprocessor
(config, **kwargs)[source]¶ Preprocessor for sequence labeling
-
config_class
¶ alias of
SequenceLabelingPreprocessorConfig
-
Text Comprehension Preprocessor¶
-
easytransfer.preprocessors.comprehension_preprocessor.
whitespace_tokenize
(text)[source]¶ Runs basic whitespace cleaning and splitting on a piece of text.
-
class
easytransfer.preprocessors.comprehension_preprocessor.
ComprehensionPreprocessorConfig
(**kwargs)[source]¶
-
class
easytransfer.preprocessors.comprehension_preprocessor.
Example
(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, is_impossible=False)[source]¶ A single training/test example for simple sequence classification.
For scripts without an answer, the start and end position are -1.
-
class
easytransfer.preprocessors.comprehension_preprocessor.
InputFeatures
(unique_id, qas_id, example_index, doc_span_index, doc_tokens, tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, is_impossible=None)[source]¶ A single set of features of data.
-
class
easytransfer.preprocessors.comprehension_preprocessor.
ComprehensionPreprocessor
(config, thread_num=1, **kwargs)[source]¶ Preprocessor for single-turn text comprehension
-
config_class
¶ alias of
ComprehensionPreprocessorConfig
-
-
class
easytransfer.preprocessors.comprehension_preprocessor.
CQAExample
(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]¶ A single training/test example for multi-turn comprehension.
-
class
easytransfer.preprocessors.comprehension_preprocessor.
CQAInputFeatures
(qas_id, unique_id, example_index, doc_span_index, tokens, doc_tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]¶ A single set of features of data for multi-turn comprehension
-
class
easytransfer.preprocessors.comprehension_preprocessor.
MultiTurnComprehensionPreprocessor
(config, **kwargs)[source]¶ Preprocessor for multi-turn text comprehension
-
config_class
¶ alias of
ComprehensionPreprocessorConfig
-
Deep Text Preprocessor¶
-
easytransfer.preprocessors.deeptext_preprocessor.
get_pretrained_embedding
(stoi, pretrained_w2v_path, init='random')[source]¶
Tokenization¶
-
easytransfer.preprocessors.tokenization.
encode_pieces
(sp_model, text, return_unicode=True, sample=False)[source]¶ turn sentences into word pieces.
-
easytransfer.preprocessors.tokenization.
convert_to_unicode
(text)[source]¶ Converts text to Unicode (if it's not already), assuming utf-8 input.
-
easytransfer.preprocessors.tokenization.
printable_text
(text)[source]¶ Returns text encoded in a way suitable for print or tf.logging.
-
easytransfer.preprocessors.tokenization.
load_vocab
(vocab_file)[source]¶ Loads a vocabulary file into a dictionary.
-
easytransfer.preprocessors.tokenization.
convert_by_vocab
(vocab, items)[source]¶ Converts a sequence of [tokens|ids] using the vocab.
-
easytransfer.preprocessors.tokenization.
whitespace_tokenize
(text)[source]¶ Runs basic whitespace cleaning and splitting on a piece of text.
-
class
easytransfer.preprocessors.tokenization.
FullTokenizer
(vocab_file=None, do_lower_case=True, spm_model_file=None)[source]¶ Runs end-to-end tokenziation.
-
class
easytransfer.preprocessors.tokenization.
BasicTokenizer
(do_lower_case=True)[source]¶ Runs basic tokenization (punctuation splitting, lower casing, etc.).
-
class
easytransfer.preprocessors.tokenization.
WordpieceTokenizer
(vocab, unk_token='[UNK]', max_input_chars_per_word=200)[source]¶ Runs WordPiece tokenziation.
-
tokenize
(text)[source]¶ Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
- For example:
- input = "unaffable" output = ["un", "##aff", "##able"]
Parameters: text -- A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. Returns: A list of wordpiece tokens.
-