ez_transfer.preprocessors¶
Base Preprocessor¶
-
easytransfer.preprocessors.preprocessor.truncate_seq_pair(tokens_a, tokens_b, max_length)[source]¶ Truncates a sequence pair in place to the maximum length.
Classifcation/Regression Preprocessor¶
-
class
easytransfer.preprocessors.classification_regression_preprocessor.ClassificationRegressionPreprocessorConfig(**kwargs)[source]¶
-
class
easytransfer.preprocessors.classification_regression_preprocessor.ClassificationRegressionPreprocessor(config, **kwargs)[source]¶ Preprocessor for classification/regression task
-
config_class¶
-
Pre-train Preprocessor¶
-
class
easytransfer.preprocessors.pretrain_preprocessor.MaskedLmInstance(index, label)¶ -
index¶ Alias for field number 0
-
label¶ Alias for field number 1
-
-
easytransfer.preprocessors.pretrain_preprocessor.create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, do_whole_word_mask, rng)[source]¶ Creates the predictions for the masked LM objective.
-
class
easytransfer.preprocessors.pretrain_preprocessor.TrainingInstance(tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next)[source]¶ A single training instance (sentence pair).
Sequence Labeling Preprocessor¶
-
class
easytransfer.preprocessors.labeling_preprocessor.SequenceLabelingPreprocessorConfig(**kwargs)[source]¶
-
class
easytransfer.preprocessors.labeling_preprocessor.SequenceLabelingPreprocessor(config, **kwargs)[source]¶ Preprocessor for sequence labeling
-
config_class¶ alias of
SequenceLabelingPreprocessorConfig
-
Text Comprehension Preprocessor¶
-
easytransfer.preprocessors.comprehension_preprocessor.whitespace_tokenize(text)[source]¶ Runs basic whitespace cleaning and splitting on a piece of text.
-
class
easytransfer.preprocessors.comprehension_preprocessor.ComprehensionPreprocessorConfig(**kwargs)[source]¶
-
class
easytransfer.preprocessors.comprehension_preprocessor.Example(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, is_impossible=False)[source]¶ A single training/test example for simple sequence classification.
For scripts without an answer, the start and end position are -1.
-
class
easytransfer.preprocessors.comprehension_preprocessor.InputFeatures(unique_id, qas_id, example_index, doc_span_index, doc_tokens, tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, is_impossible=None)[source]¶ A single set of features of data.
-
class
easytransfer.preprocessors.comprehension_preprocessor.ComprehensionPreprocessor(config, thread_num=1, **kwargs)[source]¶ Preprocessor for single-turn text comprehension
-
config_class¶ alias of
ComprehensionPreprocessorConfig
-
-
class
easytransfer.preprocessors.comprehension_preprocessor.CQAExample(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]¶ A single training/test example for multi-turn comprehension.
-
class
easytransfer.preprocessors.comprehension_preprocessor.CQAInputFeatures(qas_id, unique_id, example_index, doc_span_index, tokens, doc_tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]¶ A single set of features of data for multi-turn comprehension
-
class
easytransfer.preprocessors.comprehension_preprocessor.MultiTurnComprehensionPreprocessor(config, **kwargs)[source]¶ Preprocessor for multi-turn text comprehension
-
config_class¶ alias of
ComprehensionPreprocessorConfig
-
Deep Text Preprocessor¶
-
easytransfer.preprocessors.deeptext_preprocessor.get_pretrained_embedding(stoi, pretrained_w2v_path, init='random')[source]¶
Tokenization¶
-
easytransfer.preprocessors.tokenization.encode_pieces(sp_model, text, return_unicode=True, sample=False)[source]¶ turn sentences into word pieces.
-
easytransfer.preprocessors.tokenization.convert_to_unicode(text)[source]¶ Converts text to Unicode (if it's not already), assuming utf-8 input.
-
easytransfer.preprocessors.tokenization.printable_text(text)[source]¶ Returns text encoded in a way suitable for print or tf.logging.
-
easytransfer.preprocessors.tokenization.load_vocab(vocab_file)[source]¶ Loads a vocabulary file into a dictionary.
-
easytransfer.preprocessors.tokenization.convert_by_vocab(vocab, items)[source]¶ Converts a sequence of [tokens|ids] using the vocab.
-
easytransfer.preprocessors.tokenization.whitespace_tokenize(text)[source]¶ Runs basic whitespace cleaning and splitting on a piece of text.
-
class
easytransfer.preprocessors.tokenization.FullTokenizer(vocab_file=None, do_lower_case=True, spm_model_file=None)[source]¶ Runs end-to-end tokenziation.
-
class
easytransfer.preprocessors.tokenization.BasicTokenizer(do_lower_case=True)[source]¶ Runs basic tokenization (punctuation splitting, lower casing, etc.).
-
class
easytransfer.preprocessors.tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)[source]¶ Runs WordPiece tokenziation.
-
tokenize(text)[source]¶ Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
- For example:
- input = "unaffable" output = ["un", "##aff", "##able"]
Parameters: text -- A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. Returns: A list of wordpiece tokens.
-