ez_transfer.preprocessors¶

Base Preprocessor¶

easytransfer.preprocessors.preprocessor.truncate_seq_pair(tokens_a, tokens_b, max_length)[source]¶: Truncates a sequence pair in place to the maximum length.

class easytransfer.preprocessors.preprocessor.PreprocessorConfig(**kwargs)[source]¶

classmethod from_json_file(**kwargs)[source]¶

class easytransfer.preprocessors.preprocessor.Preprocessor(config, thread_num=1, input_queue=None, output_queue=None, job_name='DISTPreprocessor', **kwargs)[source]¶

classmethod get_preprocessor(**kwargs)[source]¶

set_feature_schema()[source]¶

convert_example_to_features(items)[source]¶

call(inputs)[source]¶

process(inputs)[source]¶

Classifcation/Regression Preprocessor¶

class easytransfer.preprocessors.classification_regression_preprocessor.ClassificationRegressionPreprocessorConfig(**kwargs)[source]¶

class easytransfer.preprocessors.classification_regression_preprocessor.ClassificationRegressionPreprocessor(config, **kwargs)[source]¶

Preprocessor for classification/regression task

config_class¶: alias of ClassificationRegressionPreprocessorConfig

set_feature_schema()[source]¶

convert_example_to_features(items)[source]¶

Convert single example to classifcation/regression features

Parameters:	items (dict) -- inputs from the reader
Returns:	(input_ids, input_mask, segment_ids, label_id)
Return type:	features (tuple)

class easytransfer.preprocessors.classification_regression_preprocessor.PairedClassificationRegressionPreprocessor(config, **kwargs)[source]¶

Preprocessor for paired classification/regression task

config_class¶: alias of ClassificationRegressionPreprocessorConfig

set_feature_schema()[source]¶

convert_example_to_features(items)[source]¶

Convert single example to classifcation/regression features

Parameters:	items (dict) -- inputs from the reader
Returns:	(input_ids_a, input_mask_a, segment_ids_a, input_ids_b, input_mask_b, segment_ids_b, label_id)
Return type:	features (tuple)

Pre-train Preprocessor¶

class easytransfer.preprocessors.pretrain_preprocessor.MaskedLmInstance(index, label)¶

index¶: Alias for field number 0

label¶: Alias for field number 1

easytransfer.preprocessors.pretrain_preprocessor.create_chinese_subwords(segment)[source]¶

easytransfer.preprocessors.pretrain_preprocessor.create_int_feature(values)[source]¶

easytransfer.preprocessors.pretrain_preprocessor.create_float_feature(values)[source]¶

easytransfer.preprocessors.pretrain_preprocessor.create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, do_whole_word_mask, rng)[source]¶: Creates the predictions for the masked LM objective.

class easytransfer.preprocessors.pretrain_preprocessor.TrainingInstance(tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next)[source]¶: A single training instance (sentence pair).

class easytransfer.preprocessors.pretrain_preprocessor.PretrainPreprocessorConfig(**kwargs)[source]¶

class easytransfer.preprocessors.pretrain_preprocessor.PretrainPreprocessor(config, **kwargs)[source]¶

config_class¶: alias of PretrainPreprocessorConfig

set_feature_schema()[source]¶

convert_example_to_features(items)[source]¶

Sequence Labeling Preprocessor¶

class easytransfer.preprocessors.labeling_preprocessor.SequenceLabelingPreprocessorConfig(**kwargs)[source]¶

class easytransfer.preprocessors.labeling_preprocessor.SequenceLabelingPreprocessor(config, **kwargs)[source]¶

Preprocessor for sequence labeling

config_class¶: alias of SequenceLabelingPreprocessorConfig

set_feature_schema()[source]¶

convert_example_to_features(items)[source]¶

Convert single example to sequence labeling features

Parameters:	items (dict) -- inputs from the reader
Returns:	(input_ids, input_mask, segment_ids, label_id, tok_to_orig_index)
Return type:	features (tuple)

Text Comprehension Preprocessor¶

easytransfer.preprocessors.comprehension_preprocessor.is_whitespace(c)[source]¶

easytransfer.preprocessors.comprehension_preprocessor.whitespace_tokenize(text)[source]¶: Runs basic whitespace cleaning and splitting on a piece of text.

class easytransfer.preprocessors.comprehension_preprocessor.ComprehensionPreprocessorConfig(**kwargs)[source]¶

class easytransfer.preprocessors.comprehension_preprocessor.Example(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, is_impossible=False)[source]¶

A single training/test example for simple sequence classification.

For scripts without an answer, the start and end position are -1.

class easytransfer.preprocessors.comprehension_preprocessor.InputFeatures(unique_id, qas_id, example_index, doc_span_index, doc_tokens, tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, is_impossible=None)[source]¶: A single set of features of data.

class easytransfer.preprocessors.comprehension_preprocessor.ComprehensionPreprocessor(config, thread_num=1, **kwargs)[source]¶

Preprocessor for single-turn text comprehension

config_class¶: alias of ComprehensionPreprocessorConfig

convert_example_to_features(items)[source]¶

Convert single example to multiple input features

Parameters:	items (dict) -- inputs from the reader
Returns:	list of InputFeature
Return type:	features (list)

call(inputs)[source]¶

process(inputs)[source]¶

class easytransfer.preprocessors.comprehension_preprocessor.CQAExample(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]¶: A single training/test example for multi-turn comprehension.

class easytransfer.preprocessors.comprehension_preprocessor.CQAInputFeatures(qas_id, unique_id, example_index, doc_span_index, tokens, doc_tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]¶: A single set of features of data for multi-turn comprehension

class easytransfer.preprocessors.comprehension_preprocessor.MultiTurnComprehensionPreprocessor(config, **kwargs)[source]¶

Preprocessor for multi-turn text comprehension

config_class¶: alias of ComprehensionPreprocessorConfig

static convert_examples_to_example_variations(examples, max_considered_history_turns)[source]¶

convert_example_to_features(example)[source]¶

Convert single example to multiple input features

Parameters:	items (dict) -- inputs from the reader
Returns:	list of CQAInputFeatures
Return type:	features (list)

call(inputs)[source]¶

process(inputs)[source]¶

Deep Text Preprocessor¶

easytransfer.preprocessors.deeptext_preprocessor.get_pretrained_embedding(stoi, pretrained_w2v_path, init='random')[source]¶

class easytransfer.preprocessors.deeptext_preprocessor.DeepTextVocab[source]¶

has(word)[source]¶

add_word(word)[source]¶

add_line(line)[source]¶

to_idx(word)[source]¶

to_word(ind)[source]¶

filter_vocab_to_fix_length(max_vocab_size=50000)[source]¶

classmethod build_from_file(file_path)[source]¶

export_to_file(file_path)[source]¶

class easytransfer.preprocessors.deeptext_preprocessor.DeepTextPreprocessor(config, **kwargs)[source]¶

Preprocessor for deep text models such as CNN, DAM, HCNN, etc.

set_feature_schema()[source]¶

convert_example_to_features(items)[source]¶

Convert single example to classifcation/regression features

Parameters:	items (dict) -- inputs from the reader
Returns:	(input_ids, input_mask, segment_ids, label_id)
Return type:	features (tuple)

Tokenization¶

easytransfer.preprocessors.tokenization.encode_pieces(sp_model, text, return_unicode=True, sample=False)[source]¶: turn sentences into word pieces.

easytransfer.preprocessors.tokenization.encode_ids(sp_model, text, sample=False)[source]¶

easytransfer.preprocessors.tokenization.convert_to_unicode(text)[source]¶: Converts text to Unicode (if it's not already), assuming utf-8 input.

easytransfer.preprocessors.tokenization.printable_text(text)[source]¶: Returns text encoded in a way suitable for print or tf.logging.

easytransfer.preprocessors.tokenization.load_vocab(vocab_file)[source]¶: Loads a vocabulary file into a dictionary.

easytransfer.preprocessors.tokenization.convert_by_vocab(vocab, items)[source]¶: Converts a sequence of [tokens|ids] using the vocab.

easytransfer.preprocessors.tokenization.convert_tokens_to_ids(vocab, tokens)[source]¶

easytransfer.preprocessors.tokenization.whitespace_tokenize(text)[source]¶: Runs basic whitespace cleaning and splitting on a piece of text.

class easytransfer.preprocessors.tokenization.FullTokenizer(vocab_file=None, do_lower_case=True, spm_model_file=None)[source]¶

Runs end-to-end tokenziation.

tokenize(text)[source]¶

convert_tokens_to_ids(tokens)[source]¶

convert_ids_to_tokens(ids)[source]¶

class easytransfer.preprocessors.tokenization.BasicTokenizer(do_lower_case=True)[source]¶

Runs basic tokenization (punctuation splitting, lower casing, etc.).

tokenize(text)[source]¶: Tokenizes a piece of text.

class easytransfer.preprocessors.tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)[source]¶

Runs WordPiece tokenziation.

tokenize(text)[source]¶

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example:: input = "unaffable" output = ["un", "##aff", "##able"]

Parameters:	text -- A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.
Returns:	A list of wordpiece tokens.