ez_transfer.preprocessors

Base Preprocessor

easytransfer.preprocessors.preprocessor.truncate_seq_pair(tokens_a, tokens_b, max_length)[source]

Truncates a sequence pair in place to the maximum length.

class easytransfer.preprocessors.preprocessor.PreprocessorConfig(**kwargs)[source]
classmethod from_json_file(**kwargs)[source]
class easytransfer.preprocessors.preprocessor.Preprocessor(config, thread_num=1, input_queue=None, output_queue=None, job_name='DISTPreprocessor', **kwargs)[source]
classmethod get_preprocessor(**kwargs)[source]
set_feature_schema()[source]
convert_example_to_features(items)[source]
call(inputs)[source]
process(inputs)[source]

Classifcation/Regression Preprocessor

class easytransfer.preprocessors.classification_regression_preprocessor.ClassificationRegressionPreprocessorConfig(**kwargs)[source]
class easytransfer.preprocessors.classification_regression_preprocessor.ClassificationRegressionPreprocessor(config, **kwargs)[source]

Preprocessor for classification/regression task

config_class

alias of ClassificationRegressionPreprocessorConfig

set_feature_schema()[source]
convert_example_to_features(items)[source]

Convert single example to classifcation/regression features

Parameters:items (dict) -- inputs from the reader
Returns:(input_ids, input_mask, segment_ids, label_id)
Return type:features (tuple)
class easytransfer.preprocessors.classification_regression_preprocessor.PairedClassificationRegressionPreprocessor(config, **kwargs)[source]

Preprocessor for paired classification/regression task

config_class

alias of ClassificationRegressionPreprocessorConfig

set_feature_schema()[source]
convert_example_to_features(items)[source]

Convert single example to classifcation/regression features

Parameters:items (dict) -- inputs from the reader
Returns:
(input_ids_a, input_mask_a, segment_ids_a,
input_ids_b, input_mask_b, segment_ids_b, label_id)
Return type:features (tuple)

Pre-train Preprocessor

class easytransfer.preprocessors.pretrain_preprocessor.MaskedLmInstance(index, label)
index

Alias for field number 0

label

Alias for field number 1

easytransfer.preprocessors.pretrain_preprocessor.create_chinese_subwords(segment)[source]
easytransfer.preprocessors.pretrain_preprocessor.create_int_feature(values)[source]
easytransfer.preprocessors.pretrain_preprocessor.create_float_feature(values)[source]
easytransfer.preprocessors.pretrain_preprocessor.create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, do_whole_word_mask, rng)[source]

Creates the predictions for the masked LM objective.

class easytransfer.preprocessors.pretrain_preprocessor.TrainingInstance(tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next)[source]

A single training instance (sentence pair).

class easytransfer.preprocessors.pretrain_preprocessor.PretrainPreprocessorConfig(**kwargs)[source]
class easytransfer.preprocessors.pretrain_preprocessor.PretrainPreprocessor(config, **kwargs)[source]
config_class

alias of PretrainPreprocessorConfig

set_feature_schema()[source]
convert_example_to_features(items)[source]

Sequence Labeling Preprocessor

class easytransfer.preprocessors.labeling_preprocessor.SequenceLabelingPreprocessorConfig(**kwargs)[source]
class easytransfer.preprocessors.labeling_preprocessor.SequenceLabelingPreprocessor(config, **kwargs)[source]

Preprocessor for sequence labeling

config_class

alias of SequenceLabelingPreprocessorConfig

set_feature_schema()[source]
convert_example_to_features(items)[source]

Convert single example to sequence labeling features

Parameters:items (dict) -- inputs from the reader
Returns:(input_ids, input_mask, segment_ids, label_id, tok_to_orig_index)
Return type:features (tuple)

Text Comprehension Preprocessor

easytransfer.preprocessors.comprehension_preprocessor.is_whitespace(c)[source]
easytransfer.preprocessors.comprehension_preprocessor.whitespace_tokenize(text)[source]

Runs basic whitespace cleaning and splitting on a piece of text.

class easytransfer.preprocessors.comprehension_preprocessor.ComprehensionPreprocessorConfig(**kwargs)[source]
class easytransfer.preprocessors.comprehension_preprocessor.Example(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, is_impossible=False)[source]

A single training/test example for simple sequence classification.

For scripts without an answer, the start and end position are -1.

class easytransfer.preprocessors.comprehension_preprocessor.InputFeatures(unique_id, qas_id, example_index, doc_span_index, doc_tokens, tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, is_impossible=None)[source]

A single set of features of data.

class easytransfer.preprocessors.comprehension_preprocessor.ComprehensionPreprocessor(config, thread_num=1, **kwargs)[source]

Preprocessor for single-turn text comprehension

config_class

alias of ComprehensionPreprocessorConfig

convert_example_to_features(items)[source]

Convert single example to multiple input features

Parameters:items (dict) -- inputs from the reader
Returns:list of InputFeature
Return type:features (list)
call(inputs)[source]
process(inputs)[source]
class easytransfer.preprocessors.comprehension_preprocessor.CQAExample(qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]

A single training/test example for multi-turn comprehension.

class easytransfer.preprocessors.comprehension_preprocessor.CQAInputFeatures(qas_id, unique_id, example_index, doc_span_index, tokens, doc_tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, history_answer_marker=None, metadata=None)[source]

A single set of features of data for multi-turn comprehension

class easytransfer.preprocessors.comprehension_preprocessor.MultiTurnComprehensionPreprocessor(config, **kwargs)[source]

Preprocessor for multi-turn text comprehension

config_class

alias of ComprehensionPreprocessorConfig

static convert_examples_to_example_variations(examples, max_considered_history_turns)[source]
convert_example_to_features(example)[source]

Convert single example to multiple input features

Parameters:items (dict) -- inputs from the reader
Returns:list of CQAInputFeatures
Return type:features (list)
call(inputs)[source]
process(inputs)[source]

Deep Text Preprocessor

easytransfer.preprocessors.deeptext_preprocessor.get_pretrained_embedding(stoi, pretrained_w2v_path, init='random')[source]
class easytransfer.preprocessors.deeptext_preprocessor.DeepTextVocab[source]
has(word)[source]
add_word(word)[source]
add_line(line)[source]
to_idx(word)[source]
to_word(ind)[source]
filter_vocab_to_fix_length(max_vocab_size=50000)[source]
classmethod build_from_file(file_path)[source]
export_to_file(file_path)[source]
class easytransfer.preprocessors.deeptext_preprocessor.DeepTextPreprocessor(config, **kwargs)[source]

Preprocessor for deep text models such as CNN, DAM, HCNN, etc.

set_feature_schema()[source]
convert_example_to_features(items)[source]

Convert single example to classifcation/regression features

Parameters:items (dict) -- inputs from the reader
Returns:(input_ids, input_mask, segment_ids, label_id)
Return type:features (tuple)

Tokenization

easytransfer.preprocessors.tokenization.encode_pieces(sp_model, text, return_unicode=True, sample=False)[source]

turn sentences into word pieces.

easytransfer.preprocessors.tokenization.encode_ids(sp_model, text, sample=False)[source]
easytransfer.preprocessors.tokenization.convert_to_unicode(text)[source]

Converts text to Unicode (if it's not already), assuming utf-8 input.

easytransfer.preprocessors.tokenization.printable_text(text)[source]

Returns text encoded in a way suitable for print or tf.logging.

easytransfer.preprocessors.tokenization.load_vocab(vocab_file)[source]

Loads a vocabulary file into a dictionary.

easytransfer.preprocessors.tokenization.convert_by_vocab(vocab, items)[source]

Converts a sequence of [tokens|ids] using the vocab.

easytransfer.preprocessors.tokenization.convert_tokens_to_ids(vocab, tokens)[source]
easytransfer.preprocessors.tokenization.whitespace_tokenize(text)[source]

Runs basic whitespace cleaning and splitting on a piece of text.

class easytransfer.preprocessors.tokenization.FullTokenizer(vocab_file=None, do_lower_case=True, spm_model_file=None)[source]

Runs end-to-end tokenziation.

tokenize(text)[source]
convert_tokens_to_ids(tokens)[source]
convert_ids_to_tokens(ids)[source]
class easytransfer.preprocessors.tokenization.BasicTokenizer(do_lower_case=True)[source]

Runs basic tokenization (punctuation splitting, lower casing, etc.).

tokenize(text)[source]

Tokenizes a piece of text.

class easytransfer.preprocessors.tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)[source]

Runs WordPiece tokenziation.

tokenize(text)[source]

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example:
input = "unaffable" output = ["un", "##aff", "##able"]
Parameters:text -- A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.
Returns:A list of wordpiece tokens.