easynlp.data

Dataset for sequence classification

class easynlp.appzoo.sequence_classification.data.ClassificationDataset(pretrained_model_name_or_path, data_file, max_seq_length, input_schema, first_sequence, label_name=None, second_sequence=None, label_enumerate_values=None, multi_label=False, *args, **kwargs)[source]

Classification Dataset

Parameters:
  • pretrained_model_name_or_path -- for init tokenizer.
  • data_file -- input data file.
  • max_seq_length -- max sequence length of each input instance.
  • first_sequence -- input text
  • label_name -- label column name
  • second_sequence -- set as None
  • label_enumerate_values -- a list of label values
  • multi_label -- set as True if perform multi-label classification, otherwise False
label_enumerate_values

Returns the label enumerate values.

convert_single_row_to_example(row)[source]

Convert sample token to indices.

Parameters:
  • row -- contains sequence and label.
  • text_a -- the first sequence in row.
  • text_b -- the second sequence in row if self.second_sequence is true.
  • label -- label token if self.label_name is true.
Returns: sing example
encoding: an example contains token indices.
batch_fn(features)[source]

Divide examples into batches.

class easynlp.appzoo.sequence_classification.data.DistillatoryClassificationDataset(user_defined_parameters: dict, *args, **kwargs)[source]
class easynlp.appzoo.sequence_classification.data.FewshotSequenceClassificationDataset(pretrained_model_name_or_path, data_file, max_seq_length, first_sequence, input_schema=None, user_defined_parameters=None, label_name=None, second_sequence=None, label_enumerate_values=None, **kwargs)[source]

Dataset for sequence labeling

class easynlp.appzoo.sequence_labeling.data.InputExample(text_a, text_b=None, label=None, guid=None)[source]

A single training/test example for simple sequence classification.

class easynlp.appzoo.sequence_labeling.data.LabelingFeatures(input_ids, input_mask, segment_ids, all_tokens, label_ids, tok_to_orig_index, seq_length=None, guid=None)[source]

A single set of features of data for sequence labeling.

easynlp.appzoo.sequence_labeling.data.bert_labeling_convert_example_to_feature(example, tokenizer, max_seq_length, label_map=None)[source]

Convert InputExample into InputFeature For sequence labeling task

Parameters:
  • example (InputExample) -- an input example
  • tokenizer (BertTokenizer) -- BERT Tokenizer
  • max_seq_length (int) -- Maximum sequence length while truncating
  • label_map (dict) -- a map from label_value --> label_idx, "regression" task if it is None else "classification"
Returns:

an input feature

Return type:

feature (InputFeatures)

class easynlp.appzoo.sequence_labeling.data.SequenceLabelingDataset(pretrained_model_name_or_path, data_file, max_seq_length, first_sequence, label_name=None, label_enumerate_values=None, *args, **kwargs)[source]

Sequence Labeling Dataset

Parameters:
  • pretrained_model_name_or_path -- for init tokenizer.
  • data_file -- input data file.
  • max_seq_length -- max sequence length of each input instance.
  • first_sequence -- input sequence.
  • label_name -- label column name.
  • label_enumerate_values -- the list of label values.
label_enumerate_values
convert_single_row_to_example(row)[source]
batch_fn(features)[source]

Dataset for language modeling

class easynlp.appzoo.language_modeling.data.LanguageModelingDataset(pretrained_model_name_or_path, data_file, max_seq_length, user_defined_parameters, mlm_mask_prop=0.15, **kwargs)[source]

Whole word mask Language Model Dataset

Parameters:
  • pretrained_model_name_or_path -- for init tokenizer.
  • data_file -- input data file.
  • max_seq_length -- max sequence length of each input instance.
  • mlm_mask_prop -- the percentage of masked words
convert_single_row_to_example(row)[source]
batch_fn(batch)[source]
mask_tokens(inputs, mask_labels)[source]

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. Set 'mask_labels' means we use whole word mask (wwm), we directly mask idxs according to it's ref.

dkplm_row_data_process(sentence)[source]
align_dkplm_input(max_seq_len, token_ids, ent_pos, relation_id, replaced_entity_id)[source]