Rapidformer Data

rapidformer.data.huggingface_bert_dataset module

class rapidformer.data.huggingface_bert_dataset.HuggingfaceBertDataset(name, indexed_dataset, data_prefix, num_epochs, max_num_samples, masked_lm_prob, max_seq_length, short_seq_prob, seed, binary_head)

Bases: Generic[torch.utils.data.dataset.T_co]

Datasets for huggingface bert model pretraining.

build_training_sample(sample, target_seq_length, max_seq_length, vocab_id_list, vocab_id_to_token_dict, cls_id, sep_id, mask_id, pad_id, masked_lm_prob, np_rng, binary_head)

Build training sample.

Parameters
  • sample -- A list of sentences in which each sentence is a list token ids.

  • target_seq_length -- Desired sequence length.

  • max_seq_length -- Maximum length of the sequence. All values are padded to this length.

  • vocab_id_list -- List of vocabulary ids. Used to pick a random id.

  • vocab_id_to_token_dict -- A dictionary from vocab ids to text tokens.

  • cls_id -- Start of example id.

  • sep_id -- Separator id.

  • mask_id -- Mask token id.

  • pad_id -- Padding token id.

  • masked_lm_prob -- Probability to mask tokens.

  • np_rng -- Random number genenrator. Note that this rng state should be numpy and not python since python randint is inclusive for the opper bound whereas the numpy one is exclusive.

  • binary_head -- A boolean to specify whether use binary head or not.

Returns

A huggingface bert training sample with dict format like below, train_sample = {

'input_ids': input_ids, 'attention_mask': attention_mask, 'token_type_ids': token_type_ids, 'labels': labels, 'next_sentence_label': is_next_random}

class rapidformer.data.huggingface_bert_dataset.MaskedLmInstance(index, label)

Bases: tuple

property index

Alias for field number 0

property label

Alias for field number 1

rapidformer.data.huggingface_bert_dataset.build_pretrain_bert_datasets_for_huggingface(data_prefix, data_impl, splits_string, train_valid_test_num_samples, max_seq_length, masked_lm_prob, short_seq_prob, seed, skip_warmup, binary_head=False)

Build pretraining bert datasets for huggingface.

Parameters
  • data_prefix -- Indexed dataset prefix.

  • data_impl -- Should be mmap value.

  • splits_string -- The percent of train/eval/test samples, --split 980,10,10.

  • train_valid_test_num_samples -- You don't need to set this value due to it is a callback.

  • max_seq_length -- Maximum length of the sequence. All values are padded to this length

  • masked_lm_prob -- Maxed language modeling probability.

  • short_seq_prob -- Short sequence probability.

  • seed -- Random seed.

  • skip_warmup -- A boolean to specify whethere warm up mmap files.

  • binary_head -- A boolean to specify whether use binary head or not.

Returns: train_dataset, valid_dataset, test_dataset