Rapidformer Data¶

rapidformer.data.huggingface_bert_dataset module¶

class rapidformer.data.huggingface_bert_dataset.HuggingfaceBertDataset(name, indexed_dataset, data_prefix, num_epochs, max_num_samples, masked_lm_prob, max_seq_length, short_seq_prob, seed, binary_head)¶

Bases: Generic[torch.utils.data.dataset.T_co]

Datasets for huggingface bert model pretraining.

build_training_sample(sample, target_seq_length, max_seq_length, vocab_id_list, vocab_id_to_token_dict, cls_id, sep_id, mask_id, pad_id, masked_lm_prob, np_rng, binary_head)¶

Build training sample.

Parameters

sample -- A list of sentences in which each sentence is a list token ids.
target_seq_length -- Desired sequence length.
max_seq_length -- Maximum length of the sequence. All values are padded to this length.
vocab_id_list -- List of vocabulary ids. Used to pick a random id.
vocab_id_to_token_dict -- A dictionary from vocab ids to text tokens.
cls_id -- Start of example id.
sep_id -- Separator id.
mask_id -- Mask token id.
pad_id -- Padding token id.
masked_lm_prob -- Probability to mask tokens.
np_rng -- Random number genenrator. Note that this rng state should be numpy and not python since python randint is inclusive for the opper bound whereas the numpy one is exclusive.
binary_head -- A boolean to specify whether use binary head or not.

Returns

A huggingface bert training sample with dict format like below, train_sample = {: 'input_ids': input_ids, 'attention_mask': attention_mask, 'token_type_ids': token_type_ids, 'labels': labels, 'next_sentence_label': is_next_random}

class rapidformer.data.huggingface_bert_dataset.MaskedLmInstance(index, label)¶

Bases: tuple

property index¶: Alias for field number 0

property label¶: Alias for field number 1

rapidformer.data.huggingface_bert_dataset.build_pretrain_bert_datasets_for_huggingface(data_prefix, data_impl, splits_string, train_valid_test_num_samples, max_seq_length, masked_lm_prob, short_seq_prob, seed, skip_warmup, binary_head=False)¶

Build pretraining bert datasets for huggingface.

Parameters

data_prefix -- Indexed dataset prefix.
data_impl -- Should be mmap value.
splits_string -- The percent of train/eval/test samples, --split 980,10,10.
train_valid_test_num_samples -- You don't need to set this value due to it is a callback.
max_seq_length -- Maximum length of the sequence. All values are padded to this length
masked_lm_prob -- Maxed language modeling probability.
short_seq_prob -- Short sequence probability.
seed -- Random seed.
skip_warmup -- A boolean to specify whethere warm up mmap files.
binary_head -- A boolean to specify whether use binary head or not.

Returns: train_dataset, valid_dataset, test_dataset