Rapidformer Engine¶

rapidformer.engine.arguments module¶

rapidformer.engine.arguments.parse_args(extra_args_provider=None, defaults={}, ignore_unknown_args=False)¶

Parse all arguments.

Parameters

extra_args_provider -- Task specific args if needed.
args_defaults -- args defaults.
ignore_unknown_args -- A boolean to specify whether ignore unkown args.

Returns

parser

rapidformer.engine.clip_grads module¶

rapidformer.engine.clip_grads.clip_grad_norm_fp32(parameters, max_norm, norm_type=2)¶

Clips gradient norm of an iterable of parameters whose gradients: are in fp32.

Parameters

parameters (Iterable[Tensor] or Tensor) -- an iterable of Tensors or a single Tensor that will have gradients normalized
max_norm (float or int) -- max norm of the gradients
norm_type (float or int) -- type of the used p-norm. Can be 'inf' for infinity norm.

Returns

Total norm of the parameters (viewed as a single vector).

rapidformer.engine.engine module¶

class rapidformer.engine.engine.RapidformerEngine(extra_args_provider=None, args_defaults={})¶

Bases: object

The Rapidformer Engine class to wrap acceleration tricks.

compose(model=None, optimizer=None, lr_scheduler_fn=None, model_optimizer_lrscheduler_provider_func=None)¶

Generate wrapped model optimizer and lr_scheduler.

model_obj, optimizer and lr_scheduler_fn are used by no trainer user.

model_optimizer_lrscheduler_provider_func are used by trainer user.

lr_scheduler_fn can be made using partial api like below:

lr_scheduler_fn = partial(get_linear_schedule_with_warmup, num_warmup_steps=args.lr_warmup_iters, num_training_steps=args.train_iters)

Parameters

model -- A Huggingface, EasyTexminer, Megatron model object.
optimizer -- PyTorch optimizer.
lr_scheduler_fn -- lr scheduler function object.
model_optimizer_lrscheduler_provider_func -- The function will be used as callback to build model, optimizer and lr scheduler.

Returns

model, optimizer, lr_scheduler

rapidformer.engine.global_vars module¶

rapidformer.engine.global_vars.get_tokenizer()¶: Return tokenizer.

rapidformer.engine.global_vars.get_args()¶: Return arguments.

rapidformer.engine.global_vars.get_num_microbatches()¶: Return number microbatches.

rapidformer.engine.global_vars.get_current_global_batch_size()¶: Return current global batch size.

rapidformer.engine.global_vars.update_num_microbatches(consumed_samples, consistency_check=True)¶

rapidformer.engine.global_vars.get_timers()¶: Return timers.

rapidformer.engine.global_vars.get_logger()¶: Return logger.

rapidformer.engine.global_vars.set_rapidformer_global_variables(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False)¶: Set args, tokenizer, tensorboard-writer, adlr-autoresume, and timers.

class rapidformer.engine.global_vars.Timers(logger)¶

Bases: object

Group of timers.

write(names, writer, iteration, normalizer=1.0, reset=False)¶: Write timers to a tensorboard writer

log(names, normalizer=1.0, reset=True)¶: Log a group of timers.

class rapidformer.engine.global_vars.Logger(log_file=None, level='info')¶

Bases: object

level_relations = {'crit': 50, 'debug': 10, 'error': 40, 'info': 20, 'warning': 30}¶

rapidformer.engine.global_vars.build_num_microbatches_calculator(args)¶

class rapidformer.engine.global_vars.NumMicroBatchesCalculator¶

Bases: abc.ABC

get()¶

get_current_global_batch_size()¶

abstract update(consumed_samples, consistency_check)¶

class rapidformer.engine.global_vars.ConstantNumMicroBatches(global_batch_size, micro_batch_size, data_parallel_size)¶

Bases: rapidformer.engine.global_vars.NumMicroBatchesCalculator

update(consumed_samples, consistency_check)¶

class rapidformer.engine.global_vars.RampupBatchsizeNumMicroBatches(start_batch_size, batch_size_increment, ramup_samples, global_batch_size, micro_batch_size, data_parallel_size)¶

Bases: rapidformer.engine.global_vars.NumMicroBatchesCalculator

update(consumed_samples, consistency_check)¶

rapidformer.engine.initialize module¶

rapidformer.engine.initialize.initialize_rapidformer(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False, allow_no_cuda=False)¶

Set global variables, initialize distributed, and set autoresume and random seeds.

Parameters

extra_args_provider -- Task specific args if needed.
args_defaults -- Arguments defaults.
ignore_unknown_args -- A boolean to specify whether ignore unkown args.
allow_no_cuda -- should not be set unless using megatron for cpu only data processing. In general this arg should not be set unless you know what you are doing.

Returns

a function to finalize distributed env initialization (optionally, only when args.lazy_mpu_init == True)

rapidformer.engine.optimizer module¶

Megatron optimizer.

class rapidformer.engine.optimizer.MegatronOptimizer(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad, use_contiguous_buffers_in_local_ddp)¶

Bases: abc.ABC

get_parameters()¶

clip_grad_norm(clip_grad)¶

count_zeros()¶

abstract zero_grad(set_to_none=True)¶

abstract get_loss_scale()¶: The output should be a cuda tensor of size 1.

scale_loss(loss)¶: Simple scaling.

abstract step()¶

abstract reload_model_params()¶: Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.

abstract state_dict()¶

abstract load_state_dict(state_dict)¶

property state¶

property param_groups¶

class rapidformer.engine.optimizer.Float16OptimizerWithFloat16Params(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad, use_contiguous_buffers_in_local_ddp, bf16, grad_scaler)¶

Bases: rapidformer.engine.optimizer.MegatronOptimizer

Float16 optimizer for fp16 and bf16 data types.

Parameters

optimizer -- base optimizer such as Adam or SGD
clip_grad -- clip gradeints with this global L2 norm. Note that clipping is ignored if clip_grad == 0
log_num_zeros_in_grad -- return number of zeros in the gradients.
params_have_main_grad -- flag indicating if parameters have a main_grad field. If this is set, we are assuming that the model parameters are store in the main_grad field instead of the typical grad field. This happens for the DDP cases where there is a contihuous buffer holding the gradients. For example for bfloat16, we want to do gradient accumulation and all-reduces in float32 and as a result we store those gradients in the main_grad. Note that main grad is not necessarily in float32.
bf16 -- if true, the model is running in bfloat16.
grad_scaler -- used for scaling gradients. Note that this can be None. This case happens when bf16 = True and we don't use any loss scale. Note that for bf16 = True, we can have a constnat gradient scaler. Also for bf16 = False, we always require a grad scaler.

zero_grad(set_to_none=True)¶: We only need to zero the model related parameters, i.e., float16_groups & fp32_from_fp32_groups. We additionally zero fp32_from_float16_groups as a memory optimization to reduce fragmentation; in the case of set_to_none==True, the space used by this field can be safely deallocated at this point.

get_loss_scale()¶

reload_model_params()¶

step()¶

state_dict()¶

load_state_dict(state_dict)¶

rapidformer.engine.schedules module¶

rapidformer.engine.schedules.get_learning_rate_scheduler(optimizer)¶: Build the learning rate scheduler.

rapidformer.engine.utils module¶

rapidformer.engine.utils.honor_type(obj, generator)¶: Cast a generator to the same type as obj (list, tuple or namedtuple)

rapidformer.engine.utils.is_torch_tensor(tensor)¶

rapidformer.engine.utils.recursively_apply(func, data, *args, test_type=<function is_torch_tensor>, error_on_other_type=False, **kwargs)¶

Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.

Parameters

func (callable) -- The function to recursively apply.
data (nested list/tuple/dictionary of main_type) -- The data on which to apply func
*args -- Positional arguments that will be passed to func when applied on the unpacked data.
main_type (type, optional, defaults to torch.Tensor) -- The base type of the objects to which apply func.
error_on_other_type (bool, optional, defaults to False) -- Whether to return an error or not if after unpacking data, we get on an object that is not of type main_type. If False, the function will leave objects of types different than main_type unchanged.
**kwargs -- Keyword arguments that will be passed to func when applied on the unpacked data.

Returns

The same data structure as data with func applied to every object of type main_type.

rapidformer.engine.utils.gather(tensor)¶

rapidformer.engine.utils.send_to_device(tensor, device)¶

Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device.

Parameters

tensor (nested list/tuple/dictionary of torch.Tensor) -- The data to send to a given device.
device (torch.device) -- The device to send the data to

Returns

The same data structure as tensor with all tensors sent to the proper device.

rapidformer.engine.utils.report_memory(name, logger)¶: Simple GPU memory report.

rapidformer.engine.utils.unwrap_model(model, module_instances=<class 'torch.nn.parallel.distributed.DistributedDataParallel'>)¶

rapidformer.engine.utils.average_losses_across_data_parallel_group(losses)¶: Reduce a tensor of losses across all GPUs.