Rapidformer Engine

rapidformer.engine.arguments module

rapidformer.engine.arguments.parse_args(extra_args_provider=None, defaults={}, ignore_unknown_args=False)

Parse all arguments.

Parameters
  • extra_args_provider -- Task specific args if needed.

  • args_defaults -- args defaults.

  • ignore_unknown_args -- A boolean to specify whether ignore unkown args.

Returns

parser

rapidformer.engine.clip_grads module

rapidformer.engine.clip_grads.clip_grad_norm_fp32(parameters, max_norm, norm_type=2)
Clips gradient norm of an iterable of parameters whose gradients

are in fp32.

Parameters
  • parameters (Iterable[Tensor] or Tensor) -- an iterable of Tensors or a single Tensor that will have gradients normalized

  • max_norm (float or int) -- max norm of the gradients

  • norm_type (float or int) -- type of the used p-norm. Can be 'inf' for infinity norm.

Returns

Total norm of the parameters (viewed as a single vector).

rapidformer.engine.engine module

class rapidformer.engine.engine.RapidformerEngine(extra_args_provider=None, args_defaults={})

Bases: object

The Rapidformer Engine class to wrap acceleration tricks.

compose(model=None, optimizer=None, lr_scheduler_fn=None, model_optimizer_lrscheduler_provider_func=None)

Generate wrapped model optimizer and lr_scheduler.

model_obj, optimizer and lr_scheduler_fn are used by no trainer user.

model_optimizer_lrscheduler_provider_func are used by trainer user.

lr_scheduler_fn can be made using partial api like below:

lr_scheduler_fn = partial(get_linear_schedule_with_warmup, num_warmup_steps=args.lr_warmup_iters, num_training_steps=args.train_iters)

Parameters
  • model -- A Huggingface, EasyTexminer, Megatron model object.

  • optimizer -- PyTorch optimizer.

  • lr_scheduler_fn -- lr scheduler function object.

  • model_optimizer_lrscheduler_provider_func -- The function will be used as callback to build model, optimizer and lr scheduler.

Returns

model, optimizer, lr_scheduler

rapidformer.engine.global_vars module

rapidformer.engine.global_vars.get_tokenizer()

Return tokenizer.

rapidformer.engine.global_vars.get_args()

Return arguments.

rapidformer.engine.global_vars.get_num_microbatches()

Return number microbatches.

rapidformer.engine.global_vars.get_current_global_batch_size()

Return current global batch size.

rapidformer.engine.global_vars.update_num_microbatches(consumed_samples, consistency_check=True)
rapidformer.engine.global_vars.get_timers()

Return timers.

rapidformer.engine.global_vars.get_logger()

Return logger.

rapidformer.engine.global_vars.set_rapidformer_global_variables(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False)

Set args, tokenizer, tensorboard-writer, adlr-autoresume, and timers.

class rapidformer.engine.global_vars.Timers(logger)

Bases: object

Group of timers.

write(names, writer, iteration, normalizer=1.0, reset=False)

Write timers to a tensorboard writer

log(names, normalizer=1.0, reset=True)

Log a group of timers.

class rapidformer.engine.global_vars.Logger(log_file=None, level='info')

Bases: object

level_relations = {'crit': 50, 'debug': 10, 'error': 40, 'info': 20, 'warning': 30}
rapidformer.engine.global_vars.build_num_microbatches_calculator(args)
class rapidformer.engine.global_vars.NumMicroBatchesCalculator

Bases: abc.ABC

get()
get_current_global_batch_size()
abstract update(consumed_samples, consistency_check)
class rapidformer.engine.global_vars.ConstantNumMicroBatches(global_batch_size, micro_batch_size, data_parallel_size)

Bases: rapidformer.engine.global_vars.NumMicroBatchesCalculator

update(consumed_samples, consistency_check)
class rapidformer.engine.global_vars.RampupBatchsizeNumMicroBatches(start_batch_size, batch_size_increment, ramup_samples, global_batch_size, micro_batch_size, data_parallel_size)

Bases: rapidformer.engine.global_vars.NumMicroBatchesCalculator

update(consumed_samples, consistency_check)

rapidformer.engine.initialize module

rapidformer.engine.initialize.initialize_rapidformer(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False, allow_no_cuda=False)

Set global variables, initialize distributed, and set autoresume and random seeds.

Parameters
  • extra_args_provider -- Task specific args if needed.

  • args_defaults -- Arguments defaults.

  • ignore_unknown_args -- A boolean to specify whether ignore unkown args.

  • allow_no_cuda -- should not be set unless using megatron for cpu only data processing. In general this arg should not be set unless you know what you are doing.

Returns

a function to finalize distributed env initialization (optionally, only when args.lazy_mpu_init == True)

rapidformer.engine.optimizer module

Megatron optimizer.

class rapidformer.engine.optimizer.MegatronOptimizer(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad, use_contiguous_buffers_in_local_ddp)

Bases: abc.ABC

get_parameters()
clip_grad_norm(clip_grad)
count_zeros()
abstract zero_grad(set_to_none=True)
abstract get_loss_scale()

The output should be a cuda tensor of size 1.

scale_loss(loss)

Simple scaling.

abstract step()
abstract reload_model_params()

Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.

abstract state_dict()
abstract load_state_dict(state_dict)
property state
property param_groups
class rapidformer.engine.optimizer.Float16OptimizerWithFloat16Params(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad, use_contiguous_buffers_in_local_ddp, bf16, grad_scaler)

Bases: rapidformer.engine.optimizer.MegatronOptimizer

Float16 optimizer for fp16 and bf16 data types.

Parameters
  • optimizer -- base optimizer such as Adam or SGD

  • clip_grad -- clip gradeints with this global L2 norm. Note that clipping is ignored if clip_grad == 0

  • log_num_zeros_in_grad -- return number of zeros in the gradients.

  • params_have_main_grad -- flag indicating if parameters have a main_grad field. If this is set, we are assuming that the model parameters are store in the main_grad field instead of the typical grad field. This happens for the DDP cases where there is a contihuous buffer holding the gradients. For example for bfloat16, we want to do gradient accumulation and all-reduces in float32 and as a result we store those gradients in the main_grad. Note that main grad is not necessarily in float32.

  • bf16 -- if true, the model is running in bfloat16.

  • grad_scaler -- used for scaling gradients. Note that this can be None. This case happens when bf16 = True and we don't use any loss scale. Note that for bf16 = True, we can have a constnat gradient scaler. Also for bf16 = False, we always require a grad scaler.

zero_grad(set_to_none=True)

We only need to zero the model related parameters, i.e., float16_groups & fp32_from_fp32_groups. We additionally zero fp32_from_float16_groups as a memory optimization to reduce fragmentation; in the case of set_to_none==True, the space used by this field can be safely deallocated at this point.

get_loss_scale()
reload_model_params()
step()
state_dict()
load_state_dict(state_dict)

rapidformer.engine.schedules module

rapidformer.engine.schedules.get_learning_rate_scheduler(optimizer)

Build the learning rate scheduler.

rapidformer.engine.utils module

rapidformer.engine.utils.honor_type(obj, generator)

Cast a generator to the same type as obj (list, tuple or namedtuple)

rapidformer.engine.utils.is_torch_tensor(tensor)
rapidformer.engine.utils.recursively_apply(func, data, *args, test_type=<function is_torch_tensor>, error_on_other_type=False, **kwargs)

Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.

Parameters
  • func (callable) -- The function to recursively apply.

  • data (nested list/tuple/dictionary of main_type) -- The data on which to apply func

  • *args -- Positional arguments that will be passed to func when applied on the unpacked data.

  • main_type (type, optional, defaults to torch.Tensor) -- The base type of the objects to which apply func.

  • error_on_other_type (bool, optional, defaults to False) -- Whether to return an error or not if after unpacking data, we get on an object that is not of type main_type. If False, the function will leave objects of types different than main_type unchanged.

  • **kwargs -- Keyword arguments that will be passed to func when applied on the unpacked data.

Returns

The same data structure as data with func applied to every object of type main_type.

rapidformer.engine.utils.gather(tensor)
rapidformer.engine.utils.send_to_device(tensor, device)

Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device.

Parameters
  • tensor (nested list/tuple/dictionary of torch.Tensor) -- The data to send to a given device.

  • device (torch.device) -- The device to send the data to

Returns

The same data structure as tensor with all tensors sent to the proper device.

rapidformer.engine.utils.report_memory(name, logger)

Simple GPU memory report.

rapidformer.engine.utils.unwrap_model(model, module_instances=<class 'torch.nn.parallel.distributed.DistributedDataParallel'>)
rapidformer.engine.utils.average_losses_across_data_parallel_group(losses)

Reduce a tensor of losses across all GPUs.