带娃指南：解决训练过程中的各种小问题

本帖子旨在记录一些在 babysit 训练过程中发生的各种与算法本身无关的问题及其解决方案。

常用命令

Rsync 拷贝文件

1	rsync -avz -e "ssh -p 2222" --progress user@remote_host:/path/to/remote/folder /path/to/local/destination

多卡 Auto Resume 时 CPU 内存不足

就是缺钱嘛，但其实这个是可以无痛解决的。因为多卡 auto resume 的时候，这么多个进程都在把 ckpt 读取到内存中，比如 8 卡 resume 一个 50 GB 的 ckpt，至少就得 400 GB 的 CPU RAM，如果机器内存比较小就 hold 不住了。解决方法也很简单，在 resume 的时候给不同进程一开始加一个 time.sleep()，并且在 load 完 state_dict 之后记得 del 一下就行。这样瞬发的内存占用就不会这么高了。

具体而言，比如 DeepSpeed 的 load_checkpoint：

def load_checkpoint(model, optimizer, lr_scheduler, args):
    """Load a model checkpoint."""

    iteration, release = get_checkpoint_iteration(args)

    if args.deepspeed:
        # ADD: time.sleep
        if local_rank > 3:
            time.sleep(180)

        # load state_dict
        checkpoint_name, sd = model.load_checkpoint(args.load, iteration)

        # ADD: del & barrier
        del sd
        torch.distributed.barrier()

        if checkpoint_name is None:
            if mpu.get_data_parallel_rank() == 0:
                print("Unable to load checkpoint.")
            return iteration
    else:
        ......

DeepSpeed の诸多问题

Stage 2 训一会儿就 loss 为 0 了

据 Issue 1231 - haotian-liu/LLaVA，这是 DeepSpeed 的一个 bug，在 ‘zero2.json’ 或者什么地方把 overlap_comm 改成 false 就好：

"zero_optimization": {
    "stage": 2,
    "overlap_comm": true,	// 改成 false
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto"
}

真抽象啊哥。

LLaMA 3 怎么也有问题？

训练时 NaN

参考 https://huggingface.co/imone/Llama-3-8B-fixed-special-embedding：

The original Llama 3 8b (base) special token weights are zero, which might cause NaN gradients. This version re-initialized the weights of all the following special tokens to alleviate the problem.

1
2
3

<|eot_id|>
<|start_header_id|>
<|end_header_id|>

We set the weights of these tokens in embed and lm_head to be the mean of all other tokens.

这位大兄弟用如下代码处理了一下 special token 的权重：

import argparse
import transformers
import torch

def init_eot_embedding_llama3(model_path, output_dir, special_tokens=["<|eot_id|>", "<|start_header_id|>", "<|end_header_id|>"], mean_cutoff=128000, dtype=torch.bfloat16):
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
    model = transformers.AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, torch_dtype=dtype)

    assert model.model.embed_tokens.weight.shape[0] >= mean_cutoff
    assert model.lm_head.weight.shape[0]            >= mean_cutoff

    with torch.no_grad():
        for token in special_tokens:
            token_id = tokenizer.convert_tokens_to_ids(token)

            print (f"Token {token} ID {token_id}")

            model.model.embed_tokens.weight[token_id] = torch.mean(model.model.embed_tokens.weight[:mean_cutoff].to(torch.float32), dim=0).to(dtype)
            model.lm_head.weight[token_id]            = torch.mean(model.lm_head.weight[:mean_cutoff].to(torch.float32), dim=0).to(dtype)

    # Save
    tokenizer.save_pretrained(output_dir)
    model.save_pretrained(output_dir)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model-path",
        help="Location of model, or HuggingFace repo ID",
    )
    parser.add_argument(
        "--output-dir",
        help="Location to write resulting model and tokenizer",
    )

    init_eot_embedding_llama3(**vars(parser.parse_args()))

if __name__ == "__main__":
    main()