本帖子旨在记录一些在 babysit 训练过程中发生的各种与算法本身无关的问题及其解决方案。

常用命令

Rsync 拷贝文件

1
rsync -avz -e "ssh -p 2222" --progress user@remote_host:/path/to/remote/folder /path/to/local/destination

多卡 Auto Resume 时 CPU 内存不足

就是缺钱嘛,但其实这个是可以无痛解决的。因为多卡 auto resume 的时候,这么多个进程都在把 ckpt 读取到内存中,比如 8 卡 resume 一个 50 GB 的 ckpt,至少就得 400 GB 的 CPU RAM,如果机器内存比较小就 hold 不住了。解决方法也很简单,在 resume 的时候给不同进程一开始加一个 time.sleep(),并且在 load 完 state_dict 之后记得 del 一下就行。这样瞬发的内存占用就不会这么高了。

具体而言,比如 DeepSpeed 的 load_checkpoint:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def load_checkpoint(model, optimizer, lr_scheduler, args):
"""Load a model checkpoint."""

iteration, release = get_checkpoint_iteration(args)

if args.deepspeed:
# ADD: time.sleep
if local_rank > 3:
time.sleep(180)

# load state_dict
checkpoint_name, sd = model.load_checkpoint(args.load, iteration)

# ADD: del & barrier
del sd
torch.distributed.barrier()

if checkpoint_name is None:
if mpu.get_data_parallel_rank() == 0:
print("Unable to load checkpoint.")
return iteration
else:
......

DeepSpeed の 诸多问题

Stage 2 训一会儿就 loss 为 0 了

Issue 1231 - haotian-liu/LLaVA,这是 DeepSpeed 的一个 bug,在 ‘zero2.json’ 或者什么地方把 overlap_comm 改成 false 就好:

1
2
3
4
5
6
7
"zero_optimization": {
"stage": 2,
"overlap_comm": true, // 改成 false
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto"
}

真抽象啊哥。

LLaMA 3 怎么也有问题?

训练时 NaN

参考 https://huggingface.co/imone/Llama-3-8B-fixed-special-embedding:

The original Llama 3 8b (base) special token weights are zero, which might cause NaN gradients. This version re-initialized the weights of all the following special tokens to alleviate the problem.

1
2
3
<|eot_id|>
<|start_header_id|>
<|end_header_id|>

We set the weights of these tokens in embed and lm_head to be the mean of all other tokens.

这位大兄弟用如下代码处理了一下 special token 的权重:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import argparse
import transformers
import torch

def init_eot_embedding_llama3(model_path, output_dir, special_tokens=["<|eot_id|>", "<|start_header_id|>", "<|end_header_id|>"], mean_cutoff=128000, dtype=torch.bfloat16):
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
model = transformers.AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, torch_dtype=dtype)

assert model.model.embed_tokens.weight.shape[0] >= mean_cutoff
assert model.lm_head.weight.shape[0] >= mean_cutoff

with torch.no_grad():
for token in special_tokens:
token_id = tokenizer.convert_tokens_to_ids(token)

print (f"Token {token} ID {token_id}")

model.model.embed_tokens.weight[token_id] = torch.mean(model.model.embed_tokens.weight[:mean_cutoff].to(torch.float32), dim=0).to(dtype)
model.lm_head.weight[token_id] = torch.mean(model.lm_head.weight[:mean_cutoff].to(torch.float32), dim=0).to(dtype)

# Save
tokenizer.save_pretrained(output_dir)
model.save_pretrained(output_dir)

def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model-path",
help="Location of model, or HuggingFace repo ID",
)
parser.add_argument(
"--output-dir",
help="Location to write resulting model and tokenizer",
)

init_eot_embedding_llama3(**vars(parser.parse_args()))

if __name__ == "__main__":
main()