本帖子旨在记录一些在 babysit 训练过程中发生的各种与算法本身无关的问题及其解决方案。
常用命令 Rsync 拷贝文件 1 rsync -avz -e "ssh -p 2222" --progress user@remote_host:/path/to/remote/folder /path/to/local/destination
多卡 Auto Resume 时 CPU 内存不足 就是缺钱嘛,但其实这个是可以无痛解决的。因为多卡 auto resume 的时候,这么多个进程都在把 ckpt 读取到内存中,比如 8 卡 resume 一个 50 GB 的 ckpt,至少就得 400 GB 的 CPU RAM,如果机器内存比较小就 hold 不住了。解决方法也很简单,在 resume 的时候给不同进程一开始加一个 time.sleep()
,并且在 load 完 state_dict
之后记得 del
一下就行。这样瞬发的内存占用就不会这么高了。
具体而言,比如 DeepSpeed 的 load_checkpoint:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 def load_checkpoint (model, optimizer, lr_scheduler, args ): """Load a model checkpoint.""" iteration, release = get_checkpoint_iteration(args) if args.deepspeed: if local_rank > 3 : time.sleep(180 ) checkpoint_name, sd = model.load_checkpoint(args.load, iteration) del sd torch.distributed.barrier() if checkpoint_name is None : if mpu.get_data_parallel_rank() == 0 : print ("Unable to load checkpoint." ) return iteration else : ......
DeepSpeed の 诸多问题 Stage 2 训一会儿就 loss 为 0 了 据 Issue 1231 - haotian-liu/LLaVA ,这是 DeepSpeed 的一个 bug,在 ‘zero2.json’ 或者什么地方把 overlap_comm
改成 false 就好:
1 2 3 4 5 6 7 "zero_optimization" : { "stage" : 2 , "overlap_comm" : true , "contiguous_gradients" : true , "sub_group_size" : 1e9 , "reduce_bucket_size" : "auto" }
真抽象啊哥。
LLaMA 3 怎么也有问题? 训练时 NaN 参考 https://huggingface.co/imone/Llama-3-8B-fixed-special-embedding:
The original Llama 3 8b (base) special token weights are zero, which might cause NaN gradients. This version re-initialized the weights of all the following special tokens to alleviate the problem.
1 2 3 <|eot_id|> <|start_header_id|> <|end_header_id|>
We set the weights of these tokens in embed
and lm_head
to be the mean of all other tokens.
这位大兄弟用如下代码处理了一下 special token 的权重:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import argparseimport transformersimport torchdef init_eot_embedding_llama3 (model_path, output_dir, special_tokens=["<|eot_id|>" , "<|start_header_id|>" , "<|end_header_id|>" ], mean_cutoff=128000 , dtype=torch.bfloat16 ): tokenizer = transformers.AutoTokenizer.from_pretrained(model_path) model = transformers.AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True , torch_dtype=dtype) assert model.model.embed_tokens.weight.shape[0 ] >= mean_cutoff assert model.lm_head.weight.shape[0 ] >= mean_cutoff with torch.no_grad(): for token in special_tokens: token_id = tokenizer.convert_tokens_to_ids(token) print (f"Token {token} ID {token_id} " ) model.model.embed_tokens.weight[token_id] = torch.mean(model.model.embed_tokens.weight[:mean_cutoff].to(torch.float32), dim=0 ).to(dtype) model.lm_head.weight[token_id] = torch.mean(model.lm_head.weight[:mean_cutoff].to(torch.float32), dim=0 ).to(dtype) tokenizer.save_pretrained(output_dir) model.save_pretrained(output_dir) def main (): parser = argparse.ArgumentParser() parser.add_argument( "--model-path" , help ="Location of model, or HuggingFace repo ID" , ) parser.add_argument( "--output-dir" , help ="Location to write resulting model and tokenizer" , ) init_eot_embedding_llama3(**vars (parser.parse_args())) if __name__ == "__main__" : main()