NanoLM-365M-base

Introduction

Based on Qwen2-0.5B, the tokenizer has been replaced with BilingualTokenizer-8K to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.

Details

To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on wikipedia-zh and cosmopedia-100k.

	Value
Total Params	365 M
Trainable Params	< 10 M
Trainable Parts	`model.embed_tokens`
Training Steps	40,000
Training Dataset	wikipedia-zh, cosmopedia-100k
Optimizer	adamw_torch
Learning Rate	2e-4
LR Scheduler	cosine
Weight Decay	0.1
Warm-up Ratio	0.03
Batch Size	16
Gradient Accumulation Steps	1
Seq Len	4096
Dtype	bf16
Peak GPU Memory	< 48 GB
Device	NVIDIA A100-SXM4-80GB

The specific training records are as follows:

Mxode
/

NanoLM-365M-Base

NanoLM-365M-base

Introduction

Details

Model tree for Mxode/NanoLM-365M-Base

Datasets used to train Mxode/NanoLM-365M-Base

Collection including Mxode/NanoLM-365M-Base

NanoLM