superhot-13b-16k-4bit--1g-safetensors
Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 16384 (or lower) and 8.
Merged base LLaMA and LoRA with this: https://github.com/tloen/alpaca-lora
Base LLaMA 13B: https://huggingface.co/huggyllama/llama-13b
SuperHOT 13B 16k no-rlhf-test LoRA: https://huggingface.co/kaiokendev/superhot-13b-16k-no-rlhf-test
BASE_MODEL=huggyllama_llama-13b LORA=kaiokendev_superhot-13b-16k-no-rlhf-test python export_hf_checkpoint.py
Quantized with AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ
python quant_with_alpaca.py --pretrained_model_dir superhot-13b-16k-safetensors --quantized_model_dir superhot-13b-16k-4bit--1g-safetensors --bits 4 --group_size -1 --desc_act --num_samples 256 --save_and_reload
Perplexity:
CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
-d /workspace/models/superhot-13b-16k-4bit--1g-safetensors \
-ppl \
-ppl_ds datasets/wikitext2.txt \
-l 16384 \
-cpe 8 \
-ppl_cn 40 \
-ppl_cs 16384 \
-ppl_ct 16384
-- Perplexity:
-- - Dataset: datasets/wikitext2.txt
-- - Chunks: 40
-- - Chunk size: 16384 -> 16384
-- - Chunk overlap: 0
-- - Min. chunk size: 50
-- - Key: text
-- Tokenizer: /workspace/models/superhot-13b-16k-4bit--1g-safetensors/tokenizer.model
-- Model config: /workspace/models/superhot-13b-16k-4bit--1g-safetensors/config.json
-- Model: /workspace/models/superhot-13b-16k-4bit--1g-safetensors/4bit.safetensors
-- Sequence length: 16384
-- RoPE compression factor: 8.0
-- Tuning:
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --sdp_thd: 8
-- Options: ['perplexity']
** Time, Load model: 3.69 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
!! Model has empty group index (discarded)
** VRAM, Model: [cuda:0] 6,974.74 MB
-- Loading dataset...
-- Testing 21 chunks...
** Perplexity: 7.5462
- Downloads last month
- 15
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.