[huggingface🤗] Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

이 글은 huggingface blog 의 'Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA' 이라는 글을 의역한 것입니다.

https://huggingface.co/blog/4bit-transformers-bitsandbytes

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA LLMs are known to be large, and running or training them in consumer hardware is a huge challenge for users and accessibility. Our LLM.int8 blogpost showed how the techniques

huggingface.co

이 글을 읽기 위해선 직전 글을 읽어야 합니다. 참고 바랍니다 :)

https://heygeronimo.tistory.com/48

[huggingface] 8-bit Matrix Multiplication for transformers

해당 포스팅은 학습 차원에서 아래 글을 의역하여 작성합니다. 도입부와 배경은 가볍게 다루되, 이해해야 할 부분은 최대한 자세히 담아보고자 합니다. https://huggingface.co/blog/hf-bitsandbytes-integratio

heygeronimo.tistory.com

Introduction

FP8 format

먼저 자료형에 대해 소개해주는데, 처음보는 것만 간단히 짚겠다.

E4M3, E5M2: E는 exponential, M은 Mantissa 를 의미한다.
지수를 정확히 표현하는지, 소수점을 정확히 표현하는지에 따라 E5M2 와 E4M3 의 사용여부가 갈린다고 한다.
E4M3: -448 ~448 의 범위를 가지며, 경험적으로 forward pass 에 좋다고 한다.
E5M2: -57344 ~ 57344 의 범위를 가지며, 경험적으로 backward pass 에 좋다고 한다.

FP4 precision

sign bit 1개, exponential bit 2개, fraction bit 1개로 구성된다
예컨대, 1101이면, -1 * 2^(2) * (1 + 2^-1) = -1 * 4 * 1.5 = -6 이 된다.
첫번째 '1'은 음수, 두/세번째 '10'은 2의 지수, 네번째 '1' 숫자는 1.X 에서 X = 1 을 의미한다.
지수 비트를 3개 혹은 2개로 할 수 있는데 정해진 건 없다. 다만, 2비트가 일반적으로 성능이 좋다고 한다.

QLoRA, 모두에게 거대 모델이 닿는 새로운 길

QLoRA는 16 bit training 성능을 유지하면서, 메모리 사용량을 낮췄다. 33B model 이 24GB에서 사용가능하고, 65B model이 46GB면 충분하다.

더 구체적으로 말하자면, 기존 LLM 파라미터는 고정하고(frozen: 학습단계에서도 학습하지 않는 걸 의미함.), 매우 작은 학습가능한 파라미터들만 추가해서 LoRA(Low Rank Adapters) 방식으로 학습하는 걸 말한다. QLoRA 는 4 bit 로 역전파를 진행하는데, 자세한 방식은 LoRA 논문을 참고하면 좋다.

https://arxiv.org/abs/2106.09685

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes le

arxiv.org

QLoRA 는 base model weight 를 위한 저장 데이터 형태로 4bit Normal Float 를 사용하고, 계산을 위해선 16 bit BrainFloat 을 사용한다. QLoRA 는 저장 데이터 형태(4bit)를 역양자화(dequantization)하여 계산 데이터 형태(16bit)로 변환해서 forward 와 backward 에 활용한다. 단, gradient 를 계산할 때는 bfloat 16 을 사용한다.

QLoRA tuning 방식은 16 bit finetuning 방식으로 다양한 실험에 쓰일 수 있다. Vicuna benchmark 에서 ChatGPT 와 유사한 성능을 보일 정도로 효과적인 학습방식이라고 한다.

자세한 건, QLoRA 논문을 참조하길 바란다.

https://arxiv.org/abs/2305.14314

QLoRA: Efficient Finetuning of Quantized LLMs

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quan

arxiv.org

코드 예시

QLoRA 활용으로 좋은 예시는 KoAlpaca github 에 있다. 아래 github 링크와 QLoRA 예제 링크를 모두 남겨두겠다.

https://github.com/Beomi/KoAlpaca

GitHub - Beomi/KoAlpaca: KoAlpaca: 한국어 명령어를 이해하는 오픈소스 언어모델

KoAlpaca: 한국어 명령어를 이해하는 오픈소스 언어모델. Contribute to Beomi/KoAlpaca development by creating an account on GitHub.

github.com

https://colab.research.google.com/gist/Beomi/a3032e4eaa33b86fdf8de1f47f15a647/2023_05_26_bnb_4bit_koalpaca_v1_1a_on_polyglot_ko_12_8b.ipynb

Run, share, and edit Python notebooks

colab.research.google.com

코드 해석

기존 코드에서 새로운 config 와 함수들이 등장하기 때문에 그 부분에 대해서 함께 알아보겠다.

from transformers import BitsAndBytesConfig


nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

BitsAndBytes 와 huggingface 가 함께 협업하여 BitsAndBytesConfig 가 만들어졌다. 기존에 8-bit CUDA functions for PyTorch 를 연구하고 있었다.
load_in_4bit: 양자화를 4 bit 로 하는지 묻는 변수
bnb_4bit_quant_type: 어떤 자료형으로 할지 입력해야 한다. nf4(NormalFloat 4), fp4 등 이 있다. QLoRA 에선 nf4 를 쓰고 있고, 이게 default 기도 하다.
bnb_4bit_use_double_quant: QLoRA 논문에 나온 방식으로, 양자화한 상수들을 한번 더 양자화하여 평균적인 메모리 사용량을 낮추는 것을 의미한다.
bnb_4bit_compute_dtype: 양자화에서 forward, backward 는 4 bit 에서 일어나지만, gradient computation 과정에서는 float16, bfloat16, float32 등의 자료형으로 변환하여 진행해야 한다. 그러므로, 이에 맞는 자료형을 골라서 넣어주면 된다.

저작자표시

'NLP > huggingface' 카테고리의 다른 글

[huggingface🤗] OSError: You are trying to access a gated repo (0)	2024.03.22
[huggingface🤗] Prompting? PEFT? 총정리 (3)	2023.08.09
[huggingface🤗] 8-bit Matrix Multiplication for transformers (0)	2023.06.23
[huggingface🤗] How to generate text #2 (1)	2022.12.28
[huggingface🤗] How to generate text #1 (0)	2022.12.09

자연어천재만재

[huggingface🤗] Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Introduction

FP8 format

'NLP > huggingface' 카테고리의 다른 글

티스토리툴바

[huggingface🤗] Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Introduction

FP8 format

'NLP > huggingface' 카테고리의 다른 글

'NLP/huggingface' Related Articles

티스토리툴바