How does training on just 800k pieces of data need 7b parameters?

comex · on March 29, 2023

Because it’s fine-tuning an existing 7B-parameter language model, not training from scratch.

sp332 · on March 29, 2023

Llama 7B was trained on a trillion tokens. The Lora is a small fraction of extra neurons that get integrated into the structure, and those are what get trained on the new data. It's like fine-tuning but takes less RAM and compute than retraining the whole model.

KRAKRISMOTT · on March 29, 2023

The chinchilla formula demands a 20:1 ratio

airstrike · on March 29, 2023

"demands" is great

KRAKRISMOTT · on March 29, 2023

Hey I didn't invent Chinchilla, blame google for that.

adt · on March 29, 2023

DeepMind

(apparently there's some friction between those two!)