Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How does training on just 800k pieces of data need 7b parameters?


Because it’s fine-tuning an existing 7B-parameter language model, not training from scratch.


Llama 7B was trained on a trillion tokens. The Lora is a small fraction of extra neurons that get integrated into the structure, and those are what get trained on the new data. It's like fine-tuning but takes less RAM and compute than retraining the whole model.


The chinchilla formula demands a 20:1 ratio


"demands" is great


Hey I didn't invent Chinchilla, blame google for that.


DeepMind

(apparently there's some friction between those two!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: