The Smol Training Playbook: The Secrets to Building World-Class LLMs

237 points by kashifr 3 days ago

lewtun 19 hours ago

Hi, Lewis here (one of the co-authors). Happy to answer any questions people have about the book :)

empiko 3 hours ago

Really impressive writeup. In your opinion, how long will this stay up to date? The field is constantly evolving, do you plan to keep updating this document?
- lewtun 3 hours ago
  
  Thanks! I expect the book will remain relevant as long as the Transformers architecture does. That’s why we mostly focus on topics we think will stand the test of time, but let’s see how that plays out :)
danielmarkbruce 18 hours ago

I'm a little ways through this and it's great so far, nice job.
One of the reasons people build one though is to learn. Most smart folks are quite aware that the reality of pre-training a real LLM is going to involve some head banging against the wall (ie, things don't go smoothly like "building an llm from scratch" book), and they want to go through the process.

tsenturk a day ago

Hugging Face is not just an AI information-sharing website; it’s also a great learning platform for all AI learners. This documentation is one of the most impressive hands-on resources I’ve ever read.

abossy 21 hours ago

What others would you recommend that are comparable in quality?
- pixelmelt 18 hours ago
  
  Been reading a book by u/fpham "The Cranky mans guide to lora and qlora" and it's pretty great, writing quality isnt all there but the content is valuable for learning to make good finetunes
- donkeyboy 20 hours ago
  
  The documentation for common ai packages is pretty good too. For example, pytorch docs, peft docs, timm docs.

forgingahead 12 hours ago

Where does "Smol" come from? It's supposed to mean "Small" right? If yes then what's the etymology and reason for popular usage?

lewtun 10 hours ago

In the specific case of SmolLM, it originates from the meme in this dataset https://huggingface.co/datasets/bigcode/the-stack-smol
potsandpans 11 hours ago

It's just internet speak from the days of tumbler. It usually has cutsie connotations.
Tumbler speak has a bunch of whacky things, notably "chimkin nuggers."

doctorpangloss 12 hours ago

I really like the Hugging Face guys, but...

> Modify one thing at a time

> Change only one variable per ablation while keeping everything else constant. If you change multiple things and performance improves, you won’t know what caused it. Test modifications individually, then combine successful ones and reassess.

This is an unintentional microcosm of what is flawed with the document.

CamperBob2 11 hours ago

What's wrong with it? That's good advice in almost any optimization or troubleshooting context where variables may interact.
- yorwba 9 hours ago
  
  One problem with testing one change at a time is that if you can only run a small number of experiments because each one requires many GPU hours to get results, you can also only test a small number of changes. If you can come up with and implement new changes much more easily than you can test them, it would be more efficient to test multiple changes at a time and use some form of Bayesian optimization to find the best combination of changes with as few experiments as possible.
  - ImageXav 5 hours ago
    
    Agreed. One at a time testing (OAT) has been outdated for almost a century at this point. Factorial and fractional factorial experiments have been around for that long and give detailed insights into the effect of not just single changes but the interaction between changes, which means you can superpower your learnings as many variables in DL do in fact interact.
    Or, more modern Bayesian methods if you're more interested in getting the best results for a given hyperparameter sweep.
    However, that is not to detract from the excellent effort made here and the great science being investigated. Write ups like this offer so much gold to the community.
  - empiko an hour ago
    
    The number of runs you can afford are not enough to perform Bayesian optimization. Count how many different options they explored in the text and take a guess how many samples you need to start modeling the hyperparameter space.