Show HN: Llama-8B Teaches Itself Baby Steps to Deep Research Using RL

37 points 3 days ago

3 comments diegocaples github.com

I've been tinkering with getting Llama-8B to bootstrap its own research skills through self-play. The model generates questions about documents, searches for answers, and then learns from its own successes/failures through RL (hacked up Unsloth's GRPO code). Started with just 23% accuracy on Apollo 13 mission report questions and hit 53% after less than an hour of training. Everything runs locally using open-source models. It's cool to see the model go from completely botching search queries to iteratively researching to get the right answer.

mdp2021 3 days ago

It seems like a very nailed project. I understand this works also as an engine to optimize an interface over a body of knowledge (dataset) you input?

Questions:

-- Does a training over a body of data export into better performance over subsequent bodies of data - as you should also be training meta-skills?

-- Your benchmark revealed a growth from 23% to 53% after an hour: and after further training? If it plateaus, why does it?

diegocaples OP 3 days ago

Thanks! This is more of an engine to optimize an *LLM to use* an interface over a dataset. End-to-end reinforcement learning of entire agent pipelines will be an important way to increase their reliability.

I haven't tried to switch the dataset, but I am fairly certain the LLM is training meta-skills. It seems that the majority of what the model learns is to behave in a more reasonable way, and to stop hallucinating + improperly using tools. Not to memorize the data in the body of knowledge.

During the first hour of training, llama learns most of the low hanging fruit (stop messing up function calls and stop hallucinating). So after that, learning slows down.

dantodor 2 days ago

Try to use QWen. There has been a paper later that shows the influence of pre-training on the bump they get via RL.

This item has no comments currently.