dude thats sick! i tried it out and it works. theres a couple layers in there that are part of the voidy block that doesnt do much for the selected answer, so i narrowed it down to L48-53 where this model is mapping out its reasoning strategy, and repeated that twice, i got a big improvement over the original config (i chose some questions from atropos and claude code made some up so idk not like a real dataset).
so thats about %15 more compute per forward pass with 0 extra memory which is just nuts, so for a streaming or disk-based setup its just free better answers. def wasnt gonna think of this myself.
looks like the model gets a second/third go at figuring out how to approach the problem and it gets better answers.
i tried a matrix of other configurations and stuff gets totally weird. like playing em through backwards in that block doesnt make much of a difference / order doesnt seem to matter (?!). doubling each layer got a benefit, but if i doubled the layers and doubled that block there was interference. doubling the block where the model is architecting/crystallizing its plans improves reasoning but at the cost of other stuff. other mixes of blocks showed some improvements for certain kinds of prompts but didnt stand out as much.
im kind of wondering like what the ceiling would be on reasoning for something like the 1.5T models with the repeating technique, but they would take a long time to download. i think if you have them already it would take maybe an hour or so to check against a swath of prompts. whats the reasoningest open model at the moment?
my guess is that large models trained on large corpuses there is just some ceiling of "reasoning you can do" given the internal geometry implied by the training data, cause text is lossy and low-bandwidth anyway, and theres only really so much of it. past some point you just have to have models learning from real-world interactions and my guess is we're already kind of there.
I have Deepseek etc, but inferencing on DDR5 would take about 2-3 weeks for a simple scan. I think this works best with dense models, but it also seems ok with MoE.
@everyone: Can someone hook me up with Nvidia sponsorship?
oh neat ill check that one out. i dont get that much speedup from ssd/128gb unified vs vram if im doing like a predefined set of prompts, since i have it load it from disk anyway and im just doing one forward pass per prompt, and just like load part of it at a time. its a bit slower if im doing cpu inferencing but i only had to do that with one model so far.
but yeah on demand would be a lot of ssd churn so id just do it for testing or getting some hidden state vectors.
so thats about %15 more compute per forward pass with 0 extra memory which is just nuts, so for a streaming or disk-based setup its just free better answers. def wasnt gonna think of this myself.
looks like the model gets a second/third go at figuring out how to approach the problem and it gets better answers.i tried a matrix of other configurations and stuff gets totally weird. like playing em through backwards in that block doesnt make much of a difference / order doesnt seem to matter (?!). doubling each layer got a benefit, but if i doubled the layers and doubled that block there was interference. doubling the block where the model is architecting/crystallizing its plans improves reasoning but at the cost of other stuff. other mixes of blocks showed some improvements for certain kinds of prompts but didnt stand out as much.