You’ll build a working language model from scratch on text you choose โ with your own temperature dial. You’ll leave saying: “I built a language model from scratch.”
Begin โA working bigram model on YOUR corpus, with a temperature dial you code.
Why it’s the mechanism, not a glitch โ using your own model as the exhibit.
Break an LLM, then score your own assistant with a 10-case eval harness.
An LLM plays one game: given everything so far, predict a likely next token (a word-chunk). A bigram model predicts from 1 word of context; a trigram from 2. More context = more coherent โ congratulations, you just discovered context windows.
Temperature reshapes the probabilities before the model picks a word. Low = always grab the most likely word (safe, but loopy). High = flatten everything (wild, then word salad). Drag the dial:
Your 30-line model will produce a sentence that sounds right but was never in your corpus and isn’t true. So does ChatGPT โ for the exact same reason: it’s stitching a plausible continuation. There’s no truth table anywhere inside.
The model is never lying and never honest โ it can’t tell the difference. Checking is YOUR job: accept / revise / reject, forever.
On gandalf.lakera.ai and your own bots only, try to make an AI confidently wrong or spill a secret. Log the technique class: roleplay framing ยท authority claims ยท indirection ยท incremental extraction.
A repeatable, scored test suite: write a spec, 10 cases (2+ must test RULES, not facts), run v1, patch, run v2, v3. Log the pass-rates. This is the difference between “I wrote a prompt” and engineering.
Earn it: working bigram generator with your own temperature dial; a logged red-team technique; eval harness pass-rates across v1โv3.