::synthesis · Tim-Ferriss method

Context windows

::minimum effective dose

A context window is the model's working memory for one turn — every token of system prompt, conversation history, attached files, tool outputs, and the response itself competes for the same fixed budget. When you hit the cap, the oldest tokens get evicted, summarized, or the call errors. That's it. The mechanics that matter: (1) Tokens != words — roughly 4 chars / 0.75 words per token in English, more for code. (2) Cost scales with input tokens too, not just output — a 200K-token prompt is expensive even if the reply is one line. (3) Attention degrades in the middle — models reliably attend to the start and end of the window, lose precision in the middle (the 'lost in the middle' problem, confirmed across Claude, GPT, Gemini). (4) Putting your most important instruction in the LAST 10% of the prompt outperforms putting it in the system message for long contexts. The operational rule: treat context as a budget, not a bucket. Every token in the window should earn its seat. Stop pasting whole docs when a relevant excerpt and a citation works. Compress before you submit, not after.

::DiSSS · deconstruction questions

01What is the exact token cap of the model I'm using right now, and how do I see remaining headroom mid-conversation?
02Where does my model degrade — start, middle, or end of the window — and have I measured it, not assumed it?
03When the window fills, does my tool truncate, summarize, error, or silently drop tokens? (Each behaves differently.)
04What's my per-token input cost vs output cost, and which is the larger line item on my actual bill?
05Can I get the same answer with 10% of the context I'm currently shoving in?

::fear-setting

Cost of not learning this: you'll burn 5-50x what you need to on API bills, get worse answers than someone using 1/10th your tokens, and slowly conclude the model 'isn't smart enough' when really you're feeding it a haystack and asking it to find one needle in the middle — exactly where attention is weakest. You'll also hit hard caps mid-task and lose work. Cost of getting it wrong: catastrophic in agent loops. An agent that doesn't manage context will recursively self-poison — each tool call appends output to the next call, until the window is 95% old tool noise and 5% actual task, then the agent 'hallucinates' a wrong answer that was actually a context-saturation failure. You'll blame the model. The model wasn't the problem.

::80 / 20 cut

SKIP: the academic literature on positional encoding (RoPE, ALiBi, YaRN). You don't need it to use models well. OBSESS OVER: (1) measuring your actual token usage per call — most operators have never looked, (2) the lost-in-the-middle effect — put critical instructions at the END of long prompts, (3) context pruning before submission — strip whitespace, dedupe history, summarize old turns. One hour spent profiling your real token consumption returns more than a week reading transformer papers.

::tribe of mentors · paraphrased stances

Nelson Liu

Stanford, lead author of 'Lost in the Middle' (2023), the foundational empirical paper on middle-context degradation

Liu's stance: don't trust marketing claims about 'million-token context.' Run the needle-in-a-haystack test on YOUR prompt at YOUR position. Performance at token 500K is often not what the benchmark suggests, and the failure mode is silent.

Greg Kamradt

Independent practitioner who ran public needle-in-a-haystack tests on Claude, GPT-4, Gemini at long context

Kamradt's stance: most context-window claims are real for retrieval at the edges and degraded in the middle. Test before you trust. Your actual usable context is often half the advertised number for high-precision tasks.

Anthropic prompt engineering team

Authors of Claude prompt engineering documentation, the most operator-useful long-context guidance in the industry

Anthropic's stance: structure long context with XML tags so the model can address regions by name; put queries AFTER the documents, not before; ask the model to quote before answering when you need precision on long inputs.

Simon Willison

Co-creator of Django, builds LLM tooling daily, writes the most-read operator blog on practical LLM use

Willison's stance: context is a real cost center. He treats every long-context call as 'am I paying for this attention to work?' and routinely solves with retrieval at 1/100th the token cost.

::real-world test · this week

This week: take your most complex working prompt. Run it once. Count input tokens via the API response metadata (every major provider returns this). Then cut the prompt by 50% — strip examples down to one, drop the chatty system message preamble, remove duplicated instructions. Run it again. If output quality holds, you just freed half your bill. If quality drops, you found which 50% was actually load-bearing. Either way, you now know your real signal-to-noise ratio.

::action items · ranked

01Add token-usage logging to every API call you make this week — input, output, and total per call
02Test the lost-in-the-middle effect on your longest prompt by placing a critical instruction in three positions (start, middle, end) and comparing outputs
03Restructure your top three prompts so the actual question lives in the last 10% of the input, not the first
04Set a per-task token budget and log when you exceed it — treat overages like cloud-cost overages
05Replace one document-paste workflow this week with a retrieval-plus-citation pattern to cut tokens by 10-50x