Jason Voorhees
๐ธ๐๐๐๐๐๐๐๐ ๐ฎ๐๐๐ โข ๐๐๐๐๐ฅ
- Joined
- May 15, 2020
- Posts
- 93,883
- Reputation
- 284,241
I have been following this since years and here's my list of Obsidian Notes with research papers going back till 2022.
Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic
First, some terms so this is understandable for someone non technical
LLMโ a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.
Parameters โ you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..
Big models vs small models (SLMs) โ I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1โ14B dials, big models above that
Cloud vs local / on-device โ Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.
Tokens โ a token is a chunk of text, around a word-piece.
NPU โ a dedicated AI chip built into phones and laptops these days.
Quantization โ compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.
Distillation โ training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.
Pretraining vs test-time (inference) compute โ pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.
AGIโ Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.
Part 1: Local Models
This is headline feature of 2026, the shift toward small and local models
This app I built for a user. This was only possible because of advancements in local LLMs in 2025โ26.
You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.
A small (7B) model costs roughly 10โ30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)
Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.
Part 2: the scaling debate
One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.
This is called the scaling law and in 2026 we're now seeing its limits.
The original scaling-laws result performance improving predictably as you add size, data, and compute
Kaplan et al. (2020), "Scaling Laws for Neural Language Models"
The new abilities appear as you scale" observation:
Wei et al. (2022), "Emergent Abilities of Large Language Models"
Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof
HEC Paris (2025), "AI Beyond the Scaling Laws."
Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.
Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."
(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws โ What's Actually Coming Next."
So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this
Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."
OpenAI (2024), "Learning to Reason with LLMs" (o1)"
A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.
www.codeant.ai
awesomeagents.ai
The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning
openrouter.ai
In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive
www.silicondata.com
But on the bigger question experts still disagree:
Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed
The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.
Bengio et al. (2026), "International AI Safety Report 2026."
These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.
In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.
Part 3: How the small models and the big ones are connected
Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.
On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures
"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).
That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.
AI Magicx (2026), "On-Device AI in 2026.
I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts
Part 4: Meek models shall inherit the earth
There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes โ the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.
Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."
The same direction shows up in forecasts of how many models cross each capability bar over time.
Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?
TLDR-
My take
From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.
The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated
www.washingtontimes.com
Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself
Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic
First, some terms so this is understandable for someone non technical
LLMโ a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.
Parameters โ you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..
Big models vs small models (SLMs) โ I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1โ14B dials, big models above that
Cloud vs local / on-device โ Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.
Tokens โ a token is a chunk of text, around a word-piece.
NPU โ a dedicated AI chip built into phones and laptops these days.
Quantization โ compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.
Distillation โ training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.
Pretraining vs test-time (inference) compute โ pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.
AGIโ Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.
Part 1: Local Models
This is headline feature of 2026, the shift toward small and local models
This app I built for a user. This was only possible because of advancements in local LLMs in 2025โ26.
You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.
A small (7B) model costs roughly 10โ30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)
Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.
Part 2: the scaling debate
One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.
This is called the scaling law and in 2026 we're now seeing its limits.
The original scaling-laws result performance improving predictably as you add size, data, and compute
Kaplan et al. (2020), "Scaling Laws for Neural Language Models"
The new abilities appear as you scale" observation:
Wei et al. (2022), "Emergent Abilities of Large Language Models"
Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof
HEC Paris (2025), "AI Beyond the Scaling Laws."
Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.
Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."
(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws โ What's Actually Coming Next."
So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this
Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."
OpenAI (2024), "Learning to Reason with LLMs" (o1)"
A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.
Why Output & Reasoning Tokens Inflate LLM Costs (2026 Guide)
Reasoning tokens can cost 6x more than inputs. Learn what actually drives LLM pricing and how to optimize spend.
www.codeant.ai
Reasoning Model API Pricing Compared - May 2026
Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.
awesomeagents.ai
The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning
State of AI 2025: 100T Token LLM Usage Study | OpenRouter
Read OpenRouter's 2025 State of AI report โ an empirical 100 trillion token study of real LLM usage, model trends, and developer insights.
In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive
Understanding LLM Cost Per Token: A 2026 Practical Guide - Silicon Data โ GPU Performance Data for Companies
A 2026 guide to real-world LLM token costs, model pricing, and proven ways to reduce spend
But on the bigger question experts still disagree:
Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed
The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.
Bengio et al. (2026), "International AI Safety Report 2026."
These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.
In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.
Part 3: How the small models and the big ones are connected
Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.
On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures
"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).
That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.
AI Magicx (2026), "On-Device AI in 2026.
I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts
Part 4: Meek models shall inherit the earth
There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes โ the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.
Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."
The same direction shows up in forecasts of how many models cross each capability bar over time.
Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?
TLDR-
- Small/local AI is a real, current 2026 trend. Prioritizing efficiency, speed, privacy, and better hardware
- Pretraining scaling shows diminishing returns; test-time/reasoning compute is still producing results at the cost of token limits and costs.
- Frontier capability kept advancing in 2026, whether/when it reaches AGI is contested.
- Small models are generally derived from big ones (distillation) plus steady efficiency gains.
- Everyone agrees frontier inference is getting too expensive per token
My take
From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.
The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated
Companies rehire workers after AI replacements fail
Complaints from frustrated customers have prompted e-commerce and financial technology companies to quietly rehire some content writers, software engineers and customer service workers they had replaced with AI bots.
Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself
Last edited:
