The 2026 AI landscape with sources under every claim

Jason Voorhees · 2026-06-25T01:12:23-0500

I have been following this since years and here's my list of Obsidian Notes with research papers going back till 2022.

Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic

First, some terms so this is understandable for someone non technical

LLM— a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.

Parameters — you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..

Big models vs small models (SLMs) — I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1–14B dials, big models above that

Cloud vs local / on-device — Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.

Tokens — a token is a chunk of text, around a word-piece.

NPU — a dedicated AI chip built into phones and laptops these days.

Quantization — compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.

Distillation — training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.

Pretraining vs test-time (inference) compute — pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.

AGI— Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.

Part 1: Local Models

This is headline feature of 2026, the shift toward small and local models

This app I built for a user. This was only possible because of advancements in local LLMs in 2025–26.

You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.

A small (7B) model costs roughly 10–30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)

Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.

Part 2: the scaling debate

One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.

This is called the scaling law and in 2026 we're now seeing its limits.

The original scaling-laws result performance improving predictably as you add size, data, and compute

Kaplan et al. (2020), "Scaling Laws for Neural Language Models"

The new abilities appear as you scale" observation:

Wei et al. (2022), "Emergent Abilities of Large Language Models"

Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof

HEC Paris (2025), "AI Beyond the Scaling Laws."

Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.

Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."

(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws — What's Actually Coming Next."

So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this

Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."

OpenAI (2024), "Learning to Reason with LLMs" (o1)"

A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.

Why Output & Reasoning Tokens Inflate LLM Costs (2026 Guide)

Reasoning tokens can cost 6x more than inputs. Learn what actually drives LLM pricing and how to optimize spend.

www.codeant.ai

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

awesomeagents.ai

The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning

State of AI 2025: 100T Token LLM Usage Study | OpenRouter

Read OpenRouter's 2025 State of AI report — an empirical 100 trillion token study of real LLM usage, model trends, and developer insights.

openrouter.ai

In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive

Understanding LLM Cost Per Token: A 2026 Practical Guide - Silicon Data — GPU Performance Data for Companies

A 2026 guide to real-world LLM token costs, model pricing, and proven ways to reduce spend

www.silicondata.com

But on the bigger question experts still disagree:

Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed

The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.

Bengio et al. (2026), "International AI Safety Report 2026."

These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.

In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.

Part 3: How the small models and the big ones are connected

Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.

On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures

"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).

That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.

AI Magicx (2026), "On-Device AI in 2026.

I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts

Part 4: Meek models shall inherit the earth

There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes — the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.

Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."

The same direction shows up in forecasts of how many models cross each capability bar over time.

Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?

TLDR-

Small/local AI is a real, current 2026 trend. Prioritizing efficiency, speed, privacy, and better hardware
Pretraining scaling shows diminishing returns; test-time/reasoning compute is still producing results at the cost of token limits and costs.
Frontier capability kept advancing in 2026, whether/when it reaches AGI is contested.
Small models are generally derived from big ones (distillation) plus steady efficiency gains.
Everyone agrees frontier inference is getting too expensive per token

My take

From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.

The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated

Companies rehire workers after AI replacements fail

Complaints from frustrated customers have prompted e-commerce and financial technology companies to quietly rehire some content writers, software engineers and customer service workers they had replaced with AI bots.

www.washingtontimes.com

Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself

SplashJuice · 2026-06-25T01:14:24-0500

It's too early to predict anything, only in 20 to 30 years we will see the effect of AI. It's like the early days of the internet.

Jason Voorhees · 2026-06-25T01:14:52-0500

@NorwoodAscender @petsmart @FunnyVALENTINE @xaxanibber @Sycophant

Jason Voorhees · 2026-06-25T01:15:39-0500

@Chadeep @LXR @dhusc @chang cypionate

chang cypionate · 2026-06-25T01:16:26-0500

Jason Voorhees said:
@Chadeep @LXR @dhusc @chang cypionate

Thank you will read after work

Jason Voorhees · 2026-06-25T01:17:57-0500

ㅤ @exvh @primal_shitmuncher @Souleth

xaxanibber · 2026-06-25T01:18:09-0500

will read after cs game

Chadeep · 2026-06-25T01:18:37-0500

Why did they ban Mythos

Jason Voorhees · 2026-06-25T01:19:22-0500

@Rick_bozo @milkshake_addict @hollywoodngl @Aㅤㅤㅤ

exvh · 2026-06-25T01:20:32-0500

nice guide

will read in a min

princeof316 · 2026-06-25T01:25:04-0500

Thank u for the read

Rick_bozo · 2026-06-25T01:25:58-0500

Gud

LTG · 2026-06-25T01:26:31-0500

Jason Voorhees said:
I have been following this since years and here's my list of Obsidian Notes with research papers going back till 2022.

View attachment 5270171

Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic

First, some terms so this is understandable for someone non technical

LLM— a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.

Parameters — you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..

Big models vs small models (SLMs) — I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1–14B dials, big models above that

Cloud vs local / on-device — Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.

Tokens — a token is a chunk of text, around a word-piece.

NPU — a dedicated AI chip built into phones and laptops these days.

Quantization — compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.

Distillation — training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.

View attachment 5270162

Pretraining vs test-time (inference) compute — pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.

View attachment 5270170

AGI— Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.

Part 1: Local Models

This is headline feature of 2026, the shift toward small and local models

This app I built for a user. This was only possible because of advancements in local LLMs in 2025–26.

You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.

A small (7B) model costs roughly 10–30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)

Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.

Part 2: the scaling debate

One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.

This is called the scaling law and in 2026 we're now seeing its limits.

The original scaling-laws result performance improving predictably as you add size, data, and compute

Kaplan et al. (2020), "Scaling Laws for Neural Language Models"

The new abilities appear as you scale" observation:

Wei et al. (2022), "Emergent Abilities of Large Language Models"

Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof

View attachment 5270158

HEC Paris (2025), "AI Beyond the Scaling Laws."

Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.

Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."

(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws — What's Actually Coming Next."

So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this

Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."

OpenAI (2024), "Learning to Reason with LLMs" (o1)"

View attachment 5270118

A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.

Why Output & Reasoning Tokens Inflate LLM Costs (2026 Guide)

Reasoning tokens can cost 6x more than inputs. Learn what actually drives LLM pricing and how to optimize spend.

www.codeant.ai

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

awesomeagents.ai

The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning

State of AI 2025: 100T Token LLM Usage Study | OpenRouter

Read OpenRouter's 2025 State of AI report — an empirical 100 trillion token study of real LLM usage, model trends, and developer insights.

openrouter.ai

View attachment 5270106

In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive

Understanding LLM Cost Per Token: A 2026 Practical Guide - Silicon Data — GPU Performance Data for Companies

A 2026 guide to real-world LLM token costs, model pricing, and proven ways to reduce spend

www.silicondata.com

But on the bigger question experts still disagree:

Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed

The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.

Bengio et al. (2026), "International AI Safety Report 2026."

These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.

In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.

View attachment 5270112
Part 3: How the small models and the big ones are connected

Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.

On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures

"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).

That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.

AI Magicx (2026), "On-Device AI in 2026.

I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts

View attachment 5270184

Part 4: Meek models shall inherit the earth

There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes — the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.

View attachment 5270104

Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."

The same direction shows up in forecasts of how many models cross each capability bar over time.

Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?

TLDR-

Small/local AI is a real, current 2026 trend. Prioritizing efficiency, speed, privacy, and better hardware

Pretraining scaling shows diminishing returns; test-time/reasoning compute is still producing results at the cost of token limits and costs.

Frontier capability kept advancing in 2026, whether/when it reaches AGI is contested.

Small models are generally derived from big ones (distillation) plus steady efficiency gains.

Everyone agrees frontier inference is getting too expensive per token

My take

From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.

The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated

Companies rehire workers after AI replacements fail

Complaints from frustrated customers have prompted e-commerce and financial technology companies to quietly rehire some content writers, software engineers and customer service workers they had replaced with AI bots.

www.washingtontimes.com

Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself

high iq thread

Jason Voorhees · 2026-06-25T01:29:59-0500

@Centurion Hunter @afroheadluke @callard

Centurion Hunter · 2026-06-25T01:33:57-0500

Jason Voorhees said:
I have been following this since years and here's my list of Obsidian Notes with research papers going back till 2022.

View attachment 5270171

Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic

First, some terms so this is understandable for someone non technical

LLM— a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.

Parameters — you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..

Big models vs small models (SLMs) — I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1–14B dials, big models above that

Cloud vs local / on-device — Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.

Tokens — a token is a chunk of text, around a word-piece.

NPU — a dedicated AI chip built into phones and laptops these days.

Quantization — compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.

Distillation — training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.

View attachment 5270162

Pretraining vs test-time (inference) compute — pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.

View attachment 5270170

AGI— Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.

Part 1: Local Models

This is headline feature of 2026, the shift toward small and local models

This app I built for a user. This was only possible because of advancements in local LLMs in 2025–26.

You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.

A small (7B) model costs roughly 10–30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)

Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.

Part 2: the scaling debate

One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.

This is called the scaling law and in 2026 we're now seeing its limits.

The original scaling-laws result performance improving predictably as you add size, data, and compute

Kaplan et al. (2020), "Scaling Laws for Neural Language Models"

The new abilities appear as you scale" observation:

Wei et al. (2022), "Emergent Abilities of Large Language Models"

Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof

View attachment 5270158

HEC Paris (2025), "AI Beyond the Scaling Laws."

Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.

Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."

(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws — What's Actually Coming Next."

So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this

Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."

OpenAI (2024), "Learning to Reason with LLMs" (o1)"

View attachment 5270118

A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.

Why Output & Reasoning Tokens Inflate LLM Costs (2026 Guide)

Reasoning tokens can cost 6x more than inputs. Learn what actually drives LLM pricing and how to optimize spend.

www.codeant.ai

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

awesomeagents.ai

The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning

State of AI 2025: 100T Token LLM Usage Study | OpenRouter

Read OpenRouter's 2025 State of AI report — an empirical 100 trillion token study of real LLM usage, model trends, and developer insights.

openrouter.ai

View attachment 5270106

In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive

Understanding LLM Cost Per Token: A 2026 Practical Guide - Silicon Data — GPU Performance Data for Companies

A 2026 guide to real-world LLM token costs, model pricing, and proven ways to reduce spend

www.silicondata.com

But on the bigger question experts still disagree:

Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed

The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.

Bengio et al. (2026), "International AI Safety Report 2026."

These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.

In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.

View attachment 5270112
Part 3: How the small models and the big ones are connected

Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.

On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures

"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).

That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.

AI Magicx (2026), "On-Device AI in 2026.

I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts

View attachment 5270184

Part 4: Meek models shall inherit the earth

There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes — the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.

View attachment 5270104

Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."

The same direction shows up in forecasts of how many models cross each capability bar over time.

Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?

TLDR-

Small/local AI is a real, current 2026 trend. Prioritizing efficiency, speed, privacy, and better hardware

Pretraining scaling shows diminishing returns; test-time/reasoning compute is still producing results at the cost of token limits and costs.

Frontier capability kept advancing in 2026, whether/when it reaches AGI is contested.

Small models are generally derived from big ones (distillation) plus steady efficiency gains.

Everyone agrees frontier inference is getting too expensive per token

My take

From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.

The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated

Companies rehire workers after AI replacements fail

Complaints from frustrated customers have prompted e-commerce and financial technology companies to quietly rehire some content writers, software engineers and customer service workers they had replaced with AI bots.

www.washingtontimes.com

Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself

Interesting
To be honest we shall see how it moves forward . Not much point in trying to predict when many of the AI companies dont even know themselves how much they could optimize their models.

LXR · 2026-06-25T01:37:42-0500

Jason Voorhees said:
I have been following this since years and here's my list of Obsidian Notes with research papers going back till 2022.

View attachment 5270171

Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic

First, some terms so this is understandable for someone non technical

LLM— a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.

Parameters — you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..

Big models vs small models (SLMs) — I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1–14B dials, big models above that

Cloud vs local / on-device — Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.

Tokens — a token is a chunk of text, around a word-piece.

NPU — a dedicated AI chip built into phones and laptops these days.

Quantization — compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.

Distillation — training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.

View attachment 5270162

Pretraining vs test-time (inference) compute — pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.

View attachment 5270170

AGI— Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.

Part 1: Local Models

This is headline feature of 2026, the shift toward small and local models

This app I built for a user. This was only possible because of advancements in local LLMs in 2025–26.

You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.

A small (7B) model costs roughly 10–30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)

Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.

Part 2: the scaling debate

One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.

This is called the scaling law and in 2026 we're now seeing its limits.

The original scaling-laws result performance improving predictably as you add size, data, and compute

Kaplan et al. (2020), "Scaling Laws for Neural Language Models"

The new abilities appear as you scale" observation:

Wei et al. (2022), "Emergent Abilities of Large Language Models"

Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof

View attachment 5270158

HEC Paris (2025), "AI Beyond the Scaling Laws."

Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.

Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."

(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws — What's Actually Coming Next."

So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this

Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."

OpenAI (2024), "Learning to Reason with LLMs" (o1)"

View attachment 5270118

A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.

Why Output & Reasoning Tokens Inflate LLM Costs (2026 Guide)

Reasoning tokens can cost 6x more than inputs. Learn what actually drives LLM pricing and how to optimize spend.

www.codeant.ai

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

awesomeagents.ai

The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning

State of AI 2025: 100T Token LLM Usage Study | OpenRouter

Read OpenRouter's 2025 State of AI report — an empirical 100 trillion token study of real LLM usage, model trends, and developer insights.

openrouter.ai

View attachment 5270106

In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive

Understanding LLM Cost Per Token: A 2026 Practical Guide - Silicon Data — GPU Performance Data for Companies

A 2026 guide to real-world LLM token costs, model pricing, and proven ways to reduce spend

www.silicondata.com

But on the bigger question experts still disagree:

Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed

The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.

Bengio et al. (2026), "International AI Safety Report 2026."

These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.

In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.

View attachment 5270112
Part 3: How the small models and the big ones are connected

Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.

On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures

"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).

That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.

AI Magicx (2026), "On-Device AI in 2026.

I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts

View attachment 5270184

Part 4: Meek models shall inherit the earth

There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes — the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.

View attachment 5270104

Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."

The same direction shows up in forecasts of how many models cross each capability bar over time.

Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?

TLDR-

Small/local AI is a real, current 2026 trend. Prioritizing efficiency, speed, privacy, and better hardware

Pretraining scaling shows diminishing returns; test-time/reasoning compute is still producing results at the cost of token limits and costs.

Frontier capability kept advancing in 2026, whether/when it reaches AGI is contested.

Small models are generally derived from big ones (distillation) plus steady efficiency gains.

Everyone agrees frontier inference is getting too expensive per token

My take

From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.

The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated

Companies rehire workers after AI replacements fail

Complaints from frustrated customers have prompted e-commerce and financial technology companies to quietly rehire some content writers, software engineers and customer service workers they had replaced with AI bots.

www.washingtontimes.com

Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself

My prediction is that Google will win because it is the only company that is still in contention and has proved that it can last. AI is not its most important part, but it certainly can become. Till then Google labs will keep pumping research and find use cases most will never bother about. Anthropic and OpenAi simply cant compete in those niches because they have no MOAT apart from AI

pegorino · 2026-06-25T01:47:33-0500

Nice
Anyone know about Ai bots and the science behind that. Youtube comments

Jason Voorhees · 2026-06-25T01:47:41-0500

@callard @Sayori @

Dr. Mog · 2026-06-25T01:50:41-0500

SplashJuice said:
It's too early to predict anything, only in 20 to 30 years we will see the effect of AI. It's like the early days of the internet.

It’s crazy to believe that the person who created the World Wide Web is still well alive today meaning it isn’t even that old and historically we are still in the early days of the Internet, yet we have one more addition of AI now that’s changes the Internet exponentially.

LXR · 2026-06-25T01:56:26-0500

Dr. Mog said:
It’s crazy to believe that the person who created the World Wide Web is still well alive today meaning it isn’t even that old and historically we are still in the early days of the Internet, yet we have one more addition of AI now that’s changes the Internet exponentially.

In some years we will talk about the people behind the Transformers paper as the originators of the LLM boom

Jason Voorhees · 2026-06-25T02:20:32-0500

@blinkers @Pony

blinkers · 2026-06-25T02:21:18-0500

Mirim the high iq thread always love reading them

Bump

Jason Voorhees · 2026-06-25T02:54:31-0500

blinkers said:
Mirim the high iq thread always love reading them

Bump

Thx man

Jason Voorhees · 2026-06-25T02:59:53-0500

@TheGreatDetective

Jason Voorhees · 2026-06-25T03:25:06-0500

@zennn

Jason Voorhees · 2026-06-25T03:32:11-0500

Jason Voorhees · 2026-06-25T03:36:26-0500

@Mast @buccalfatremoval

IronMike · 2026-06-25T03:37:08-0500

20$ a month, never hit limits, make 100s of photos a day

Jason Voorhees · 2026-06-25T03:41:54-0500

@Gomez @munnabhai

Jason Voorhees · 2026-06-25T03:43:39-0500

@Hess @Luquier

zennn · 2026-06-25T04:15:44-0500

This makes so much sense.
Its why ChatGPT has felt so brain rotted lately, theyre forcing the models to budget tokens :Bruh:

Jason Voorhees · 2026-06-25T04:37:59-0500

zennn said:
This makes so much sense.
Its why ChatGPT has felt so brain rotted lately, theyre forcing the models to budget tokens

Still no rep given

Jason Voorhees · 2026-06-25T04:38:20-0500

@inyerta @socio

zennn · 2026-06-25T04:38:25-0500

Jason Voorhees said:
Still no rep given

whoopsie

Jason Voorhees · 2026-06-25T04:39:17-0500

@ReadBooksEveryday

socio · 2026-06-25T04:44:03-0500

Good thread. I think a mix of local and cloud AI is the most likely future.

Jason Voorhees · 2026-06-25T04:46:24-0500

socio said:
Good thread. I think a mix of local and cloud AI is the most likely future.

Agreed

ReadBooksEveryday · 2026-06-25T04:48:33-0500

With the ramp up on production w Memory (Micron) , Photonics, and capitalization

AI is basically speed running the economy whilst every other industry is stagnating

munnabhai · 2026-06-25T04:59:49-0500

Jason Voorhees said:
I have been following this since years and here's my list of Obsidian Notes with research papers going back till 2022.

View attachment 5270171

Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic

First, some terms so this is understandable for someone non technical

LLM— a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.

Parameters — you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..

Big models vs small models (SLMs) — I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1–14B dials, big models above that

Cloud vs local / on-device — Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.

Tokens — a token is a chunk of text, around a word-piece.

NPU — a dedicated AI chip built into phones and laptops these days.

Quantization — compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.

Distillation — training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.

View attachment 5270162

Pretraining vs test-time (inference) compute — pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.

View attachment 5270170

AGI— Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.

Part 1: Local Models

This is headline feature of 2026, the shift toward small and local models

This app I built for a user. This was only possible because of advancements in local LLMs in 2025–26.

You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.

A small (7B) model costs roughly 10–30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)

Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.

Part 2: the scaling debate

One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.

This is called the scaling law and in 2026 we're now seeing its limits.

The original scaling-laws result performance improving predictably as you add size, data, and compute

Kaplan et al. (2020), "Scaling Laws for Neural Language Models"

The new abilities appear as you scale" observation:

Wei et al. (2022), "Emergent Abilities of Large Language Models"

Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof

View attachment 5270158

HEC Paris (2025), "AI Beyond the Scaling Laws."

Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.

Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."

(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws — What's Actually Coming Next."

So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this

Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."

OpenAI (2024), "Learning to Reason with LLMs" (o1)"

View attachment 5270118

A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.

Why Output & Reasoning Tokens Inflate LLM Costs (2026 Guide)

Reasoning tokens can cost 6x more than inputs. Learn what actually drives LLM pricing and how to optimize spend.

www.codeant.ai

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

awesomeagents.ai

The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning

State of AI 2025: 100T Token LLM Usage Study | OpenRouter

Read OpenRouter's 2025 State of AI report — an empirical 100 trillion token study of real LLM usage, model trends, and developer insights.

openrouter.ai

View attachment 5270106

In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive

Understanding LLM Cost Per Token: A 2026 Practical Guide - Silicon Data — GPU Performance Data for Companies

A 2026 guide to real-world LLM token costs, model pricing, and proven ways to reduce spend

www.silicondata.com

But on the bigger question experts still disagree:

Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed

The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.

Bengio et al. (2026), "International AI Safety Report 2026."

These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.

In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.

View attachment 5270112
Part 3: How the small models and the big ones are connected

Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.

On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures

"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).

That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.

AI Magicx (2026), "On-Device AI in 2026.

I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts

View attachment 5270184

Part 4: Meek models shall inherit the earth

There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes — the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.

View attachment 5270104

Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."

The same direction shows up in forecasts of how many models cross each capability bar over time.

Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?

TLDR-

Small/local AI is a real, current 2026 trend. Prioritizing efficiency, speed, privacy, and better hardware

Pretraining scaling shows diminishing returns; test-time/reasoning compute is still producing results at the cost of token limits and costs.

Frontier capability kept advancing in 2026, whether/when it reaches AGI is contested.

Small models are generally derived from big ones (distillation) plus steady efficiency gains.

Everyone agrees frontier inference is getting too expensive per token

My take

From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.

The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated

Companies rehire workers after AI replacements fail

Complaints from frustrated customers have prompted e-commerce and financial technology companies to quietly rehire some content writers, software engineers and customer service workers they had replaced with AI bots.

www.washingtontimes.com

Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself

Read every last word

Super simple and easy to understand even for a low iq nigga like myself

When can we get real world applications like A.I traffic lights bro that just work base on real time traffic to eliminate waiting. Shit like that

Jason Voorhees · 2026-06-25T05:08:46-0500

munnabhai said:
Read every last word

Super simple and easy to understand even for a low iq nigga like myself

When can we get real world applications like A.I traffic lights bro that just work base on real time traffic to eliminate waiting. Shit like that

Already exists bro. Project Green light. Already in use in Boston and Seattle.

Mast · 2026-06-25T05:10:52-0500

Jason Voorhees said:
I have been following this since years and here's my list of Obsidian Notes with research papers going back till 2022.

View attachment 5270171

Nothing I say will be my own words, I've read a lot of these papers and every objective statement I make in this thread will come with a link to a study which validates it and in the end I'll give you my take which you can agree or disagree with. I've condensed all the information in an easy to read and understand format no math formulas, no deep research or technical things and I have simplified all the graphs to make them less academic

First, some terms so this is understandable for someone non technical

LLM— a program trained on a huge pile of text that predicts the next word over and over. That's how it writes sentences. LLM= Large Language Model.

Parameters — you can think of them as the internal adjustable dials that it tunes during training More dials = a bigger model and generally better. When you see 7B written nextthat means 7 billion of these dials. Claude/GPT models have hundreds of billions of these parameters..

Big models vs small models (SLMs) — I'll use the famous analogy. A big model is like a giant general hospital knows everything, expensive, slow to get to vs a local clinic that's fast, cheap, and handles 90% of what you actually walk in with. Small Language Models are roughly 1–14B dials, big models above that

Cloud vs local / on-device — Cloud means the AI runs on a company's servers. Local / on-device means it runs on the chip inside your phone or laptop. Your data never leaves your hand can run without internet.

Tokens — a token is a chunk of text, around a word-piece.

NPU — a dedicated AI chip built into phones and laptops these days.

Quantization — compressing a model so it fits on a small device like how people compress big ass photos into smaller pictures to send. Some quality lost, but same model.

Distillation — training a small "student" model to imitate a big "teacher" model. This is how tiny models get surprisingly smart. Remember this word I've bolded and colored this and the next term because this is where a lot of the recent innovation has happened.

View attachment 5270162

Pretraining vs test-time (inference) compute — pretraining is the studying the model does in advance. Test time compute is how much "thinking" it's allowed to do in the moment you ask it something.

View attachment 5270170

AGI— Artificial General Intelligence: the hypothetical AI that can do basically any intellectual task a human can. Nobody agrees on the exact definition which is actually half the problem.

Part 1: Local Models

This is headline feature of 2026, the shift toward small and local models

This app I built for a user. This was only possible because of advancements in local LLMs in 2025–26.

You can freely download very capable open models now, the Gemma and Qwen families handle everyday tasks at a quality comparable to cloud models from a year or two earlier.

A small (7B) model costs roughly 10–30x less to run than a large one, and current phone chips can run 8-billion-dial models at conversational speed (20+ tokens/sec)

Deployment has also matured a lot. I was able to do it in little time because now running a model locally isn't a research project taking weeks anymore. You have Ollama, LM Studio and llama.cpp etc.

Part 2: the scaling debate

One thing people noticed in the early days of AI is that when you started throwing more compute at a model, it gained new abilities and more intelligence on its own.

This is called the scaling law and in 2026 we're now seeing its limits.

The original scaling-laws result performance improving predictably as you add size, data, and compute

Kaplan et al. (2020), "Scaling Laws for Neural Language Models"

The new abilities appear as you scale" observation:

Wei et al. (2022), "Emergent Abilities of Large Language Models"

Pretraining scaling (bigger model + more data) in 2026 has reached the point of diminishing returns. GPT-5 dead on arrival is the headline proof

View attachment 5270158

HEC Paris (2025), "AI Beyond the Scaling Laws."

Test-time compute letting a model reason longer before answering. This is a separate entirely that has kept yielding gains. Research shows a smaller model given more thinking time can outperform a much larger model that answers instantly in some cases a reasoning model can beat one ~14x its size.

Snell, Lee, Xu & Kumar (2024), "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."

(the 14x framing) Vardey (2026), "The Frontier: Reasoning Models, Scaling Laws — What's Actually Coming Next."

So the approach shifted from make it bigger to make it reason and that's how frontier models kept gaining ability. Claude's thinking effort control is exactly this

Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."

OpenAI (2024), "Learning to Reason with LLMs" (o1)"

View attachment 5270118

A reasoning model comes at a cost, it first generates a hidden chain of thinking tokens, then writes the visible reply. Those thinking tokens are billed at the same output rate even though you never see them.

Why Output & Reasoning Tokens Inflate LLM Costs (2026 Guide)

Reasoning tokens can cost 6x more than inputs. Learn what actually drives LLM pricing and how to optimize spend.

www.codeant.ai

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

awesomeagents.ai

The whole market moved this way. The share of tokens served by reasoning models went from small chunk in early 2025 to over half average prompt length roughly quadrupled since early 2024, and completion length nearly tripled largely reasoning

State of AI 2025: 100T Token LLM Usage Study | OpenRouter

Read OpenRouter's 2025 State of AI report — an empirical 100 trillion token study of real LLM usage, model trends, and developer insights.

openrouter.ai

View attachment 5270106

In pure scaling era train one giant model once and done. In the reasoning era the cost moved to the meter: Every hard query spends more tokens this is why people these days are saying AI is too expensive

Understanding LLM Cost Per Token: A 2026 Practical Guide - Silicon Data — GPU Performance Data for Companies

A 2026 guide to real-world LLM token costs, model pricing, and proven ways to reduce spend

www.silicondata.com

But on the bigger question experts still disagree:

Some (Like Yann LeCun) argue current methods alone won't reach AGI and new breakthroughs are needed

The International AI Safety Report 2026 states that scaling the key inputs (compute, data, algorithms) is technically feasible to around 2030 before hitting a fundamental bottleneck.

Bengio et al. (2026), "International AI Safety Report 2026."

These are genuinely open ended and a contested questions with strong arguments from both sides but one thing everyone agrees on is that at the frontier this is too expensive.

In April 2026, claude released mythos and then said the Fable 5's lead over its other models is that it grows the longer and more complex the task. In other words it is the long-horizon reasoning/agentic lever from Part 2 paying off but the cost and token usage is too much. It's priced like premium reasoning which is a big part of why local models are being preferred e.g. a fine-tuned 7B handling a document at ~$0.02 vs ~$0.30 for a frontier cloud API call.

View attachment 5270112
Part 3: How the small models and the big ones are connected

Remember distillation. The clever little phone sized model is usually a student trained to imitate a giant teacher model. No frontier teacher, no smart student, so the big models are the factory that makes the small ones good.

On top of that, the field keeps finding cleverer training tricks, so the compute needed to hit a given capability keeps dropping roughly 3x per year by some measures

"Compute Requirements for Algorithmic Innovation in Frontier AI Models" (2025).

That's why the common 2026 setup isn't local replacing cloud, entirely it's hybrid. The on device model handles routine stuff, and it falls back to a cloud model for the genuinely hard queries and more intensive parts.

AI Magicx (2026), "On-Device AI in 2026.

I actually implemented this with the Voice feature of that App I built for the user. This is the 2026 pragmatic approach instead of fighting ghosts

View attachment 5270184

Part 4: Meek models shall inherit the earth

There's a 2025 paper literally titled Meek Models Shall Inherit the Earth. Its claim: because returns to raw compute are shrinking, the capability gap between the most expensive models and cheaper ones opens up and then closes — the little guys catch up. If that holds, frontier-level ability ends up everywhere, in everyone's pocket.

View attachment 5270104

Gundlach, Lynch & Thompson (2025), "Meek Models Shall Inherit the Earth."

The same direction shows up in forecasts of how many models cross each capability bar over time.

Epoch AI (2025), "How Many AI Models Will Exceed Compute Thresholds?

TLDR-

Small/local AI is a real, current 2026 trend. Prioritizing efficiency, speed, privacy, and better hardware

Pretraining scaling shows diminishing returns; test-time/reasoning compute is still producing results at the cost of token limits and costs.

Frontier capability kept advancing in 2026, whether/when it reaches AGI is contested.

Small models are generally derived from big ones (distillation) plus steady efficiency gains.

Everyone agrees frontier inference is getting too expensive per token

My take

From what I've seen is that in silicon valley companies have taken a hands off and side line approach to the frontier dream. Not we gave up but more like we are not betting on it happening anytime soon. If you talk to a lot of the founders and CEOs, you mostly get a shrug. AGI might happen, might not, but either way the bills are brutal right now because of this most people are moving towards pocket AI that is cheap per query. That's the strongest trend and the one I'd put the most money on.

The Replaced with AI was the headline story in 2024, 2025 but in 2026 the new headline is the walkback not because of AI failed but once you add up the errors, the cleanup, and rehiring people at higher salaries, a lot of those "AI savings" evaporated

Companies rehire workers after AI replacements fail

Complaints from frustrated customers have prompted e-commerce and financial technology companies to quietly rehire some content writers, software engineers and customer service workers they had replaced with AI bots.

www.washingtontimes.com

Where I think people overreach is the scaling is dead, so AGI is far off line. It's true for the bigger is not always better but think longer is still alive and well. when as I keep saying the big models are exactly what make the small ones good. So to me this looks less like the moonshot failed and more like the moonshot's results are going straight into your phone fast. I've always been AI optimist and always embraced new technology but feel free to disagree with me. I've laid out all the facts and you can form an opinion for yourself

Will read later if i don't forget

Aㅤㅤㅤ · 2026-06-25T05:13:28-0500

Fucking masterpiece

Jason Voorhees · 2026-06-25T05:14:19-0500

@diarrhetic

Basin · 2026-06-25T05:21:24-0500

didnt read everything too slop brained atm but i will say that the ammount of money that power users make AI Companies spend in comparison to how much the companies get is so off.

I don't even spend any money on ai atm but still use loopholes to constantly have free access to GPT 5.5 , opus 4.8, fable (until it was shut off)

I as an individual probably cost the average AI Company 500-1000 dollars a month in no return and even if you paid for a simple service like 20$/month you would still cost the ai companies more than you give them.

I have no clue about economics but this seems unsustainable and off.

diarrhetic · 2026-06-25T05:21:47-0500

Jason Voorhees said:
@diarrhetic

I appreciate you tagging me. I will read the thread a bit later though because i have to study now

ReadBooksEveryday · 2026-06-25T05:28:43-0500

Btw if they were ever OpenAI or Anthropic to release their propreitery code; even if the model is 1000Trillion Terrabytes

I gurantee you some incel will manage to distill it into 100GB or 500GB model that could run on a decent GPU machine

Human innovativeness ironically is at the 100million+ AI engineers but also on the basement dwelling Incel autist that does it purely for the love of the game

Jason Voorhees · 2026-06-25T05:30:01-0500

ReadBooksEveryday said:
Btw if they were ever OpenAI or Anthropic to release their propreitery code; even if the model is 1000Trillion Terrabytes

I gurantee you some incel will manage to distill it into 100GB or 500GB model that could run on a decent GPU machine

Human innovativeness ironically is at the 100million+ AI engineers but also on the basement dwelling Incel autist that does it purely for the love of the game

It already happened.

Claude code's source code just got leaked in a mother of leaks

Muh safety first architecture. Closed down. https://thehackernews.com/2026/04/claude-code-tleaked-via-npm-packaging.html?m=1 https://www.businessinsider.com/claude-code-leak-what-happened-recreated-python-features-revealed-2026-4 Within seconds of a the leak. A few cybersecurity experts and...

looksmax.org

Jason Voorhees · 2026-06-25T07:27:03-0500

@HardStuckLTN;

Jason Voorhees · 2026-06-25T07:42:29-0500

@Irrelevance

Jason Voorhees · 2026-06-25T08:01:46-0500

@BeanCelll @AverageCurryEnjoyer

The 2026 AI landscape with sources under every claim

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

SplashJuice

Platinum

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

​

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

chang cypionate

Pussymoneyweedcodeineshesaidmydickfeellikemorphine

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

xaxanibber

3 year neet trophy

Chadeep

Mumbai Slum Dweller

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

exvh

-

princeof316

If something is worth doing, it’s worth overdoing

Rick_bozo

I need to reach HTN

LTG

lowtierfailure.com/debt/

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Centurion Hunter

Incel of the East

LXR

I ask you to judge me by the enemies I have made

pegorino

Iron

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Dr. Mog

PhD in moggerology

LXR

I ask you to judge me by the enemies I have made

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

blinkers

#1 roid advocate

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

IronMike

DOMAIN EXPANSION: BLACK NEW WORLD ORDER

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

zennn

Hair is life

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

zennn

Hair is life

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

socio

Gold

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

ReadBooksEveryday

Forum User Of The Year 2022,2023, Pro Kaligula

munnabhai

victim to high inhib