How to run LLMs and generate AI pics/videos locally

Jason Voorhees · May 2, 2026

Option 1 Mac Studio setup

Buy the Mac studio. Specifically the Max or Ultra chip with as much memory as you can afford. This is the easiest, cheapest and set and forget setup

On a Mac Studio with 192GB or 256GB of unified memory, the GPU can access a lot of memory. This allows you to run massive Large Language Models like Llama-3 70B or Gemma 4 31B entirely on chip.

You also run a high quantization Llama-3 405B. At 4 bit quantization if my math is correct should be around 230GB of VRAM which is basically as good as ChatGPT-4 but it will be slow. 1-2 tokens per second almost the same speed as fast human typing so it's slow but the Main advantage is you can get a very decent setup for $4-5k and run complex models that can do coding, math and don't hallucinate as much

Option 2 Nvidia AI setup

If you want the absolute best performance possible and also plan to do AI training. You want something that is bullet proof. Then one or multiple RTX 5090. Besides being a gaming beast it is also extremely capable in AI workloads. In fact in startups and many low budget workflows they only use RTX 5090s for everything.

Almost every Al research paper and library PyTorch, TensorFlow is written for NVIDIA/CUDA first. GDDR7 memory provides insane bandwidth you generate images and get tokens within seconds and is in 2026 the industry norm.

There's another variant of this called RTX 6000 Ada or the newer Pro 6000 . Big advantage of this GPU besided more VRAM is Error correction memory which is super important if you are doing serious stuff. You can run this GPU at 100% for months and it won't crash which is great for AI training and fine tuning but it only makes sense to buy this GPU if you are doing AI research work and your work makes you money. It alsp costs as much as a car. Like $10k each.

Problem with this setup is that they stupid expensive and at 32-48GB, you are effectively locked out of the most intelligent models unless you use heavy distillation or multiple cards.
You are limited to 8B to 30B still very powerful but the not the full ChatGPT replacement unless you spend $15k+ on multiple GPUs.

Local LLM models recommendation for each setup

For Mac Studio with 128-256 gb vram

meta-llama/Llama-3.1-405B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

As I said Llama-3 405B with 4 bit quantization is the best for general usecase imo.

CohereLabs/c4ai-command-r-plus · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

deepseek-ai/DeepSeek-R1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

You can also run Command R+ and DeepSeek-V3 / R1 mixture of experts but I would stick to Llama because these are more specialized model for specific use cases.

black-forest-labs/FLUX.1-schnell · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Image and Video Generation models. I know why most of you nighas want to run local LLMs. It is to generate AI porn. For image generation mac studio is decent but not ideal. For generation of pixels you need TFLOPS

and everything like I said is optimized for CUDA and apple uses META which is completely different architecture
so mac studio tries to cope by trying to imitate what nvidia card does through a bunch of complex
processes that I won't get into but the result is long wait times like 30-50 seconds per image. The gold standard rn for images is Flux and for videos it's a nightmare even a 5 second video can take 10-20 minutes on Kling local. If you are on a Mac, you should also look for MLX versions of models on Hugging Face.

Kling AI: Next-Generation AI Creative Studio

Create professional videos and images with Kling AI's state-of-the-art generative AI platform. Our tools support video generation, image creation, and advanced editing capabilities for content creators.

kling.ai

For the Nvidia setup with 32-48 GB RAM

Gemma 4 31B. Released just last month. Somehow very few people I think because it got overshadowed by the Gemini-4 release rumours but it's Google's local masterpiece.

google/gemma-4-31B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

On a 5090, it hits 80+ tokens per second and it demolishes Qwen and other older models. It is basically a super intelligent instant messenger. Takes full advantages of the CUDA and tensor cores to give you blazing fast speed and is extremely good for a model that is only 31B parameters.

And for Image and video Generation. This is where Nvidia and Cuda environment earns it's price tag. same Flux image takes 1.5-3 seconds that's it. You can run also run a brand new thing in the market called Real-time SDXL where the image changes as you type the prompt which is super cool.

For Video Generation. Most models are trained and built specifically for Cuda so nvidia again shines here. A 5-second video clip renders in under 60 seconds.

Only problem is you can't get too much detail without spending a shit ton of money. Like you can't generate an 8k picture without running out of memory

If you tech savvy and understand AI on software level or you are an AI engineer. I also encourage you to fiddle on GGUF files and also do some fine tuning to make the AI behave exactly as you want. Even I'm learning about these things and experimenting but be very careful because you can give the AI a lobotomy easily.

For the ultimate baller setup. You can use both Max studio and the Nvidia GPU. This was a pipe dream since forever but Just a few weeks back a company called Tiny Corp released officially signed drivers (TinyGPU) that allow Apple Silicon Macs to talk to NVIDIA GPUs over Thunderbolt 4/5. So you get the best of both world. Only con is the bandwidth. Thunderbolt is fast but it's still slower than a PCIe but still much better than using metal.

P.S Two more cons I forgot to mention. Mac studio is very efficient barely sips like 230-250W even on full load, is dead silent and doesn't get too hot. Nvidia setup is the opposite you easily would need something like 1600W PSU or something to run them. If you live in Europe or in those older houses in america or apartments, a 1600W PC plus a monitor, a fridge on the same circuit will trip the breaker and give you a massive electricity bill. One more thing is that the fans will ramp up to 100% AI inference is intensive and is a sustained heavy load unlike gaming which is spikes, your PC will sound like a jet engine taking off and your PC and even the room it is in, will get hotter. My mum always yells at me every time I turn it on. Just something to keep in mind if you live with someone or have a small room or something.

psychopathic · May 2, 2026

I'd rather generate videos of Kirk, Diddy, and Epstein playing FNAF on Gemini's dime.

TheGreatDetective · May 2, 2026

Read and learn unc @IronMike

vision_n · May 2, 2026

All this to become a vtuber streamer on twitch

Jason Voorhees · May 2, 2026

@Petsmart @imontheloose @ltnbrownacnecel

TheGreatDetective · May 2, 2026

TheGreatDetective said:
Read and learn unc @IronMike

Jason Voorhees · May 2, 2026

Bryce said:
I'd rather generate videos of Kirk, Diddy, and Epstein playing FNAF on Gemini's dime.

You can generate locally without guard rails and as much degenerate stuff as you want locally

Jason Voorhees · May 2, 2026

@mohito @CollioureViews @Stargazer

Jason Voorhees · May 2, 2026

@Jgns @Sayori @MLP @mcmentalonthemic

Jason Voorhees · May 2, 2026

@dhusc @kisslessvirgin @dawooddX

Sayori · May 2, 2026

Jason Voorhees said:
Option 1 Mac Studio setup

Buy the Mac studio. Specifically the Max or Ultra chip with as much memory as you can afford. This is the easiest, cheapest and set and forget setup

View attachment 4995193

On a Mac Studio with 192GB or 256GB of unified memory, the GPU can access a lot of memory. This allows you to run massive Large Language Models like Llama-3 70B or Gemma 4 31B entirely on chip.

You also run a high quantization Llama-3 405B. At 4 bit quantization if my math is correct should be around 230GB of VRAM which is basically as good as ChatGPT-4 but it will be slow. 1-2 tokens per second almost the same speed as fast human typing so it's slow but the Main advantage is you can get a very decent setup for $4-5k and run complex models that can do coding, math and don't hallucinate as much

Option 2 Nvidia AI setup

If you want the absolute best performance possible and also plan to do AI training. You want something that is bullet proof. Then one or multiple RTX 5090

View attachment 4995197

Almost every Al research paper and library PyTorch, TensorFlow is written for NVIDIA first. GDDR7 memory provides insane bandwidth you generate images and get tokens within seconds and is in 2026 the industry norm.

There's another variant of this called RTX 6000 Ada or the newer Pro 6000 . Big advantage of this GPU besided more VRAM is Error correction memory which is super important if you are doing serious stuff. You can run this GPU at 100% for months and it won't crash which is great for AI training and fine tuning but it only makes sense to buy this GPU if you are doing AI research work and your work makes you money. It alsp costs as much as a car. Like $10k each.

View attachment 4995329
View attachment 4995215

Problem with this setup is that they stupid expensive and at 32-48GB, you are effectively locked out of the most intelligent models unless you use heavy distillation or multiple cards.
You are limited to 8B to 30B still very powerful but the not the full ChatGPT replacement unless you spend $15k+ on multiple GPUs.

Local LLM models recommendation for each setup

For Mac Studio with 128-256 gb vram

meta-llama/Llama-3.1-405B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

As I said Llama-3 405B with 4 bit quantization is the best for general usecase imo.

CohereLabs/c4ai-command-r-plus · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

deepseek-ai/DeepSeek-R1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

You can also run Command R+ and DeepSeek-V3 / R1 mixture of experts but I would stick to Llama because these are more specialized model for specific use cases.

black-forest-labs/FLUX.1-schnell · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Image and Video Generation models. I know why most of you nighas want to run local LLMs. It is to generate AI porn. For image generation mac studio is decent but not ideal. For generation of pixels you need TFLOPS

View attachment 4995351

and everything like I said is optimized for CUDA and apple uses META which is completely different architecture
so mac studio tries to cope by trying to imitate what nvidia card does through a bunch of complex
processes that I won't get into but the result is long wait times like 30-50 seconds per image. The gold standard rn is Flux and it's a nightmare even a 5 second video can take 10-20 minutes on Kling local. If you are on a Mac, you should also look for MLX versions of models on Hugging Face.

Kling AI: Next-Generation AI Creative Studio

Create professional videos and images with Kling AI's state-of-the-art generative AI platform. Our tools support video generation, image creation, and advanced editing capabilities for content creators.

kling.ai

Models for Nvidia setup. Gemma 4 31B. Released just last month. Somehow very few people know about it because it got overshadowed by the Gemini-4 release but it's Google's local masterpiece.

google/gemma-4-31B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View attachment 4995276

On a 5090, it hits 80+ tokens per second and it demolishes Qwen and other older models. It is basically a super intelligent instant messenger. Takes full advantages of the CUDA and tensor cores to give you blazing fast speed and is extremely good for a model that is only 31B parameters.

And for Image and video Generation. This is where Nvidia and Cuda environment earns it's price tag. same Flux image takes 1.5-3 seconds that's it. You can run also run a brand new thing in the market called Real-time SDXL where the image changes as you type the prompt which is super cool.

For Video Generation. Most models are trained and built specifically for Cuda so nvidia again shines here. A 5-second video clip renders in under 60 seconds.

Only problem is you can't get too much detail without spending a shit ton of money. Like you can't generate an 8k picture without running out of memory

If you tech savvy and understand AI on software level or you are an AI engineer. I also encourage you to fiddle on GGUF files and also do some fine tuning to make the AI behave exactly as you want but be very careful because you can give the AI a lobotomy very easily.

P.S Two more cons I forgot to mention. Mac studio is very efficient barely sips like 230-250W, is dead silent and doesn't get too hot. Nvidia setup is the opposite you easily would need something like 1600W PSU or something to run them. If you live in Europe or in those older houses in america or apartments, a 1600W PC plus a monitor, a fridge on the same circuit will trip the breaker and give you a massive electricity bill. One more thing is that the fans will ramp up to 100% AI inference is intensive so your PC will sound like a jet engine taking off and your PC and even the room it is in, will get hotter. My mum always yells at me every time I turn it on. Just something to keep in mind if you live with someone or have a small room or something.

Bookmarked will be using this

kisslessvirgin · May 2, 2026

good post - well researched

this is close to what i remember when i looked into it as well

but i don't think anyone here will do it

Jason Voorhees · May 2, 2026

kisslessvirgin said:
good post - well researched

this is close to what i remember when i looked into it as well

but i don't think anyone here will do it

People are too broke or people are too lazy to do it?

kisslessvirgin · May 2, 2026

Jason Voorhees said:
People are too broke or people are too lazy to do it?

mix of both

and some don't have a use case

Jason Voorhees · May 2, 2026

@sub5outsider @Orka @TemporaryName

Jason Voorhees · May 2, 2026

@Hernan @Mess @ecstazy

Jason Voorhees · May 2, 2026

@unknownincel @chang cypionate

Jason Voorhees · May 2, 2026

@Nathan Fielder @yyk117

1966Ford · May 2, 2026

I use those nude AI from the huggingface space

Jason Voorhees · May 2, 2026

@Topkra @tensor @qxdr @davidlaidisme67 @topology

unknownincel · May 2, 2026

im broke

goku21 · May 2, 2026

Noted

Jason Voorhees · May 2, 2026

@Gomez @psltristan1 @Kara

WhateverItTakes1 · May 2, 2026

so mac is better for LLM and nvidia for image generation?

Gomez · May 2, 2026

Jason Voorhees said:
@Gomez @psltristan1 @Kara

I've wanted to have something like this but there is no way i am gonna afford this any time soon

Jason Voorhees · May 2, 2026

WhateverItTakes1 said:
so mac is better for LLM and nvidia for image generation?

Yes

Jason Voorhees · May 2, 2026

Gomez said:
I've wanted to have something like this but there is no way i am gonna afford this any time soon

You can still run smaller models locally on your laptop etc but it will be significantly worse than chatgpt

Jason Voorhees · May 2, 2026

@duhz @lemureater

Jason Voorhees · May 2, 2026

WhateverItTakes1 said:
so mac is better for LLM and nvidia for image generation?

Tbh I personally would still invest in nvidia simply because of mature and stable cuda eco system is. It has been the industry norm for 15 years and still is and it's not changing anytime soon. Mac pro is basically just for consumers and niche ML training. If you want to get work done. You have to go nvidia. You can also buy AMD Radeon Pro GPUs that run on ROCm cheaper than Nvidia and much better than METAL but it only has drivers on Linux and still behind CUDA. I mentioned rtx 5090 because I don't think anyone here is an AI engineer or something. Most people here are probably gamers or professionals/college students who would also like to get a secondary use outside of gaming and work

Jason Voorhees · May 2, 2026

@WhateverItTakes1 if you are investing in some GPU or mac let me know. Id be interested in a discussion

Jason Voorhees · May 2, 2026

@browncurrycel @kababcel

WhateverItTakes1 · May 2, 2026

Jason Voorhees said:
@WhateverItTakes1 if you are investing in some GPU or mac let me know. Id be interested in a discussion

Yeah definitely interested because all the proprietary stuff is becoming dumber and pricier at the same time. my use case is solo game development. I'll probably need an LLM (or multiple) and hardware for the following:

-coding alongside me, so I write most of the code myself but the AI can assist with debugging, turning pseudocode into real code, and showing me how something works so i can use it myself. not vibecoding or agentic

-something creative that can give good ideas or inspo

-can search the web in order to get more than just its hardcoded knowledge

I also want to look into a voice generation AI so I can get voicelines for my game. Ill have a good gaming PC by this point so hardware should be covered, but Im interested in AI softwares for this.

hopefully our discussion doesnt become outdated by the time i can actually afford something

. Earliest would be in a few months

Deleted member 337609 · May 2, 2026

all this to generate porn comissions for $5 a piece on onlyfans

Jason Voorhees · May 2, 2026

WhateverItTakes1 said:
Yeah definitely interested because all the proprietary stuff is becoming dumber and pricier at the same time. my use case is solo game development. I'll probably need an LLM (or multiple) and hardware for the following:

-coding alongside me, so I write most of the code myself but the AI can assist with debugging, turning pseudocode into real code, and showing me how something works so i can use it myself. not vibecoding or agentic

-something creative that can give good ideas or inspo

-can search the web in order to get more than just its hardcoded knowledge

I also want to look into a voice generation AI so I can get voicelines for my game. Ill have a good gaming PC by this point so hardware should be covered, but Im interested in AI softwares for this.

hopefully our discussion doesnt become outdated by the time i can actually afford something . Earliest would be in a few months

In 2026 DeepSeek R1 is the king of this imo. There is this Continue extension in VS Code you can connect it at your Mac Studio's local IP and done. Use Qwen 3-Coder on your gaming pc itself for auto completes and deepseek for more complex stuff. This is essentially the GitHub copilot replacement

Jason Voorhees · May 2, 2026

@Quetila

yyk117 · May 2, 2026

mirin effort

Quetila · May 2, 2026

Good thread. What’s your job?

Jason Voorhees · May 2, 2026

Quetila said:
Good thread. What’s your job?

Can't give my designation because it's very specific to my company but basically DevOps+SWE working remotely. I don't touch AI but work on AI infrastructure.

Jason Voorhees · May 2, 2026

@Bars @ReadBooksEveryday

Jason Voorhees · May 2, 2026

@Divineincel @masai jumps enjoyer

Aㅤㅤㅤ · May 2, 2026

Mirin informative guide, once again. For the love of god Jason, invest in formatting your good guides.

Jason Voorhees · May 2, 2026

@Aox Ofwar @lnceIs

Jason Voorhees · May 2, 2026

Aㅤㅤㅤ said:
Mirin informative guide, once again. For the love of god Jason, invest in formatting your good guides.

Will do thx for input

Jason Voorhees · May 2, 2026

@Idontknow- @jaaba

Deleted member 175289 · May 2, 2026

Jason Voorhees · May 2, 2026

jaaba said:
Bump

Soft dnrd

Idontknow- · May 2, 2026

Idk nothing but bump

Jason Voorhees · May 2, 2026

@dragomog @KIVVV @81xa

dragomog · May 2, 2026

Jason Voorhees said:
@dragomog @KIVVV @81xa

why was i tagged

81xa · May 2, 2026

Jason Voorhees said:
@dragomog @KIVVV @81xa

what am i going to do with this info besides create ai porn
mirin effort

How to run LLMs and generate AI pics/videos locally

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

psychopathic

Banned

TheGreatDetective

The Witch of Truth

vision_n

logical guy

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

TheGreatDetective

The Witch of Truth

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Sayori

ascend or die

kisslessvirgin

Megastardom

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

kisslessvirgin

Megastardom

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

1966Ford

KHHEHTV Autist~Yakubian Fascist~550IQ~Afrocentrist

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

unknownincel

there is more to life

goku21

Chadriguez (LNWO) Latinos hmu for advice

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

WhateverItTakes1

Banned

Gomez

Supreme Thinker

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

WhateverItTakes1

Banned

Deleted member 337609

let go of all this, you don't need it.

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

yyk117

Kraken

Quetila

Kraken

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees

𝕸𝖊𝖗𝖈𝖊𝖓𝖆𝖗𝖞 𝕮𝖔𝖗𝖕 • 𝟐𝟎𝟐𝟒🥇

Jason Voorhees