How to run LLMs and generate AI pics/videos locally

Jason Voorhees

Jason Voorhees

๐•ธ๐–Š๐–—๐–ˆ๐–Š๐–“๐–†๐–—๐–ž ๐•ฎ๐–”๐–—๐–• โ€ข ๐Ÿ๐ŸŽ๐Ÿ๐Ÿ’๐Ÿฅ‡
Joined
May 15, 2020
Posts
92,378
Reputation
279,300
Option 1 Mac Studio setup

Buy the Mac studio. Specifically the Max or Ultra chip with as much memory as you can afford. This is the easiest, cheapest and set and forget setup

1000190414


On a Mac Studio with 192GB or 256GB of unified memory, the GPU can access a lot of memory. This allows you to run massive Large Language Models like Llama-3 70B or Gemma 4 31B entirely on chip.

You also run a high quantization Llama-3 405B. At 4 bit quantization if my math is correct should be around 230GB of VRAM which is basically as good as ChatGPT-4 but it will be slow. 1-2 tokens per second almost the same speed as fast human typing so it's slow but the Main advantage is you can get a very decent setup for $4-5k and run complex models that can do coding, math and don't hallucinate as much

Option 2 Nvidia AI setup

If you want the absolute best performance possible and also plan to do AI training. You want something that is bullet proof. Then one or multiple RTX 5090. Besides being a gaming beast it is also extremely capable in AI workloads. In fact in startups and many low budget workflows they only use RTX 5090s for everything.

1000190419


Almost every Al research paper and library PyTorch, TensorFlow is written for NVIDIA/CUDA first. GDDR7 memory provides insane bandwidth you generate images and get tokens within seconds and is in 2026 the industry norm.

There's another variant of this called RTX 6000 Ada or the newer Pro 6000 . Big advantage of this GPU besided more VRAM is Error correction memory which is super important if you are doing serious stuff. You can run this GPU at 100% for months and it won't crash which is great for AI training and fine tuning but it only makes sense to buy this GPU if you are doing AI research work and your work makes you money. It alsp costs as much as a car. Like $10k each.

1000190432

1000190418


Problem with this setup is that they stupid expensive and at 32-48GB, you are effectively locked out of the most intelligent models unless you use heavy distillation or multiple cards.
You are limited to 8B to 30B still very powerful but the not the full ChatGPT replacement unless you spend $15k+ on multiple GPUs.

Local LLM models recommendation for each setup


For Mac Studio with 128-256 gb vram


As I said Llama-3 405B with 4 bit quantization is the best for general usecase imo.



You can also run Command R+ and DeepSeek-V3 / R1 mixture of experts but I would stick to Llama because these are more specialized model for specific use cases.




Image and Video Generation models. I know why most of you nighas want to run local LLMs. It is to generate AI porn. For image generation mac studio is decent but not ideal. For generation of pixels you need TFLOPS

1000190437


and everything like I said is optimized for CUDA and apple uses META which is completely different architecture
so mac studio tries to cope by trying to imitate what nvidia card does through a bunch of complex
processes that I won't get into but the result is long wait times like 30-50 seconds per image. The gold standard rn for images is Flux and for videos it's a nightmare even a 5 second video can take 10-20 minutes on Kling local. If you are on a Mac, you should also look for MLX versions of models on Hugging Face.


For the Nvidia setup with 32-48 GB RAM

Gemma 4 31B. Released just last month. Somehow very few people I think because it got overshadowed by the Gemini-4 release rumours but it's Google's local masterpiece.


1000190429


On a 5090, it hits 80+ tokens per second and it demolishes Qwen and other older models. It is basically a super intelligent instant messenger. Takes full advantages of the CUDA and tensor cores to give you blazing fast speed and is extremely good for a model that is only 31B parameters.

And for Image and video Generation. This is where Nvidia and Cuda environment earns it's price tag. same Flux image takes 1.5-3 seconds that's it. You can run also run a brand new thing in the market called Real-time SDXL where the image changes as you type the prompt which is super cool.



For Video Generation. Most models are trained and built specifically for Cuda so nvidia again shines here. A 5-second video clip renders in under 60 seconds.

Only problem is you can't get too much detail without spending a shit ton of money. Like you can't generate an 8k picture without running out of memory

If you tech savvy and understand AI on software level or you are an AI engineer. I also encourage you to fiddle on GGUF files and also do some fine tuning to make the AI behave exactly as you want. Even I'm learning about these things and experimenting but be very careful because you can give the AI a lobotomy easily.

For the ultimate baller setup. You can use both Max studio and the Nvidia GPU. This was a pipe dream since forever but Just a few weeks back a company called Tiny Corp released officially signed drivers (TinyGPU) that allow Apple Silicon Macs to talk to NVIDIA GPUs over Thunderbolt 4/5. So you get the best of both world. Only con is the bandwidth. Thunderbolt is fast but it's still slower than a PCIe but still much better than using metal.



P.S Two more cons I forgot to mention. Mac studio is very efficient barely sips like 230-250W even on full load, is dead silent and doesn't get too hot. Nvidia setup is the opposite you easily would need something like 1600W PSU or something to run them. If you live in Europe or in those older houses in america or apartments, a 1600W PC plus a monitor, a fridge on the same circuit will trip the breaker and give you a massive electricity bill. One more thing is that the fans will ramp up to 100% AI inference is intensive and is a sustained heavy load unlike gaming which is spikes, your PC will sound like a jet engine taking off and your PC and even the room it is in, will get hotter. My mum always yells at me every time I turn it on. Just something to keep in mind if you live with someone or have a small room or something.
 
Last edited:
  • +1
  • Love it
  • Woah
Reactions: Divineincel, mrdouchebag, kababcel and 64 others
I'd rather generate videos of Kirk, Diddy, and Epstein playing FNAF on Gemini's dime.
 
  • +1
  • JFL
Reactions: kababcel, Sycophant, masai jumps enjoyer and 16 others
Read and learn unc @IronMike :feelsrope:
 
  • +1
Reactions: BigBallsLarry and Jason Voorhees
All this to become a vtuber streamer on twitch
 
  • +1
Reactions: Nothing, BigBallsLarry, dhusc and 1 other person
@Petsmart @imontheloose @ltnbrownacnecel
 
  • +1
Reactions: kababcel, estonianslayerr, BigBallsLarry and 3 others
I'd rather generate videos of Kirk, Diddy, and Epstein playing FNAF on Gemini's dime.
You can generate locally without guard rails and as much degenerate stuff as you want locally
 
  • +1
Reactions: estonianslayerr, Nothing, BigBallsLarry and 3 others
@mohito @CollioureViews @Stargazer
 
  • +1
Reactions: Stargazer, estonianslayerr, BigBallsLarry and 3 others
@Jgns @Sayori @MLP @mcmentalonthemic
 
  • +1
Reactions: estonianslayerr, BigBallsLarry, jaaba and 3 others
@dhusc @kisslessvirgin @dawooddX
 
  • +1
Reactions: estonianslayerr, BigBallsLarry, jaaba and 3 others
Option 1 Mac Studio setup

Buy the Mac studio. Specifically the Max or Ultra chip with as much memory as you can afford. This is the easiest, cheapest and set and forget setup

View attachment 4995193

On a Mac Studio with 192GB or 256GB of unified memory, the GPU can access a lot of memory. This allows you to run massive Large Language Models like Llama-3 70B or Gemma 4 31B entirely on chip.

You also run a high quantization Llama-3 405B. At 4 bit quantization if my math is correct should be around 230GB of VRAM which is basically as good as ChatGPT-4 but it will be slow. 1-2 tokens per second almost the same speed as fast human typing so it's slow but the Main advantage is you can get a very decent setup for $4-5k and run complex models that can do coding, math and don't hallucinate as much

Option 2 Nvidia AI setup

If you want the absolute best performance possible and also plan to do AI training. You want something that is bullet proof. Then one or multiple RTX 5090

View attachment 4995197

Almost every Al research paper and library PyTorch, TensorFlow is written for NVIDIA first. GDDR7 memory provides insane bandwidth you generate images and get tokens within seconds and is in 2026 the industry norm.

There's another variant of this called RTX 6000 Ada or the newer Pro 6000 . Big advantage of this GPU besided more VRAM is Error correction memory which is super important if you are doing serious stuff. You can run this GPU at 100% for months and it won't crash which is great for AI training and fine tuning but it only makes sense to buy this GPU if you are doing AI research work and your work makes you money. It alsp costs as much as a car. Like $10k each.

View attachment 4995329
View attachment 4995215

Problem with this setup is that they stupid expensive and at 32-48GB, you are effectively locked out of the most intelligent models unless you use heavy distillation or multiple cards.
You are limited to 8B to 30B still very powerful but the not the full ChatGPT replacement unless you spend $15k+ on multiple GPUs.

Local LLM models recommendation for each setup


For Mac Studio with 128-256 gb vram


As I said Llama-3 405B with 4 bit quantization is the best for general usecase imo.



You can also run Command R+ and DeepSeek-V3 / R1 mixture of experts but I would stick to Llama because these are more specialized model for specific use cases.




Image and Video Generation models. I know why most of you nighas want to run local LLMs. It is to generate AI porn. For image generation mac studio is decent but not ideal. For generation of pixels you need TFLOPS

View attachment 4995351

and everything like I said is optimized for CUDA and apple uses META which is completely different architecture
so mac studio tries to cope by trying to imitate what nvidia card does through a bunch of complex
processes that I won't get into but the result is long wait times like 30-50 seconds per image. The gold standard rn is Flux and it's a nightmare even a 5 second video can take 10-20 minutes on Kling local. If you are on a Mac, you should also look for MLX versions of models on Hugging Face.


Models for Nvidia setup. Gemma 4 31B. Released just last month. Somehow very few people know about it because it got overshadowed by the Gemini-4 release but it's Google's local masterpiece.


View attachment 4995276

On a 5090, it hits 80+ tokens per second and it demolishes Qwen and other older models. It is basically a super intelligent instant messenger. Takes full advantages of the CUDA and tensor cores to give you blazing fast speed and is extremely good for a model that is only 31B parameters.

And for Image and video Generation. This is where Nvidia and Cuda environment earns it's price tag. same Flux image takes 1.5-3 seconds that's it. You can run also run a brand new thing in the market called Real-time SDXL where the image changes as you type the prompt which is super cool.



For Video Generation. Most models are trained and built specifically for Cuda so nvidia again shines here. A 5-second video clip renders in under 60 seconds.

Only problem is you can't get too much detail without spending a shit ton of money. Like you can't generate an 8k picture without running out of memory

If you tech savvy and understand AI on software level or you are an AI engineer. I also encourage you to fiddle on GGUF files and also do some fine tuning to make the AI behave exactly as you want but be very careful because you can give the AI a lobotomy very easily.

P.S Two more cons I forgot to mention. Mac studio is very efficient barely sips like 230-250W, is dead silent and doesn't get too hot. Nvidia setup is the opposite you easily would need something like 1600W PSU or something to run them. If you live in Europe or in those older houses in america or apartments, a 1600W PC plus a monitor, a fridge on the same circuit will trip the breaker and give you a massive electricity bill. One more thing is that the fans will ramp up to 100% AI inference is intensive so your PC will sound like a jet engine taking off and your PC and even the room it is in, will get hotter. My mum always yells at me every time I turn it on. Just something to keep in mind if you live with someone or have a small room or something.

Bookmarked will be using this
 
  • +1
  • JFL
  • Love it
Reactions: lnceIs, BigBallsLarry, unknownincel and 2 others
good post - well researched

this is close to what i remember when i looked into it as well

but i don't think anyone here will do it
 
  • +1
Reactions: Algernon, shedontluv-U, Nothing and 4 others
good post - well researched

this is close to what i remember when i looked into it as well

but i don't think anyone here will do it
People are too broke or people are too lazy to do it?
 
  • +1
Reactions: estonianslayerr, Algernon, Nothing and 4 others
  • +1
Reactions: Algernon, Nothing, BigBallsLarry and 3 others
@sub5outsider @Orka @TemporaryName
 
  • +1
Reactions: estonianslayerr, jaaba and unknownincel
@Hernan @Mess @ecstazy
 
  • +1
Reactions: estonianslayerr, Hernan, Deleted member 121340 and 3 others
@unknownincel @chang cypionate
 
  • +1
Reactions: estonianslayerr, jaaba, unknownincel and 1 other person
@Nathan Fielder @yyk117
 
  • +1
Reactions: estonianslayerr, jaaba, Deleted member 104510 and 3 others
I use those nude AI from the huggingface space:):)
 
  • +1
Reactions: psltristan1 and Jason Voorhees
@Topkra @tensor @qxdr @davidlaidisme67 @topology
 
  • +1
Reactions: estonianslayerr, jaaba, psltristan1 and 2 others
im broke
 
  • +1
Reactions: Nothing, psltristan1 and Jason Voorhees
Noted
 
  • +1
Reactions: psltristan1 and Jason Voorhees
@Gomez @psltristan1 @Kara
 
  • +1
Reactions: jaaba, Kara, Gomez and 1 other person
so mac is better for LLM and nvidia for image generation?
 
  • +1
Reactions: Nothing and Jason Voorhees
  • +1
Reactions: Nothing and Jason Voorhees
  • +1
Reactions: Nothing, jaaba and WhateverItTakes1
I've wanted to have something like this but there is no way i am gonna afford this any time soon​
You can still run smaller models locally on your laptop etc but it will be significantly worse than chatgpt
 
  • +1
Reactions: Nothing, jaaba and Gomez
@duhz @lemureater
 
  • +1
Reactions: duhz, jaaba and Lemur
so mac is better for LLM and nvidia for image generation?
Tbh I personally would still invest in nvidia simply because of mature and stable cuda eco system is. It has been the industry norm for 15 years and still is and it's not changing anytime soon. Mac pro is basically just for consumers and niche ML training. If you want to get work done. You have to go nvidia. You can also buy AMD Radeon Pro GPUs that run on ROCm cheaper than Nvidia and much better than METAL but it only has drivers on Linux and still behind CUDA. I mentioned rtx 5090 because I don't think anyone here is an AI engineer or something. Most people here are probably gamers or professionals/college students who would also like to get a secondary use outside of gaming and work
 
Last edited:
  • +1
Reactions: Nothing, jaaba and WhateverItTakes1
@WhateverItTakes1 if you are investing in some GPU or mac let me know. Id be interested in a discussion
 
  • +1
Reactions: jaaba and WhateverItTakes1
@browncurrycel @kababcel
 
  • +1
Reactions: kababcel, jaaba and browncurrycel
@WhateverItTakes1 if you are investing in some GPU or mac let me know. Id be interested in a discussion
Yeah definitely interested because all the proprietary stuff is becoming dumber and pricier at the same time. my use case is solo game development. I'll probably need an LLM (or multiple) and hardware for the following:

-coding alongside me, so I write most of the code myself but the AI can assist with debugging, turning pseudocode into real code, and showing me how something works so i can use it myself. not vibecoding or agentic

-something creative that can give good ideas or inspo

-can search the web in order to get more than just its hardcoded knowledge

I also want to look into a voice generation AI so I can get voicelines for my game. Ill have a good gaming PC by this point so hardware should be covered, but Im interested in AI softwares for this.

hopefully our discussion doesnt become outdated by the time i can actually afford something :feelswhy:. Earliest would be in a few months
 
  • +1
Reactions: Nothing and Jason Voorhees
all this to generate porn comissions for $5 a piece on onlyfans
 
  • JFL
Reactions: Nothing and Jason Voorhees
Yeah definitely interested because all the proprietary stuff is becoming dumber and pricier at the same time. my use case is solo game development. I'll probably need an LLM (or multiple) and hardware for the following:

-coding alongside me, so I write most of the code myself but the AI can assist with debugging, turning pseudocode into real code, and showing me how something works so i can use it myself. not vibecoding or agentic

-something creative that can give good ideas or inspo

-can search the web in order to get more than just its hardcoded knowledge

I also want to look into a voice generation AI so I can get voicelines for my game. Ill have a good gaming PC by this point so hardware should be covered, but Im interested in AI softwares for this.

hopefully our discussion doesnt become outdated by the time i can actually afford something :feelswhy:. Earliest would be in a few months
In 2026 DeepSeek R1 is the king of this imo. There is this Continue extension in VS Code you can connect it at your Mac Studio's local IP and done. Use Qwen 3-Coder on your gaming pc itself for auto completes and deepseek for more complex stuff. This is essentially the GitHub copilot replacement
 
  • +1
Reactions: jaaba and WhateverItTakes1
@Quetila
 
  • +1
Reactions: jaaba and Quetila
mirin effort
 
  • +1
Reactions: Jason Voorhees
Good thread. Whatโ€™s your job?
 
  • +1
Reactions: Nothing and Jason Voorhees
Good thread. Whatโ€™s your job?
Can't give my designation because it's very specific to my company but basically DevOps+SWE working remotely. I don't touch AI but work on AI infrastructure.
 
  • +1
Reactions: Nothing, jaaba and Quetila
@Bars @ReadBooksEveryday
 
  • +1
Reactions: ReadBooksEveryday, Bars and jaaba
@Divineincel @masai jumps enjoyer
 
  • +1
Reactions: Divineincel, masai jumps enjoyer and jaaba
Mirin informative guide, once again. For the love of god Jason, invest in formatting your good guides.
 
  • +1
Reactions: Nothing and Jason Voorhees
@Aox Ofwar @lnceIs
 
  • +1
Reactions: lnceIs, Aox Ofwar and jaaba
  • +1
Reactions: Nothing, jaaba and Aใ…คใ…คใ…ค
@Idontknow- @jaaba
 
  • +1
Reactions: Idontknow- and jaaba
Bump
 
  • +1
Reactions: Jason Voorhees and Idontknow-
  • JFL
  • +1
Reactions: jaaba and Idontknow-
Idk nothing but bump
 
  • +1
Reactions: Jason Voorhees
@dragomog @KIVVV @81xa
 
  • +1
Reactions: jaaba and 81xa

Similar threads

rawr
Replies
6
Views
61
null.
null.
Jason Voorhees
Replies
6
Views
47
Kara
Kara
Chadeep
Replies
9
Views
75
Chadeep
Chadeep
OldRooster
Replies
0
Views
17
OldRooster
OldRooster

Users who are viewing this thread

Back
Top