Intro

Like everyone else in the world, I wanted to dive deeper in Large Language Models (LLMs), Generative Pretrained Transformers (GPTs), and Artificial Intelligence (AI). The hype around this topic is crazy right now, and unlike BlockChain/NFTs - there is serious non-criminal potential/viability. It can be an accelerator for everything we do, and seeing the fascinating projects that are happening where I work has been a huge inspiration.

ChatGPT provided some good fun in my first forays, but it just wasn't scratching the itch. I hate "magic", and always want to know what's happening under the hood. I wanted to *run* this stuff, not just use it. I had also played around with Coding Assistants like Copilot and Continue, but they always fell short of expectations. Speaking of expectations, sending code off to 3rd party is always a big privacy/compliance no-no. Having a local model running on my machine meant I could avoid any potential issues.

Goals

As always, being intentional with our actions means we can stay focused and get the best results. My goals for this excercise were as follows:

1: A reasonably decent Coding Assistant.

I wanted a way to automate the easy/repetitive stuff so I could focus more on Deep Work (thanks Cal Newport!).

  • Take class A and reframe it so it does B instead.

  • Create an example of design pattern X that does Y.

  • Write a Class/Method/Function that does Z

  • Tell me what exactly this code does?

  • Generate Unit Tests for this Class/Method/Function

  • Look for obvious issues / act as "soft" code reviewer.

2: A reasonably decent General Purpose Chat model.

I wanted to be able to do various ad-hoc tasks, and also get some editing help with writing.

  • Summarizing Text

  • Rewriting Text

  • Expanding in Ideas/Snippets

  • Keeping writing "voice" consistent

So naively I was hoping for 1 super model, or 1 tool where I could swap between 2 purpose-built models. So we start with a pocket full of dreams, and a beefy laptop with a good GPU (4080) 🚀🚀🚀

Searching online led me to LLM Mecca - HuggingFace.

Hugging Face 🤗

The general recipe given for running these models is:

  1. Go to HuggingFace

  2. Be overwhelmed

  3. ????

  4. Profit ✔️

I'd like to consider myself a 'reasonably technical' person, but coming to HF for the first time threw me for a loop. Models, Spaces, Datasets. Transformers, Diffusers, SafeTensors. If you're not savvy on the lingo, you're not going to have a good time.

To be fair, once you find models - there generally is good documentation. HuggingFace allows you to run everything in the cloud (for a cost), and models usually also provide instructions on how to use/run them locally. The caveat, is that it's all very python-centric. I really didn't want to have to build/manage something that would run in the console, as I was conscious of time and getting side-tracked down a dark rabbit-hole. Looking for a quicker solution, I found GPT4ALL.

GPT4ALL

GPT4ALL is pretty much what it says on the tin: a (great) project to enable everyone to run LLMs.

Install was quick (1 click!), and once it loads it has a small catalogue of models that you can download and swap between. You may need to do a bit of googling to understand the differance between them, but on the whole it's faily easy and intuitive to get set up.

For me it ran okay, albeit slowly. This surprised me, as my machine is pretty beefy, with plenty of VRAM! For small models it was servicable, but as I started to use larger models I was getting very few tokens per second. Needless to say, the more interesting code models are typically quite large; so getting 1 token/word every 5-10 seconds with them was a total deal breaker.

The reason for this slow scaling with model size, is that GPT4ALL is built on top of llama.cpp. Llama.cpp allows models to be run in the CPU (which is great!). This does severely limit throughput though. Even with GPU offloading active, it's still primarily CPU based, and just wasn't getting good results.

Some googling/redditing about more GPU/bound processing later let me to…

Oobabooga

Oobabooga's Text Generation WebUI is a fantastic tool. It acts as a web-based frontend for using and managing LLMs. It’s the rolls-royce of run-it-yourself Text Model fun.

Installation is a 2-step rocket:

  1. Clone the repo to your machine

  2. Run the appropriate starting script (for me it was start_windows.bat).

    • On first run it will download all the dependencies that it needs, and then will run. From then on you keep using the same script to start, and it will just start; skipping the install step.

Diving through the Models and Parameters tabs had a lot of complicated options. Some quick googling helped me make sense of them, and gave me enough confidence to just roll with the defaults for the moment.

I dropped in the models that I had downloaded via GPT4ALL earlier, but I was struggling with similarly slow responses on any models I tried.

GGUF Back to HuggingFace

A good google deep-dive led me to Reddit posts about GGUF Models. GGUF is a stable, extensible model format that surpasses the limitations of previous formats. Where do we get GGUF models though? Back to HuggingFace!

One of the top contributors there is a user called TheBloke who uploads compiled/compressed versions of the most popular models out there. Things can get hairy pretty quickly as there are many different versions/quantizations of models (see table below!). It took a lot of trial and error (downloading various 4-8gb files 😒) to figure out what worked for my setup. Discovering Can you run it? ended up saving the day, allowing me to see if my machine could actually load models as I was searching through.

A list of the quantization options for a GGUF Reupload, with disk/VRAM requirements

After selecting the right sized models, speeds were definitely better - and I was getting access to more specialist models. However, it just wasn't in a place that I found comfortable to work with. I could ask it a question, go make a coffee, come back and it would be just finishing. For this to be usable, more immediate responses were required 😒

Unearthing GPTQ

Another hour or so of googling/redditing, and we found our way to the promised land! As always, the pro-tips are in the comments. Deep diving through Reddit posts and articles, someone offhandedly mentioned in a comment that GPTQ models at 7B instructions worked for them with 8GB of VRAM. Hmmm... what's GPTQ?

GPTQ is another module format, whereby the models are quantized at 4-bits to allow for GPU Inference and improved performance.

Heading back to TheBloke's list of models, and huzzah - they also offer GPTQ versions of most models! By selecting models that hovered around the 8-10Gb size and using the ExLLama loaders, I could get models that loaded entirely in VRAM, with some space to spare. This left Oobabooga with enough VRAM as scratch space, meaning I could have MUCH faster results.

Using this combination, I'm now clocking in around ~30-40 Tokens/Words per second. This is a much smoother, usable experience. Result!
I’m sure if I start digging through the configs I can speed things up more, but I’m at a stable place now so will just let sleeping dogs lie…

Oobabooga allows swapping of models without too much trouble or downtime. The 2 models I settled on are:

Summary

Actually getting going took relatively little time, once I knew understood what I was doing. I'm now using these models daily, and have seen my workflows speed up nicely!

The real time-sink here was a lack of knowledge of the space, and that meant I spent far too long going back-and-forth trying to figure out details/errors that in hindsight - were totally meaningless. Inconsistent performance, nonsense errors, and bad code (GPU out of memory? Better crash the whole app!) means the learning curve is much steeper than it needs to be.

The AI space is moving so fast, however the rickety tooling, poor/non-existent docs and the lack of up-to-date tutorials REALLY hurt adoption, and shows that the area is RIPE for someone interested in helping newbies out. Now that I can run models locally, I'd like to create my own. Stay tuned!

Warning!

Beware when using extensions in Oobabooga! When you install them, bear in mind that they can perform arbitrary code execution and WILL alter your installation. I've had issues were I after installing various vetted extensions, that I could no longer start Oobabooga due to a missing packages. Manually adding them, didn’t help, and even deleting the extension manually had no effect!

I ended up having deleting the installer_files directory, and then running the start script meant it would reinstall all of the dependencies again. This is a grand aul waste of time, so don’t be afraid to back-up as you go!

Previous
Previous

Code that fits in your Head #1

Next
Next

S3E6 - Why you're not advancing in your career