Polarised AI Experiences

After a month of digging into AI I’m developing some more solid views, and it’s a very nuanced situation. As things stand, using machine-local hardware isn’t going to give a great experience for most folk, and I think this explains why there are some pretty polar views.

There are two main variables when it comes to LLMs; Parameter count, and Quantization (i.e. down-scaling). Both of these numbers are proportional to the quality of the output, with parameter count being the largest influencer. This means if you’re using a model that has 120 billion parameters, in 32 bit floating point representation, you’ll probably get a good experience…. but… you’ll also be looking at around 500GB of memory, and that would be GPU memory for the best experience, which is a ridiculously expensive setup to own, which is why cloud providers have a place in this world.

The first step folk do is use smaller models, but, as with any reduction, it’s a lossy process. If you go from a 120 billion parameter model to a 20 billion one, you start looking at something more feasible for a single machine setup; 80GB of RAM. You can do this, but you’re going to see more hallucinations, which creates a worse experience.

The next thing is Quantization, which is basically a down-scaling the model. Seeing models downscaled from a 32 bit floating representation to an 8 bit integer one (Q8), or even 4 bit integers (Q4) isn’t uncommon. This further reduces the memory requirements, and, for Nvidia hardware, can vastly increase the performance. If you take the 20 billion parameter model, and quantize it to 4 bit integers, you’re then looking at a model which fits into 10GB of RAM, which is within the realm of being usable on a 16GB consumer GPU you can buy off the shelf.

So to get a machine-local configuration you’ve lost over 80% of the individual parameters (120 to 20), and you’re using a representation that loses over 85% of the precision for each parameter.

Which makes a big difference to your experience.

To me, this helps to explain why I see folk praising cloud based solutions, where they’re paying tens or hundreds of pounds/euros/dollars for time on a cloud system which uses models that have hundreds of billions of parameters at a high level of precision, and others complaining about systems which run a small fraction of the parameter count, and a vastly reduced precision.