Image generated by Dall-E 3, because it’s the best
Between the media’s AI hype cycle blowing up over Gemini making some questionable decisions in image generation, and blowing up over Elon suing OpenAI for living long enough to see itself become the villain, we have a new entrant to the chatbot clickbait wars: Anthropic’s Claude 3 Opus purportedly beating GPT4 and Gemini Ultra on a ton of very important benchmarks!
If you’re as skeptical (pessimistic?) as I am, you probably put performance benchmarks in the same mental category as the Gartner Magic Quadrant and Forbes 30 Under 30: somewhere between good marketing and directionally accurate. That said, I would love to be proven wrong if it means I get an upgrade over the OpenAI Premium subscription I’ve spent the last year with. And spoiler alert, I am happily canceling my OpenAI subscription! But not for the reason you think.
Since this news is almost a week old, I’m already super late to the party and there are hundreds of other — probably better — blog posts about this topic, so don’t bother reading this one. Seriously, I’m not going to be referencing a ton of stats, building my own snazzy new meta benchmark, or doing anything remotely interesting in this blog post, so you might as well stop reading and go back to work.
Instead, I’m going to review the user experience (actually, just my experience) of using these tools both personally and professionally. Your results will vary, which is why you should really stop reading this and try it out for yourself.
👉 Beware of zero-click worms targeting ChatGPT, Google Gemini, and other generative AI
ChatGPT vs Claude vs Gemini vs Grok
Let’s get Grok out of the way: from a technical perspective it sucks, don’t use it. If you want to use Grok because you’re edgy and 16 years old, then don’t pay Elon for it. Go to HuggingFace, download some 13B model without a lot of filters and tell it you like Andrew Tate. You can send the $16 per month you saved to me or spend it on OnlyFans tips.
Ok sorry, onto the blog I meant to write.
ChatGPT vs Claude vs Gemini
All three premium models will set you back $22.05 per month after taxes, they feature nice UIs with dark mode, worse but faster free versions with more usage limitations and less preferential treatment when their servers get overloaded, and they’re all generally very strong B2C LLMaaS offerings. They also all suffer from hallucination, incorrect responses, some amount of laziness, and occasional bugs. That said, I would willingly pay for the premium versions of any of them; however, as there are three and I only need one, we need to compare their differences.
ChatGPT by OpenAI
I was concerned when I set out to test these models side-by-side that the hundreds of hours I’ve spent with ChatGPT over the last year would unfairly bias me in its favor. I didn’t know if I had subconsciously developed prompting techniques that work better on GPT4 than other models, and I wasn’t sure how much my overall familiarity with the product would bias any attempt at a fair comparison.
Reasons to use ChatGPT:
It’s the only model service that allows the user to configure a personal “system” prompt in settings to customize some of the model’s behavior for each end user
- This is potentially a great feature I would like to see adopted by its competitors
Dall-E 3’s mage creation abilities are substantially better than Gemini’s, but if you’re generating more than the occasional meme with it then you’re still using the wrong tool
- I like the GPT Store in theory, and have made one myself, but almost never use them in practice
- For professional use there are privacy concerns with linking your company’s tools to some random person’s GPT
- For personal use I just don’t care to do mediocre self guided story based narratives, escape rooms, etc.
It will read its responses out loud to you, which is a nice accessibility feature the others don’t have
Reasons not to use ChatGPT:
GPT4-turbo has been getting noticeably lazier and more reticent to use Bing search even when it’s instructed to do so
- I assume this is an intentional cost cutting decision more than it was driven by this long prompt that Twitter discovered and decided it hated for some reason
- I have a conspiratorial belief that GPT4 is actually an LLM cascade and only triggers the full model when it decides your question warrants it and it has GPU capacity to serve the answer, otherwise it uses GPT3.5 turbo or some other cheaper, inferior model than the one you’re paying them for, but I have only user experience to support that
It loves to waffle
- When asking a question of GPT4 you almost never get a straightforward response, it will say something NPR-esque like “while most scientists believe knives are made of steel, some have argued that wet napkin based knives could make a sustainable alternative as they would require less mining and metal refining”. That’s not a real example but it illustrates my grievance
When explaining things or giving options, it likes to give long lists that degrade in quality from top to bottom, this is not desirable behavior
This is petty, but the answers are too densely formatted for my liking, even when I tell it shorter, more direct answers are preferred to longer ones
Claude 3 by Anthropic
Anthropic re-asserted its B2C LLMaaS relevance on Monday when it posted some rather impressive benchmark results on its website for Claude 3. However, when actually using it for the types of tasks I’m used to, I was extremely disappointed.
To clarify, I’m not accusing Anthropic of lying; I’m accusing the benchmarks of being bad. They link the git repo so I’m sure the results are factually correct, but just as a higher GRE score does not mean you’re wiser than someone with a lower score, these benchmark results do not make Claude 3 Opus a better, wiser, or more useable model than its competitors.
Reasons to use Claude:
Claude Sonnet is extremely fast, but not as good as Opus
- Still, there are plenty of situations where end users are willing to sacrifice a little quality for a considerable boost in speed
It includes a dyslexic friendly font option, which is a great accessibility feature I hope the others adopt
And that’s about it.
Reasons not to use Claude:
It’s bad
- It’s worse at rejecting false assumptions made in prompts
- It’s more prone to hallucination
- Its explanations aren’t as rich as GPT4’s or as intuitive as Gemini’s, you really get the worst of both worlds
- Its code for complex asks is not as complete or as correct
- It doesn’t preserve formatting of copy-pasted inputs, which makes rereading your own prompts a painful experience
It lacks features
- It can’t browse the web
- It can’t make images
- It can’t process images
- It has no ability to run code
Maybe these features are coming, but they’re not here, so I can’t review them. Just as I’m not reviewing Gemini Ultra 1.5, I’m reviewing Gemini Ultra 1.0 for its current features and behavior. Speaking of which…
Gemini by Google
I think the general consensus when Gemini was first released was that a lot of people felt that the company who accelerated the End Times by inventing transformer architecture, the company with the researchers, data, compute power, and financial motivation of potentially its own survival on the line would do a better job of it. I guess that’s fair, but Google has a long tradition of inventing transformative technologies and letting everyone else profit off them, so why stop now?
That said, I think people ought to give the current version of Gemini Advanced another try, especially if they’re either new to the B2C LLMaaS market or they’ve been using OpenAI for a while and are open to shopping around. Given that Gemini offers a free 2-month trial, there’s really no downside to taking it for a spin.
👉 Learn more about building an AI chat app using Gemini and Node.js
Reasons to use Gemini:
Possibly the fastest of all premium models
- Results come in full paragraphs and not token-at-a-time like GPT4’s, which I find less jarring
Most intuitive explanations of complex topics
- While GPT4 tends to give a fat wall of text with a ton of valuable, but hard to quickly digest information, Gemini breaks its responses into a more manageable flow of a couple sentences at a time interspersed with code snippets and sometimes even simple analogies that I find extremely impressive
- I’ve never once seen GPT4 give me an analogy to explain a concept without being explicitly prompted to do so
Code on par with GPT4
- Many people will debate this, and I encourage you to not take me at my word here, but for the types of tasks I was asking for, I got more complete code from Gemini Ultra 1.0 than GPT4-turbo. Sue me.
It has by far the best features integrated into the UI
- After you get a response, in two clicks you can tell it to rewrite it simpler, longer, shorter, more professional, or more casual
- You can even highlight sections of the response and tell it what you didn’t like or want more expansion on
- One-click ability to make it validate its own answer with Google Search
- I’ve never seen it refuse to do this in the way that GPT4 frequently refuses to use Bing
- I also don’t know how good it is, so more testing to be released in a later update to this blog
It has a better attitude than GPT4
- Petty maybe, but GPT4-turbo has become an indecisive, argumentative little prick sometimes and Gemini — while still very much capable of being wrong — has a more can-do attitude and seems more eager to give things the old college try
Reasons not to use Gemini:
Marginally less information in the average response
- Could be a pro or con depending on who you ask
- I think the average unit of information from a Gemini Ultra response is at least on par with GPT4-turbo, but GPT4 throws more at you on average
- But this also means Gemini gives you less chaff to sift through when looking for grains of information
No ability to add system prompt to customize your experience
- Again, I really like this feature of OpenAI’s, even if it frequently ignores it
Image generation is much worse than Dall-E 3’s
- Try any prompt side by side and I think you’ll agree; it’s missing the detail, the depth, and the magic that Dall-E 3 brings
- But again, if you’re primary goal is image generation, none of these are the right tool for your work
I might come back to this blog in a month after the honeymoon period has died down more and I find more things to hate about Gemini, but for now I am making the switch. My default LLM tab is now Gemini Advanced and not ChatGPT Premium.
Winner
And the winner is… Llama-v2–70b-chat!
Ok fine, while I am a staunch proponent for open source, I admit Llama 70b is not the same quality as these premium models without fine tuning or other serious customization. While it may make sense for many enterprises to use Llama-v2 for their business problems, that doesn’t mean it’s the right tool for individual consumers, which is who this is for.
Closing thoughts
I want to reiterate that I did not evaluate these models using any kind of scientific methodology; I’m simply telling you how I feel about them after using them for the kinds of tasks I would use them for on the daily. My main point of writing this blog that you’re not reading is to encourage people to press X to doubt when they see benchmarks. Just as 9/10 dentists somehow recommend every brand of toothpaste (maybe we’re comparing it to not using toothpaste?), every tech company has a set of benchmarks where it’s better than its competitors. Benchmarks are marketing material, not holy metrics passed down from the singularity.
The only thing I want the three people reading this to take away from it is that the types of tasks people use these tools for on a regular basis does not map 1:1 for the tasks used to produce benchmarks. Therefore, if you want to find out which one is best, you need to test them yourselves during the course of a normal work week. That’s all I wanted to say about it really, this last paragraph, the rest was just a waste of your time.