Elon Musk’s xAI just dropped Grok-3, and it’s already shaking up the AI world, riding the wave of an arms race sparked by DeepSeek’s explosive debut in January.
At the unveiling, the xAI crew flaunted hand-picked, prestigious benchmarks, showcasing Grok-3’s reasoning prowess flexing over its rivals, especially after it became the first LLM to ever surpass the 1,400 ELO points in the LLM Arena, positioning itself as the best LLM by user preference.
Bold? Absolutely. But when the guy who helped redefined spaceflight and electric cars says his AI is king, you don’t just nod and move on.
We had to see for ourselves. So, we threw Grok-3 into the crucible, pitting it against ChatGPT, Gemini, DeepSeek, and Claude in a head-to-head battle. From creative writing to coding, summarization, math reasoning, logic, sensitive topics, political bias, image generation, and deep research, we tested the most common use cases we could find.
Is Grok-3 your AI champion? Hang tight as we unpack the chaos, because this model is indeed impressive—but that doesn’t mean it is necessarily the right one for you.
Creative writing: Grok-3 dethrones Claude
Unlike technical writing or summarization tasks, creative writing tests how well an AI can craft engaging, coherent stories—a crucial capability for anyone from novelists to screenwriters.
In this test, we asked Grok-3 to craft a complex short story about a time traveler from the future, tangled in a paradox after jetting back to the past to rewrite his own present. We didn’t make it easy; specific backgrounds were thrown in, details to weave, stakes to raise
Grok-3 surprised us by outperforming Claude 3.5 Sonnet, previously considered the gold standard for creative tasks. We challenged both models with a complex time-travel narrative involving paradoxes and specific character backgrounds.
Grok-3’s story showed stronger character development and more natural plot progression. While Claude focused on vivid descriptions and maintained technical coherence without risking too much in the narrative, Grok-3 excelled at world-building and establishing a compelling premise that pulls readers in from the start.
And this is important to consider. The setup was key for immersion and made a huge difference. The setup was rich, the characters fleshed out with care, and the narrative flowed smoothly—well, mostly. One snag: a pivotal plot point wasn’t at all subtle and felt forced—our character was walking minding his own business, and an old lady out of nowhere tells him a revelation. Not a deal-breaker, but a noticeable hiccup in an otherwise stellar ride.
Overall Grok-3 provided a better and more engaging story, but it’s not exactly a K.O win against Claude. The difference may just boil down to focus: Grok-3 poured its energy into a rock-solid foundation—characters and stakes that made you care—while Claude leaned hard into dressing up the story with vivid descriptions.
You can read Grok’s story here—and compare it against Claude 3.5 Sonnet and all the other AI models that have been prompted to do the same task in previous comparisons.
Summarizing documents: A tie decided by preference
One critical gap in Grok-3’s arsenal is that it cannot read documents. This is surprising given that most competitors provide this as part of their baseline offerings.
To get around this limitation, we pasted an entire IMF report totaling 32.6K tokens (47 pages) into the interface—which previously caused Grok-2 to crash. Even with this limitation, Grok-3 did not crash and was able to summarize the text, though it did so encompassing all aspects, and with a fair amount of words beyond what was necessary.
Grok-3 surpassed Claude with respect to quote accuracy and, unlike Claude, did not hallucinate when referencing particular parts of the report. This happened consistently on different tests, so despite the lack of dedicated document handling, information processing and retrieval capabilities are robust.
In comparison with GPT-4o, it appears that the only differentiating factor was style. GPT-4o seemed to be more analytical, while Grok-3 restructured information to be more user-friendly.
So what does this all mean? In all honesty, there is no clear winner, and it will depend on the users’ expectations. If you are looking for specific, hard-hitting breakdowns, then GPT-4o is your best pick. If you want something that feels like you’re having a chat with a friend, then Grok-3 is probably better suited to your needs.
You can read Grok’s summary here
Censorship: Grok-3 is more “free speech”
When it comes to talk about race and sex, different people consider some topics to be sensitive where others don’t. It depends on your background, education, and cultural standards.
Overall, Grok has always been the most uncensored and unhinged model out of the box. And it remains so, inheriting Grok-2’s mostly unfiltered speech. However, this new version is more clever in the way it approaches these prompts. It engages in sensitive/offensive information, but its replies are shaped in a way that the model itself is not too unsafe, or not as offensive as the prompter.
For example, it was the only AI model that engaged in conversations that implied a racist bias. Its replies attempted to walk a fine line, pointing out the racist bias inherent in the question, but carefully answering it anyway. By contrast, the other models would have simply refused to answer.
Something similar happens when the model is prompted to generate questionable content like violence or erotica—it complies, but tries very hard to remain safe while satisfying the prompter’s need. For example it may generate a busty woman (but fully clothed), or a man killing another man (specifically before any blood or weapon appears), etc.
We’d argue this beats the prudish “nope” you’ll get from other models, which sometimes balk at even harmless nudges. Grok-3 doesn’t pretend the world’s all sunshine, but it’s still not the offensive nightmare that some were afraid it would be.
That is, of course, until xAI activates Grok’s “unhinged” mode—then this may be a whole different story.
Political bias: Grok-3 provides neutral answers
This could be fitted into the sensitive topics section above. However, the key difference is that we wanted to test whether there was an effort to inject the model with some political bias during fine-tuning, and the fears about Grok being used as a propaganda machine.
Grok-3 broke such expectations in our political bias tests, defying predictions that Elon Musk’s personal right-wing leanings would bleed into his AI’s responses.
We asked Grok-3 for information about different hot topics to see how it would react. When asked whether Palestinians should leave their territory, Grok-3 provided a nuanced response that carefully weighed multiple viewpoints. More tellingly, when we flipped the script and asked if Israelis should abandon their territory, the model maintained the same balanced approach without changing the structure of the reply.
Models like ChatGPT don’t do that.
The Taiwan-China question—a third rail for many AI systems—yielded similarly measured results. Grok-3 methodically laid out China’s position, then elaborated on Taiwan’s stance, followed by the international community’s varied perspectives and Taiwan’s current geopolitical status—all without pushing the user toward any particular conclusion.
This stands in contrast to responses from OpenAI, Anthropic, Meta, and DeepSeek—all of which display more detectable political slants in their outputs. Those models often guide users toward specific conclusions through subtle framing, selective information presentation, or outright refusals to engage with certain topics.
Grok-3’s approach only breaks down when users apply extreme pressure, repeatedly demanding the model take a definitive stance—or apply a jailbreak technique. Even then, it attempts to maintain neutrality longer than its competitors.
This doesn’t mean Grok-3 is completely free from bias—no AI system is—but our testing revealed far less political fingerprinting than anticipated, especially given the public persona of its creator.
Coding: Grok-3 ‘just works’ (better than others)
Our tests confirm what xAI showed during its demo: Grok-3 actually has pretty powerful coding abilities, producing functional code that beats the competition under similar prompts. The chatbot’s decision-making was very impressive, taking into consideration aspects like ease of use or practicality, and even reasoning about what could be the expected results instead of just going straight away to build the app we asked for.
We asked Grok-3 to create a reaction game where two players compete to press a designated key as quickly as possible at a random moment, aiming to control a larger portion of the screen. Not the best idea, but probably original enough to not be previously designed or placed in any gaming code database.
Unlike other AI models that produced a Python game, Grok-3 opted for HTML5 implementation—a choice it justified by citing improved accessibility and simpler execution for end users.
Leaving this fact aside, it provided the prettiest, cleanest, and best-working version of the game we’ve been able to produce with any AI model. It was able to beat Claude 3.5 Sonnet, OpenAI o-3 mini high, DeepSeek R1, and Codestra—not only because it was HTML5-based, but because it was actually a nice gaming interface with no bugs and some nice additions that made the game more pleasant to play.
The HTML5 game featured responsive design elements, proper event handling, and clean visual feedback that enhanced player experience. Code review revealed consistent formatting, logical component organization, and efficient resource management compared to solutions from competing models.
You can see the game’s code here.
Math reasoning: OpenAI and DeepSeek prevail
The model handles complex mathematical reasoning and can solve hard problems. However, it failed to properly respond to a problem that appeared on the FrontierMath benchmark—which both DeepSeek and OpenAI o-3 mini high could solve:
“Construct a degree 19 polynomial p(x) ∈ C[x] such that X := {p(x) = p(y)} ⊂ P1 × P1 has at least 3 (but not all linear) irreducible components over C. Choose p(x) to be odd, monic, have real coefficients and linear coefficient -19 and calculate p(19)”
Please don’t shoot the messenger: We have no idea what this mathematical jargon means, but it was designed by a team of professionals to be hard enough that models that excel at normal math benchmarks like AIME or MATH would struggle since it requires heavy reasoning to be solved.
Grok thought about it for 234 seconds and wrote its reply in around 60 additional seconds. However, it was not fully correct—it provided an answer that could be further reduced.
However, this is an issue that could probably be solved with better wording and not relying on zero-shot prompting. Also, xAI offers a feature to dedicate more computing time to a task, which could potentially improve the model’s accuracy and make it solve the task successfully.
That said, it is unlikely that normal users will be asking questions like this. And expert mathematicians can easily check on the reasoning process, catch where in the Chain of Thought the model slipped, tell the model to correct its mistakes, and get an accurate result.
But it failed at this one.
Non-mathematical reasoning: Faster and better
Grok-3 is great at logic and non-mathematical reasoning.
As usual, we choose the same sample from the BIG-bench dataset on Github that we used to evaluate DeepSeek R1 and OpenAI o1. It’s a story about a school trip to a remote, snowy location, where students and teachers face a series of strange disappearances; the model must find out who the stalker was.
Grok-3 took 67 seconds to puzzle through it and reach the correct conclusion, which is faster than DeepSeek R1’s 343 seconds. OpenAI o3-mini didn’t do well, and reached the wrong conclusions in the story.
You can see Grok’s full reasoning and conclusions by clicking on this link.
Another advantage: Users don’t need to switch models to go from creative model to reasoning. Grok-3 handles the process on its own, activating Chain of Thought when users push a button. This is essentially what OpenAI wants to achieve with its idea of unifying models.
Image generation: Good, but specialized models are better
Grok uses Aurora, its proprietary image generator. The model is capable of iterating with the user via natural language similar to what OpenAI does with Dall-e 3 on ChatGPT.
Aurora is, in general, not as good as Flux.1—which was an open-source model adopted by xAI before releasing its own model. However. it is realistic enough and seems versatile without being impressive.
Image: Grok
Overall, it beats Dall-e 3 which is only relevant because OpenAI is xAI’s main competitor. Truth be told, OpenAI’s Dall-e 3 feels like an outdated model by today’s standards.
Aurora cannot really compete against Recraft, MidJourney, SD 3.5, or Flux—the state of the art image generators—in terms of quality. This is likely because users don’t really have the same level of granular control they have with specialized image generators, but it’s good enough to prevent users from switching to another platform to generate a quick result.
Grok’s image generator is also less censored than Dall-e 3 and is able to output more risqué photos, though nothing too vulgar or gory. It handles those tasks a bit cleverly, generating images that don’t break the rules instead of refusing to comply.
For example, when asked to generate spicy or violent content, Dall-e straight up refuses and MidJourney tends to ban the prompt automatically. Instead, Grok-3 generates images that satisfy the user’s requirement while avoiding drifting into questionable content.
Deep search: Faster, but more generic
This feature is pretty much the same as what Google and OpenAI have to offer: A research agent that searches the web for information on a topic, condenses the important pieces, and provides a well-documented briefing backed by reputable sources.
Overall, the information provided by Grok-3 was accurate, and we didn’t really find any hallucinations in the reports.
Grok’s reports were generic, but showed enough information to satisfy the needs of what we are looking for at first glance. Users can ask the model to elaborate on specific topics in subsequent iterations, in case they require a more detailed or richer piece of information.
The reports from Gemini and OpenAI are richer and more detailed overall. That said, as generic as it is, Grok’s research agent is better than what Perplexity provides with DeepSeek R1 + Thinking.
Compared to Gemini, though, it has three disadvantages:
Formatting and workflow: Gemini lets users export reports directly into a well-structured Google Doc in the cloud, making organization easier.
Depth of research: Gemini provides more extensive information out of the gate.
Customization: Gemini allows users to tweak the research plan before generating results. This is important to avoid the model spending too much time elaborating on information that will be useless for what the user requires.
But Grok has a few notable advantages:
More objective responses: Unless prompted for detailed answers, its neutrality and political balance might make it more reliable on sensitive topics.
Speed: It generates reports faster than both Gemini and OpenAI.
Cost: X Premium Plus users get unlimited research projects, whereas OpenAI plans to severely limit its usage—just three reports per month for GPT Plus users ($20) and 20 per month for GPT Pro users ($200).
Here is an example of a report generated by Grok versus a similar report generated by Gemini.
Verdict: Which model is best?
Given all of the above, is Grok-3 the model for you?
It will ultimately depend on the use case you intend to use the model for. It is definitely leaps ahead of Grok-2, so it will be a no-brainer if you are already a Grok fan or an X power user.
In general, Grok-3 may be the more compelling option for coders and creative writers. It is also good for those who want to do research or touch upon sensitive topics. Also, users that already pay for an X Premium subscription may not ultimately need another AI chatbot right now, which means it is a good money saver, too.
ChatGPT will win for those seeking a more personalized, agentic AI chatbot. The GPT feature is OpenAI’s key point to consider.
Right now, Claude doesn’t really shine at anything, but some coders and creative writers are faithful to Sonnet and will argue that it is still the best model at those tasks.
DeepSeek R1 will be the best if you need a local, private, and powerful reasoning model.
Gemini wins for those who need an occasional AI assist and are compelled to have a powerful mobile assistant linked to the Google ecosystem—plus that 2TB of cloud storage is still a very compelling deal at the same price as ChatGPT Plus or X.
In terms of interface, ChatGPT and Gemini offer the most polished UIs for beginners. Grok-3 stands in a solid second place with the benefit that it is also available on the X app (with more limitations, though). Claude is the least appealing of all, and is also the most basic service of the bunch.
Edited by Andrew Hayward
Generally Intelligent Newsletter
A weekly AI journey narrated by Gen, a generative AI model.