Advanced LLM AI models vs A Simple Question

I wanted to find out what the current state-of-art of the current largest LLM models is. There are great and many standardised ways to test the quality of any LLM model, but I wanted an easier way.

I stumbled upon a seemingly simple logic test question, which I fed directly into OpenAI, Co-Pilot, Gemini, Inflection AI, Mistral AI and Anthropic.

Yesterday I had 9 books. Today I read two of them. How many books do I have now?

I am sure, you, as a human see the gotcha and will answer the question correctly. But how will our LLM AI models fare?

Lets start with one of the oldest companies, author of the original Transformer paper and valued over $1 trillion, Google's Gemini Advanced (aka Ultra 1.0) model, launched February 2024:

Second with Anthropic's Claude Opus launched March 2024. Tested best compared to any other competing models according to their press release:

Next we have Microsoft Bing with Co-pilot, enabled by default if you search in Windows or Bing. To my knowledge this is powered by OpenAI's GPT 3.5, and can't be upgraded for users: