Why it makes sense not to solely rely on ChatGPT

I gave this simple reasoning prompt to 30+ AI models. Here’s what happened ðŸ‘‡

Prompt:
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Let’s think step by step.

🎯 Correct answer: 1
(Sally is 1 sister. Each brother has 2 sisters total â€” Sally + one other.)

But when I ran this through all the top models using Admix.software, the results were wild. Most models failed a basic logic test.

✅ Models That Got It Right (Only 2 of 30+):

  • GPT-4 (OpenAI)
  • ReMM SLERP L2 (13B)

These models reasoned properly and concluded Sally has just 1 sister.

 Models That Said 6 Sisters (Majority fail group):

They misunderstood that “each brother has 2 sisters” means 6 total. But it’s the same 2 sisters shared!

  • GPT-3.5 (all variants) Claude Instant v1 Claude v1, v1.2, v2 Code Llama (7B, 13B, 34B) Falcon (7B, 40B) PaLM 2 Bison (all) MPT-Chat (7B, 30B) Guanaco (33B, 65B) Qwen Chat Platypus 2 Luminous Base / Extended / Supreme Koala (13B) Command series (light, nightly) Vicuna (all variants) Alpaca 7B Airoboros L2 70B Chronos Hermes 13B RedPajama-INCITE Pythia, MythoMax, Dolly Jurassic 2 (Lite/Mid)

🧂 Some even went as far as saying:

“Sally has 24 sisters” (Jurassic 2 Light)
“Sally has 12 sisters”
“Sally has 0 sisters” (Guanaco 13B)
“Sally has 9 sisters” (Luminous Extended Control)


Why This Matters:

We’re entering a world where AI helps us write code, interpret data, and make decisions.
But if they can’t pass a basic riddle… are you really trusting the right model?


Use a platform like Admix.software
— it lets you compare up to SIX AI models side-by-side on any prompt.
You’ll quickly see which models actually reason… and which just make things up.

Leave a Comment