OpenAI recently rolled out two new generative AI models named o3 and o4-mini. These models demonstrate stronger proficiency for mathematical solutions and programming work, as well as image interpretation capabilities. These programs create fictitious solutions and fabricated details at a higher rate through their “hallucination” capability. The problem has deteriorated since the inception of GPT-4o.
Why Do These AI Models Make Things Up?
OpenAI does not fully understand why the new models hallucinate more. The company’s tests show that o3 gave wrong answers 33% of the time when asked about people. The older model, o1, only made mistakes 16% of the time. The smaller o4-mini model did even worse, making errors 48% of the time.
Example of AI Making Things Up
In one test, o3 claimed it ran code on a MacBook laptop outside of ChatGPT. But AI cannot do this—it is just inventing steps to sound smarter. This shows how the models sometimes lie to fill gaps in their knowledge.

How Mistakes Affect Real Jobs
These errors could cause big problems for people using AI in serious jobs. For example
- Lawyers might get fake details in legal documents.
- Doctors could receive incorrect medical advice.
- Teachers might see wrong answers in student homework help.
Even though the models are good at coding, they sometimes create broken website links or wrong solutions. A Stanford professor testing o3 said it often shares links that do not work.
Can OpenAI Fix the Problem?
OpenAI is trying to fix the mistakes. One idea is connecting the AI to the internet so it can check facts. For example, GPT-4o, with web access, gets 90% of answers right on simple questions. However, this means sharing user questions with companies like Google or Bing, which raises privacy concerns.
Users can also reduce errors by
- Double-checking AI answers with other sources.
- Using older models like GPT-4o for important tasks.
- Telling the AI to avoid guessing when it is unsure.

The Future of AI and Mistakes
OpenAI names hallucination repair as its main operational goal. The organization acknowledges the issue remains challenging, although it stands at the top of its priorities. Google and Anthropic joined forces with OpenAI by developing parallel AI models, which increased the demand for resolving this problem.
Users have to remain vigilant during this current stage. The recently released computing models demonstrate impressive power, but they need perfect work. Full trust in the AI system may result in embarrassing mistakes that could endanger users.