How do ChatGPT, Gemini & co. choose their sources?
Training knowledge or live sources
A language model first only predicts the statistically most likely next word, based on its training data. Without extra sources this can lead to hallucinations. For many questions, modern AI systems therefore fetch fresh information from the web. This step is called grounding (technically: Retrieval-Augmented Generation, or RAG).
What influences the selection
For commercially interesting questions (such as "best tool for X") the system almost always runs a web search. It then judges which content answers the question most precisely and draws its answer from that. It prefers fact-based, clearly structured and topically relevant sources. What matters is not a link position like on Google, but whether your content provides the direct basis for the answer.
Which source types show up often
In practice, AI systems frequently cite third-party sources instead of the brand website: industry portals and news sites, Wikipedia, review platforms, forums like Reddit and above all YouTube. Personal LinkedIn profiles are often used too. Which source dominates depends heavily on the topic.
What this means for you
If you want to appear in AI answers, you have to be present where the model finds its sources. Optimizing your own website is not enough. You need to know the sources cited for your topics and build content and mentions there on purpose.
Key takeaways
- AI answers from training knowledge or via live web search (grounding/RAG).
- Commercial questions usually trigger a web search worth monitoring.
- Cited sources are often third parties: YouTube, Reddit, industry portals, Wikipedia, LinkedIn.
- Visibility happens where the model pulls its sources, not only on your site.
Frequently asked questions
Does AI always answer from the web?
No. Some answers come purely from training knowledge. But current and commercial questions in particular trigger a live web search.
Why does AI often not cite my website?
For generic questions, AI systems often prefer neutral third-party sources like industry portals, listicles or forums over the brand website.
How do I find out which sources matter for my topic?
Through prompt monitoring: query relevant questions repeatedly and analyse which domains are cited in the answers.