AI主要搜索的資料來自Reddit跟Wikipedia

2025-09-04

2025年6月,根據Semrush對超過十五萬條大型語言模型(LLM)引用來源的分析,揭示當前人工智慧工具在信息依賴上的清晰格局與隱憂。調查顯示,Reddit 以高達40.1% 的引用頻率穩居首位,成為LLM獲取資訊的最主要來源;而Wikipedia以26.3%的比例緊隨其後,展現出其在知識性資料上的強大存在感。除此之外,在涉及地理資訊的問題時,Mapbox與OpenStreetMap則是LLM引用最頻繁的專業數據來源。整體來看,這些數據凸顯LLM在生成回答時對「使用者生成內容」的高度依賴。

然而,這種依賴模式並非全然無風險。首先,來自社群平台的資訊雖然龐雜而多元,但缺乏專業驗證,容易導致錯誤訊息的傳播。過去就曾有案例指出,AI在提供淨水建議時,誤導性地引用漂白劑的使用,這樣的錯誤可能對使用者造成嚴重危害。其次,AI工具過度依靠Reddit 這類平台,可能加劇「回音室效應」,也就是流行敘事與偏見在社群中被不斷放大並重複,進而影響AI的輸出結果。最後,對於專業領域的知識,由於缺乏權威審核與把關,AI工具生成的資訊可能難以保證精準與可信度,這對需要嚴謹資訊的研究與專業應用來說,是一大隱憂。

這一研究不僅揭示LLM在現階段運作中所依賴的資訊來源特徵,也凸顯使用者在面對AI工具時應具備更高的甄別與批判能力。如何在享受AI帶來的便捷與廣度之餘,避免被錯誤訊息或偏見引導,將會是未來人機互動中的一大課題。這同時也為AI平台開發者與研究者提供了方向——如何在保留AI廣泛吸收用戶生成內容的優勢下,提升資訊的準確性與專業性,將是AI發展能否更成熟、可靠的關鍵所在。

In June 2025, Semrush conducted an analysis of over 150,000 citation sources used by large language models (LLMs), revealing a clear pattern in the information landscape that AI tools rely on. According to the findings, Reddit accounted for 40.1% of all citations, making it the primary source of information. Wikipedia followed with 26.3%, while in the field of geographical data, Mapbox and OpenStreetMap were the most frequently cited. Taken together, these figures highlight LLMs’ strong reliance on user-generated content when producing responses.

However, this reliance is not without risks. First, information from community platforms, while extensive and diverse, often lacks professional verification, making it prone to spreading misinformation. A notable case involved AI tools once mistakenly recommending bleach as a water purification method—an error that could have serious consequences. Second, the heavy dependence on platforms like Reddit risks amplifying the “echo chamber effect,” where popular narratives and biases are reinforced and recycled, ultimately shaping AI outputs in distorted ways. Finally, in specialized fields, the absence of authoritative review raises concerns about the precision and reliability of AI-generated content—an especially critical issue for research and professional applications that demand accuracy.

This study not only sheds light on the current sourcing habits of LLMs but also underscores the importance of critical thinking and discernment when using AI-generated information. The key challenge lies in how users can benefit from the breadth and convenience of AI while avoiding the pitfalls of misinformation and bias. At the same time, it offers a direction for AI developers and researchers: the task ahead is to maintain the advantages of drawing from diverse user-generated content while enhancing the accuracy and credibility of information. This will be crucial for advancing AI toward greater maturity and trustworthiness.