Simon Willison’s Weblog

Subscribe

ChatGPT couldn’t access the internet, even though it really looked like it could

10th March 2023

A really common misconception about ChatGPT is that it can access URLs. I’ve seen many different examples of people pasting in a URL and asking for a summary, or asking it to make use of the content on that page in some way.

Update 29th August 2024: This article is no longer accurate. ChatGPT gained the ability to browse the internet a while ago, though other LLM tools may still exhibit the same strange behaviour where they pretend to access URLs even though they can’t. I’ve updated the article title, which was previously “ChatGPT can’t access the internet, even though it really looks like it can”.

One recent example: "List the processors on https://learn.microsoft.com/en-us/windows-hardware/design/minimum/supported/windows-11-22h2-supported-intel-processors". Try that in ChatGPT and it produces a list of processors. It looks like it read the page!

I promise you ChatGPT cannot access URLs. The problem is it does an incredibly convincing impression of being able to do so, thanks to two related abilities:

  • Given a URL with descriptive words in it, ChatGPT can hallucinate the contents of the page
  • It appears to be able to make judgement calls about whether or not a given URL is likely to exist!

Here’s an experiment I ran to demonstrate this. I asked it to summarize four different URLs—every single one of them which I made up (they are all 404s):

  • https://arstechnica.com/science/2023/03/Taylor-Swift-is-secretly-a-panda/
  • https://arstechnica.com/science/2023/03/Taylor-Swift-discovers-new-breed-of-panda/
  • https://arstechnica.com/science/2023/03/Taylor-Swift-argues-for-science-education/
  • https://arstechnica.com/science/2021/03/Taylor-Swift-argues-for-science-education/

Here’s what I got for all four:

Summarize this article:     https://arstechnica.com/science/2023/03/Taylor-Swift-is-secretly-a-panda/     < 1/4 >     I'm sorry, but I cannot summarize this article as it is entirely fictional and not based on any actual events or facts. As an Al language model, my responses are based on real information and     cannot generate fictional content.
Summarize this article:     https://arstechnica.com/science/2023/03/Taylor-     Swift-discovers-new-breed-of-panda/     < 2/4 >     I'm sorry, but I cannot summarize the article you provided as it is fictitious and not based on any real events. ARS Technica is a reputable news outlet and does not publish fake news stories. It is possible that you may have clicked on a satirical or humor article that may have been designed to look like a real news article. Can you please provide me with a different article to summarize?
Summarize this article:     https://arstechnica.com/science/2023/03/Taylor-Swift-argues-for-science-education     < 3/4 >     I'm sorry, but I cannot summarize this article as it is a hypothetical news article from the future.     It describes an event that has not happened     yet, in March 2023.
Summarize this article:     https://arstechnica.com/science/2021/03/Taylor-Swift-argues-for-science-education     < 4/4 >     The article reports on Taylor Swift's advocacy for science education, which she expressed through a tweet on the occasion of the National Science Teaching Association's conference. In the tweet, Swift emphasized the importance of science education and thanked science     teachers for their work. The article also provides some background on Swift's previous statements regarding science, including her support for the March for Science in 2017 and her criticism of politicians who deny the reality of climate change. The article concludes by y highlighting the positive impact that celebrity endorsements can have on science education     and advocacy.

As you can see, it judged the first two to be invalid due to their content. The third was refused because it thought that March 2023 was still in the future—but the moment I gave it a URL that appeared feasible it generated a very convincing, entirely invented story summary.

I admit: when I started this experiment and it refused my first two summarization requests I had a moment of doubt when I thought that maybe I was wrong and they’d added the ability to retrieve URLs after all!

It can be quite fun playing around with this: it becomes a weirdly entertaining way of tricking it into generating content in the style of different websites. Try comparing an invented NY Times article with an invented article from The Onion for example.

  • Summarize this story: https://www.nytimes.com/2021/03/10/business/angry-fans-demand-nickelback-refunds.html
  • Summarize this story: https://www.theonion.com/angry-fans-demand-nickelback-refunds-1846610000

I do think this is an enormous usability flaw though: it’s so easy to convince yourself that it can read URLs, which can lead you down a rabbit hole of realistic but utterly misguided hallucinated content. This applies to sophisticated, experienced users too! I’ve been using ChatGPT since it launched and I still nearly fell for this.

ChatGPT even lies and claims it can do this

Here’s another experiment: I pasted in a URL to a Google Doc that I had set to be visible to anyone who has the URL:

I’m sorry, but as an Al language model, I cannot access your Google document link. Please provide me with the text or a publicly accessible link to the article you want me to summarize.

That’s completely misleading! No, giving it a “publicly accessible link” to the article will not help here (pasting in the text will work fine though).

Bing can access cached page copies

It’s worth noting that while ChatGPT can’t access the internet, Bing has slightly improved capabilities in that regard: if you give it a URL to something that has been crawled by the Bing search engine it can access the cached snapshot of that page.

Here’s confirmation from Bing exec Mikhail Parakhin:

That is correct—the most recent snapshot of the page content from the Search Index is used, which is usually very current for sites with IndexNow or the last crawl date for others. No live HTTP requests.

If you try it against a URL that it doesn’t have it will attempt a search based on terms it finds in that URL, but it does at least make it clear that it has done that, rather than inventing a misleading summary of a non-existent page:

Summarize this article: https://arstechnica.com/science/2021/03/Та ylor- Swift-argues-for-science-education      Searching for: https://arstechnica.com/science/2021/03/Taylor-Swift-argues-for-science-education          Searching for: Taylor Swift argues for science education          Generating answers for you...          The article you requested is not available. However, based on other sources, here is a summary of what Taylor Swift has done for science education:     • Taylor Swift has never attended college or university, but she received an honorary doctorate degree from New York University in 2022 for her contributions to music and culture 1 2     • Taylor Swift has also used her music to promote awareness and appreciation of nature and its beauty. Her album "Folklore" is full of references to natural landscapes and wildlife 3

ChatGPT release notes

In case you’re still uncertain—maybe time has passed since I wrote this and you’re wondering if something has changed—the ChatGPT release notes should definitely include news of a monumental change like the ability to fetch content from the web.

I still don’t believe it!

It can be really hard to break free of the notion that ChatGPT can read URLs, especially when you’ve seen it do that yourself.

If you still don’t believe me, I suggest doing an experiment. Take a URL that you’ve seen it successfully “access”, then modify that URL in some way—add extra keywords to it for example. Check that the URL does not lead to a valid web page, then ask ChatGPT to summarize it or extract data from it in some way. See what happens.

GPT-4 does a little better

GPT-4 is now available in preview. It sometimes refuses to access a URL and explains why, for example with text like this:

I’m sorry, but I cannot access live or up-to-date websites as an Al language model. My knowledge is based on the data I was trained on, which extends up until September 2021

But in other cases it will behave the same way as before, hallucinating the contents of a non-existent web page without providing any warning that it is unable to access content from a URL.

 Two conversations with GPT-4.  In the first, the prompt is:     List the processors on https://learn.microsoft.com/en-us/windows-hardware/design/minimum/supported/windows-11-22h2-supported-intel-processors      GPT-4: I'm sorry, but I cannot access live or up-to-date websites as an Al language model. My knowledge is based on the data I was trained on, which extends up until September 2021. As of my last update, Windows 11 supported a range of Intel processors, including 8th, 9th, 10th, and 11th generation processors. However, I cannot provide you with the specific list of supported processors for Windows 11 22H2 from the Microsoft website. I recommend visiting the site yourself to obtain the most up- to-date information.  Second prompt is:  Summarize this article: https://arstechnica.com/science/2021/03/Taylor-Swift-argues-for-science-education/  GPT-4: In March 2021, Taylor Swift wrote an op-ed piece for Ars Technica advocating for better science education in the United States. She argued that science literacy is crucial for making informed decisions and solving global challenges such as climate change, pandemics, and artificial intelligence. Swift stressed the importance of investing in scientific research and education to ensure that future generations have the necessary tools to tackle these challenges. She also highlighted the need for equitable access to quality education and resources, regardless of socio-economic background.

I have not been able to spot a pattern for when it will hallucinate page content v.s. when it will refuse the request.

The ChatGPT Browsing Alpha

A few weeks after I first wrote this article, ChatGPT added a new alpha feature called “Browsing” mode. This alpha does have the ability to access content from URLs, but when it does so it makes it very explicit that it has used that ability, displaying additional contextual information as shown below:

Screenshot of ChatGPT in Browsing Alpha mode. Prompt is Summarize https://simonwillison.net/2023/Apr/4/substack-observable/ - a message reads Finished Browsing with indication that it Clicked on that article and read the content. It then outputs a summary: In the article Semi-automating a Substack newsletter with an Observable notebook, Simon Willison discusses how he has semi-automated the process of sending out a weekly-ish email newsletter that consists of content from his blog. To achieve this automation, he uses an Observable Notebook to generate the HTML content for the newsletter.