Simon Willison’s Weblog

Subscribe
Atom feed for search-engines

37 posts tagged “search-engines”

2025

ChatGPT agent’s user-agent

Visit ChatGPT agent's user-agent

I was exploring how ChatGPT agent works today. I learned some interesting things about how it exposes its identity through HTTP headers, then made a huge blunder in thinking it was leaking its URLs to Bingbot and Yandex... but it turned out that was a Cloudflare feature that had nothing to do with ChatGPT.

[... 1,260 words]

In today's example of how Google's AI overviews are the worst form of AI-assisted search (previously, hallucinating Encanto 2), it turns out you can type in any made-up phrase you like and tag "meaning" on the end and Google will provide you with an entirely made-up justification for the phrase.

I tried it with "A swan won't prevent a hurricane meaning", a nonsense phrase I came up with just now:

Google search results for "A swan won't prevent a hurricane meaning". AI overview from Search Labs shows the following: The saying "a swan won't prevent a hurricane" means that no matter how well-intentioned or seemingly harmless a person (or in this case, a swan) might be, they can't stop or prevent a large, destructive event like a hurricane. It highlights the idea that some things are simply too powerful or inevitable to be stopped by individual actions or efforts. Here's a more detailed breakdown: Swan: In this context, a swan is used as a symbol of something small, harmless, and insignificant. Hurricane: A hurricane is a powerful and destructive natural phenomenon, representing something large, overwhelming, and inevitable. The Saying's Meaning: The saying emphasizes that even the best efforts of a seemingly powerless entity (the swan) cannot alter the course or impact of a powerful, destructive event (the hurricane). In essence, "a swan won't prevent a hurricane" is a reminder that sometimes, we need to accept that certain events are simply beyond our control. Then two links represented as cardns: The Next Black Swan Event? Hurricane Milton's... Oct 7, 2024  Massive Flooding: Already saturated land from previou... in Linkedin by Jennifer Gibbs.  Then Coping with Black Swans - Carrier Management from June 4 2014. It finishes with a note that Generative AI is experimental

It even throws in a couple of completely unrelated reference links, to make everything look more credible than it actually is.

I think this was first spotted by @writtenbymeaghan on Threads.

# 23rd April 2025, 7:56 pm / ai-ethics, slop, google, generative-ai, ai, llms, search-engines

AI assisted search-based research actually works now

Visit AI assisted search-based research actually works now

For the past two and a half years the feature I’ve most wanted from LLMs is the ability to take on search-based research tasks on my behalf. We saw the first glimpses of this back in early 2023, with Perplexity (first launched December 2022, first prompt leak in January 2023) and then the GPT-4 powered Microsoft Bing (which launched/cratered spectacularly in February 2023). Since then a whole bunch of people have taken a swing at this problem, most notably Google Gemini and ChatGPT Search.

[... 1,618 words]

Introducing Perplexity Deep Research. Perplexity become the third company to release a product with "Deep Research" in the name.

And now Perplexity Deep Research, announced on February 14th.

The three products all do effectively the same thing: you give them a task, they go out and accumulate information from a large number of different websites and then use long context models and prompting to turn the result into a report. All three of them take several minutes to return a result.

In my AI/LLM predictions post on January 10th I expressed skepticism at the idea of "agents", with the exception of coding and research specialists. I said:

It makes intuitive sense to me that this kind of research assistant can be built on our current generation of LLMs. They’re competent at driving tools, they’re capable of coming up with a relatively obvious research plan (look for newspaper articles and research papers) and they can synthesize sensible answers given the right collection of context gathered through search.

Google are particularly well suited to solving this problem: they have the world’s largest search index and their Gemini model has a 2 million token context. I expect Deep Research to get a whole lot better, and I expect it to attract plenty of competition.

Just over a month later I'm feeling pretty good about that prediction!

# 16th February 2025, 12:46 am / google, search-engines, ai, generative-ai, chatgpt, llms, perplexity, gemini, ai-agents, deep-research, ai-assisted-search

2024

Google search hallucinates Encanto 2. Jason Schreier on Bluesky:

I was excited to tell my kids that there's a sequel to Encanto, only to scroll down and learn that Google's AI just completely made this up

I just replicated the same result by searching Google for encanto 2. Here's what the "AI overview" at the top of the page looked like:

Search Labs | Al Overview. Encanto 2: A New Generation is an animated musical fantasy comedy film that is scheduled for release in the United States on August 25, 2024. lt is the sequel to the 2021 Disney film Encanto. Here are some details about the film: Plot: The film takes place years after the original and centers on a new generation of the Madrigal family, led by an older Mirabel and her grandson, José. Directors: Byron Howard and Jared Bush are directing the film. Show more...

Only when I clicked the "Show more" link did it become clear what had happened:

Writers: Jared Bush and Charise Castro Smith are writing the film. Music: Lin-Manuel Miranda will write original songs for the film, as he did for the
original. Some say that a sequel to Encanto is logical because of the film's huge  investment in the franchise. Jared Bush, who co-directed the original Encanto, has hinted that a sequel may be in the works. He said, "I would love to spend more time in the Madrigal House and return to Encanto.” Generative Al is experimental

The link in that first snippet was to the Encanto 2: A New Generation page on Idea Wiki:

This is a fanon wiki, and just like fan-fiction wikis, this one has a variety of fan created ideas on here! These include potential sequels and new series that have yet to exist.

Other cited links included this article about Instagram fan art and Encanto's Sequel Chances Addressed by Disney Director, a very thin article built around a short quote from Encanto's director at D23 Brazil.

And that August 2024 release date (which the AI summary weirdly lists as "scheduled for release" despite that date being five months in the past)? It's from the Idea Wiki imaginary info box for the film.

This is a particularly clear example of how badly wrong AI summarization can go. LLMs are gullible: they believe what you tell them, and the web is full of misleading information - some of which is completely innocent.

Update: I've had some pushback over my use of the term "hallucination" here, on the basis that the LLM itself is doing what it's meant to: summarizing the RAG content that has been provided to it by the host system.

That's fair: this is not a classic LLM hallucination, where the LLM produces incorrect data purely from knowledge partially encoded in its weights.

I classify this as a bug in Google's larger LLM-powered AI overview system. That system should be able to take the existence of invalid data sources into account - given how common searches for non-existent movie sequels (or TV seasons) are, I would hope that AI overviews could classify such searches and take extra steps to avoid serving misleading answers.

So think this is a "hallucination" bug in the AI overview system itself: it's making statements about the world that are not true.

# 29th December 2024, 1:30 am / ethics, google, search, search-engines, ai, generative-ai, llms, rag, slop, ai-ethics, ai-assisted-search, hallucinations

Google is the only search engine that works on Reddit now thanks to AI deal (via) This is depressing. As of around June 25th reddit.com/robots.txt contains this:

User-agent: *
Disallow: /

Along with a link to Reddit's Public Content Policy.

Is this a direct result of Google's deal to license Reddit content for AI training, rumored at $60 million? That's not been confirmed but it looks likely, especially since accessing that robots.txt using the Google Rich Results testing tool (hence proxied via their IP) appears to return a different file, via this comment, my copy here.

# 24th July 2024, 6:29 pm / google, reddit, search-engines, seo, ai, llms

Some goofy results from ‘AI Overviews’ in Google Search. John Gruber collects two of the best examples of Google’s new AI overviews going horribly wrong.

Gullibility is a fundamental trait of all LLMs, and Google’s new feature apparently doesn’t know not to parrot ideas it picked up from articles in the Onion, or jokes from Reddit.

I’ve heard that LLM providers internally talk about “screenshot attacks”—bugs where the biggest risk is that someone will take an embarrassing screenshot.

In Google search’s case this class of bug feels like a significant reputational threat.

# 24th May 2024, 5:33 am / ethics, google, john-gruber, search-engines, ai, generative-ai, llms, ai-ethics, ai-assisted-search

Google Scholar search: “certainly, here is” -chatgpt -llm (via) Searching Google Scholar for “certainly, here is” turns up a huge number of academic papers that include parts that were evidently written by ChatGPT—sections that start with “Certainly, here is a concise summary of the provided sections:” are a dead giveaway.

# 15th March 2024, 1:43 pm / ethics, google, search-engines, ai, generative-ai, chatgpt, llms, ai-ethics

And now, in Anno Domini 2024, Google has lost its edge in search. There are plenty of things it can’t find. There are compelling alternatives. To me this feels like a big inflection point, because around the stumbling feet of the Big Tech dinosaurs, the Web’s mammals, agile and flexible, still scurry. They exhibit creative energy and strongly-flavored voices, and those voices still sometimes find and reinforce each other without being sock puppets of shareholder-value-focused private empires.

Tim Bray

# 20th January 2024, 12:13 pm / google, search-engines, tim-bray

2023

According to interviews with former employees, publishing executives, and experts associated with the early days of AMP, while it was waxing poetic about the value and future of the open web, Google was privately urging publishers into handing over near-total control of how their articles worked and looked and monetized. And it was wielding the web’s most powerful real estate — the top of search results — to get its way.

David Pierce

# 15th May 2023, 9:55 pm / google, search-engines

2020

Apple now receives an estimated $8 billion to $12 billion in annual payments — up from $1 billion a year in 2014 — in exchange for building Google’s search engine into its products. It is probably the single biggest payment that Google makes to anyone and accounts for 14 to 21 percent of Apple’s annual profits.

Apple, Google and a Deal That Controls the Internet

# 26th October 2020, 3:10 am / apple, google, search-engines

2019

Discussion about Altavista on Hacker News. Fascinating thread on Hacker News where Bryant Durrell, a former Director from Altavista provides some insider thoughts on how they lost against Google.

# 16th February 2019, 6:57 pm / computer-history, google, internet-history, search, search-engines

2013

Why is site search so bad on most websites?

It’s not so much that site search is bad, it’s that your expectations have been raised enormously high by the incredible quality of search provided by search engines like Google.

[... 125 words]

Do comments really count for SEO link building?

Most sensible commenting systems will put rel=nofollow on links to discourage comment spam, which will have a significant effect on SEO.

[... 35 words]

2012

Is there such thing as a specific multi-site search?

Yes—Blekko does exactly this, with its “slashtags” feature: http://help.blekko.com/index.php...

[... 27 words]

Is there a free/open-source software source code search engine?

If you want to search through actual code in open source projects, GitHub search is fantastic https://github.com/search—e.g. here’s a search for all Ruby code that mentions oauth https://github.com/search?q=oaut...

[... 71 words]

How can you build a search engine for a website built in PHP/MySQL?

There are a bunch of options.

[... 310 words]

Why does Google use “Allow” in robots.txt, when the standard seems to be “Disallow?”

The Disallow command prevents search engines from crawling your site.

[... 59 words]

What are the best events search engines?

Since I co-founded one I’m certainly not qualified to express an opinion on which ones are best, but here are a few of my favourites:

[... 233 words]

Why is Google indexing & displaying www1 versions of my site and how might I stop this?

You should stop serving your site to the public on multiple subdomains. Configure your site to serve a 301 permanent redirect from www1-www4 to the equivalent page on www—also, make sure that your site accessed without the www redirects to the right place as well.

[... 269 words]

What kind of publicly available search software is able to be purchased or used freely as part of a website, and how good is it?

There are plenty of good open source options—Solr is currently my favourite. It’s extremely powerful but you do need to do some programming on top of it—I use Django and Haystack to build the search UI on most of my projects.

[... 115 words]

2010

Is it not time for Google to redesign its search page by removing the “search” & “I’m Feeling Lucky” buttons since the buttons are now useless with the new “Instant” structure?

I don’t think so. The “Search” button defines their entire purpose. The “I’m Feeling Lucky” button is an important part of their brand.

[... 60 words]

Who are major competitors to Solr?

ElasticSearch is a really interesting one—it’s the same underlying search library (Lucene) and the same integration model (an HTTP interface) but takes quite a different approach. It hasn’t been around for a long time but it looks very impressive: http://www.elasticsearch.com/

[... 95 words]

How do Solr, Lucene, Sphinx and Searchify compare?

Lucene is a Java library for creating and searching through a full text index. If you want to make use of it, you’ll need to write your own Java code that integrates with it.

[... 109 words]

2009

Awkward Suggestions (via) The Google search box “suggest” feature returns very different results depending on the quality of your grammar—“how 2” v.s. “how might one” is particularly illuminating.

# 12th November 2009, 10:31 am / funny, google, search-engines, suggest

Official Google Webmaster Blog: A proposal for making AJAX crawlable. It's horrible! The Google crawler would map url#!state to url?_escaped_fragment_=state, then expect your site to provide rendered HTML that reflects that state (they even go as far as to suggest running a headless browser within your web server to do this). Just stick to progressive enhancement instead, it's far less hideous. It looks like the proposal may have originated with the GWT team.

# 8th October 2009, 5:52 pm / ajax, crawling, google, gwt, javascript, progressive-enhancement, search-engines, seo

PubSubHubbub for Google Alerts. “Think of it as a search API that tells *you* when it finds new results.”

# 23rd September 2009, 9:30 pm / google, google-alerts, pubsubhubbub, search-engines, webhooks

Specify your canonical. You can now use a link rel=“canonical” to tell Google that a page has a canonical URL elsewhere. I’ve run in to this problem a bunch of times—in some sites it really does make sense to have the same content shown in two different places—and this seems like a neat solution that could apply to much more than just metadata for external search engines.

# 14th February 2009, 11:28 am / canonical, google, metadata, relcanonical, search-engines, seo, urls

Powering a Google search. I thought the recent estimate of each Google search producing 7g of CO2 was a little high—Google have responded with a claim that the amount is 0.2g instead.

# 12th January 2009, 11:15 am / co2, google, search-engines