41 items tagged “wikipedia”
2024
Wikidata is a Giant Crosswalk File.
Drew Breunig shows how to take the 140GB Wikidata JSON export, use sed 's/,$//'
to convert it to newline-delimited JSON, then use DuckDB to run queries and extract external identifiers, including a query that pulls out 500MB of latitude and longitude points.
Wikipedia Manual of Style: Linking (via) I started a conversation on Mastodon about the grammar of linking: how to decide where in a phrase an inline link should be placed.
Lots of great (and varied) replies there. The most comprehensive style guide I've seen so far is this one from Wikipedia, via Tom Morris.
qrank (via) Interesting and very niche project by Colin Dellow.
Wikidata has pages for huge numbers of concepts, people, places and things.
One of the many pieces of data they publish is QRank—“ranking Wikidata entities by aggregating page views on Wikipedia, Wikispecies, Wikibooks, Wikiquote, and other Wikimedia projects”. Every item gets a score and these scores can be used to answer questions like “which island nations get the most interest across Wikipedia”—potentially useful for things like deciding which labels to display on a highly compressed map of the world.
QRank is published as a gzipped CSV file.
Colin’s hikeratlas/qrank GitHub repository runs weekly, fetches the latest qrank.csv.gz file and loads it into a SQLite database using SQLite’s “.import” mechanism. Then it publishes the resulting SQLite database as an asset attached to the “latest” GitHub release on that repo—currently a 307MB file.
The database itself has just a single table mapping the Wikidata ID (a primary key integer) to the latest QRank—another integer. You’d need your own set of data with Wikidata IDs to join against this to do anything useful.
I’d never thought of using GitHub Releases for this kind of thing. I think it’s a really interesting pattern.
Become a Wikipedian in 30 minutes (via) A characteristically informative and thoughtful guide to getting started with Wikipedia editing by Molly White—video accompanied by a full transcript.
I found the explanation of Reliable Sources particularly helpful, including why Wikipedia prefers secondary to primary sources.
“The way we determine reliability is typically based on the reputation for editorial oversight, and for factchecking and corrections. For example, if you have a reference book that is published by a reputable publisher that has an editorial board and that has edited the book for accuracy, if you know of a newspaper that has, again, an editorial team that is reviewing articles and issuing corrections if there are any errors, those are probably reliable sources.”
Wikimedia Commons Category:Bach Dancing & Dynamite Society. After creating a new Wikipedia page for the Bach Dancing & Dynamite Society in Half Moon Bay I ran a search across Wikipedia for other mentions of the venue... and found 41 artist pages that mentioned it in a photo caption.
On further exploration it turns out that Brian McMillen, the official photographer for the venue, has been uploading photographs to Wikimedia Commons since 2007 and adding them to different artist pages. Brian has been a jazz photographer based out of Half Moon Bay for 47 years and has an amazing portfolio of images. It’s thrilling to see him share them on Wikipedia in this way.
Wikipedia: Bach Dancing & Dynamite Society (via) I created my first Wikipedia page! The Bach Dancing & Dynamite Society is a really neat live music venue in Half Moon Bay which has been showcasing world-class jazz talent for over 50 years. I attended a concert there for the first time on Sunday and was surprised to see it didn’t have a page yet.
Creating a Wikipedia page is an interesting process. New pages on English Wikipedia created by infrequent editors stay in “draft” mode until they’ve been approved by a member of “WikiProject Articles for creation”—the standards are really high, especially around sources of citations. I spent quite a while tracking down good citation references for the key facts I used in my first draft for the page.
WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. This paper describes a really interesting LLM system that runs Retrieval Augmented Generation against Wikipedia to help answer questions, but includes a second step where facts in the answer are fact-checked against Wikipedia again before returning an answer to the user. They claim “97.3% factual accuracy of its claims in simulated conversation” on a GPT-4 backed version, and also see good results when backed by LLaMA 7B.
The implementation is mainly through prompt engineering, and detailed examples of the prompts they used are included at the end of the paper.
2023
Wikimedia Commons: Photographs by Gage Skidmore (via) Gage Skidmore is a Wikipedia legend: this category holds 93,458 photographs taken by Gage and released under a Creative Commons license, including a vast number of celebrities taken at events like San Diego Comic-Con. CC licensed photos of celebrities are generally pretty hard to come by so if you see a photo of any celebrity on Wikipedia there’s a good chance it’s credited to Gage.
Wikipedia search-by-vibes through millions of pages offline (via) Really cool demo by Lee Butterman, who built embeddings of 2 million Wikipedia pages and figured out how to serve them directly to the browser, where they are used to implement “vibes based” similarity search returning results in 250ms. Lots of interesting details about how he pulled this off, using Arrow as the file format and ONNX to run the model in the browser.
2018
Why it took a long time to build that tiny link preview on Wikipedia (via) Wikipedia now shows a little preview card on internal links with an image and summary paragraph of the linked page. As a Wikpedia user I absolutely love this feature—and as an engineer and product designer, it’s fascinating to hear the challenges they overcame to ship it. Of particular interest: actually generating a useful summary of a page, while stripping out the cruft that often accumulates at the beginning of their text. It’s also an impressive scaling challenge: the API they use for this feature is now handling more than 500,000 requests per minute.
2012
Why doesn’t Wikipedia try something other than donations to make money?
Wikipedia is run by a non-profit, and the content is created by volunteers for free. Those volunteers created that content under the understanding that it would be for the benefit of the species. Alternative methods of making money would break that assumed contract with their volunteers, and would likely damage their ability to encourage free contributions in the future.
[... 76 words]2010
What are the best APIs for creating location-based Wikipedia mashups?
GeoNames has a fantastic API for finding Wikipedia articles near a specific latitude/longitude pair:
[... 32 words]List of important publications in computer science (via) Amazingly comprehensive list on Wikipedia.
2009
Authority, historically, gets bestowed on the gatekeepers of information, such as Britannica, universities, newspapers, etc. Everything that can be digitized will be digitized, and will then be available over the internet, which is disruptive, not only to business models, but to authority.
Best of OpenStreetMap (via) I keep on telling people OpenStreetMap is this year’s Wikipedia—at its best, it beats commercially available maps. This “best of” site highlights the areas where OSM really shines (the yellow stars)—the German mapping community in particular have produced some outstanding cartography.
Wikipedia over DNS. Added to my ~/bin/ directory as dns-wikipedia.sh: host -t txt $1.wp.dg.cx
2008
License Hacking. Wikipedia is making the switch to a CC license, by asking the Free Software Foundation to include that as an option in the latest version of the Free Documentation License which Wikipedia currently uses and which includes an auto-upgrade clause. Devious.
It’s a purple world. Stuart Langridge made a purplish map of the US election results, using JSON data from Google and an SVG map of the US from Wikipedia.
Data Scraping Wikipedia with Google Spreadsheets. I hadn’t played with =importHTML in Google spreadsheets, which lets you suck in data from an HTML table or list somewhere on the web. This tutorial takes it further, bringing Wikipedia, Yahoo! Pipes and KML in to the mix.
Google’s Wikipedia and Panoramio layers are now available in the API. I really like their use of reverse domain style identifiers for the layer IDs: map.addOverlay(new GLayer(“org.wikipedia”));
GiantBomb.com. Launched today, powered by Django—a combination of (mostly ex-Gamespot) quality editorial content and a massive structured wiki of every computer game ever released. This is going to be a lot of fun—all of the crazy detailed content that Wikipedia tends to reject.
Comet (programming) on Wikipedia on 4th June 2008 (via) The last useful version (which I had pointed many people to) before it was gutted down to just a couple of paragraphs by infuriating deletionists.
The fatal flaw of deletionism is the mindset of deciding what someone else should find interesting
Wikipedia:Canvassing (via) Apparently it’s considered bad form to tell people about debates occurring on Wikipedia (such as votes for deletion). Looks like a policy designed to discourage the participation of subject experts in favour of the participation of Wikipedia process gnomes.
There are two [Wikipedias]: One is the public-facing reliable-enough-on-average encyclopedia that people read every day, which makes for nice fluff pieces in the media about "these new Web thingamajigs that the kids are building, aren't they neat?". The other is the insular behind-the-scenes bureaucracy, which reads like an improvised performance of the collected writings of Clay Shirky.
Google Maps now shows photos and Wikipedia articles. Click the “More...” button. My first thought was “how do they get so many photo markers on the map?”—Firebug shows that they’re generating tiles on the server containing multiple photo markers, then when you click on one an Ajax call checks which photo is in that particular spot.
MediaWiki API. Wikipedia’s best kept secret?
wikinear.com, OAuth and Fire Eagle
I’m pleased to announce wikinear.com. It’s a simple site that does just one thing: show you a list of the five Wikipedia pages that are geographically closest to your current location. It’s designed (or not-designed) to be used mainly from mobile phones.
[... 1,190 words]Everyone applauds when Google goes after Microsoft's Office monopoly [...] but when they start to go after web non-profits like Wikipedia, you see where the ineluctible logic leads. As Google's growth slows, as inevitably it will, it will need to consume more and more of the web ecosystem, trading against its former suppliers, rather than distributing attention to them.
2007
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo. (via) See also: Wikipedia’s “List of linguistic example sentences”.