Simon Willison’s Weblog

Subscribe
Atom feed for ibm

10 items tagged “ibm”

2024

Docling. MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.

Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.

Those models are available on Hugging Face.

Here's how to try out the Docling CLI interface using uvx (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):

uvx docling mydoc.pdf --to json --to md

This will output a mydoc.json file with complex layout information and a mydoc.md Markdown file which includes Markdown tables where appropriate.

The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

I ran that inside uv run --with docling python. It took a little while to run, but it demonstrated that the library works.

# 3rd November 2024, 4:57 am / ibm, ocr, pdf, python, ai, hugging-face, uv

2009

Scaling Django web apps on Apache. Cool to see this kind of article cropping up on IBM developerWorks, but it’s a shame they don’t mention mod_wsgi.

# 10th April 2009, 9:23 am / apache, django, ibm, modwsgi, python

DB2 support for Django is coming. From IBM, under the Apache 2.0 License. I’m not sure if this makes it hard to bundle it with the rest of Django, which uses the BSD license.

# 18th February 2009, 10:58 pm / antonio-cangiano, bsd, databases, db2, django, ibm, licenses, open-source, orm, python

2008

Damien Katz: New Gig. IBM have employed Damien Katz to work full time on CouchDB. The work will be under the Apache license with the ASF owning the copyright.

# 2nd January 2008, 8:35 pm / apache, asf, couchdb, damien-katz, ibm

2006

Introducing Operator. New microformat detecting Firefox extension, developed at IBM and released by Mozilla Labs. Examples are from Yahoo! Local, Upcoming and Flickr.

# 18th December 2006, 4:36 pm / extension, firefox, flickr, ibm, microformats, mozilla, mozillalabs, upcoming, yahoo

2005

IBM poop heads say LAMP users need to “grow up”. Ryan blows away a ton of the myths surrounding LAMP.

Nope. We call bullshit. After wasting years of our lives trying to implement physical three tier architectures that "scale" and failing miserably time after time, we're going with something that actually works.

# 30th May 2005, 9:34 am / ibm, lamp, ryan-tomayko

IBM: ’LAMP’ users need to grow up (via) Which is why Friendster switched from JSP to PHP. Pfft.

# 26th May 2005, 11:41 pm / ibm, lamp

IBM to Free Java—Next Week? The question mark means it’s a rumour.

# 19th January 2005, 4:54 pm / ibm, java

2003

Linux on the desktop at IBM

Spotted on Slashdot, IBM’s Open Source Desktop—Directions for today... and Tomorrow presentation includes one slide that really caught my attention:

[... 95 words]

2002

IBM accessibility center

IBM’s Accessibility Center has a plethora of useful information and resources, including a free 30 day trial of their Home Page Reader text-to-speech browser software.