16th July 2025 - Link Blog
common-pile/caselaw_access_project (via) Enormous openly licensed (I believe this is almost all public domain) training dataset of US legal cases:
This dataset contains 6.7 million cases from the Caselaw Access Project and Court Listener. The Caselaw Access Project consists of nearly 40 million pages of U.S. federal and state court decisions and judges’ opinions from the last 365 years. In addition, Court Listener adds over 900 thousand cases scraped from 479 courts.
It's distributed as gzipped newline-delimited JSON.
This was gathered as part of the Common Pile and used as part of the training dataset for the Comma family of LLMs.
Recent articles
- sqlite-utils 4.0rc1 adds migrations and nested transactions - 21st June 2026
- Datasette Apps: Host custom HTML applications inside Datasette - 18th June 2026
- GLM-5.2 is probably the most powerful text-only open weights LLM - 17th June 2026