16th July 2025 - Link Blog
common-pile/caselaw_access_project (via) Enormous openly licensed (I believe this is almost all public domain) training dataset of US legal cases:
This dataset contains 6.7 million cases from the Caselaw Access Project and Court Listener. The Caselaw Access Project consists of nearly 40 million pages of U.S. federal and state court decisions and judges’ opinions from the last 365 years. In addition, Court Listener adds over 900 thousand cases scraped from 479 courts.
It's distributed as gzipped newline-delimited JSON.
This was gathered as part of the Common Pile and used as part of the training dataset for the Comma family of LLMs.
Recent articles
- Is Claude Code going to cost $100/month? Probably not - it's all very confusing - 22nd April 2026
- Where's the raccoon with the ham radio? (ChatGPT Images 2.0) - 21st April 2026
- Changes in the system prompt between Claude Opus 4.6 and 4.7 - 18th April 2026