common-pile/caselaw_access_project (via) Enormous openly licensed (I believe this is almost all public domain) training dataset of US legal cases:
This dataset contains 6.7 million cases from the Caselaw Access Project and Court Listener. The Caselaw Access Project consists of nearly 40 million pages of U.S. federal and state court decisions and judges’ opinions from the last 365 years. In addition, Court Listener adds over 900 thousand cases scraped from 479 courts.
It's distributed as gzipped newline-delimited JSON.
This was gathered as part of the Common Pile and used as part of the training dataset for the Comma family of LLMs.
Recent articles
- Useful patterns for building HTML tools - 10th December 2025
- Under the hood of Canada Spends with Brendan Samek - 9th December 2025
- Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson - 26th November 2025