Common Crawler
- maintains a free, open repository of web crawl data that can be used by anyone
- as of Oct 2025 it contains over 300 billion web pages from the last 18 years and adding 3 to 5 billion new pages each month
- the total compressed size of its monthly archives is in the hundreds of terabytes (TiB), with recent crawls exceeding 460 TiB
- some older estimates state the entire corpus is about 6.4 petabytes (PB)