Common Crawler
  • maintains a free, open repository of web crawl data that can be used by anyone
  • as of Oct 2025 it contains over 300 billion web pages from the last 18 years and adding 3 to 5 billion new pages each month
  • the total compressed size of its monthly archives is in the hundreds of terabytes (TiB), with recent crawls exceeding 460 TiB
  • some older estimates state the entire corpus is about 6.4 petabytes (PB)

Resources