Wayback Web Archive

Wayback Machine (https://archive.org/web/)

# You can also uses as CLI tool
https://github.com/tomnomnom/waybackurls

cat domains.txt | waybackurls > urls

# Cache pages
http://cachedview.com/
https://www.giftofspeed.com/cache-checker/


Getting PDF on Web Archive

# Great resource
https://openfacto.fr/2020/04/19/recuperer-des-fichiers-pdf-en-masse-sur-archive-org/

# Step 1
# By adding '*' at the and of a company URL, you can get all indexed documents
# Then you can filter by "PDF" (right search bar)
https://web.archive.org/web/*/https://testcompany.fr/*

# Step 2
# Here you want to get URL list
# In the Firefox developer tools -> Network
# You can get an HTTP request to a JSON file containing URLs
# Copy as curl and get the file

# Step 3
# OpenRefine can help to parse and process the file
# Filter on PDF

# Step 4
# NEVER download directly
# You can do it through archived document
# Add the prefix for every line
https://web.archive.org/web/

# Step 5
# To get the document, the '*' in URL must be replaced by the timestamp
# If several documents have been indexed (you can download the first, or the last)
# Also, add "if_"
https://web.archive.org/web/20160102030102if_/http://www.xxx.fr/document.pdf