Public Documents and Metadata

Online Resources

https://www.documentcloud.org

Metagoofil

# Extracting metadata of public documents (pdf,doc,xls,ppt,etc) availables in the target websites

# The tool first perform a query in Google requesting different filetypes that can have useful metadata (pdf, doc, xls,ppt,etc)
# Then will download those documents to the disk and extracts the metadata of the file using specific libraries for
# parsing different file types (Hachoir, Pdfminer, etc)

# Options
# -d: domain to search
#  -t: filetype to download (pdf,doc,xls,ppt,odp,ods,docx,xlsx,pptx)
# -l: limit of results to search (default 200)
# -h: work with documents in directory (use \"yes\" for local analysis)
# -n: limit of files to download
# -o: working directory (location to save downloaded files)
# -f: output file

metagoofil.py -d domain.com -t doc,pdf -l 10 -n 10 -o /tmp/result -f /tmp/result/result.html

Investigate Powerpoint Documents

https://medium.com/@osint/powerpoint-what-data-is-beneath-the-surface-2eb000ef95fb
https://medium.com/week-in-osint/week-in-osint-2020-21-4c92d335116a

# You can potentially get lots of data from PPTX files
# 1/ Obvious metadate such as author
# 2/ But also from embedded content such as screenshots

# - Some people often use shapes to hide content. If you can edit, you can delete them and get the content
# - "Crop" feature can also be used to get the full content of a screenshot

# You can also simply unpack the whole document (if using Open XML Format) and get the embedded content

TruffleHog

# Searches through git repositories for secrets, digging deep into commit history and branches.
# This is effective at finding secrets accidentally committed.
https://github.com/dxa4481/truffleHog

truffleHog --regex --entropy=False https://github.com/dxa4481/truffleHog.git

truffleHog --json --max_depth 10 https://github.com/dxa4481/truffleHog.git

Just-Metadata

# Collect metadata about IP
# You have two main functionnalities divided into modules (gather and analyze)

# Load IP file
[>] load /path/to/ip.txt

# List all the gather modules
[>] list gather

# You can then user the gather command to collect from any source
# Shodan is the only module that requires an API key (Just-Metadata/module/intelgathering/get_shodan.py)
[>] gather
[>] gather shodan

# List all the analysis modules
[>] list analysis

# Then you can use the analyze command
[>] analyze geoinfo
 
# You can get all gathered info about one IP with the following
[>] ip_info <ip>

# You can save your result to use it later
[>] save
[>] import /path/to/file.state