Ubuntu Manpage: waymore - Tool to discover extensive data from online archives

NAME

       waymore - Tool to discover extensive data from online archives

SYNOPSIS

          waymore [-h] [-i INPUT] [-n] [-mode {U,R,B}] [-oU OUTPUT_URLS] [-oR OUTPUT_RESPONSES] [-f] [-fc FC] [-mc MC] [-l <signed integer>] [-from <yyyyMMddhhmmss>] [-to <yyyyMMddhhmmss>]
                  [-ci {h,d,m,none}] [-ra REGEX_AFTER] [-url-filename] [-xwm] [-xcc] [-xav] [-xus] [-xvt] [-lcc LCC] [-lcy LCY] [-t <seconds>] [-p <integer>] [-r RETRIES] [-m <integer>]
                  [-ko [KEYWORDS_ONLY]] [-lr LIMIT_REQUESTS] [-ow] [-nlf] [-c CONFIG] [-wrlr WAYBACK_RATE_LIMIT_RETRY] [-urlr URLSCAN_RATE_LIMIT_RETRY] [-co] [-nd] [-v] [--version]

DESCRIPTION

       waymore  is  a  versatile  tool designed to extract comprehensive information from various
       sources including the Wayback  Machine,  Common  Crawl,  Alien  Vault  OTX,  URLScan,  and
       VirusTotal.  Whether  you're  searching  for  historical  web  data  or analyzing security
       threats, waymore provides a seamless experience with its intuitive interface and extensive
       features.

OPTIONS

-h, --help:
Display command usage and options. Provides quick access to comprehensive
assistance, including detailed explanations of available options.

-i INPUT, --input INPUT:
The target domain (or file of domains) to find links for. This can be a domain
only, or a domain with a specific path. If it is a domain only to get everything
for that domain, don't prefix with "www."

-n, --no-subs:
Don't include subdomains of the target domain (only used if input is not a domain
with a specific path).

-mode {U,R,B}:
The mode to run: U (retrieve URLs only), R (download Responses only) or B (Both).

-oU OUTPUT_URLS, --output-urls OUTPUT_URLS:
The file to save the Links output to, including path if necessary. If the "-oR"
argument is not passed, a "results" directory will be created in the path specified
by the DEFAULT_OUTPUT_DIR key in config.yml file (typically defaults to
"~/.config/waymore/"). Within that, a directory will be created with target domain
(or domain with path) passed with "-i" (or for each line of a file passed with
"-i").

-oR OUTPUT_RESPONSES, --output-responses OUTPUT_RESPONSES:
The directory to save the response output files to, including path if necessary. If
the argument is not passed, a "results" directory will be created in the path
specified by the DEFAULT_OUTPUT_DIR key in config.yml file (typically defaults to
"~/.config/waymore/"). Within that, a directory will be created with target domain
(or domain with path) passed with "-i" (or for each line of a file passed with
"-i").

-f, --filter-responses-only:
The initial links from Wayback Machine will not be filtered (MIME Type and Response
Code), only the responses that are downloaded, e.g. it maybe useful to still see
all available paths from the links even if you don't want to check the content.

-fc FC:
Filter HTTP status codes for retrieved URLs and responses. Comma separated list of
codes (default: the FILTER_CODE values from config.yml). Passing this argument
will override the value from config.yml

-mc MC:
Only Match HTTP status codes for retrieved URLs and responses. Comma separated list
of codes. Passing this argument overrides the config FILTER_CODE and -fc.

-l <signed integer>, --limit <signed integer>:
How many responses will be saved (if -mode is R or B). A positive value will get
the first N results, a negative value will will get the last N results. A value of
0 will get ALL responses (default: 5000)

-from <yyyyMMddhhmmss>, --from-date <yyyyMMddhhmmss>:
What date to get responses from. If not specified it will get from the earliest
possible results. A partial value can be passed, e.g. 2016, 201805, etc.

-to <yyyyMMddhhmmss>, --to-date <yyyyMMddhhmmss>:
What date to get responses to. If not specified it will get to the latest possible
results. A partial value can be passed, e.g. 2016, 201805, etc.

-ci {h,d,m,none}, --capture-interval {h,d,m,none}:
Filters the search on Wayback Machine (archive.org) to only get at most 1 capture
per hour (h), day (d) or month (m). This filter is used for responses only. The
default is 'd' but can also be set to 'none' to not filter anything and get all
responses.

-ra REGEX_AFTER, --regex-after REGEX_AFTER:
RegEx for filtering purposes against links found all sources of URLs AND responses
downloaded. Only positive matches will be output.

-url-filename:
Set the file name of downloaded responses to the URL that generated the response,
otherwise it will be set to the hash value of the response. Using the hash value
means multiple URLs that generated the same response will only result in one file
being saved for that response.

-xwm: Exclude checks for links from Wayback Machine (archive.org)

-xcc: Exclude checks for links from commoncrawl.org

-xav: Exclude checks for links from alienvault.com

-xus: Exclude checks for links from urlscan.io

-xvt: Exclude checks for links from virustotal.com

-lcc LCC:
Limit the number of Common Crawl index collections searched, e.g. '-lcc 10' will
just search the latest 10 collections (default: 3). As of July 2023 there are
currently 95 collections. Setting to 0 (default) will search ALL collections. If
you don't want to search Common Crawl at all, use the -xcc option.

-lcy LCY:
Limit the number of Common Crawl index collections searched by the year of the
index data. The earliest index has data from 2008. Setting to 0 (default) will
search collections or any year (but in conjunction with -lcc). For example, if you
are only interested in data from 2015 and after, pass -lcy 2015. If you don't want
to search Common Crawl at all, use the -xcc option.

-t <seconds>, --timeout <seconds>:
This is for archived responses only! How many seconds to wait for the server to
send data before giving up (default: 30 seconds)

-p <integer>, --processes <integer>:
Basic multithreading is done when getting requests for a file of URLs. This
argument determines the number of processes (threads) used (default: 1)

-r RETRIES, --retries RETRIES:
The number of retries for requests that get connection error or rate limited
(default: 1).

-m <integer>, --memory-threshold <integer>:
The memory threshold percentage. If the machines memory goes above the threshold,
the program will be stopped and ended gracefully before running out of memory
(default: 95)

-ko [KEYWORDS_ONLY], --keywords-only [KEYWORDS_ONLY]:
Only return links and responses that contain keywords that you are interested in.
This can reduce the time it takes to get results. If you provide the flag with no
value, Keywords are taken from the comma separated list in the "config.yml" file
with the "FILTER_KEYWORDS" key, otherwise you can pass an specific Regex value to
use, e.g. -ko "admin" to only get links containing the word admin, or -ko
".js(?|$)" to only get JS files. The Regex check is NOT case sensitive.

-lr LIMIT_REQUESTS, --limit-requests LIMIT_REQUESTS:
Limit the number of requests that will be made when getting links from a source
(this doesn't apply to Common Crawl). Some targets can return a huge amount of
requests needed that are just not feasible to get, so this can be used to manage
that situation. This defaults to 0 (Zero) which means there is no limit.

-ow, --output-overwrite:
If the URL output file (default waymore.txt) already exists, it will be overwritten
instead of being appended to.

-nlf, --new-links-file:
If this argument is passed, a .new file will also be written that will contain
links for the latest run.

-c CONFIG, --config CONFIG:
Path to the YML config file. If not passed, it looks for file 'config.yml' in the
same directory as runtime file 'waymore.py'.

-wrlr WAYBACK_RATE_LIMIT_RETRY, --wayback-rate-limit-retry WAYBACK_RATE_LIMIT_RETRY:
The number of minutes the user wants to wait for a rate limit pause on Watback
Machine (archive.org) instead of stopping with a 429 error (default: 3).

-urlr URLSCAN_RATE_LIMIT_RETRY, --urlscan-rate-limit-retry URLSCAN_RATE_LIMIT_RETRY:
The number of minutes the user wants to wait for a rate limit pause on URLScan.io
instead of stopping with a 429 error (default: 1).

-co, --check-only:
This will make a few minimal requests to show you how many requests, and roughly
how long it could take, to get URLs from the sources and downloaded responses from
Wayback Machine.

-nd, --notify-discord:
Whether to send a notification to Discord when waymore completes. It requires
WEBHOOK_DISCORD to be provided in the config.yml file.

-v, --verbose:
Verbose output

--version:
Show version number

EXAMPLES

       Common usage:

       • Example 1:

       Just  get  the  URLs  from  all  sources  for redbull.com (-mode U is just for URLs, so no
       responses are downloaded):

          $ waymore -i redbull.com -mode U

       The URLs are saved in the same path  as  config.yml  (typically  ~/.config/waymore)  under
       results/redbull.com/waymore.txt

       • Example 2:

       Get  ALL  the URLs from Wayback for redbull.com (no filters are applied in mode U with -f,
       and no URLs are retrieved from Common Crawl, Alien Vault, URLScan and Virus Total, because
       -xcc,  -xav,  -xus,  -xvt  are passed respectively). Save the FIRST 200 responses that are
       found starting from 2022 (-l 200 -from 2022):

          $ waymore -i redbull.com -f -xcc -xav -xus -xvt -l 200 -from 2022

       • Example 3:

       You can pipe waymore to other tools. Any errors are sent to stderr and any links found are
       sent  to  stdout. The output file is still created in addition to the links being piped to
       the next program. However, archived responses are not piped to the next program, but  they
       are still written to files. For example:

          $ waymore -i redbull.com -mode U | unfurl keys | sort -u

       You can also pass the input through stdin instead of -i:

          $ cat redbull_subs.txt | waymore

       • Example 4:

       Sometimes  you may just want to check how many requests, and how long waymore is likely to
       take if you ran it for a particular domain.  You  can  do  a  quick  check  by  using  the
       -co/--check-only argument. For example:

          $ waymore -i redbull.com --check-only

AUTHOR

       Aquila Macedo <aquilamacedo@riseup.net>

COPYRIGHT

       Expat

                                            2024-03-22                                 WAYMORE(1)