Provided by: waymore_3.7-1.1_all bug

NAME

       waymore - Tool to discover extensive data from online archives

SYNOPSIS

          waymore [-h] [-i INPUT] [-n] [-mode {U,R,B}] [-oU OUTPUT_URLS] [-oR OUTPUT_RESPONSES] [-f] [-fc FC] [-mc MC] [-l <signed integer>] [-from <yyyyMMddhhmmss>] [-to <yyyyMMddhhmmss>]
                  [-ci {h,d,m,none}] [-ra REGEX_AFTER] [-url-filename] [-xwm] [-xcc] [-xav] [-xus] [-xvt] [-lcc LCC] [-lcy LCY] [-t <seconds>] [-p <integer>] [-r RETRIES] [-m <integer>]
                  [-ko [KEYWORDS_ONLY]] [-lr LIMIT_REQUESTS] [-ow] [-nlf] [-c CONFIG] [-wrlr WAYBACK_RATE_LIMIT_RETRY] [-urlr URLSCAN_RATE_LIMIT_RETRY] [-co] [-nd] [-v] [--version]

DESCRIPTION

       waymore  is  a  versatile  tool designed to extract comprehensive information from various
       sources including the Wayback  Machine,  Common  Crawl,  Alien  Vault  OTX,  URLScan,  and
       VirusTotal.  Whether  you're  searching  for  historical  web  data  or analyzing security
       threats, waymore provides a seamless experience with its intuitive interface and extensive
       features.

OPTIONS

       -h, --help:
              Display   command  usage  and  options.  Provides  quick  access  to  comprehensive
              assistance, including detailed explanations of available options.

       -i INPUT, --input INPUT:
              The target domain (or file of domains) to find links for.  This  can  be  a  domain
              only,  or  a domain with a specific path.  If it is a domain only to get everything
              for that domain, don't prefix with "www."

       -n, --no-subs:
              Don't include subdomains of the target domain (only used if input is not  a  domain
              with a specific path).

       -mode {U,R,B}:
              The mode to run: U (retrieve URLs only), R (download Responses only) or B (Both).

       -oU OUTPUT_URLS, --output-urls OUTPUT_URLS:
              The  file  to  save  the Links output to, including path if necessary. If the "-oR"
              argument is not passed, a "results" directory will be created in the path specified
              by   the   DEFAULT_OUTPUT_DIR   key  in  config.yml  file  (typically  defaults  to
              "~/.config/waymore/").  Within that, a directory will be created with target domain
              (or  domain  with  path)  passed  with "-i" (or for each line of a file passed with
              "-i").

       -oR OUTPUT_RESPONSES, --output-responses OUTPUT_RESPONSES:
              The directory to save the response output files to, including path if necessary. If
              the  argument  is  not  passed,  a  "results" directory will be created in the path
              specified by the DEFAULT_OUTPUT_DIR key in config.yml file (typically  defaults  to
              "~/.config/waymore/").  Within that, a directory will be created with target domain
              (or domain with path) passed with "-i" (or for each line  of  a  file  passed  with
              "-i").

       -f, --filter-responses-only:
              The initial links from Wayback Machine will not be filtered (MIME Type and Response
              Code), only the responses that are downloaded, e.g. it maybe useful  to  still  see
              all available paths from the links even if you don't want to check the content.

       -fc FC:
              Filter  HTTP status codes for retrieved URLs and responses. Comma separated list of
              codes (default: the FILTER_CODE values from  config.yml).   Passing  this  argument
              will override the value from config.yml

       -mc MC:
              Only Match HTTP status codes for retrieved URLs and responses. Comma separated list
              of codes. Passing this argument overrides the config FILTER_CODE and -fc.

       -l <signed integer>, --limit <signed integer>:
              How many responses will be saved (if -mode is R or B). A positive  value  will  get
              the first N results, a negative value will will get the last N results.  A value of
              0 will get ALL responses (default: 5000)

       -from <yyyyMMddhhmmss>, --from-date <yyyyMMddhhmmss>:
              What date to get responses from. If not specified it will  get  from  the  earliest
              possible results. A partial value can be passed, e.g. 2016, 201805, etc.

       -to <yyyyMMddhhmmss>, --to-date <yyyyMMddhhmmss>:
              What  date to get responses to. If not specified it will get to the latest possible
              results. A partial value can be passed, e.g. 2016, 201805, etc.

       -ci {h,d,m,none}, --capture-interval {h,d,m,none}:
              Filters the search on Wayback Machine (archive.org) to only get at most  1  capture
              per  hour  (h),  day (d) or month (m). This filter is used for responses only.  The
              default is 'd' but can also be set to 'none' to not filter  anything  and  get  all
              responses.

       -ra REGEX_AFTER, --regex-after REGEX_AFTER:
              RegEx  for filtering purposes against links found all sources of URLs AND responses
              downloaded. Only positive matches will be output.

       -url-filename:
              Set the file name of downloaded responses to the URL that generated  the  response,
              otherwise  it  will be set to the hash value of the response.  Using the hash value
              means multiple URLs that generated the same response will only result in  one  file
              being saved for that response.

       -xwm:  Exclude checks for links from Wayback Machine (archive.org)

       -xcc:  Exclude checks for links from commoncrawl.org

       -xav:  Exclude checks for links from alienvault.com

       -xus:  Exclude checks for links from urlscan.io

       -xvt:  Exclude checks for links from virustotal.com

       -lcc LCC:
              Limit  the  number  of Common Crawl index collections searched, e.g. '-lcc 10' will
              just search the latest 10 collections (default: 3).  As  of  July  2023  there  are
              currently  95  collections.  Setting to 0 (default) will search ALL collections. If
              you don't want to search Common Crawl at all, use the -xcc option.

       -lcy LCY:
              Limit the number of Common Crawl index collections searched  by  the  year  of  the
              index  data.  The  earliest  index has data from 2008.  Setting to 0 (default) will
              search collections or any year (but in conjunction with -lcc). For example, if  you
              are  only interested in data from 2015 and after, pass -lcy 2015. If you don't want
              to search Common Crawl at all, use the -xcc option.

       -t <seconds>, --timeout <seconds>:
              This is for archived responses only! How many seconds to wait  for  the  server  to
              send data before giving up (default: 30 seconds)

       -p <integer>, --processes <integer>:
              Basic  multithreading  is  done  when  getting  requests  for  a file of URLs. This
              argument determines the number of processes (threads) used (default: 1)

       -r RETRIES, --retries RETRIES:
              The number of retries for requests  that  get  connection  error  or  rate  limited
              (default: 1).

       -m <integer>, --memory-threshold <integer>:
              The  memory  threshold percentage. If the machines memory goes above the threshold,
              the program will be stopped and ended  gracefully  before  running  out  of  memory
              (default: 95)

       -ko [KEYWORDS_ONLY], --keywords-only [KEYWORDS_ONLY]:
              Only  return  links and responses that contain keywords that you are interested in.
              This can reduce the time it takes to get results.  If you provide the flag with  no
              value,  Keywords  are  taken from the comma separated list in the "config.yml" file
              with the "FILTER_KEYWORDS" key, otherwise you can pass an specific Regex  value  to
              use,  e.g.  -ko  "admin"  to  only  get  links  containing  the  word admin, or -ko
              ".js(?|$)" to only get JS files. The Regex check is NOT case sensitive.

       -lr LIMIT_REQUESTS, --limit-requests LIMIT_REQUESTS:
              Limit the number of requests that will be made when getting  links  from  a  source
              (this  doesn't  apply  to  Common Crawl).  Some targets can return a huge amount of
              requests needed that are just not feasible to get, so this can be  used  to  manage
              that situation. This defaults to 0 (Zero) which means there is no limit.

       -ow, --output-overwrite:
              If the URL output file (default waymore.txt) already exists, it will be overwritten
              instead of being appended to.

       -nlf, --new-links-file:
              If this argument is passed, a .new file will also  be  written  that  will  contain
              links for the latest run.

       -c CONFIG, --config CONFIG:
              Path  to  the YML config file. If not passed, it looks for file 'config.yml' in the
              same directory as runtime file 'waymore.py'.

       -wrlr WAYBACK_RATE_LIMIT_RETRY, --wayback-rate-limit-retry WAYBACK_RATE_LIMIT_RETRY:
              The number of minutes the user wants to wait for a  rate  limit  pause  on  Watback
              Machine (archive.org) instead of stopping with a 429 error (default: 3).

       -urlr URLSCAN_RATE_LIMIT_RETRY, --urlscan-rate-limit-retry URLSCAN_RATE_LIMIT_RETRY:
              The  number  of minutes the user wants to wait for a rate limit pause on URLScan.io
              instead of stopping with a 429 error (default: 1).

       -co, --check-only:
              This will make a few minimal requests to show you how many  requests,  and  roughly
              how  long it could take, to get URLs from the sources and downloaded responses from
              Wayback Machine.

       -nd, --notify-discord:
              Whether to send a notification to  Discord  when  waymore  completes.  It  requires
              WEBHOOK_DISCORD to be provided in the config.yml file.

       -v, --verbose:
              Verbose output

       --version:
              Show version number

EXAMPLES

       Common usage:

       • Example 1:

       Just  get  the  URLs  from  all  sources  for redbull.com (-mode U is just for URLs, so no
       responses are downloaded):

          $ waymore -i redbull.com -mode U

       The URLs are saved in the same path  as  config.yml  (typically  ~/.config/waymore)  under
       results/redbull.com/waymore.txt

       • Example 2:

       Get  ALL  the URLs from Wayback for redbull.com (no filters are applied in mode U with -f,
       and no URLs are retrieved from Common Crawl, Alien Vault, URLScan and Virus Total, because
       -xcc,  -xav,  -xus,  -xvt  are passed respectively). Save the FIRST 200 responses that are
       found starting from 2022 (-l 200 -from 2022):

          $ waymore -i redbull.com -f -xcc -xav -xus -xvt -l 200 -from 2022

       • Example 3:

       You can pipe waymore to other tools. Any errors are sent to stderr and any links found are
       sent  to  stdout. The output file is still created in addition to the links being piped to
       the next program. However, archived responses are not piped to the next program, but  they
       are still written to files. For example:

          $ waymore -i redbull.com -mode U | unfurl keys | sort -u

       You can also pass the input through stdin instead of -i:

          $ cat redbull_subs.txt | waymore

       • Example 4:

       Sometimes  you may just want to check how many requests, and how long waymore is likely to
       take if you ran it for a particular domain.  You  can  do  a  quick  check  by  using  the
       -co/--check-only argument. For example:

          $ waymore -i redbull.com --check-only

AUTHOR

       Aquila Macedo <aquilamacedo@riseup.net>

COPYRIGHT

       Expat

                                            2024-03-22                                 WAYMORE(1)