lunar (1) colmux.1.gz

Provided by: collectl_4.3.1-1_all bug

NAME

        colmux  -  multiplex  communications  to  multiple systems running collectl from a single
       system

SYNOPSIS

       colmux [-command "collectl-switches... [-p filespec]]"  [-address  addr1[,addr2,...]|-addr
       filename] [-cols col1[,col2...]] | [-column num]

DESCRIPTION

       This  utility  gathers up data generated by collectl from multiple systems and multiplexes
       it into a single consolidated format.  It runs in essentially 2 distinct modes, the  first
       is  known  as real-time, because data is retrieved and displayed in real time.  The second
       is playback mode because data is played back from existing collectl data files.

       There are also 2 general formats for the data being displayed.  The first is a  multi-line
       display  in  which  the  data  is  displayed in the native form that collectl displays it,
       except it is sorted by a distint column, essentially allowing one to see the TOP producers
       of  that  data.  The  second format is a single line display in which one or more distinct
       data elements from each source is displayed on the same line.  This latter format is never
       sorted, but rather positionally organized by the name of the system that generated it.

       Collectl  will  be then be executed, using any optional switches specified by -command, on
       each of the systems specified by -address OR read those  addresses  from  a  file  it  the
       target  of that switch is a filename rather than a list of hosts OR on the local system if
       -address is not specified.  See collectl for details of the  various  switches.   In  some
       cases  certain collectl switches will not make sense in a colmux environment and if chosen
       will generate an error.  Further, if hosts are specified with -address, they should  be  a
       individual  addresses  or  hostnames  separated by commas.  In turn, any of them can be in
       what those familiar with pdsh would recognize as -w format.

       Colmux will then execute the collectl command, gather the results from all sources  for  a
       particular  interval  and display them one result per line, sorted by the specified column
       OR all on the same line in groups specified by -cols.  The number of  lines  displayed  is
       set  to  the size of the terminal window by default, but can be changed using -lines.  The
       one exception is the use of -nosort  which  only  applies  to  the  playback  of  existing
       collectl  raw files.  In this mode all records for a particular interval will be displayed
       and the sorting bypassed, making this a speedy and convenient mechanism for gathering  all
       data from all systems in one place for potential further processing.

       Colmux  will  never  modify  the size of the terminal window so to see more or wider lines
       either expand the window or override the number of display lines and run it again.  If the
       number  display  lines is set greater then the terminal height or 0, colmux will no longer
       overlay the previous window and simply run in a continuous scrolling mode.

       Common Switches

       -address list|pdsh|filename
              Specify any combination of addresses as  hostnames  OR  in  pdsh  -w  format  OR  a
              filename  containing  a  list  of  hostnames/addresses,  1 per line.  You MUST have
              passwordless ssh access to these nodes.  If a different username  is  required,  be
              sure  to  specify  addresses in username@host format noting you do not have to have
              the same username on each host.  If specified, these usernames will override  those
              specified with the -username switch.  rsh access is not supported.

       -command switches
              One  can specify virtually any collectl command here, both in real-time or playback
              mode.  Some switches may only be used during one mode or the other and colmux  will
              usually  let  you  know  if  you  specify  an  invalid  combination or an otherwise
              restricted switch.  Only those directly affecting colmux are listed below:

              --from, --thru
                     Limit the timeframe for data being played back, noting you can include  both
                     the  from  and thru times with the --from switch if you separate then with a
                     hyphen.

              -o time-format
                     This is a "magic" switch in that it not only tells collectl how  to  display
                     dates/times  (no  other options are permitted using -o other than those from
                     the set [dDTm]), it also tells colmux how to display dates/times too.

                     In single line mode, the timestamp will either come from the host system  in
                     real-time  mode  OR  the  first host when run in playback mode.  This is the
                     most common use/need for this switch.  But be  careful  in  choosing  column
                     numbers  with  -cols  as  the  position of the data shifts by 1 when time is
                     included and by 2 if date and time are.  Using -test will correctly show the
                     shifted  positions  but  only if you include -o with the command at the same
                     time you use -test.

                     In real-time/top mode this switch is not allowed since colmux simply reports
                     the current time of the system it is running on.

                     When  playing  back data multi-line formatted data from one or more files, a
                     timestamp for each interval is reported, consisting  of  the  time  of  that
                     interval.   When  this  switch is included, each line will be tagged with an
                     appropriate timestamp since on rare occasions they may not  necessarily  all
                     be identical.

              -p playback-file
                     This  switch  tells  colmux  to  run  in playback mode.  The filename should
                     include the directory location and is usually  specified  with  wild  cards,
                     limiting  the  selected file(s) to a specific date.  When those files are on
                     the same host (-address is not specified), they may be for  multiple  hosts,
                     but  when  the files are on remote hosts they must all be for be that unique
                     host.  If the file specification includes the string TODAY or YESTERDAY they
                     will be replaced with *yyyymmdd* for that date.

              -P
                     Run  collectl  in  plot-format.   This  allows one to specify just about any
                     combination of subsystems since all data is always  displayed  on  a  single
                     line.   However, due to the lack of formatting, this also makes no sense for
                     multi-line displays and is therefore only supported in single-line format.

       -help
              Show a brief help message and exit.

       -hostwidth n
              By default, colmux set the hostwidth to 8, unless it sees something wider  and  for
              most  situations  this is sufficient.  However, if one specifies hostnames that are
              aliases of the longer hostname, colmux has no way of knowing the  real  hostlengths
              until  after  it starts receiving data from collectl and the formatting will be off
              if the hostnames are longer than the default.  To overcome this problem,  use  this
              switch to force the hostname to be wider.

       -lines
              Change the number of lines that are displayed for each interval in multi-line mode.
              The default will be determined by the terminal size returned by  the  linux  resize
              command if present.  If that command is not present, the size will be initially set
              to 24.  If -lines is greater than the terminal size or 0,  top-like  behavior  will
              not be used when in real-time mode.

              Single-line format controls the number of lines displayed between headers.  A value
              of 0 will only display the header one time.

       -noescape
              Colmux uses brute-force screen formatting, that  is  it  generates  its  own  VT100
              escape  sequences to clear lines and/or move the cursor.  On some occasions you may
              want to disable this sequences if you wish to recode the output  and  do  your  own
              post-processing of it.  This switch will do just that.

       -port
              Sometimes  a  remote version of collectl is already using the default socket.  This
              allows one to start another instance and override that value.

       -test
              This tells colmux to execute the specified collectl command either  locally  or  on
              the first remote system specified by -address, print the associated header with the
              selected column(s) highlighted and also include each column  name  along  with  its
              ordinal  number,  making  it  fairly  easy  to  make sure you've selected the right
              column(s).

       -username name
              Use this username for ALL ssh commands.  It can be overridden for specific hosts by
              specifying them with the -address switch with the desired hostnames.

       -version
              Display  the  version  and exit.  It will also report if Term::ReadKey is installed
              and if so what its version number is.

       Playback Mode Specific

       The following additional switches only apply to playback mode.   There  are  no  real-time
       mode specific switches.

       -delay seconds
              Introduce a delay between intervals in seconds.  You can specify fractional values.
              Not using this switch will cause the output to be displayed as fast as  it  can  be
              rendered.

       -home
              Move the cursor to the home position (upper left-hand corner) of the display to use
              a top-like display format.  This ONLY applies to multi-line mode when  in  playback
              mode and provides a mechanism for displaying recorded data in a top-like fashion.

       -hostfilter addr[,addr]
              When  playing  back  files for multiple hosts on the local system, sometimes you do
              not want to play back ALL the host files.  This filter allows you to  specify  only
              those  hosts  which  you  want  to process.  The format of the list of addresses is
              specified in the same way as -address except that you cannot specify a filename.

       -nosort
              Intended primarily for output that would be redirected to a file, do  not  sort  or
              include any escape sequences in the output.

       Multi-Line Format

              When there is more output then will fit on the screen, colmux includes the text:
                     Displaying: lines xx thru yy out of zz
              on the right-side of the top line of the display, where xx is typically 1.

              However,  once  colmux  is  running, one might want to look at subsequent lines, ie
              those below the bottom of the screen  and  therefore  invisible.   If  the  ReadKey
              module  is  installed, one can simply use the PageDown key to move down the display
              and the PageUp key to move in the other direction.  If ReadKey  is  not  installed,
              typing  the multi-key sequences pd<ENTER> or pu<ENTER> will cause the same thing to
              happen.

       -colhelp
              When you wish to change the sort column and the arrow keys aren't available to you,
              it  may  be  cumbersome to identify the number of the column to type in followed by
              RETURN.  This tells colmux to display the numbers over each column eliminating  the
              need to manually count them and find the one you want.

       -column num
              Set  the  sort  column  to  this number.  The column numbering is determined by the
              columns returned by collectl for the requested command.   Since  date/time  columns
              are  optional  for  non-plot data, their inclusion will change the numbering of the
              columns so if you are not sure you selected the correct column,  you  should  first
              execute your command with -test included.

              You  can also change the column number interactively with the RIGHT/LEFT arrow keys
              IF the ReadKey module is installed (see colmux  -version)  OR  simply  type  it  in
              followed by the <ENTER> key.

       -finalcr
              There is a real odd case in which you might want to pipe colmux real-time output to
              a script for further processing.  However, if you do this you can't read the  final
              line  with  a  routine  that  expects  a  terminating CR, like python's readline().
              Rather, that last line and the one that  follows  will  be  returned  as  one  long
              string.   This  switch  tell colmux to insert that final CR, which WILL mess up the
              screen under normal operations, so be forewarned.

       -hostformat char:pos
              There are times one has long hostnames which can either  take  up  valuable  screen
              real estate or are simply painful to look at.  This switch may evolve over time and
              is currently targetted as hostnames that have repeating parts along with  a  unique
              part, separated by a character such as a hyphen.  This switch allows you to specify
              a single character followed by  the  piece  of  the  hostname  you'd  like  to  see
              displayed.    For   example,  if  you  have  a  hostname  like  aaa-bbbb-cccc-dddd,
              -hostformat -:3 will cause the cccc piece to be displayed.

       -nobold
              Do not highlight the selected column.  This may be useful when  redirecting  output
              to a file and you do not want the associated escape sequences to be written to it.

       -reverse
              Reverse  the  default  sort  order.   You can also change the direction of the sort
              interactively with the UP/DOWN arrow keys IF the ReadKey module is  installed  (see
              colmux -version)
               OR simply type the r key and <ENTER>.

       -zero
              Do  not  display  any  rows  with  0  in  the  sort  column.   You  can  also  type
              z<ENTER>interactively.

       Single-Line Format

       -col1000
              Divide each column by 1000 before display

       -colk
              Divide each column by 1024 before display

       -collog10
              Remap large numbers to a smaller number of values by taking the log10 of  them  and
              further  transforming by the followign mapping: 0,1 to 0, 10 to 10, 100 to 20, 1000
              to 30, 10000 to 40, ... 1e9 to 90.

       -cols num,...
              Group all data together for each host by column number(s).  As  with  -column,  you
              can confirm the correct column(s) have been selected by first running with -test.

       -colnodet
              Do not show data for individual hosts, just display the totals.

       -colnodiv num,...
              Do  not divide the specified column numbers by 1000 or 1024 when col1000 or colk or
              apply the colllog10 transformation when specified.  A typical usage is if you  want
              to look at cpu loads as well as network or disk stats in which case you may want to
              divide the latter by 1024 but not the cpu.

       -colnoinst
              Do no include instance portion (and surrounding brackets) in totals column headers.

       -coltotal
              Include the totals for each column to the right.

       -colwidth
              Set the output columns to this width, typically used in conjunction  with  -col1000
              or  colk to allow more hosts to fit onto the same line.  It can also be used if the
              host names are too narrow for column headers and you have  room  to  display  wider
              names.

       Exception Reporting Specific

       In  single-line format, rather than wait for all hosts to report their data, colmux simply
       reports the last data seen when the time to generate a line of output has come.   In  most
       cases,  these do reflect the most recent data values but in times of load, the data may be
       late getting to colmux and so a previous value may be reported.  If the age of  that  data
       exceeds a defined number of intervals, the default is currently 2, an exception value will
       be reported of -1.  At other times it has been seen where  kernel/driver  bugs  may  cause
       incorrect  values to be reported as negative numbers and those values are also reported as
       -1.  Both the age and exception values can be changed with the following switches.

       -age number
              When initially starting up and all hosts have not yet  reported  any  data,  colmux
              will  display  a  -1 to indicate no data has been seen yet.  If during processing a
              host fails to report in -age intervals, the default is 2, colmux will also report a
              -1 indicating the data is stale.

       -negdataval val
              In  some  cases, there could be erroneous data reported as negative numbers (though
              sometimes negative numbers  are  valid).   When  specified,  replace  any  negative
              numbers with this value.

       -nodataval val
              This  switch  allows  you to change the -1 that is normally reported for missing or
              stale data to the specified value, most commonly 0.

       Diagnostics

       The following switches are intended more for diagnostic purposes  than  normal  operation,
       though are also worth using on appropriate occasions.

       -debug val
              This  switch  is  for  generating  diagnostic information at various levels.  It is
              actually a bit mask, whose values are listed in the  beginning  on  colmux  itself.
              Perhaps  the  most  useful  value  is  1 as it will cause colmux to display all the
              remote commands issues to each host in  the  address  list  and  can  often  reveal
              problems when things don't seem to be working correctly

       -nocheck
              This  switch was initially included in an earlier version when remote host checking
              was causing problem in some cases and by skipping those checks,  colmux  would  run
              more  reliably.   While  it is felt that as of V3.2.0 these reachability checks are
              now reliable and should not be skipped, this switch has been left in place.

       -quiet
              By default and when -nocheck not specified,  colmux  checks  the  versions  of  all
              collectl  instances against that of the first node found to be running collectl and
              if different, reports the mismatch.  This switch suppresses that warning.

              When a connection is received  from  an  unexpected  address,  a  warning  is  also
              reported  and  the  request  promptly  ignored.   This switch also suppresses those
              messages as well.  For more information  on  problems  connecting,  see  CONNECTION
              PROBLEMS.

       -reachable
              By  default,  when  a node is found to not be reachable, colmux will remove it from
              its list of hosts and continue execution.  This switch will  tell  colmux  to  exit
              when all hosts are not reachable.

       Miscellaneous

       There are 2 switches whose descriptions don't really fit anywhere else:

       -colbin path
              On  rare  occasions, such as testing a patch to collectl in a copy NOT in /usr/bin,
              you may want to tell colmux to use that copy instead of the standard one.  Use this
              switch  to  point to that copy.  Naturally that copy must exist in that location on
              all systems.

       -keepalive secs
              Colmux uses ssh to start collectl on each remote machine  and  then  communications
              between  collectl  and  colmux occur over a socket.  Normally, ssh is configured to
              timeout after an interval of inactivity, such as 30 minutes, which  means  a  long-
              running  colmux  session  will  begin  to  lose  connections  when this interval is
              reached.  By specifying a keepalive interval, you're telling  the  ssh  to  send  a
              periodic keepalive to the other end so that connection doesn't get dropped.

       -retaddr addr
              Tell  remote  collectls to open a socket on this address instead of the preselected
              one.  For more details on this, see CONNECTION PROBLEMS.

       -timeout secs
              By default, collectl waits up to 10 seconds for remote  instances  of  collectl  to
              connect  back.   On  slower  networks or when a very large number of instances have
              been started, they may fail to connect back in time.  This switch will extend  that
              timeout,  but  it  also requires collectl V3.6.4 be used because earlier version do
              not support this feature.

       -timerange secs
              When colmux starts up and checks the connectivity to all the machines specified  by
              -addr,  it  also  gets their current date/time and using that computes the range of
              system times across all nodes.  If that time is found to be  more  then  -timerange
              seconds,  colmux  generates  a  warning  as  this  difference could cause reporting
              probems.  One can increase the range to get rid of  the  message  (not  recommended
              unless  other  factors  are  preventing nodes from responding quickly enough to the
              date command) OR suppress the warning with -quiet.

PLAYBACK MODE RESTRICTIONS

       All logs being played back must have been collected using the same interval as colmux only
       looks at the first file/host to determine the appropriate value.

       It is assumed all clocks are reasonably well synchronized as colmux uses time to determine
       which data is to be displayed as a set.

       All files must be in the same directory on all systems and that directory must be included
       in the playback file specification

       All files on a remote host must be for that host only

EXAMPLES

       Run collectl on 3 nodes, showing CPU, Disk and Network statistics once a second and sorted
       by column 1, which happens to be total cpu.

       colmux -addr abc,def,xyz

       Dynamically display top processes on nodes n1-n10 of a cluster once a  second,  sorted  by
       column 5.

       colmux -addr n[1-10] -command "-sZ :1" -column 5

       Do  the  same for yesterday, between the hours of 5AM and 6AM, being sure to stall for 1/2
       second between intervals.  Note, if you leave off -addr you could put all  the  logs  into
       /var/log/collectl on the local host and play them back from there.

       colmux  -addr  n[1-10]  -command  "-sZ  -p/var/log/collectl/YESTERDAY  -from  05:00-06:00"
       -column 5 -delay .5

       Look at the amount of mapped and slab memory consumed on nodes n1-n10  and  n15  in  real-
       time, every 2 seconds using single-line format.  Include totals and preface each line with
       the time.  Since memory sizes tend to be rather large, divide each by 1024 so  we  see  MB
       rather  than  KB.   Note that the columns numbers are always displayed are ascending order
       regardless of their order in -cols. To be sure, first test the column numbers.

       colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot -colk -test
       colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot -colk

       Display most active disks, based on KB written, on nodes n1, n4 and n5.

       colmux -addr n1,n4,n5 -command "-sD" -column 6

       Here is a cool trick.  Collectl currently lets you look at top processes  with  the  --top
       switch  and  even choose a sort column by name.  However, if you want to change the column
       you need to exit, then rerun collectl with a different sort column name.  But if  you  run
       it  like  this example, you get the power of colmux to dynamically change the sort columns
       with the arrow keys!  You can also use this technique to have  collectl  dynamically  sort
       any  local  multi-line  data  such as slabs or even detail data like CPU, Disk, Lustre and
       Networks too!  Naturally this technique works just as well with playing back data as well.

       colmux -command "-sZ -i:1"

RESTRICTIONS

       colmux requires passwordless ssh between the node it is running on those it is monitoring.
       also be sure the port you are using for communications, the default is 2655, if open

CONNECTION PROBLEMS

       The  way  colmux  works is to choose an address it wants to communicate over and starts up
       one or more remote copies of collectl, telling them to connect back to colmux  using  that
       address.  The easiest way to see this, is to run colmux with -noesc, which tells it NOT to
       issue any escape sequences and therefore not to run in full  screen  mode.   The  addional
       switch  of -debug 1 tells it to show the remote collectl startup command.  When there is a
       communications problem you will typically see 'connection timed out' messages displayed.

       There are actually a couple  of  possibilities  here,  one  of  which  is  a  firewall  is
       preventing  connections  and  the  easiest  way  to test this is run collectl on the local
       machine like this: collectl -Aserver.  This tells collectl run as a server, listening  for
       connections   just   like   colmux.    Then   log   into   a   remote   machine   and  run
       /usr/share/collectl/util/client.pl addr-of-server which tells client.pl to open  a  socket
       to  that  copy  of collectl.  It should fail just like when it was run via colmux, so  try
       opening the firewall and try it again.  If  it  fixes  the  problem,  it  was  indeed  the
       firewall blocking things and colmux should now work just fine.

       Sometimes  there are multiple interfaces defined on the machine hosting colmux and in some
       cases only some addresses will allow socket connections.  Again, using  client.pl  on  the
       remote  machine try connecting back to collectl over different addresses and when you find
       one that works, tell colmux to use that address for communication via the -retaddr switch.

AUTHOR

       This program was written by Mark Seger (mjseger@gmail.com).
       Copyright 2016 Hewlett-Packard Development Company, L.P.

SEE ALSO

       http://collectl-utils.sourceforge.net/colmux.html