oracular (7) mawk-arrays.7.gz

Provided by: mawk_1.3.4.20240622-2_amd64 bug

NAME

       mawk-arrays - design notes for mawk's array implementation

SYNOPSIS

       This  is  the  documentation for the mawk implementation of awk arrays.  Arrays in awk are
       associations of strings  to  awk  scalar  values.   The  mawk  implementation  stores  the
       associations  in  hash  tables.  The hash table scheme was influenced by and is similar to
       the design presented in Griswold and Townsend, The Design and  Implementation  of  Dynamic
       Hashing Sets and Tables in Icon, Software Practice and Experience, 23, 351-367, 1993.

DATA STRUCTURES

   Array Structure
       The  type  ARRAY is a pointer to a struct array.  The size field is the number of elements
       in the table.  The meaning of the other fields depends on the type field.

       There are three types of arrays and these are distinguished  by  the  type  field  in  the
       structure.  The types are:

       AY_NULL
            The  array  is  empty  and  the  size field is always zero.  The other fields have no
            meaning.

       AY_SPLIT
            The array was created by the AWK built-in split.  The  return  value  from  split  is
            stored  in the size field.  The ptr field points at a vector of CELLs.  The number of
            CELLs is the limit field.  It is always true that sizelimit.  The address of  A[i]
            is (CELL*)A->ptr+i-1 for 1≤ isize.  The hmask field has no meaning.

       Hash Table
            The  array  is  a  hash  table.  If the AY_STR bit in the type field is set, then the
            table is keyed on strings.  If the AY_INT bit in the type  field  is  set,  then  the
            table  is  keyed  on  integers.   Both  bits  can  be  set, and then the two keys are
            consistent, i.e., look up of A[-14] and A["-14"] will return identical CELL  pointers
            although  the look up methods will be different.  In this case, the size field is the
            number of hash nodes in the table.  When insertion of a new element would cause  size
            to  exceed  limit,  the  table  grows  by  doubling  the  number of hash chains.  The
            invariant, (hmask+1)max_ave_list_length=limit is always true.  Max_ave_list_length is
            a tunable constant.

   Hash Tables
       The  hash tables are linked lists of nodes, called ANODEs.  The number of lists is hmask+1
       which is always a power of two.  The ptr field points at a vector of  list  heads.   Since
       there  are potentially two types of lists, integer lists and strings lists, each list head
       is a structure, DUAL_LINK.

       The string lists are chains connected by slinks and the integer lists are chains connected
       by  ilinks.   We  sometimes  refer to these lists as slists and ilists, respectively.  The
       elements on the lists are ANODEs.  The fields of an ANODE are:

       slink The link field for slists.  ilink The link field for ilists.  sval If non-null, then
       sval is a pointer to a string key.  For a given table, if the AY_STR bit is set then every
       ANODE has a non-null sval field and conversely, if AY_STR is  not  set,  then  every  sval
       field is null.

       hval The hash value of sval.  This field has no meaning if sval is null.

       ival The integer key.  The field has no meaning if set to the constant, NOT_AN_IVALUE.  If
       the AY_STR bit is off, then every ANODE will have a valid ival field.  If the  AY_STR  bit
       is on, then the ival field may or may not be valid.

       cell The data field in the hash table.  \ndhitems

       So  the  value of A[expr is stored in the cell field, and if expr is an integer, then expr
       is stored in ival, else it is stored in sval.

ARRAY OPERATIONS

       The functions that operate on arrays are,

       CELL* array_find(ARRAY A, CELL *cp, int create_flag)
            returns a pointer to A[expr] where cp is a pointer to the CELL holding expr.  If  the
            create_flag  is  on and expr is not an element of A, then the element is created with
            value null.

       void array_delete(ARRAY A, CELL *cp)
            removes an element A[expr from the array A.  cp points at the CELL holding expr.

       void array_load(ARRAY A, size_t cnt)
            builds a split array.  The values A[1..cnt] are moved into A from an anonymous buffer
            with transfer_to_array() which is declared in split.h.

       void array_clear(ARRAY A) removes all elements of A.
            The type of A is then AY_NULL.

       STRING** array_loop_vector(ARRAY A, size_t *sizep)
            returns  a  pointer to a linear vector that holds all the strings that are indices of
            A.  The size of the the vector is returned indirectly in  *sizep.   If  A->size0,  a
            null pointer is returned.

       CELL* array_cat(CELL *sp, int cnt)
            concatenates  the elements of sp[1-cnt..0], with each element separated by SUBSEP, to
            compute an array index.  For example, on a reference to A[i,j], array_cat computes  iSUBSEP  j where  denotes concatenation.

   Array Find
       Any  reference to A[expr] creates a call to array_find(A,cp,CREATE) where cp points at the
       cell holding expr.  The test, expr in A, creates a call to array_find(A,cp,NO_CREATE).

       Array_find is a hash-table lookup function that handles two cases:

       1.   If  *cp  is  numeric  and  integer  valued,  then  lookup  by  integer  value   using
            find_by_ival.  If *cp is numeric, but not integer valued, then convert to string with
            sprintf(CONVFMT,...) and go to case~2.

       2.   If *cp is string valued, then lookup by string value using find_by_sval.  \ndlist

       To test whether cp->dval is integer, we convert to the nearest integer by rounding towards
       zero (done by do_to_I) and then cast back to double.  If we get the same number we started
       with, then cp->dval is integer valued.

       When we get to the function find_by_ival, the search has been reduced to lookup in a  hash
       table by integer value.

       When a search by integer value fails, we have to check by string value to correctly handle
       the case insertion by A["123"]  and  later  search  as  A[123].   This  string  search  is
       necessary if and only if the AY_STR bit is set.  An important point is that all ANODEs get
       created with a valid sval if AY_STR is set, because then  creation  of  new  nodes  always
       occurs in a call to find_by_sval.

       Searching by string value is easier because AWK arrays are really string associations.  If
       the array does not have the AY_STR bit set, then we have to convert the array  to  a  dual
       hash table with strings which is done by the function add_string_associations.

       One  Int  value  is  reserved  to show that the ival field is invalid.  This works because
       d_to_I returns a value in [-Max_Int, Max_Int].

       On entry to add_string_associations, we know that the AY_STR bit is not set.   We  convert
       to a dual hash table, then walk all the integer lists and put each ANODE on a string list.

   Array Delete
       The  execution  of  the statement, delete A[expr], creates a call to array_delete(ARRAY A,
       CELL *cp).  Depending on  the  type  of  *cp,  the  call  is  routed  to  find_by_sval  or
       find_by_ival.  Each of these functions leaves its return value on the front of an slist or
       ilist, respectively, and then it is deleted from the front of the list.   The  case  where
       A[expr  is on two lists, e.g., A[12] and A["12"] is checked by examining the sval and ival
       fields of the returned ANODE*.

       Even though we found a node by searching an ilist it might also be on an slist  and  vice-
       versa.

       When  the  size  of  a  hash  table drops below a certain value, it might be profitable to
       shrink the hash table.  Currently we don't do this, because our guess is that it would  be
       a  waste  of  time  for most AWK applications.  However, we do convert an array to AY_NULL
       when the size goes to zero which would resize a large hash table that had been  completely
       cleared by successive deletions.

   Building an Array with Split
       A  simple  operation  is  to  create an array with the AWK primitive split.  The code that
       performs split puts the pieces in an anonymous buffer.  array_load(A, cnt) moves  the  cnt
       elements from the anonymous buffer into A.  This is the only way an array of type AY_SPLIT
       is created.

       If the array A is a split array and big enough then we reuse  it,  otherwise  we  need  to
       allocate a new split array.  When we allocate a block of CELLs for a split array, we round
       up to a multiple of 4.

   Array Clear
       The function array_clear(ARRAY A) converts A to type AY_NULL and frees all storage used by
       A except for the struct array itself.  This function gets called in three contexts:

       (1)  when an array local to a user function goes out of scope,

       (2)  execution of the AWK statement, delete A and

       (3)  when an existing changes type or size from split().

   Constructor and Conversions
       Arrays  are  always  created  as  empty  arrays  of type AY_NULL.  Global arrays are never
       destroyed although they can go empty or have their type change by  conversion.   The  only
       constructor function is a macro.

       Hash  tables  only get constructed by conversion.  This happens in two ways.  The function
       make_empty_table converts an empty array of type AY_NULL to  an  empty  hash  table.   The
       number  of  lists  in the table is a power of 2 determined by the constant STARTING_HMASK.
       The limit size of the table is determined by the constant MAX_AVE_LIST_LENGTH which is the
       largest  average  size  of the hash lists that we are willing to tolerate before enlarging
       the table.  When A->size exceeds A->limit, the hash table grows in size  by  doubling  the
       number of lists.  A->limit is then reset to MAX_AVE_LIST_LENGTH times A->hmask+1.

       The  other  way a hash table gets constructed is when a split array is converted to a hash
       table of type AY_INT.

       To determine the size of the table, we set the initial size to STARTING_HMASK+1  and  then
       double the size until A->sizeA->limit.

   Doubling the Size of a Hash Table
       The  whole  point  of  making  the table size a power of two is to facilitate resizing the
       table.  If the table size is 2**(n) and h is the hash key, then h mod 2**(n) is  the  hash
       chain  index  which  can  be calculated with bit-wise and, h & (2**(n-1)).  When the table
       size doubles, the new bit-mask has one more bit turned on.  Elements of an old hash  chain
       whose hash value have this bit turned on get moved to a new chain.  Elements with this bit
       turned off stay on the same chain.  On average only half the old chain moves  to  the  new
       chain.   If the old chain is at table[i], 0 ≤ i < 2**(n), then the elements that move, all
       move to the new chain at table[i + 2**(n)].

       As we walk an old string list with pointer p, the expression p->hval & new_hmask takes one
       of  two  values.   If  it  is equal to p->hval & old_hmask (which equals i), then the node
       stays otherwise it gets moved to a new string list at j.  The new  string  list  preserves
       order  so  that  the  positions  of  the move-to-the-front heuristic are preserved.  Nodes
       moving to the new list are appended at pointer tail.  The ANODEs, dummy0~and  dummy1,  are
       sentinels that remove special handling of boundary conditions.

       The  doubling  of  the  integer lists is exactly the same except that slink is replaced by
       ilink and hval is replaced by ival.

   Array Loops
       Our mechanism for dealing with execution of the statement,

              for (i in A) { statements }

       is simple.  We allocate a vector of STRING* of size, A->size.  Each element of the  vector
       is  a  string  key  for~A.   Note that if the AY_STR bit of A is not set, then A has to be
       converted to a string hash table, because the index i walks string indices.

       To execute the loop, the only state that needs to be saved is the  address  of  i  and  an
       index  into  the vector of string keys.  Since nothing about A is saved as state, the user
       program can do anything to A inside the body of the loop, even  delete  A,  and  the  loop
       still  works.   Essentially, we have traded data space (the string vector) in exchange for
       implementation simplicity.  On a 32-bit system, each ANODE  is  36  bytes,  so  the  extra
       memory  needed for the array loop is 11 more than the memory consumed by the ANODEs of the
       array.  Note that the large size of the ANODEs is indicative of  our  whole  design  which
       pays data space for integer lookup speed and algorithm simplicity.

       The  only  aspect  of  array  loops  that  occurs in array.c is construction of the string
       vector.  The rest of the implementation is in the file execute.c.

       As we walk over the hash table ANODEs, putting each sval in ret, we need to increment each
       reference  count.   The  user  of  the return value is responsible for these new reference
       counts.

   Concatenating Array Indices
       In AWK, an array expression A[i,j] is equivalent to the expression A[i  SUBSEP  j],  i.e.,
       the  index  is the concatenation of the three elements i, SUBSEP and j.  This is performed
       by the function array_cat.  On entry, sp points at the top of a stack of CELLs.  Cnt cells
       are  popped  off the stack and concatenated together separated by SUBSEP and the result is
       pushed back on the stack.  On entry, the first multi-index is in sp[1-cnt] and the last is
       in  sp[0].   The return value is the new stack top.  (The stack is the run-time evaluation
       stack.  This operation really has nothing to do with array structure,  so  logically  this
       code belongs in execute.c, but remains here for historical reasons.)

       We  make  a  copy of SUBSEP which we can cast to string in the unlikely event the user has
       assigned a number to SUBSEP.

       Set sp and top so the cells to concatenate are inclusively between sp and top.

       The total_len is the sum of the lengths of the cnt strings and the cnt-1 copies of subsep.

       The return value is sp and it is already set correctly.  We just need to free the  strings
       and set the contents of sp.