
awk - pattern scanning and processing language
awk [-F ERE][-v assignment] ... program [argument ...]
awk [-F ERE] -f progfile ... [-v assignment] ...[argument ...]
The awk utility shall execute programs written in the awk programming
language, which is specialized for textual data manipulation. An awk
program is a sequence of patterns and corresponding actions. When input
is read that matches a pattern, the action associated with that pattern
is carried out.
Input shall be interpreted as a sequence of records. By default, a
record is a line, less its terminating <newline>, but this can be
changed by using the RS built-in variable. Each record of input shall
be matched in turn against each pattern in the program. For each
pattern matched, the associated action shall be executed.
The awk utility shall interpret each input record as a sequence of
fields where, by default, a field is a string of non- <blank>s. This
default white-space field delimiter can be changed by using the FS
built-in variable or -F ERE. The awk utility shall denote the first
field in a record $1, the second $2, and so on. The symbol $0 shall
refer to the entire record; setting any other field causes the re-
evaluation of $0. Assigning to $0 shall reset the values of all other
fields and the NF built-in variable.
The awk utility shall conform to the Base Definitions volume of
IEEE Std 1003.1-2001, Section 12.2, Utility Syntax Guidelines.
The following options shall be supported:
-F ERE
Define the input field separator to be the extended regular
expression ERE, before any input is read; see Regular
Expressions .
-f progfile
Specify the pathname of the file progfile containing an awk
program. If multiple instances of this option are specified, the
concatenation of the files specified as progfile in the order
specified shall be the awk program. The awk program can
alternatively be specified in the command line as a single
argument.
-v assignment
The application shall ensure that the assignment argument is in
the same form as an assignment operand. The specified variable
assignment shall occur prior to executing the awk program,
including the actions associated with BEGIN patterns (if any).
Multiple occurrences of this option can be specified.
The following operands shall be supported:
program
If no -f option is specified, the first operand to awk shall be
the text of the awk program. The application shall supply the
program operand as a single argument to awk. If the text does
not end in a <newline>, awk shall interpret the text as if it
did.
argument
Either of the following two types of argument can be intermixed:
file
A pathname of a file that contains the input to be read, which
is matched against the set of patterns in the program. If no
file operands are specified, or if a file operand is â€â€™-â€â€™ , the
standard input shall be used.
assignment
An operand that begins with an underscore or alphabetic
character from the portable character set (see the table in the
Base Definitions volume of IEEE Std 1003.1-2001, Section 6.1,
Portable Character Set), followed by a sequence of underscores,
digits, and alphabetics from the portable character set,
followed by the â€â€™=â€â€™ character, shall specify a variable
assignment rather than a pathname. The characters before the â€â€™=â€â€™
represent the name of an awk variable; if that name is an awk
reserved word (see Grammar ) the behavior is undefined. The
characters following the equal sign shall be interpreted as if
they appeared in the awk program preceded and followed by a
double-quote ( â€â€™ )â€â€™ character, as a STRING token (see Grammar ),
except that if the last character is an unescaped backslash, it
shall be interpreted as a literal backslash rather than as the
first character of the sequence "\"" . The variable shall be
assigned the value of that STRING token and, if appropriate,
shall be considered a numeric string (see Expressions in awk ),
the variable shall also be assigned its numeric value. Each such
variable assignment shall occur just prior to the processing of
the following file, if any. Thus, an assignment before the first
file argument shall be executed after the BEGIN actions (if
any), while an assignment after the last file argument shall
occur before the END actions (if any). If there are no file
arguments, assignments shall be executed before processing the
standard input.
The standard input shall be used only if no file operands are
specified, or if a file operand is â€â€™-â€â€™ ; see the INPUT FILES section.
If the awk program contains no actions and no patterns, but is
otherwise a valid awk program, standard input and any file operands
shall not be read and awk shall exit with a return status of zero.
Input files to the awk program from any of the following sources shall
be text files:
* Any file operands or their equivalents, achieved by modifying the
awk variables ARGV and ARGC
* Standard input in the absence of any file operands
* Arguments to the getline function
Whether the variable RS is set to a value other than a <newline> or
not, for these files, implementations shall support records terminated
with the specified separator up to {LINE_MAX} bytes and may support
longer records.
If -f progfile is specified, the application shall ensure that the
files named by each of the progfile option-arguments are text files and
their concatenation, in the same order as they appear in the arguments,
is an awk program.
The following environment variables shall affect the execution of awk:
LANG Provide a default value for the internationalization variables
that are unset or null. (See the Base Definitions volume of
IEEE Std 1003.1-2001, Section 8.2, Internationalization
Variables for the precedence of internationalization variables
used to determine the values of locale categories.)
LC_ALL If set to a non-empty string value, override the values of all
the other internationalization variables.
LC_COLLATE
Determine the locale for the behavior of ranges, equivalence
classes, and multi-character collating elements within regular
expressions and in comparisons of string values.
LC_CTYPE
Determine the locale for the interpretation of sequences of
bytes of text data as characters (for example, single-byte as
opposed to multi-byte characters in arguments and input files),
the behavior of character classes within regular expressions,
the identification of characters as letters, and the mapping of
uppercase and lowercase characters for the toupper and tolower
functions.
LC_MESSAGES
Determine the locale that should be used to affect the format
and contents of diagnostic messages written to standard error.
LC_NUMERIC
Determine the radix character used when interpreting numeric
input, performing conversions between numeric and string values,
and formatting numeric output. Regardless of locale, the period
character (the decimal-point character of the POSIX locale) is
the decimal-point character recognized in processing awk
programs (including assignments in command line arguments).
NLSPATH
Determine the location of message catalogs for the processing of
LC_MESSAGES .
PATH Determine the search path when looking for commands executed by
system(expr), or input and output pipes; see the Base
Definitions volume of IEEE Std 1003.1-2001, Chapter 8,
Environment Variables.
In addition, all environment variables shall be visible via the awk
variable ENVIRON.
Default.
The nature of the output files depends on the awk program.
The standard error shall be used only for diagnostic messages.
The nature of the output files depends on the awk program.
Overall Program Structure
An awk program is composed of pairs of the form:
pattern { action }
Either the pattern or the action (including the enclosing brace
characters) can be omitted.
A missing pattern shall match any record of input, and a missing action
shall be equivalent to:
{ print }
Execution of the awk program shall start by first executing the actions
associated with all BEGIN patterns in the order they occur in the
program. Then each file operand (or standard input if no files were
specified) shall be processed in turn by reading data from the file
until a record separator is seen ( <newline> by default). Before the
first reference to a field in the record is evaluated, the record shall
be split into fields, according to the rules in Regular Expressions ,
using the value of FS that was current at the time the record was read.
Each pattern in the program then shall be evaluated in the order of
occurrence, and the action associated with each pattern that matches
the current record executed. The action for a matching pattern shall be
executed before evaluating subsequent patterns. Finally, the actions
associated with all END patterns shall be executed in the order they
occur in the program.
Expressions in awk
Expressions describe computations used in patterns and actions. In the
following table, valid expression operations are given in groups from
highest precedence first to lowest precedence last, with equal-
precedence operators grouped between horizontal lines. In expression
evaluation, where the grammar is formally ambiguous, higher precedence
operators shall be evaluated before lower precedence operators. In this
table expr, expr1, expr2, and expr3 represent any expression, while
lvalue represents any entity that can be assigned to (that is, on the
left side of an assignment operator). The precise syntax of expressions
is given in Grammar .
Table: Expressions in Decreasing Precedence in awk
Syntax Name Type of Result Associativity
( expr ) Grouping Type of expr N/A
$expr Field reference String N/A
++ lvalue Pre-increment Numeric N/A
-- lvalue Pre-decrement Numeric N/A
lvalue ++ Post-increment Numeric N/A
lvalue -- Post-decrement Numeric N/A
expr ^ expr Exponentiation Numeric Right
! expr Logical not Numeric N/A
+ expr Unary plus Numeric N/A
- expr Unary minus Numeric N/A
expr * expr Multiplication Numeric Left
expr / expr Division Numeric Left
expr % expr Modulus Numeric Left
expr + expr Addition Numeric Left
expr - expr Subtraction Numeric Left
expr expr String concatenation String Left
expr < expr Less than Numeric None
expr <= expr Less than or equal to Numeric None
expr != expr Not equal to Numeric None
expr == expr Equal to Numeric None
expr > expr Greater than Numeric None
expr >= expr Greater than or equal to Numeric None
expr ~ expr ERE match Numeric None
expr !~ expr ERE non-match Numeric None
expr in array Array membership Numeric Left
( index ) in array Multi-dimension array Numeric Left
membership
expr && expr Logical AND Numeric Left
expr || expr Logical OR Numeric Left
expr1 ? expr2 : expr3 Conditional expression Type of selected Right
expr2 or expr3
lvalue ^= expr Exponentiation assignment Numeric Right
lvalue %= expr Modulus assignment Numeric Right
lvalue *= expr Multiplication assignment Numeric Right
lvalue /= expr Division assignment Numeric Right
lvalue += expr Addition assignment Numeric Right
lvalue -= expr Subtraction assignment Numeric Right
lvalue = expr Assignment Type of expr Right
Each expression shall have either a string value, a numeric value, or
both. Except as stated for specific contexts, the value of an
expression shall be implicitly converted to the type needed for the
context in which it is used. A string value shall be converted to a
numeric value by the equivalent of the following calls to functions
defined by the ISO C standard:
setlocale(LC_NUMERIC, "");
numeric_value = atof(string_value);
A numeric value that is exactly equal to the value of an integer (see
Concepts Derived from the ISO C Standard ) shall be converted to a
string by the equivalent of a call to the sprintf function (see String
Functions ) with the string "%d" as the fmt argument and the numeric
value being converted as the first and only expr argument. Any other
numeric value shall be converted to a string by the equivalent of a
call to the sprintf function with the value of the variable CONVFMT as
the fmt argument and the numeric value being converted as the first and
only expr argument. The result of the conversion is unspecified if the
value of CONVFMT is not a floating-point format specification. This
volume of IEEE Std 1003.1-2001 specifies no explicit conversions
between numbers and strings. An application can force an expression to
be treated as a number by adding zero to it, or can force it to be
treated as a string by concatenating the null string ( "" ) to it.
A string value shall be considered a numeric string if it comes from
one of the following:
1. Field variables
2. Input from the getline() function
3. FILENAME
4. ARGV array elements
5. ENVIRON array elements
6. Array elements created by the split() function
7. A command line variable assignment
8. Variable assignment from another numeric string variable
and after all the following conversions have been applied, the
resulting string would lexically be recognized as a NUMBER token as
described by the lexical conventions in Grammar :
* All leading and trailing <blank>s are discarded.
* If the first non- <blank> is â€â€™+â€â€™ or â€â€™-â€â€™ , it is discarded.
* Changing each occurrence of the decimal point character from the
current locale to a period.
If a â€â€™-â€â€™ character is ignored in the preceding description, the numeric
value of the numeric string shall be the negation of the numeric value
of the recognized NUMBER token. Otherwise, the numeric value of the
numeric string shall be the numeric value of the recognized NUMBER
token. Whether or not a string is a numeric string shall be relevant
only in contexts where that term is used in this section.
When an expression is used in a Boolean context, if it has a numeric
value, a value of zero shall be treated as false and any other value
shall be treated as true. Otherwise, a string value of the null string
shall be treated as false and any other value shall be treated as true.
A Boolean context shall be one of the following:
* The first subexpression of a conditional expression
* An expression operated on by logical NOT, logical AND, or logical OR
* The second expression of a for statement
* The expression of an if statement
* The expression of the while clause in either a while or do... while
statement
* An expression used as a pattern (as in Overall Program Structure)
All arithmetic shall follow the semantics of floating-point arithmetic
as specified by the ISO C standard (see Concepts Derived from the ISO C
Standard ).
The value of the expression:
expr1 ^ expr2
shall be equivalent to the value returned by the ISO C standard
function call:
pow(expr1, expr2)
The expression:
lvalue ^= expr
shall be equivalent to the ISO C standard expression:
lvalue = pow(lvalue, expr)
except that lvalue shall be evaluated only once. The value of the
expression:
expr1 % expr2
shall be equivalent to the value returned by the ISO C standard
function call:
fmod(expr1, expr2)
The expression:
lvalue %= expr
shall be equivalent to the ISO C standard expression:
lvalue = fmod(lvalue, expr)
except that lvalue shall be evaluated only once.
Variables and fields shall be set by the assignment statement:
lvalue = expression
and the type of expression shall determine the resulting variable type.
The assignment includes the arithmetic assignments ( "+=" , "-=" , "*="
, "/=" , "%=" , "^=" , "++" , "--" ) all of which shall produce a
numeric result. The left-hand side of an assignment and the target of
increment and decrement operators can be one of a variable, an array
with index, or a field selector.
The awk language supplies arrays that are used for storing numbers or
strings. Arrays need not be declared. They shall initially be empty,
and their sizes shall change dynamically. The subscripts, or element
identifiers, are strings, providing a type of associative array
capability. An array name followed by a subscript within square
brackets can be used as an lvalue and thus as an expression, as
described in the grammar; see Grammar . Unsubscripted array names can
be used in only the following contexts:
* A parameter in a function definition or function call
* The NAME token following any use of the keyword in as specified in
the grammar (see Grammar ); if the name used in this context is not
an array name, the behavior is undefined
A valid array index shall consist of one or more comma-separated
expressions, similar to the way in which multi-dimensional arrays are
indexed in some programming languages. Because awk arrays are really
one-dimensional, such a comma-separated list shall be converted to a
single string by concatenating the string values of the separate
expressions, each separated from the other by the value of the SUBSEP
variable. Thus, the following two index operations shall be
equivalent:
var[expr1, expr2, ... exprn]
var[expr1 SUBSEP expr2 SUBSEP ... SUBSEP exprn]
The application shall ensure that a multi-dimensioned index used with
the in operator is parenthesized. The in operator, which tests for the
existence of a particular array element, shall not cause that element
to exist. Any other reference to a nonexistent array element shall
automatically create it.
Comparisons (with the â€â€™<â€â€™ , "<=" , "!=" , "==" , â€â€™>â€â€™ , and ">="
operators) shall be made numerically if both operands are numeric, if
one is numeric and the other has a string value that is a numeric
string, or if one is numeric and the other has the uninitialized value.
Otherwise, operands shall be converted to strings as required and a
string comparison shall be made using the locale-specific collation
sequence. The value of the comparison expression shall be 1 if the
relation is true, or 0 if the relation is false.
Variables and Special Variables
Variables can be used in an awk program by referencing them. With the
exception of function parameters (see User-Defined Functions ), they
are not explicitly declared. Function parameter names shall be local to
the function; all other variable names shall be global. The same name
shall not be used as both a function parameter name and as the name of
a function or a special awk variable. The same name shall not be used
both as a variable name with global scope and as the name of a
function. The same name shall not be used within the same scope both as
a scalar variable and as an array. Uninitialized variables, including
scalar variables, array elements, and field variables, shall have an
uninitialized value. An uninitialized value shall have both a numeric
value of zero and a string value of the empty string. Evaluation of
variables with an uninitialized value, to either string or numeric,
shall be determined by the context in which they are used.
Field variables shall be designated by a â€â€™$â€â€™ followed by a number or
numerical expression. The effect of the field number expression
evaluating to anything other than a non-negative integer is
unspecified; uninitialized variables or string values need not be
converted to numeric values in this context. New field variables can be
created by assigning a value to them. References to nonexistent fields
(that is, fields after $NF), shall evaluate to the uninitialized value.
Such references shall not create new fields. However, assigning to a
nonexistent field (for example, $(NF+2)=5) shall increase the value of
NF; create any intervening fields with the uninitialized value; and
cause the value of $0 to be recomputed, with the fields being separated
by the value of OFS. Each field variable shall have a string value or
an uninitialized value when created. Field variables shall have the
uninitialized value when created from $0 using FS and the variable does
not contain any characters. If appropriate, the field variable shall be
considered a numeric string (see Expressions in awk ).
Implementations shall support the following other special variables
that are set by awk:
ARGC The number of elements in the ARGV array.
ARGV An array of command line arguments, excluding options and the
program argument, numbered from zero to ARGC-1.
The arguments in ARGV can be modified or added to; ARGC can be altered.
As each input file ends, awk shall treat the next non-null element of
ARGV, up to the current value of ARGC-1, inclusive, as the name of the
next input file. Thus, setting an element of ARGV to null means that it
shall not be treated as an input file. The name â€â€™-â€â€™ indicates the
standard input. If an argument matches the format of an assignment
operand, this argument shall be treated as an assignment rather than a
file argument.
CONVFMT
The printf format for converting numbers to strings (except for
output statements, where OFMT is used); "%.6g" by default.
ENVIRON
An array representing the value of the environment, as described
in the exec functions defined in the System Interfaces volume of
IEEE Std 1003.1-2001. The indices of the array shall be strings
consisting of the names of the environment variables, and the
value of each array element shall be a string consisting of the
value of that variable. If appropriate, the environment variable
shall be considered a numeric string (see Expressions in awk );
the array element shall also have its numeric value.
In all cases where the behavior of awk is affected by environment
variables (including the environment of any commands that awk executes
via the system function or via pipeline redirections with the print
statement, the printf statement, or the getline function), the
environment used shall be the environment at the time awk began
executing; it is implementation-defined whether any modification of
ENVIRON affects this environment.
FILENAME
A pathname of the current input file. Inside a BEGIN action the
value is undefined. Inside an END action the value shall be the
name of the last input file processed.
FNR The ordinal number of the current record in the current file.
Inside a BEGIN action the value shall be zero. Inside an END
action the value shall be the number of the last record
processed in the last file processed.
FS Input field separator regular expression; a <space> by default.
NF The number of fields in the current record. Inside a BEGIN
action, the use of NF is undefined unless a getline function
without a var argument is executed previously. Inside an END
action, NF shall retain the value it had for the last record
read, unless a subsequent, redirected, getline function without
a var argument is performed prior to entering the END action.
NR The ordinal number of the current record from the start of
input. Inside a BEGIN action the value shall be zero. Inside an
END action the value shall be the number of the last record
processed.
OFMT The printf format for converting numbers to strings in output
statements (see Output Statements ); "%.6g" by default. The
result of the conversion is unspecified if the value of OFMT is
not a floating-point format specification.
OFS The print statement output field separation; <space> by default.
ORS The print statement output record separator; a <newline> by
default.
RLENGTH
The length of the string matched by the match function.
RS The first character of the string value of RS shall be the input
record separator; a <newline> by default. If RS contains more
than one character, the results are unspecified. If RS is null,
then records are separated by sequences consisting of a
<newline> plus one or more blank lines, leading or trailing
blank lines shall not result in empty records at the beginning
or end of the input, and a <newline> shall always be a field
separator, no matter what the value of FS is.
RSTART The starting position of the string matched by the match
function, numbering from 1. This shall always be equivalent to
the return value of the match function.
SUBSEP The subscript separator string for multi-dimensional arrays; the
default value is implementation-defined.
Regular Expressions
The awk utility shall make use of the extended regular expression
notation (see the Base Definitions volume of IEEE Std 1003.1-2001,
Section 9.4, Extended Regular Expressions) except that it shall allow
the use of C-language conventions for escaping special characters
within the EREs, as specified in the table in the Base Definitions
volume of IEEE Std 1003.1-2001, Chapter 5, File Format Notation ( â€â€™\â€â€™
, â€â€™â€â€™ , â€â€™â€â€™ , â€â€™â€â€™ , â€â€™
â€â€™ , â€â€™
â€â€™ , â€â€™ â€â€™ , â€â€™â€â€™ ) and the following
table; these escape sequences shall be recognized both inside and
outside bracket expressions. Note that records need not be separated
by <newline>s and string constants can contain <newline>s, so even the
"
" sequence is valid in awk EREs. Using a slash character within an
ERE requires the escaping shown in the following table.
Table: Escape Sequences in awk
Escape
Sequence Description Meaning
\" Backslash quotation-mark Quotation-mark character
\/ Backslash slash Slash character
\ddd A backslash character followed The character whose encoding
by the longest sequence of is represented by the one,
one, two, or three octal-digit two, or three-digit octal
characters (01234567). If all integer. Multi-byte characters
of the digits are 0 (that is, require multiple, concatenated
representation of the NUL escape sequences of this type,
character), the behavior is including the leading â€â€™\â€â€™ for
undefined. each byte.
Powered by the Ubuntu Manpage Repository generator
Maintained by Dustin Kirkland