Awk help Wed Feb 20 16:11:34 EST 2008 Structure of an AWK program: =========================== An awk program is a sequence of pattern-action statements pattern { action } pattern { action } "pattern" is a regular expression, numeric expression, string expression or combination; "action" is executable code, similar to C Operation: for each file for each input line for each pattern if pattern matches input line do the action Usage: awk 'program' [ file1 file2 ... ] awk -f progfile [ file1 file2 ... ] The special pattern BEGIN matches before any input has been read; END matches after all input has been read. AWK features: ================================================== input is read automatically across multiple files lines are split into fields ($1, ..., $NF; $0 for whole line) default split is by white space changing FS to some other value (string or RE) affects split variables contain string or numeric values no declarations type determined by context and use initialized to 0 and empty string operators work on strings or numbers coerce type according to context built-in variables for frequently-used values associative arrays (arbitrary subscripts) regular expressions (like egrep) control flow statements similar to C if-else, while, for, do, but no switch for (i in array) goes through each subscriptof associative array next: start next iteration of main loop exit: leave main loop, go to END block built-in and user-defined functions arithmetic, string, regular expression, text edit, ... printf for formatted output, as in C getline for input from files or processes Basic AWK programs: =================================================== These are all one-liners: { print NR, $0 } precede each line by line number { $1 = NR; print } replace first field by line number { print $2, $1 } print field 2, then field 1 { temp = $1; $1 = $2; $2 = temp; print } flip $1, $2 { $2 = ""; print } zap field 2 { print $NF } print last field NF > 0 print non-empty lines NF > 4 print if more than 4 fields $NF > 4 print if last field greater than 4 NF > 0 {print $1, $2} print two fields of non-empty lines /regexpr/ print matching lines (egrep) $1 ~ /regexpr/ print lines where first field matches END { print NR } line count A couple of two-liners: { nc += length($0) + 1; nw += NF } wc command END { print NR, "lines", nw, "words", nc, "characters" } length($1) > max { max = length($1); maxline = $0 } print longest line END { print max, maxline } Associative arrays: ================================================== AWK only provides associative arrays: subscripts are arbitrary strings or numbers. Add up name-value pairs: { amount[$1] += $2 } END { for (name in amount) print name, amount[name] } Test whether a[s] exists without creating it (normally, referring to an array element creates it): if (s in a) ... To delete an element or a whole array: delete a[s] delete a To split a string into an array: n = split(s, a, re) This splits s into a[1]..a[n] with re as delimiter. If there is no re argument, the operation is the same as field splitting on input. Built-in variables: ================================================= This isn't a complete list. Some variables can be set, others are maintained automatically (notably NR, NF). FS input field separator; controls field splitting OFS output field separator; placed between output exprs NF number of fields in current record $1..$NF input fields $0 entire input record before splitting into fields NR current input record number overall FNR current input record in current input file FILENAME current input filename ENVIRON shell environment variables ARGV command line arguments; can be set ARGC number of command line arguments; can be set Setting ARGV[i] to "" prevents that file from being examined at all. Changing $0 recomputes $1..$NF; changing $n recomputes $0. Built-in functions: =================================================== Awk strings and string functions are 1-origin; be careful. length(s) n = index(s, f) returns index of f in s, or 0 if not there n = match(s, re) index where re matched in s, or 0 if not nsub = sub(re, repl, target) replaces first instance of re in target by repl returns 0 if no match nsub = gsub(re, repl, target) replaces all instances of re in target by repl returns 0 if no match, number of replacements otherwise str = substr(s, start, length) returns substring of s starting at start, up to length characters (default is rest of string). works sensibly if you go off the ends. note: origin is 1. s = toupper(str) s = tolower(str) map case s = sprintf("...", expr list) formats expressions, returns string result There are also all the usual math functions: int, sqrt, exp, log, sin, cos, atan2, rand (uniform between 0 and 1), srand(new_seed). Functions: ===================================================== Functions are defined as function name(arg list) { statements } statements can be any sequence of statements as in actions. return [expr] returns. arg list is zero or more parameter names. if there are more names than the function was called with, the extra parameters are local variables. (this is a *terrible* design; be careful.) arguments are passed by value for scalars; arrays are in effect call by reference since the function can change the array contents. Input and Output: ============================================== Besides the automatic I/O of the main loop, print e, e, e prints the list of expressions, separated by OFS print prints $0 printf formats output, as in C print or printf > "string" sends output to file print or printf | "string" sends output to process given by string, which is created on first reference. getline x