| previous | contents | up | next |
Awk is a programming language designed to make many common information retrieval and text manipulation tasks easy to state and to perform. The name awk comes from the names of its creators Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. A GNU version of awk, called gawk, is also in wide use (and is recommended). (Also, for most practical purposes, the awk programming language has been superseded by Larry Wall's perl language.)
The basic operation of awk is to scan a set of input lines in order, searching for lines which match any of a set of patterns which the user has specified. For each pattern, an action can be specified; this action will be performed on each line that matches the pattern.
Sometimes awk's pattern matching may be more general than in grep, and the actions allowed are more involved than merely printing the matching line. For example, the awk program
{print $3, $2}
prints the third and second columns of a table in that order.
The program
$2 ~ /A|B|C/
prints all input lines with an A, B, or C in the second field.
The program
$1 != prev { print; prev = $1 }
prints all lines in which the first field is different from the previous first field.
awk '{print $0}' filename
The single quotes protect almost everything from the shell. In csh or tcsh, you still have to watch out for exclamation marks, but other than that, you're safe.
awk '{print $0,}' filename
awk: syntax error near line 1
awk: illegal statement near line 1
gawk generally has better error messages. At least it tells you where
in the line something went wrong:
gawk '{print $0,}' filename
gawk: cmd. line:1: {print $0,}
gawk: cmd. line:1: ^ parse error
So, if you're having problems getting awk syntax correct, switch to
gawk for a while.
Some Samples:
Perhaps the quickest way of learning awk is to look at some sample
programs. The one above will print the file in its entirety, just
like cat(1). Here are some others, along with a quick description of
what they do.
awk '{print $2,$1}' filename
will print the second field, then the first. All other fields are ignored.
awk '{print $1,$2,sin($3/$2)}' filename
will print the first and second fields, and then the sine of the third field divided by the second. So, the second and third field had better be numbers. Awk has other built in math functions like sine; read the manpage to see which ones.
"I still say awk '{print $1}' a lot."
the inventor of PERL, Larry Wall (lwall@netlabs.com)
What if you don't want to apply the program to each line of the file? Say, for example, that you only wanted to process lines that had the first field greater than the second. The following program will do that:
awk '$1 > $2 {print $1,$2,$1-$2}' filename
The part outside the curly braces is called the "pattern", and the part inside is the "action". The comparison operators include the ones from C:
== != < > <= >= ?:If no pattern is given, then the action applies to all lines. This fact was used in the sample programs above. If no action is given, then the entire line is printed. If "print" is used all by itself, the entire line is printed. Thus, the following are equivalent:
awk '$1 > $2' filename
awk '$1 > $2{print}' filename
awk '$1 > $2{print $0}' filename
The various fields in a line can also be treated as strings instead of
numbers. To compare a field to a string, use the following method:
awk '$1=="foo"{print $2}' filename
awk '/foo.*bar/{print $1,$3}' filename
This will print all lines containing the word "foo" and then later the word "bar". If you want only those lines where "foo" occurs in the second field, use the ~ ("contains") operator:
awk '$2~/foo/{print $3,$1}' filename
If you want lines where "foo" does not occur in the second field, use the negated ~ operator, !~
awk '$2!~/foo/{print $3,$1}' filename
This operator can be read as "does not contain".
Booleans
You can produce complicated patterns with the boolean operators from
C, which are ! for "not", && for "and", and || for
"or". Parentheses
can be used for grouping.
Start and End
There are three special forms of patterns that do not fit the above
descriptions. One is the start-end pair of regular expressions. For
example, to print all lines between and including lines that contained
"foo" and "bar", you would use
awk '/foo/,/bar/' filename
Begin and End
The other two special forms are similar; they are the BEGIN and END
patterns. Any action associated with the BEGIN pattern will happen
before any line-by-line processing is done. Actions with the END
pattern will happen after all lines are processed.
But how do you put more than one pattern-action pair into an awk program? There are several choices.
awk 'BEGIN{print"fee"} $1=="foo"{print"fi"}END{print"fo fum"}' filename
BEGIN{print"fee"}
$1=="foo"{print"fi"}
END{print"fo fum"}
Let's say that's in the file giant.awk. Now, run it using the "-f"
flag to awk:
awk -f giant.awk filename
#!/usr/bin/awk -f
BEGIN{print"fee"}
$1=="foo"{print"fi"}
END{print"fo fum"}
chmod u+x giant2.awk
and then just call it like so:
./giant2.awk filename
awk has variables that can be either real numbers or strings. For example, the following code prints a running total of the fifth column:
awk '{print x+=$5,$0 }' filename
This can be used when looking at file sizes from an "ls -l". It is also useful for balancing one's checkbook, if the amount of the check is kept in one column.
Awk variables
awk variables are initialized to either zero or the empty string the
first time they are used. Which one depends on how they are used, of
course.
Variables are also useful for keeping intermediate values. This example also introduces the use of semicolons for separating statements:
awk '{d=($2-($1-4));s=($2+$1);print d/sqrt(s),d*d/s }' filename
Note that the final statement, a "print" in this case, does not need a semicolon. It doesn't hurt to put it in, though.
awk '{imp=$1; print $imp }' filename
awk '{print $1,$NF }' filename
awk '{print NR,$0 }' filename
awk -F: '{print $1:$3 }' /etc/passwd
This variable can actually be set to any regular expression, in the manner of egrep(1).The various fields are also variables, and you can assign things to them. If you wanted to delete the 10th field from each line, you could do it by printing fields 1 through 9, and then from 11 on using a for-loop (see below). But, this will do it very easily:
awk '{$10=""; print }' filename
In many ways, awk is like C. The "for", "while", "do-while", and "if" constructs all exist. Statements can be grouped with curly braces. This script will print each field of each record on its own line.
awk '{for(i=1;i<=NF;i++) print $i }' filename
If you want to produce format that is a little better formatted than the "print" statement gives you, you can use "printf" just like in C. Here is an example that treats the first field as a string, and then does some numeric stuff
awk '{printf("%s %03d %02d %.15g\n",$1,$2,$3,$3/$2); }' filename
Note that with printf, you need the explicit newline character.
We can use "printf" to print stuff without the newline, which is useful in a for loop. This script prints each record with each of its fields reversed. Ok, so it isn't very useful.
awk '{for(i=NF;i > 0;i--) printf("%s",$i); printf("\n"); }' filename
| previous | contents | up | next |