previous contents up next

Unix for Advanced Users

15. Advanced Commands and Usage

15.7. awk

Getting Started with awk

Awk is a programming language designed to make many common information retrieval and text manipulation tasks easy to state and to perform. The name awk comes from the names of its creators Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. A GNU version of awk, called gawk, is also in wide use (and is recommended). (Also, for most practical purposes, the awk programming language has been superseded by Larry Wall's perl language.)

The basic operation of awk is to scan a set of input lines in order, searching for lines which match any of a set of patterns which the user has specified. For each pattern, an action can be specified; this action will be performed on each line that matches the pattern.

Sometimes awk's pattern matching may be more general than in grep, and the actions allowed are more involved than merely printing the matching line. For example, the awk program

{print $3, $2}

prints the third and second columns of a table in that order.

The program

$2 ~ /A|B|C/

prints all input lines with an A, B, or C in the second field.

The program

$1 != prev { print; prev = $1 }

prints all lines in which the first field is different from the previous first field.

back to the top

Some basics:

back to the top

Some Samples:

Perhaps the quickest way of learning awk is to look at some sample programs. The one above will print the file in its entirety, just like cat(1). Here are some others, along with a quick description of what they do.

awk '{print $2,$1}' filename

will print the second field, then the first. All other fields are ignored.

awk '{print $1,$2,sin($3/$2)}' filename

will print the first and second fields, and then the sine of the third field divided by the second. So, the second and third field had better be numbers. Awk has other built in math functions like sine; read the manpage to see which ones.

"I still say awk '{print $1}' a lot."
the inventor of PERL, Larry Wall (lwall@netlabs.com)

What if you don't want to apply the program to each line of the file? Say, for example, that you only wanted to process lines that had the first field greater than the second. The following program will do that:

awk '$1 > $2 {print $1,$2,$1-$2}' filename

The part outside the curly braces is called the "pattern", and the part inside is the "action". The comparison operators include the ones from C:

	== != < > <= >= ?:
If no pattern is given, then the action applies to all lines. This fact was used in the sample programs above. If no action is given, then the entire line is printed. If "print" is used all by itself, the entire line is printed. Thus, the following are equivalent:
	awk '$1 > $2'           filename
	awk '$1 > $2{print}'    filename
	awk '$1 > $2{print $0}' filename
The various fields in a line can also be treated as strings instead of numbers. To compare a field to a string, use the following method:

awk '$1=="foo"{print $2}' filename

back to the top

Using regular expressions

What if you want lines in which a certain string is found? Just put a regular expression (in the manner of egrep(1) ) into the pattern, like so:

awk '/foo.*bar/{print $1,$3}' filename

This will print all lines containing the word "foo" and then later the word "bar". If you want only those lines where "foo" occurs in the second field, use the ~ ("contains") operator:

awk '$2~/foo/{print $3,$1}' filename

If you want lines where "foo" does not occur in the second field, use the negated ~ operator, !~

awk '$2!~/foo/{print $3,$1}' filename

This operator can be read as "does not contain".

back to the top

Booleans

You can produce complicated patterns with the boolean operators from C, which are ! for "not", && for "and", and || for "or". Parentheses can be used for grouping.

back to the top

Start and End

There are three special forms of patterns that do not fit the above descriptions. One is the start-end pair of regular expressions. For example, to print all lines between and including lines that contained "foo" and "bar", you would use

awk '/foo/,/bar/' filename

back to the top

Begin and End

The other two special forms are similar; they are the BEGIN and END patterns. Any action associated with the BEGIN pattern will happen before any line-by-line processing is done. Actions with the END pattern will happen after all lines are processed.

But how do you put more than one pattern-action pair into an awk program? There are several choices.

  1. One is to just mash them together, like so:

    awk 'BEGIN{print"fee"} $1=="foo"{print"fi"}END{print"fo fum"}' filename

  2. Another choice is to put the program into a file, like so:
    	BEGIN{print"fee"}
    	$1=="foo"{print"fi"}
    	END{print"fo fum"}
    
    Let's say that's in the file giant.awk. Now, run it using the "-f" flag to awk:

    awk -f giant.awk filename

  3. A third choice is to create a file that calls awk all by itself. The following form will do the trick:
    	#!/usr/bin/awk -f
    	BEGIN{print"fee"}
    	$1=="foo"{print"fi"}
    	END{print"fo fum"} 
    
If we call this file giant2.awk, we can run it by first giving it execute permissions,

chmod u+x giant2.awk

and then just call it like so:

./giant2.awk filename

awk has variables that can be either real numbers or strings. For example, the following code prints a running total of the fifth column:

awk '{print x+=$5,$0 }' filename

This can be used when looking at file sizes from an "ls -l". It is also useful for balancing one's checkbook, if the amount of the check is kept in one column.

back to the top

Awk variables

awk variables are initialized to either zero or the empty string the first time they are used. Which one depends on how they are used, of course.

Variables are also useful for keeping intermediate values. This example also introduces the use of semicolons for separating statements:

awk '{d=($2-($1-4));s=($2+$1);print d/sqrt(s),d*d/s }' filename

Note that the final statement, a "print" in this case, does not need a semicolon. It doesn't hurt to put it in, though.

Of course, there are a myriad of other ways to put line numbers on a file using the various UNIX utilities. This is left as an exercise for the reader.

The various fields are also variables, and you can assign things to them. If you wanted to delete the 10th field from each line, you could do it by printing fields 1 through 9, and then from 11 on using a for-loop (see below). But, this will do it very easily:

awk '{$10=""; print }' filename

In many ways, awk is like C. The "for", "while", "do-while", and "if" constructs all exist. Statements can be grouped with curly braces. This script will print each field of each record on its own line.

awk '{for(i=1;i<=NF;i++) print $i }' filename

If you want to produce format that is a little better formatted than the "print" statement gives you, you can use "printf" just like in C. Here is an example that treats the first field as a string, and then does some numeric stuff

awk '{printf("%s %03d %02d %.15g\n",$1,$2,$3,$3/$2); }' filename

Note that with printf, you need the explicit newline character.

We can use "printf" to print stuff without the newline, which is useful in a for loop. This script prints each record with each of its fields reversed. Ok, so it isn't very useful.

awk '{for(i=NF;i > 0;i--) printf("%s",$i); printf("\n"); }' filename

back to the top

Punctuation guide:

{}
used around the action, and to group statements in the action.

$
denotes a field. $1 is the first field, $0 is the whole record.

~
the "contains" operator. "foobar"~"foo" is true. Strings only.

!~
the "does not contain" operator. Strings only.

==
the equality operator. Works for numbers or strings

< > <= >= !=
inequality operators. Work for numbers or strings.

#
the begin-comment character

,
separates things in a "print" or "printf" statement.

;
separates statements.

//
used around a regular expression

&&
Boolean and

||
Boolean or

!
boolean not

()
used for grouping Boolean expressions, passing arguments to functions, and around conditions for "for","while", etc.

back to the top

previous contents up next