Before you start:
cd into the data-temp folder
You will find a file called [Link]
Slides 1-4 video Lecture6-awk-specify-conditions-9min
1
The AWK program language
AWK is a programming language designed for text processing and typically used as a data
extraction and reporting tool.
awk 'condition {print action}' filename
You can extract lines based on conditions that you specify on fields
You can print specific fields {print action}
Fields are specified as following
$1 first field.
$2 second field.
$n nth field.
Whitespace character(s) or tab(s) is the default separator between fields in awk.
Print only lines that contain keyword ATOM in the 1st field:
awk '$1 == "ATOM" {print}' [Link]
Strings are enclosed between double quotes
2
Practice
awk '$1 == "ATOM" {print}' [Link]
1. Print only lines that contain keyword ATOM in the 1st field and pipe that into the
head command.
2. Use grep to extract all lines containing keyword ATOM and pipe that into the
head command. Can you spot the difference between awk and grep ?
3. Print only lines that contain keyword ATOM in the 1st field and save the output in
a file called [Link].
Specifying conditions in awk 3
awk 'condition {print action}' filename
Operators for numbers Operators for strings
== is equal to == is equal to
!= is not equal to != is not equal to
< less than
> greater than Syntax to define condition
<= less than or equal $field != "string"
>= greater than or equal $field == "string”
Syntax to define condition Strings are enclosed within
$field == number double quotes
$field >= number
Print only lines that contain keyword HIS in the 4th field:
awk '$4 == "HIS" {print}' [Link]
Print only lines of [Link] that contain a number greater than 190 in 2nd field:
awk '$2 > 190 {print}' [Link]
4
Practice
4. Print only lines where residue number (in 6th field) in file [Link] is greater
than or equal than 28
5. Print only lines of [Link] that do not contain carbon atoms in the 12th field
(field 12 should not be equal to C)
Slide 5-6 video Lecture6-awk-conditions-AND-OR-7min
5
To specify multiple conditions, use logical AND and OR
AWK uses the following logical operators:
&& (AND)
|| (OR)
conditionA && conditionB
conditionA || conditionB
Examples
Print all lines of [Link] that contain LEU OR MET in its 4th field:
awk '$4 == "HIS" || $4 == "MET" {print}' [Link]
Print all lines of [Link] that contain LEU in its 4th field AND residue number (6th field) is greater
than 15:
awk '$4 == "LEU" && $6 > 15 {print}' [Link]
If both operators are specified, && gets performed first, unless you enclose || within ()
Example
Print all lines of atoms. pdb that contain LEU OR MET in its 4th field AND residue number (6th field)
is greater than 20. Watch out for the order of operations :
awk '($4 == "LEU" || $4 == "MET") && $6 > 20 {print}' [Link]
6
Practice
6. Print lines of [Link] that contain N in the 3rd field and LYS in the 4th field,
and when the 6th field is equal to 9
7. Print lines of [Link] that contain LYS in the 4th field and when the 6th field is
equal to 9 or 28:
Slide 7-9 video Lecture6-awk-print action-7min
7
Print only specific fields
awk 'condition {print action}' filename
If a condition is not specified, awk will match all lines in the input file, and perform the print on
each one.
awk '{print $2, $6}' [Link] #print 2nd and 6th field of all
lines
If a condition is specified, awk will extract lines matching that condition, and perform the print on
those lines
awk '$4 == "HIS" {print $2, $6}' [Link] #print 2nd and 6th
fields of lines containing HIS in the 4th field
8
Arithmetic operations on fields
You can perform arithmetic operations on fields in the {print action}
+ addition
- subtraction
* multiplication
x**y (x^y) exponentiation
Examples
Print the sum of 7th, 8th, and 9th fields of all lines:
awk '{print $7 + $8 + $9}' [Link]
Print the sum of the 7th and 8th fields divided by 2 of the lines matching conditions
awk '$4 == "LEU" && $6 > 15 {print ($7 + $8)/2} ' [Link]
9
Practice
8. Use awk to extract lines with the keyword MET in 4th field and print the 2nd,
3rd, and 6th field
9. Use awk to print the sum of the 7th and 8th fields divided by 10 of the
lines having the keyword MET in the 4th field
10
Some more fun with awk
You can add text in the print action within double quotes:
Separate text and fields by commas
awk '$4 == "LEU" && $6 > 15 {print "X:", $7}' [Link]
awk '$4 == "LEU" && $6 < 15 {print "X:", $7, "Y:", $8}' [Link]
Slide 11-12 video Lecture6-awk-printf-4min
11
You can use printf instead of print
This is printf in awk, but it has the same syntax for defining the format as printf in bash.
Printf in awk is different from printf in bash only on how it lists arguments
awk 'condition {printf "format", $field}' filename
awk '$4 == "LEU" && $6 > 15 {print $7}' [Link]
awk '$4 == "LEU" && $6 > 15 {printf "%.2f\n", $7}' [Link]
awk '$4 == "LEU" && $6 > 15 {print "X:", $7}' [Link]
awk '$4 == "LEU" && $6 > 15 {printf "%s %.2f\n", "X:", $7}'
[Link]
12
Practice
10. Modify this awk code to obtain the formatted output reported below:
awk '$4 == "HIS" {print "X:", $7}' [Link]
X: 2.74
X: 3.73
X: 4.84
X: 5.17
X: 4.24
X: 4.98
X: 6.34
X: 4.54
X: 6.70
X: 5.63
Slide 13 video Lecture6-awk-specify field separator-2min
13
Specify field separator
If a file has a field separator different than blank spaces, you have to specify it with
the –F option
awk –F SEP 'condition { program actions }' filename
SEP = field separator
For example, if the field separator is colon
Print the 1st field (station ID) and 2nd field (state code) of all the lines having the 3rd
field (temperature) greater than 1.5
awk -F: '$3 > 1.5 {print $1,$2}' [Link]
Practice
11. Use awk to print the state and station ID where the recorded temperature resulted
the largest and format the temperature value to 2 decimal digit. You should use printf
in awk.
14
Some more fun with awk
awk '$4 == "HIS" && $6 < 9 {print}' [Link]
awk '$4 == "HIS" && $6 < 9 {print NR, $0}' [Link]
awk '$4 == "HIS" && $6 < 9 {print NR, NF, $0}' [Link]
NR will give a line (record) number
NF will give number of fields
Adding text to the print statement:
awk '$4 == "HIS" && $6 < 9 {print "Line number", NR}' [Link]
awk '$4 == "HIS" && $6 < 9 {print "Number of fields is:", NF}'
[Link]
14
Calculate the sum of a numeric field with awk
awk -F SEP {sum+=$field;} END{print sum;}' filename
The -F',' tells awk that the field separator for the input is ,
The {sum+=$4;} adds the value of the $field to a running total.
The END{print sum;} tells awk to print the contents of sum after all lines are read.