0% found this document useful (0 votes)

21 views12 pages

Awk Compbio

The document provides a comprehensive guide on using AWK for text processing and data extraction, specifically tailored for computational biologists. It covers essential lessons including command syntax, program structure, built-in variables, and arithmetic operators, along with practical examples. The content is structured to help users understand how to manipulate and analyze data effectively using AWK.

Uploaded by

21126562

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views12 pages

Awk Compbio

Uploaded by

21126562

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Six Essential awk Lessons for Computational Biologists

Ming Tommy Tang

4/12/2023

What is awk?

Text processing and data extraction are done using the scripting language and command known as AWK. Its
name derives from the initials of its creators, Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan,
who worked on it at AT&T Bell Laboratories in 1977. The main function of AWK is to look for text
that matches a pattern in files and then conduct an action on that text. AWK is frequently used for text
replacement, data formatting and printing, and string and integer manipulation.
The following are the three variations of AWK:

• AWK is the original AWK.

• NAWK is the new AWK.
• GAWK is GNU AWK. All Linux distributions come with GAWK. This is fully compatible with AWK
and NAWK.

Lesson 1 Awk Command Syntax

awk -Fs '/pattern/ {action}' input-file

# or
awk -Fs '{action}' input-file

• -F is the field separator. It will use a space as the field delimiter if you don’t specify.
• The /pattern/ and the {action} should be enclosed inside single quotes.
• /pattern/ is optional. If you don’t provide it, awk will process all the records from the input file.
If you specify a pattern, it will process only those records from the input-file that match the given
pattern.
• {action} - These are the awk commands, which can be one or multiple. The whole action block,
including all the awk commands together, should be closed between { and }

cat test.bed

## chr1 100 200

## chr1 300 500
## chr2 240 440
## chr2 400 600
## chr3 0 150

Print out the first column if the line matches chr1

1
awk -F '\t' '/chr1/ {print $1}' test.bed

## chr1
## chr1

Print out the second column if the line matches chr2

awk -F '\t' '/chr2/ {print $2}' test.bed

## 240
## 400

By default, awk use space/tab as the separator, so you can omit the -F '\t':

awk '/chr2/ {print $2}' test.bed

## 240
## 400

$1 means the first column, $2 means the second column, etc etc.

2
Lesson 2 Awk Program Structure (BEGIN, body, END block)
An awk program has following three blocks: BEGIN, body and END.

awk 'BEGIN { awk-commands } \ #BEGIN

/pattern/ {action} \ #body
END { awk-commands }' input-file # END

You can use \ to split a unix command line to multiple lines.

The begin block executes only once at the beginning, before awk starts executing the body block for all the
lines in the input file.

• The begin block is a good place to print report headers, and initialize variables.

• You can have one or more awk commands in the begin block.

• The keyword BEGIN should be specified in upper case.

• Begin block is optional.

The body block gets executed once for every line in the input file.

• If the input file has 10 records, the commands in the body block will be executed 10 times (once for
each record in the input file).

• There is no keyword for the body block. I discussed pattern and action previously.

The end block gets executed only once at the end, after awk completes executing the body block for all the
lines in the input-file.

• The end block is a good place to print a report footer and do any clean-up activities.

• You can have one or more awk commands in the end block.

• The keyword END should be specified in upper case.

• End block is optional.

Let’s see an example

awk 'BEGIN {FS='\t'; print "----header ----"} \

/chr2/ {print $0} \
END {print "----end------"}' test.bed

## ----header ----
## chr2 240 440
## chr2 400 600
## ----end------

$0 means the whole line.

The BEGIN and END block are optional

3
awk '/chr2/ {print $0} \
END {print "----end------"}' test.bed

## chr2 240 440

## chr2 400 600
## ----end------

awk 'BEGIN {FS='\t'; print "----header ----"} \

/chr2/ {print $0}' test.bed

## ----header ----
## chr2 240 440
## chr2 400 600

4
Lesson 3 AWK built-in variables

• FS - Input Field Separator

There are two ways to specify the field separator:

awk -F '\t' '{print $2, $3}' test.bed

## 100 200
## 300 500
## 240 440
## 400 600
## 0 150

awk 'BEGIN {FS="\t"} {print $2, $3}' test.bed

## 100 200
## 300 500
## 240 440
## 400 600
## 0 150

note that the default field separator is not just a single space. It actually matches one or more
whitespace characters.

• OFS - Output Field Separator

You can reformat the output by:

awk -F '\t' '{print $1, ":" $2, ":", $3}' test.bed

## chr1 :100 : 200

## chr1 :300 : 500
## chr2 :240 : 440
## chr2 :400 : 600
## chr3 :0 : 150

awk -F '\t' '{print $2, "," $1, ",", $3}' test.bed

## 100 ,chr1 , 200

## 300 ,chr1 , 500
## 240 ,chr2 , 440
## 400 ,chr2 , 600
## 0 ,chr3 , 150

The output still has a tab between them. Now, use OFS

awk -F '\t' 'BEGIN { OFS=":" } \

{ print $1, $2, $3 }' test.bed

5
## chr1:100:200
## chr1:300:500
## chr2:240:440
## chr2:400:600
## chr3:0:150

Now, the output is separated by “:”.

note the subtle difference between including a comma vs not including a comma in the print statement
(when printing multiple variables). When you specify a comma in the print statement between different
print values, awk will use the OFS.

awk '{print $1, $2, $3 }' test.bed

## chr1 100 200

## chr1 300 500
## chr2 240 440
## chr2 400 600
## chr3 0 150

awk '{print $1 $2 $3 }' test.bed

## chr1100200
## chr1300500
## chr2240440
## chr2400600
## chr30150

Everything squeezes together.

awk '{print $1, "test", $2, "test", $3 }' test.bed

## chr1 test 100 test 200

## chr1 test 300 test 500
## chr2 test 240 test 440
## chr2 test 400 test 600
## chr3 test 0 test 150

• RS - Record Separator

cat test.bed | tr "\n" ":"

## chr1 100 200:chr1 300 500:chr2 240 440:chr2 400 600:chr3 0 150:

Let’s specify : as the record separator

cat test.bed | tr "\n" ":" | \

awk 'BEGIN {RS=":" } {print $1,$2}'

6
## chr1 100
## chr1 300
## chr2 240
## chr2 400
## chr3 0

If we do not specify the RS:

cat test.bed | tr "\n" ":" | \

awk '{print $1,$2}'

## chr1 100

• ORS - Output Record Separator

awk 'BEGIN {ORS=","} {print $1,$2}' test.bed

## chr1 100,chr1 300,chr2 240,chr2 400,chr3 0,

awk 'BEGIN {ORS="\n---\n"} {print $1,$2}' test.bed

## chr1 100
## ---
## chr1 300
## ---
## chr2 240
## ---
## chr2 400
## ---
## chr3 0
## ---

• NR - Number of Records

When used inside the BODY block, this gives the line number. When used in the END block, this gives the
total number of records in the file.

awk '{print "line",NR,"chromosome is",$1;} \

END {print "Total number of records:",NR}' test.bed

## line 1 chromosome is chr1

## line 2 chromosome is chr1
## line 3 chromosome is chr2
## line 4 chromosome is chr2
## line 5 chromosome is chr3
## Total number of records: 5

7
Lesson 4, more built-in variables

• FILENAME – Current File Name

cp test.bed test2.bed

awk '{print FILENAME}' test.bed test2.bed

## test.bed
## test.bed
## test.bed
## test.bed
## test.bed
## test2.bed
## test2.bed
## test2.bed
## test2.bed
## test2.bed

This is useful, when you want to combine multiple files but want to add an extra column to specify which
file it is from:

awk 'BEGIN {OFS="\t"} {print $0, FILENAME}' test.bed test2.bed

## chr1 100 200 test.bed

## chr1 300 500 test.bed
## chr2 240 440 test.bed
## chr2 400 600 test.bed
## chr3 0 150 test.bed
## chr1 100 200 test2.bed
## chr1 300 500 test2.bed
## chr2 240 440 test2.bed
## chr2 400 600 test2.bed
## chr3 0 150 test2.bed

Merge all bed files and add a column for the filename:

awk '{print $0 "\t" FILENAME}' * bed

• FNR - File “Number of Record”

“NR” is “Number of Records” (or “Number of the Record”), which prints the current line number of the
file that is getting processed. How will NR behave when we give have two input files? NR keeps growing
between multiple files. When the body block starts processing the 2nd file, NR will not be reset to 1, instead
it will continue from the last NR number value of the previous file.

awk '{print FILENAME ": record number",NR,"is",$1;} \

END {print "Total number of records:",NR}' test.bed test2.bed

8
## test.bed: record number 1 is chr1
## test.bed: record number 2 is chr1
## test.bed: record number 3 is chr2
## test.bed: record number 4 is chr2
## test.bed: record number 5 is chr3
## test2.bed: record number 6 is chr1
## test2.bed: record number 7 is chr1
## test2.bed: record number 8 is chr2
## test2.bed: record number 9 is chr2
## test2.bed: record number 10 is chr3
## Total number of records: 10

Let’s use FNR instead:

awk '{print FILENAME ": record number",FNR,"is",$1;} \

END {print "Total number of records:",NR}' test.bed test2.bed

## test.bed: record number 1 is chr1

## test.bed: record number 2 is chr1
## test.bed: record number 3 is chr2
## test.bed: record number 4 is chr2
## test.bed: record number 5 is chr3
## test2.bed: record number 1 is chr1
## test2.bed: record number 2 is chr1
## test2.bed: record number 3 is chr2
## test2.bed: record number 4 is chr2
## test2.bed: record number 5 is chr3
## Total number of records: 10

Merge multiple files with the same header by keeping the header of the first file I usually do it in R, but like
the quick solution.

awk 'FNR==1 && NR!=1{next;}{print}' * .csv

9
Lesson 5 Variables

Awk variables should begin with an alphabetic character; the rest of the characters can be numbers, or letters,
or underscore. Keywords cannot be used as an awk variable name. Unlike other programming languages,
you don’t need to declare an variable to use it. If you wish to initialize an awk variable, it is better to do it
in the BEGIN section, which will be executed only once. There are no data types in Awk. Whether an awk
variable is a number or a string depends on the context in which the variable is used in.

awk 'BEGIN {total=0} \

{print $2; total=total+$2} \
END {print "total is " total}' test.bed

## 100
## 300
## 240
## 400
## 0
## total is 1040

calculate total number of reads in a bam file:

samtools idxstats example.bam | cut -f3 | \

awk 'BEGIN {total=0} {total += $1} END {print total}'

10
Lesson 6 Arithmetic Operators

• + Addition
• - Subtraction
• * Multiplication / Division
• % Modulo Division

awk 'BEGIN {OFS="\t"} {print $1, $2 + 100, $3 }' test.bed

## chr1 200 200

## chr1 400 500
## chr2 340 440
## chr2 500 600
## chr3 100 150

This is useful. If you get a transcription start site (TSS) bed file, you can use

awk 'BEGIN {OFS="\t"} {print $1, $2 - 1000, $3 }' tss.bed

to get the upstream 1000 bp bed file.

• String Operator (space) is a string operator that does string concatenation. We have seen it in the
previous examples

This operator is why you must separate the values in a print statement with a comma if you want to print
the OFS in between. If you do not include a comma to separate the values, the values are concatenated
instead.

awk 'BEGIN {OFS="\t"} {print "chromosome is "$1, "start is " $2 , "end is " $3 }' test.bed

## chromosome is chr1 start is 100 end is 200

## chromosome is chr1 start is 300 end is 500
## chromosome is chr2 start is 240 end is 440
## chromosome is chr2 start is 400 end is 600
## chromosome is chr3 start is 0 end is 150

• Comparison Operators
• > Is greater than
• >= Is greater than or equal to
• < Is less than
• <= Is less than or equal to == Is equal to
• != Is not equal to
• && Both the conditional expressions are true
• || Either one of the conditional expressions is true

A note on the following examples: If you don’t specify any action, awk will print the whole record if it
matches the conditional comparison.

11
awk '$1 == "chr1"' test.bed

## chr1 100 200

## chr1 300 500

awk '$2 > 100' test.bed

## chr1 300 500

## chr2 240 440
## chr2 400 600

awk '$2 > 100 && $3 >=500' test.bed

## chr1 300 500

## chr2 400 600

awk '$2 > 100 || $3 >=300' test.bed

## chr1 300 500

## chr2 240 440
## chr2 400 600

awk '$2 > 100 {print $2, $1}' test.bed

## 300 chr1
## 240 chr2
## 400 chr2

• Regular Expression Operators

• ~ Match operator
• !~ No Match operator

When you use the == condition, awk looks for a full match.

awk '$0 == "chr1"' test.bed

This prints nonthing because there is no whole line equals to chr1.

awk '$0 ~ "chr1"' test.bed

## chr1 100 200

## chr1 300 500

Bonus:

• AWK GTF! How to Analyze a Transcriptome Like a Pro

• bioawk
• seqkit

Awk Command
No ratings yet
Awk Command
15 pages
The Basic Syntax of AWK
No ratings yet
The Basic Syntax of AWK
18 pages
Introawk
No ratings yet
Introawk
16 pages
Cut, Awk Commands
No ratings yet
Cut, Awk Commands
2 pages
AWK Command in Unix
No ratings yet
AWK Command in Unix
6 pages
Description of An Awk Program: Pattern Action
No ratings yet
Description of An Awk Program: Pattern Action
8 pages
Lecture 3 - AWK Utility
No ratings yet
Lecture 3 - AWK Utility
52 pages
AwkUsageIn Bash Scripting
No ratings yet
AwkUsageIn Bash Scripting
67 pages
AWK Hartigan
No ratings yet
AWK Hartigan
4 pages
Unix Talk #2: AWK Overview Patterns and Actions Records and Fields Print vs. Printf
No ratings yet
Unix Talk #2: AWK Overview Patterns and Actions Records and Fields Print vs. Printf
31 pages
Advanced AWK Scripting Examples
No ratings yet
Advanced AWK Scripting Examples
16 pages
Awk Command Examples for Text Processing
No ratings yet
Awk Command Examples for Text Processing
25 pages
Linux Unit 5
No ratings yet
Linux Unit 5
33 pages
8 - Awk Programming
No ratings yet
8 - Awk Programming
7 pages
Awk - A Tutorial and Introduction - by Bruce Barnett
No ratings yet
Awk - A Tutorial and Introduction - by Bruce Barnett
233 pages
Awk
No ratings yet
Awk
5 pages
Awk Tutorial
No ratings yet
Awk Tutorial
172 pages
AWK Functions
No ratings yet
AWK Functions
11 pages
% Sed - N - e 1,50p' Datafile % Head - 50 Datafile: Linux Programming
No ratings yet
% Sed - N - e 1,50p' Datafile % Head - 50 Datafile: Linux Programming
19 pages
Awk
No ratings yet
Awk
70 pages
Last Updated - Sat Apr 17 12:39:35 EDT 2010: Why Learn AWK?
No ratings yet
Last Updated - Sat Apr 17 12:39:35 EDT 2010: Why Learn AWK?
58 pages
Awk Built-in Variables Guide
No ratings yet
Awk Built-in Variables Guide
12 pages
AWK Command Overview and Examples
No ratings yet
AWK Command Overview and Examples
14 pages
Unix Scripting: SED, AWK, Makefile & GDB
No ratings yet
Unix Scripting: SED, AWK, Makefile & GDB
35 pages
Introduction to AWK Programming
100% (1)
Introduction to AWK Programming
85 pages
Module 1 Session 2 Part 2 Linux
No ratings yet
Module 1 Session 2 Part 2 Linux
23 pages
Mastering UNIX: grep, awk, sed Guide
No ratings yet
Mastering UNIX: grep, awk, sed Guide
26 pages
Awk Is One of The Most Powerful Utilities Used in The Unix World. Whenever It Comes To Text Parsing
No ratings yet
Awk Is One of The Most Powerful Utilities Used in The Unix World. Whenever It Comes To Text Parsing
39 pages
Awk - A Pattern Scanning and Processing Language (Second Edition)
No ratings yet
Awk - A Pattern Scanning and Processing Language (Second Edition)
8 pages
Essential AWK One-Liners Guide
No ratings yet
Essential AWK One-Liners Guide
5 pages
Last Updated - Sun Jun 28 08:33:00 EDT 2009: Feedback - Ads by Google
No ratings yet
Last Updated - Sun Jun 28 08:33:00 EDT 2009: Feedback - Ads by Google
53 pages
Awk Cheatsheet
No ratings yet
Awk Cheatsheet
3 pages
Awk Cheatsheet PDF
No ratings yet
Awk Cheatsheet PDF
3 pages
Essential Awk Command Cheat Sheet
No ratings yet
Essential Awk Command Cheat Sheet
3 pages
Essential Awk Command Cheat Sheet
No ratings yet
Essential Awk Command Cheat Sheet
3 pages
Awk Command Quick Reference Guide
0% (1)
Awk Command Quick Reference Guide
3 pages
Basic AWK Syntax and Examples Guide
No ratings yet
Basic AWK Syntax and Examples Guide
43 pages
Comprehensive Guide to AWK Programming
No ratings yet
Comprehensive Guide to AWK Programming
11 pages
Awk Basics: String Splitting Guide
No ratings yet
Awk Basics: String Splitting Guide
37 pages
AWK Command in Linux With Examples
No ratings yet
AWK Command in Linux With Examples
6 pages
Awk Commands, Examples & Meaning
No ratings yet
Awk Commands, Examples & Meaning
4 pages
Awk Tutorial
No ratings yet
Awk Tutorial
13 pages
Mastering awk: A Practical Guide
No ratings yet
Mastering awk: A Practical Guide
11 pages
Awk Programming for Beginners
No ratings yet
Awk Programming for Beginners
23 pages
Understanding AWK: Syntax and Usage
No ratings yet
Understanding AWK: Syntax and Usage
32 pages
Awk
No ratings yet
Awk
44 pages
8 Powerful Awk Built-In Variables - FS, Ofs, RS, Ors, NR, NF, Filename, FNR
No ratings yet
8 Powerful Awk Built-In Variables - FS, Ofs, RS, Ors, NR, NF, Filename, FNR
14 pages
Bash Script 2 and AWK
No ratings yet
Bash Script 2 and AWK
29 pages
Presentation For Os
No ratings yet
Presentation For Os
9 pages
To Become An Expert AWK Programmer
No ratings yet
To Become An Expert AWK Programmer
19 pages
Awk Patterns: 'Awk' Patterns May Be One of The Following
No ratings yet
Awk Patterns: 'Awk' Patterns May Be One of The Following
3 pages
Learning AWK and Sed Commands
No ratings yet
Learning AWK and Sed Commands
8 pages
Gawk Command Basics and Examples
No ratings yet
Gawk Command Basics and Examples
2 pages
Mastering AWK for Text Manipulation
No ratings yet
Mastering AWK for Text Manipulation
6 pages
AWK Scripting for Linux Users
No ratings yet
AWK Scripting for Linux Users
32 pages
AWK Programming Basics Guide
No ratings yet
AWK Programming Basics Guide
5 pages
Understanding Python Static Variables
No ratings yet
Understanding Python Static Variables
9 pages
National Exit Exam Questions (1-20)
No ratings yet
National Exit Exam Questions (1-20)
7 pages
Lab 5 fl23733
No ratings yet
Lab 5 fl23733
6 pages
PDF Snake Game CDR
No ratings yet
PDF Snake Game CDR
1 page
Assignment 2 - SWP
No ratings yet
Assignment 2 - SWP
1 page
Dependency Injection for Developers
No ratings yet
Dependency Injection for Developers
6 pages
C++ OOP Course Syllabus
No ratings yet
C++ OOP Course Syllabus
3 pages
Ict 2122-Ict - 2122 - Object Oriented Programming
No ratings yet
Ict 2122-Ict - 2122 - Object Oriented Programming
2 pages
12th Computer Science EM Half Yearly Exam 2023 Question Paper Virudhunagar District English Medium PDF Download
No ratings yet
12th Computer Science EM Half Yearly Exam 2023 Question Paper Virudhunagar District English Medium PDF Download
2 pages
C++ Programming Lab Exercises
No ratings yet
C++ Programming Lab Exercises
3 pages
OOAD
No ratings yet
OOAD
2 pages
Java Programming Practical Guide
No ratings yet
Java Programming Practical Guide
46 pages
301-T3202-Object Oriented Programming
No ratings yet
301-T3202-Object Oriented Programming
3 pages
Java Programming Practical Exercises
100% (1)
Java Programming Practical Exercises
17 pages
Java Inheritance & Polymorphism Guide
No ratings yet
Java Inheritance & Polymorphism Guide
28 pages
ABAP OO Explained With Example
No ratings yet
ABAP OO Explained With Example
10 pages
A - Levels CS (OOP) Cliffnotes
No ratings yet
A - Levels CS (OOP) Cliffnotes
2 pages
Key Difference Between PERL, TCL, Ruby
No ratings yet
Key Difference Between PERL, TCL, Ruby
4 pages
Simple and Tryable Visual Basic Calculator Code
No ratings yet
Simple and Tryable Visual Basic Calculator Code
3 pages
Module 52
No ratings yet
Module 52
31 pages
JAVA-Module-3-Chapter1 JITD
No ratings yet
JAVA-Module-3-Chapter1 JITD
23 pages
Function in Python
No ratings yet
Function in Python
5 pages
OOPs ABAP
No ratings yet
OOPs ABAP
9 pages
Ooad Question Bank
No ratings yet
Ooad Question Bank
9 pages
TCS Java Interview Questions 50
No ratings yet
TCS Java Interview Questions 50
5 pages
DBMS Unit-2
No ratings yet
DBMS Unit-2
4 pages
Flutter Bar Chart Widget Guide
No ratings yet
Flutter Bar Chart Widget Guide
3 pages
Tree Creation
No ratings yet
Tree Creation
4 pages
Cs301 Assignment Solution
100% (1)
Cs301 Assignment Solution
8 pages
W.T. Unit-Iv
No ratings yet
W.T. Unit-Iv
8 pages