0% found this document useful (0 votes)

100 views12 pages

AWK for Data Processing Enthusiasts

- AWK is a programming language for text processing and data extraction tasks. It allows selecting, validating, transforming, and rearranging data through pattern matching and actions. - An AWK program consists of pattern-action statements that are applied to each line of input. Patterns can be regular expressions and actions are executable code similar to C. - AWK features include built-in variables, operators, control flow statements, functions, and arrays that allow flexible data manipulation and reporting. It is useful for tasks like report generation, validation, and simple transformations on text data.

Uploaded by

Andy Hidayat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views12 pages

AWK for Data Processing Enthusiasts

Uploaded by

Andy Hidayat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

AWK

a language for pattern scanning and processing

Al Aho, Brian Kernighan, Peter Weinberger
Bell Labs, ~1977

Intended for simple data processing:

selection, validation:
"Print all lines longer than 80 characters"
length > 80

transforming, rearranging:
"Replace the 2nd field by its logarithm"
{ $2 = log($2); print }

report generation:
"Add up the numbers in the first field,
then print the sum and average"
{ sum += $1 }
END { print sum, sum/NR }

Structure of an AWK program:

a sequence of pattern-action statements

pattern { action }
pattern { action }

"pattern" is a regular expression, numeric

expression, string expression or combination
"action" is executable code, similar to C

Operation:
for each file
for each input line
for each pattern
if pattern matches input line
do the action

Usage:
awk 'program' [ file1 file2 ... ]
awk -f progfile [ file1 file2 ... ]

1
AWK features:

input is read automatically

across multiple files
lines split into fields ($1, ..., $NF; $0 for whole line)
variables contain string or numeric values
no declarations
type determined by context and use
initialized to 0 and empty string
built-in variables for frequently-used values
operators work on strings or numbers
coerce type according to context
associative arrays (arbitrary subscripts)
regular expressions (like egrep)
control flow statements similar to C
if-else, while, for, do
built-in and user-defined functions
arithmetic, string, regular expression, text edit, ...
printf for formatted output
getline for input from files or processes

Basic AWK programs:

{ print NR, $0 } precede each line by line number

{ $1 = NR; print } replace first field by line number
{ print $2, $1 } print field 2, then field 1
{ temp = $1; $1 = $2; $2 = temp; print } flip $1, $2
{ $2 = ""; print } zap field 2
{ print $NF } print last field

NF > 0 print non-empty lines

NF > 4 print if more than 4 fields
$NF > 4 print if last field greater than 4

NF > 0 {print $1, $2} print two fields of non-empty lines

/regexpr/ print matching lines ( egrep)
$1 ~ /regexpr/ print lines where first field matches

END { print NR } line count

{ nc += length($0) + 1; nw += NF } wc command
END { print NR, "lines", nw, "words", nc, "characters" }

$1 > max { max = $1; maxline = $0 } print longest line

END { print max, maxline }

2
Awk text formatter
#!/bin/sh
# f - format text into 60-char lines

awk '
/./ { for (i = 1; i <= NF; i++)
addword($i) }
/^$/ { printline(); print "" }
END { printline() }

function addword(w) {
if (length(line) + length(w) > 60)
printline()
line = line space w
space = " "
}

function printline() {
if (length(line) > 0)
print line
line = space = ""
}
' "$@"

Arrays

Usual case: array subscripts are integers

Reverse a file:

{ x[NR] = $0 } # put each line into array x

END { for (i = NR; i > 0; i--)
print x[i] }

Making an array:
n = split(string, array, separator)
splits "string" into array[1] ... array[n]
returns number of elements
optional "separator" can be any regular expression

3
Associative Arrays

array subscripts can have any value

not limited to integers
canonical example: adding up name-value pairs

Input:
pizza 200
beer 100
pizza 500
beer 50

Output:
pizza 700
beer 150

program:

{ amount[$1] += $2 }
END { for (name in amount)
print name, amount[name] | "sort +1 -nr"
}

Assembler & simulator for toy machine

hypothetical RISC machine (tiny SPARC)
10 instructions, 1 accumulator, 1K memory
# print sum of input numbers (terminated by zero)

ld zero # initialize sum to zero

st sum
loop get # read a number
jz done # no more input if number is zero
add sum # add in accumulated sum
st sum # store new value back in sum
j loop # go back and read another number

done ld sum # print sum

put
halt

zero const 0
sum const

assignment: write an assembler and simulator

4
Assembler and simulator/intepreter
# asm - assembler and interpreter for simple computer
# usage: awk -f asm program -file data -files...

BEGIN {
srcfile = ARGV[1]
ARGV[1] = "" # remaining files are data
tempfile = " asm.temp"
n = split("const get put ld st add sub jpos jz j halt", x)
for (i = 1; i <= n; i++) # create table of op codes
op[x[i]] = i -1
# ASSEMBLER PASS 1
FS = "[ \t]+"
while (getline <srcfile > 0) {
sub(/#.*/, "") # strip comments
symtab [$1] = nextmem # remember label location
if ($2 != "") { # save op, addr if present
print $2 "\t" $3 >tempfile
nextmem++
}
}
close( tempfile )
# ASSEMBLER PASS 2
nextmem = 0
while (getline <tempfile > 0) {
if ($2 !~ /^[0-9]*$/) # if symbolic addr,
$2 = symtab [$2] # replace by numeric value
mem[nextmem++] = 1000 * op[$1] + $2 # pack into word
}
# INTERPRETER
for (pc = 0; pc >= 0; ) {
addr = mem [pc] % 1000
code = int (mem[pc++] / 1000)
if (code == op["get"]) { getline acc }
else if (code == op["put"]) { print " \t" acc }
else if (code == op[" st"]) { mem[addr] = acc }
else if (code == op["ld"]) { acc = mem[addr ] }
else if (code == op["add"]) { acc += mem[addr ] }
else if (code == op["sub"]) { acc -= mem[addr ] }
else if (code == op[" jpos "]) { if (acc > 0) pc = addr }
else if (code == op[" jz"]) { if (acc == 0) pc = addr }
else if (code == op["j"]) { pc = addr }
else if (code == op["halt"]) { pc = -1 }
else { pc = -1 }
}
}

Anatomy of a compiler

input

lexical
analysis

tokens

syntax symbol
analysis table

intermediate
form
code
generation

object
file linking

input a.out output

data

5
Anatomy of an interpreter

input

lexical
analysis

tokens

syntax symbol
analysis table

intermediate
form

input execution
output
data

Parsing by recursive descent

expr: term | expr + term | expr - term

term: factor | term * factor | term / factor
factor: NUMBER | ( expr )

NF > 0 {
f = 1
e = expr()
if (f <= NF) printf("error at %s\n", $f)
else printf("\t%.8g\n", e)
}
function expr( e) { # term | term [+-] term
e = term()
while ($f == "+" || $f == "-")
e = $(f++) == "+" ? e + term() : e - term()
return e
}
function term( e) { # factor | factor [*/] factor
e = factor()
while ($f == "*" || $f == "/")
e = $(f++) == "*" ? e * factor() : e / factor()
return e
}
function factor( e) { # number | (expr)
if ($f ~ /^[+-]?([0-9]+[.]?[0-9 ]*|[.][0-9]+)$/) {
return $(f++)
} else if ($f == "(") {
f++
e = expr()
if ($(f++) != ")")
printf("error: missing ) at %s\n", $f)
return e
} else {
printf("error: expected number or ( at %s \n", $f)
return 0
}
}

6
YACC and LEX

languages for building bigger languages

YACC: "yet another compiler compiler"

(S. C. Johnson, ~ 1972)
converts a grammar and semantic actions into a parser
for that grammar

LEX: lexical analyzer generator

(M. E. Lesk, ~ 1974)
converts regular expressions for tokens into a lexical
analyzer that recognizes those tokens

When to think of using them:

real grammatical structures (e.g., recursively defined)
complicated lexical structures
rapid development time is important
language design might change

YACC overview

YACC converts grammar rules and semantic

actions into a parsing function yyparse()
yyparse parses programs written in that
grammar
and performs the semantic actions as
grammatical constructs are recognized

yyparse calls yylex each time it needs another

input token
yylex returns a token type and stores a token
value in an external value for yyparse to find

semantic actions usually build a parse tree

but could just execute on the fly:

7
YACC-based
%{
calculator
#define YYSTYPE double /* data type of yacc stack */
%}
%token NUMBER
%left '+' '-' /* left associative, same precedence */
%left '*' '/' /* left assoc., higher precedence */
%%
list: expr '\n' { printf("\t%.8g\n", $1); }
| list expr '\n' { printf("\t%.8g\n", $2); }
;
expr : NUMBER { $$ = $1; }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '(' expr ')' { $$ = $2; }
;
%%
/* end of grammar */

#include < stdio.h>

#include < ctype.h>
int lineno = 1;

main() { /* calculator */
yyparse();
}
yylex() { /* calculator lexical analysis */
int c;
while ((c= getchar()) == ' ' || c == ' \t')
;
if (c == EOF)
return 0;
if (c == '.' || isdigit (c)) { /* number */
ungetc (c, stdin);
scanf("% lf", & yylval ); /* lexical value */
return NUMBER; /* lexical type */
}
if (c == ' \n')
lineno ++;
return c;
}
yyerror(char *s) { /* called for yacc syntax error */
fprintf(stderr , "%s near line %d\n", s, lineno );
}

YACC overview, continued

semantic actions usually build a parse tree

each node represents a particular syntactic type
children represent components
code generator walks the tree to generate code
may rewrite tree as part of optimization
an interpreter could
run directly from the program (TCL)
interpret directly from the tree (AWK, Perl?):
at each node,
interpret children
do operation of node itself
return result to caller
generate byte code output to run elsewhere (Java)
or other virtual machine instructions
generate internal byte code (Perl??, Python?, )
generate C or something else

compiled code runs faster

but compilation takes longer, needs object files,
less portable,
interpreters start faster, but run slower
for 1- or 2-line programs, interpreter is better
on the fly / just in time compilers merge these

8
Grammar specified in YACC

grammar rules give syntax

action part of a rule gives semantics
usually used to build a parse tree

statement:
IF ( expression ) statement
create node(IF, expr, stmt, 0)
IF ( expression ) statement ELSE statement
create node(IF, expr, stmt1, stmt2)
WHILE (expression ) statement
create node(WHILE, expr, stmt)
variable = expression
create node(ASSIGN, var, expr)

expression:
expression + expression
expression - expression
...
YACC creates a parser from this
when the parser runs, it creates a parse tree

Excerpt from a real grammar

term:
term '/' ASGNOP term { $$ = op2(DIVEQ, $1, $4); }
| term '+' term { $$ = op2(ADD, $1, $3); }
| term '- ' term { $$ = op2(MINUS, $1, $3); }
| term '*' term { $$ = op2(MULT, $1, $3); }
| term '/' term { $$ = op2(DIVIDE, $1, $3); }
| term '%' term { $$ = op2(MOD, $1, $3); }
| term POWER term { $$ = op2(POWER, $1, $3); }
| ' -' term %prec UMINUS { $$ = op1(UMINUS, $2); }
| '+' term %prec UMINUS { $$ = $2; }
| NOT term %prec UMINUS
{ $$ = op1(NOT, notnull($2)); }
| BLTIN '(' patlist ')'
{ $$ = op2(BLTIN, itonp($1), $3); }
| DECR var { $$ = op1(PREDECR, $2); }
| INCR var { $$ = op1(PREINCR, $2); }
| var DECR { $$ = op1(POSTDECR, $1); }
| var INCR { $$ = op1(POSTINCR, $1); }

9
Excerpts from a LEX analyzer
"++" { yylval.i = INCR; RET(INCR); }
"--" { yylval.i = DECR; RET(DECR); }

([0-9]+(\.?)[0-9]*|\.[0-9]+)([eE](\+|-)?[0-9]+)? {
yylval.cp = setsymtab(yytext, tostring(yytext),
atof(yytext), CON|NUM, symtab);
RET(NUMBER); }

while { RET(WHILE); }
for { RET(FOR); }
do { RET(DO); }
if { RET(IF); }
else { RET(ELSE); }
return {
if (!infunc)
ERROR "return not in function" SYNTAX;
RET(RETURN);
}

. { RET(yylval.i = yytext[0]);
/* everything else */
}

Whole process

grammar lexical
rules

Lex (or
YACC
other) other
C code
y.tab.c lex.yy.c
parser analyzer

C compiler

a.out

10
AWK implementation

source code is about 6000 lines of C and YACC

compiles without change on
Unix/Linux, Windows, Mac

parse tree nodes:

typedef struct Node {
int type; /* ARITH, */
Node *next;
Node *child[4];
} Node;

leaf nodes (values):

typedef struct Cell {
int type; /* VAR, FLD, */
Cell *next;
char *name;
char *sval; /* string value */
double fval; /* numeric value */
int state; /* STR | NUM | ARR */
} Cell;

Testing

700-1000 tests in regression test suite

record of all bug fixes since August 1987
Nov 22, 2003: fixed a bug in regular expressions
that dates (so help me) from 1977; it's been there
from the beginning. an anchored longest match
that was longer than the number of states
triggered a failure to initialize the machine
properly. many thanks to monaik ghosh for not only
finding this one but for providing a fix, in some of
the most mysterious code known to man.

fixed a storage leak in call() that appears to have

been there since 1983 or so -- a function without an
explicit return that assigns a string to a parameter
leaked a Cell. thanks to monaik ghosh for spotting
this very subtle one.

and some not yet fixed:

"Consider the awk program:
awk '{print $40000000000000}'
which exhausts memory on the system. this actually
occurred in the program:
awk '{i += $2}
END {print $i}'
where the simple typing error crashed the system."

11
Using awk for testing RE code

regular expression tests are described in a very

small specialized language:

^a.$ ~ ax
aa
!~ xa
aaa
axy

each test is converted into a command that

exercises awk:
echo 'ax' | awk '!/^a.$'/ { print "bad" }'

illustrates
little languages
programs that write programs
mechanization

Lessons

people use tools in unexpected, perverse ways

compiler writing
implementing languages, etc.
object language
first programming language

existence of a language encourages programs

to generate it
machine generated inputs stress differently than
people do

mistakes are inevitable and hard to change

concatenation syntax
ambiguities, especially with >
function syntax
creeping featurism from user pressure
difficulty of changing a "standard"

"One thing [the language designer] should not do

is to include untried ideas of his own."
(C. A. R. Hoare, Hints on Programming Language Design, 1973)

CD Record
No ratings yet
CD Record
33 pages
CD File
No ratings yet
CD File
22 pages
Scripting Languages: Gluing Together Other Programs, ..
No ratings yet
Scripting Languages: Gluing Together Other Programs, ..
27 pages
CD Lab
No ratings yet
CD Lab
26 pages
Lex & Yacc Programming Guide
No ratings yet
Lex & Yacc Programming Guide
36 pages
6th Sem System Software Lab Manual
100% (1)
6th Sem System Software Lab Manual
27 pages
Example Program For The Lex and Yacc Programs
No ratings yet
Example Program For The Lex and Yacc Programs
5 pages
Yacc Examples
No ratings yet
Yacc Examples
9 pages
CD Lab Manual
No ratings yet
CD Lab Manual
28 pages
Lex and Yacc Calculator Example
No ratings yet
Lex and Yacc Calculator Example
13 pages
Compiler Design Practical File
No ratings yet
Compiler Design Practical File
47 pages
LEX and YACC Programming Lab Manual
No ratings yet
LEX and YACC Programming Lab Manual
35 pages
SSC Lab Programs
No ratings yet
SSC Lab Programs
55 pages
Compiler Design Lab Record
No ratings yet
Compiler Design Lab Record
38 pages
Lab Manual for System Software Programs
No ratings yet
Lab Manual for System Software Programs
36 pages
CD Manual
No ratings yet
CD Manual
30 pages
Lex Programs
No ratings yet
Lex Programs
7 pages
Cs35o1 - Compiler Design
No ratings yet
Cs35o1 - Compiler Design
27 pages
CD Observation
No ratings yet
CD Observation
18 pages
Compiler Lab Manual
No ratings yet
Compiler Lab Manual
84 pages
CD Final Lab Manual
No ratings yet
CD Final Lab Manual
44 pages
LexYacc Final
No ratings yet
LexYacc Final
44 pages
CD Record
No ratings yet
CD Record
18 pages
CD Final Lab Manual
No ratings yet
CD Final Lab Manual
44 pages
CD - Yash Final
No ratings yet
CD - Yash Final
50 pages
Language Processing: Introduction To Compiler Construction: Andy D. Pimentel Computer Systems Architecture Group
No ratings yet
Language Processing: Introduction To Compiler Construction: Andy D. Pimentel Computer Systems Architecture Group
91 pages
Compiler - Design - Lab Final 2024
No ratings yet
Compiler - Design - Lab Final 2024
45 pages
Lab Manual For System Software, VTU
No ratings yet
Lab Manual For System Software, VTU
34 pages
CD LAN Manula Re
No ratings yet
CD LAN Manula Re
57 pages
Compiler Lab Manual
No ratings yet
Compiler Lab Manual
32 pages
Ss&Os Laboratory Manual
No ratings yet
Ss&Os Laboratory Manual
27 pages
Lab Manual2021 Regulation
No ratings yet
Lab Manual2021 Regulation
28 pages
B Tech 1006322 CD Lab
No ratings yet
B Tech 1006322 CD Lab
35 pages
CS3501 Compiler Design Lab
No ratings yet
CS3501 Compiler Design Lab
35 pages
System Software & OS Lab Manual
No ratings yet
System Software & OS Lab Manual
31 pages
System Software Lab Manual
No ratings yet
System Software Lab Manual
38 pages
Preparing For The ACW: 08348 Languages & Compilers
No ratings yet
Preparing For The ACW: 08348 Languages & Compilers
26 pages
Cdrec 1
No ratings yet
Cdrec 1
29 pages
Lexical Analyzer and YACC Programs
No ratings yet
Lexical Analyzer and YACC Programs
10 pages
C Programs for Symbol Table and Lexical Analyzer
75% (8)
C Programs for Symbol Table and Lexical Analyzer
32 pages
Compiler Design Lab Manual
29% (7)
Compiler Design Lab Manual
24 pages
Compiler Lab Manual
No ratings yet
Compiler Lab Manual
80 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
36 pages
Lex & Yacc Tutorial for Programmers
No ratings yet
Lex & Yacc Tutorial for Programmers
38 pages
Lex Material 1
No ratings yet
Lex Material 1
37 pages
Lex & Yacc Programming Guide
No ratings yet
Lex & Yacc Programming Guide
5 pages
Experiment-12: Objective
No ratings yet
Experiment-12: Objective
10 pages
Perl Scripts For Eda Tools
No ratings yet
Perl Scripts For Eda Tools
6 pages
Lex & Yacc: A Comprehensive Guide
100% (1)
Lex & Yacc: A Comprehensive Guide
17 pages
Lex & Yacc Guide for Developers
No ratings yet
Lex & Yacc Guide for Developers
17 pages
CS3501 Compiler Lab Manual
No ratings yet
CS3501 Compiler Lab Manual
33 pages
YACC Parsing Assignment Guide
No ratings yet
YACC Parsing Assignment Guide
2 pages
Osce3 Resources
No ratings yet
Osce3 Resources
3 pages
Gad Micro
No ratings yet
Gad Micro
21 pages
Brocade Data Center Quick Reference Guide
0% (1)
Brocade Data Center Quick Reference Guide
4 pages
SAP Module Pool Program Guide
No ratings yet
SAP Module Pool Program Guide
36 pages
SP5050 S v2 Manual
No ratings yet
SP5050 S v2 Manual
70 pages
Steam Turbine Pedestal Alignment Guide
No ratings yet
Steam Turbine Pedestal Alignment Guide
14 pages
CS:GO Console Commands List
No ratings yet
CS:GO Console Commands List
63 pages
Segmentation
No ratings yet
Segmentation
11 pages
Logistic Regression with Gradient Ascent
No ratings yet
Logistic Regression with Gradient Ascent
3 pages
APK Testing Report: Mobile Security Analysis
No ratings yet
APK Testing Report: Mobile Security Analysis
15 pages
66 MERN Stack Interview Questions (ANSWERED) To Nail Your Next Tech Interview - FullStack - Cafe
No ratings yet
66 MERN Stack Interview Questions (ANSWERED) To Nail Your Next Tech Interview - FullStack - Cafe
21 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
27 pages
Java Chain of Responsibility Guide
No ratings yet
Java Chain of Responsibility Guide
12 pages
AutoCAD Basics for Draftsmen
No ratings yet
AutoCAD Basics for Draftsmen
20 pages
Pages From TruckSim Quick Start
No ratings yet
Pages From TruckSim Quick Start
42 pages
Borewell Safety
No ratings yet
Borewell Safety
8 pages
OpenShift Container Platform 4.17 Installing On A Single Node
No ratings yet
OpenShift Container Platform 4.17 Installing On A Single Node
37 pages
HPC 103 Applied-Business-Tools
100% (3)
HPC 103 Applied-Business-Tools
13 pages
CS610 LAB 7updated
No ratings yet
CS610 LAB 7updated
9 pages
OSCP Exam Report
No ratings yet
OSCP Exam Report
26 pages
Zex PDF
No ratings yet
Zex PDF
20 pages
Ai in Iot Use Cases and Challenges: Dmitry Petukhov
No ratings yet
Ai in Iot Use Cases and Challenges: Dmitry Petukhov
11 pages
MCA Thesis: DBAMP Optimization
No ratings yet
MCA Thesis: DBAMP Optimization
35 pages
Network Assisted Mobile Computing With Optimal Uplink Query Processing
No ratings yet
Network Assisted Mobile Computing With Optimal Uplink Query Processing
9 pages
HP ProLiantDL160Gen9 DataSheet
No ratings yet
HP ProLiantDL160Gen9 DataSheet
2 pages
OV 2500 NMS-E4.9R2 Release Notes - RevA
No ratings yet
OV 2500 NMS-E4.9R2 Release Notes - RevA
94 pages
VCP Drivers for Windows Users
No ratings yet
VCP Drivers for Windows Users
9 pages
Smart Contract and Defi Security Tools: Do They Meet The Needs of Practitioners?
No ratings yet
Smart Contract and Defi Security Tools: Do They Meet The Needs of Practitioners?
13 pages
DS2760
No ratings yet
DS2760
8 pages
iVMS-4500 HD User Manual for iOS
No ratings yet
iVMS-4500 HD User Manual for iOS
40 pages

AWK for Data Processing Enthusiasts

Uploaded by

AWK for Data Processing Enthusiasts

Uploaded by

AWK

a language for pattern scanning and processing

Intended for simple data processing:

Structure of an AWK program:

a sequence of pattern-action statements

"pattern" is a regular expression, numeric

input is read automatically

Basic AWK programs:

{ print NR, $0 } precede each line by line number

NF > 0 print non-empty lines

NF > 0 {print $1, $2} print two fields of non-empty lines

END { print NR } line count

$1 > max { max = $1; maxline = $0 } print longest line

Usual case: array subscripts are integers

{ x[NR] = $0 } # put each line into array x

array subscripts can have any value

Assembler & simulator for toy machine

ld zero # initialize sum to zero

done ld sum # print sum

assignment: write an assembler and simulator

input a.out output

Parsing by recursive descent

expr: term | expr + term | expr - term

languages for building bigger languages

YACC: "yet another compiler compiler"

LEX: lexical analyzer generator

When to think of using them:

YACC converts grammar rules and semantic

yyparse calls yylex each time it needs another

semantic actions usually build a parse tree

#include < stdio.h>

YACC overview, continued

semantic actions usually build a parse tree

compiled code runs faster

grammar rules give syntax

Excerpt from a real grammar

source code is about 6000 lines of C and YACC

parse tree nodes:

leaf nodes (values):

700-1000 tests in regression test suite

fixed a storage leak in call() that appears to have

and some not yet fixed:

regular expression tests are described in a very

each test is converted into a command that

people use tools in unexpected, perverse ways

existence of a language encourages programs

mistakes are inevitable and hard to change

"One thing [the language designer] should not do

You might also like