Awk (Advanced Worlking)¶
Awk is a programming language for text processing and data wrangling
(sometimes called data munging).
A great resource: Learn X in Y Minutes where X = awk
- A note on examples:
Some of the examples in these notes will assume that awk is being run from the command line.
E.g.,:
The starting commandawk 'pattern {action;}' file.txt
awk
, input file, and single quotes will be assumed.
If there's an example that doesn't start withawk
, this applies.
Table of Contents¶
- Syntax
- Field and Record Separators
- Simple Examples
- Variables in Awk
- Patterns and Actions
- Useful Builtin Functions
- Control Structures in Awk
- Conditionals in Awk
- Logical Operators
- Regular Expression Match
- Conditional (Ternary) Operator
- The BEGIN Keyword
- The END Keyword
- Examples Using
BEGIN
andEND
- Passing External Variables
- Builtin Functions
- Using Awk as an Interpreter
Syntax¶
Basic Syntax of an Awk on the Command Line¶
The basic syntax of using awk
on the command line:
awk [options] 'program' input-file(s)
options
: Command-line options (e.g.,-f
to specify a file containing anawk
script).program
: Instructions forawk
to execute, typically enclosed in single quotes.input-file(s)
: The file(s)awk
will process.- If no files are given,
awk
reads from the standard input.
- If no files are given,
Basic Syntax of an Awk Script¶
Awk is also an interpreter for a programming language.
That means it can be used in the shebang line of a script.
The structure of an Awk script:
#!/bin/awk -f
# This is a comment
BEGIN { print "START" }
{ print }
END { print "STOP" }
- Notice the
-f
option following#!/bin/awk
.- The
-f
option specifies the Awk file containing the instructions. - This is used in the command line when you use Awk to execute a file
directly (
awk -f filename
).
- The
- The
BEGIN
andEND
blocks are optional. BEGIN { ... }
: Executed before processing the first line of input.{ ... }
: The pattern and action blocks to execute for each line of input.END { ... }
: Executed after processing the last line of input.
Field and Record Separators¶
Awk recognizes two types of separators:
-
The field separator, which is the character or string that separates fields (columns) in a record.
- Also called the delimiter.
-
The record separator, which is the character or string that
separates records (lines) in a file. -
By default,
awk
uses any whitespace as the field separator and a newline as the record separator. -
You can specify a field separator using the
-F
option:
Note: Parentheses are optional for theawk -F: '{print($1)}' /etc/passwd
print
function. -
You can also specify a new field separator or delimiter from within the
awk
program, using theBEGIN
block:
awk 'BEGIN { FS=":" } {print($1)}' /etc/passwd
Simple Examples¶
Print the First Column of a Text File:¶
awk '{print $1}' file.txt
{print $1}
is theawk
program that instructsawk
to print the
first field ($1
) of each record (typically, a line in the file).
Searching for a Pattern in the Entire Line:¶
awk '/pattern/ {print $0}' file.txt
Modifying an Entire Line¶
awk 'BEGIN {FS=":"} {print $1, $2}' /etc/passwd
Variables in Awk¶
Built-in Variables¶
These are builtin variables in awk:
FS
: Field separator variable (default is whitespace).OFS
: Output field separator (default is a space).NR
: Number of the current record (line).NF
: Number of fields in the current record.- This can be used to print the last field in a line when we don't know how many
fields there are.
If there are 7 fields,
cat somefile | awk '{ print $NF }'
NF
holds the value 7. So,$NF
is equivalent to$7
.
- This can be used to print the last field in a line when we don't know how many
fields there are.
Line Variables (Field Variables)¶
awk
processes text data as a series of records, which are, by
default, individual lines in the input text.
- Each record is automatically split into fields based on a field
separator, whitespace by default, can be changed with the
-F
option.- The separator can also be changed with the
FS
variable from insideawk
.
- The separator can also be changed with the
- Fields in a record are accessed using
$1
,$2
,$3
, etc.$1
is the first field,$2
is the second field, and so on.$0
is the entire line.
Declaring Variables¶
When you use a variable in awk
, it is automatically initialized.
That means it does not need to be explicitly declared.
awk 'BEGIN {var = "value"} {print(var)}'
Patterns and Actions¶
awk
programs follow a pattern-action model:
pattern { action }
E.g., to print lines where the first field is greater than 10:
awk '$1 > 10 {print}' file.txt
Useful Builtin Functions¶
printf()
: Prints a format string to stdout.length()
: Returns the length of a string.split()
: Splits a string into an array.substr()
: Returns a substring of a string.
Basic Usage Example¶
Print the length of the second field:
awk '{print length($2)}' file.txt
Control Structures in Awk¶
awk
supports common control structures like if-else
, while
, for
, and do-while
.
Example: Print fields greater than 10:
awk '{for (i = 1; i <= NF; i++) if ($i > 10) print $i}' file.txt
The iterator uses
NF
(number of fields/columns) as its max index.When the index is greater than 10 (i.e., columns 10+), it prints the field at the current index.
Example: Loop over the fields of a line¶
Use a for
loop to print each field in a line:
awk '{ for(i=1; 1<=NF; i++) print($i); }' file.txt
Conditionals in Awk¶
Relational Operators¶
Relational operators compare two values or expressions.
awk supports the following relational operators:
<
<=
==
!=
>=
>
~
(regex match)!~
(regex not match)
Conditionals Examples:¶
- Numeric Comparison:
awk '$1 > 100 { print $0 }' data.txt
-
This prints lines where the value in the first field is greater than 100.
-
String Equality:
awk '$2 == "admin" { print $0 }' /etc/passwd
-
This prints lines from the
/etc/passwd
file where the second field is "admin". -
Not Equal:
This prints lines that don't have exactly 5 fields.awk 'NF != 5 { print $0 }' data.txt
Logical Operators¶
Logical operators are used to combine multiple conditions. awk
supports logical AND (&&
), OR (||
), and NOT (!
).
Logical Examples:¶
-
Logical AND:
awk '$1 > 100 && $2 < 200 { print $0 }' data.txt
-
This prints lines where the first field is "John" or the second field is 50 or more.
-
Logical NOT:
This prints lines where the first field is not "admin".awk '!($1 == "admin") { print $0 }' /etc/passwd
Regular Expression Match¶
Regex in awk is available: the ~
(match) and !~
(not match)
operators are used with regular expressions to test if a field or
string matches or doesn't match a given pattern.
Regex Examples:¶
-
Regular Expression Match:
awk '$1 ~ /^admin/ { print $0 }' /etc/passwd
-
This prints lines where the first field starts with "admin".
-
Regular Expression Not Match:
awk '$1 !~ /^root/ { print $0 }' /etc/passwd
-
This prints lines where the first field does not start with "root".
-
Field Match:
This prints lines where the third field contains one or more digits.awk '$3 ~ /[0-9]+/ { print $0 }' data.txt
Conditional (Ternary) Operator¶
The ternary operator ?:
is used to choose between two values based on a condition.
It is the only ternary operator in awk
.
Ternary Examples:¶
-
Inline Conditions:
awk '{ print ($1 > 50) ? "High" : "Low" }' data.txt
-
This prints "High" if the first field is greater than 50, and "Low" otherwise.
-
Field Selection:
awk '{ print ($1 > $2) ? $1 : $2 }' data.txt
-
This prints the larger of the first two fields.
-
Adjust Output Based on Conditions:
This prints each item's name and categorizes it as "Expensive" or "Cheap" based on the second field.awk '{ printf("%s - %s\n", $1, ($2 > 100) ? "Expensive" : "Cheap")}' prices.txt
The BEGIN Keyword¶
The BEGIN
keyword is a block of code that is executed before the main program starts processing
input.
It's used to initialize variables, perform initialization tasks, and to perform any
other tasks that need to be done before the first record is processed.
- Purpose: The
BEGIN
block inawk
is executed one time, before any input lines are processed.- It's the perfect place to initialize variables or print headers in your output.
- Usage: You might use
BEGIN
to set the Field Separator (FS
) to parse CSV files or to print a title row for a report.
Example:
awk 'BEGIN {FS=" "; count=0;} { count++; printf("Line number: %d", count) }' myfile
The END Keyword¶
The END
keyword is similar to BEGIN
, but happens at the end of the program.
It's used to perform any cleanup tasks after all the input lines have been processed.
- Purpose: The
END
block is executed one time, after all input lines have been processed.- It's ideal for summarizing data, such as calculating averages or totals.
- Usage: Use
END
to perform actions that should only occur after all input has been read.- E.g., displaying a total count of processed records.
Example:
awk '{ count++ } END { printf("Total Records: %d", count)}' myfile
- Here, the
count
variable is implicitly initialized to 0.- This means that variable declaration is not required
- While declaration is not required, it is encouraged.
- Then when the input is finished being processed, the
END
block is run.- Here, it prints the total number of records processed (the number of lines).
Examples Using BEGIN
and END
¶
Output the Number of Headers in a Markdown File¶
Using a pattern to count the number of headers in a markdown file:
awk 'BEGIN {FS=" "; count=0;} /^#/ {count++} END {print(count)} ' ./conditionals_in_bash.md
awk 'BEGIN {count = 0;}
/^#/ {
count++;
print("Header number " count " found.");
}
END {print(count);
} ' ./conditionals_in_bash.md
{
) must be on the same line as theircorresponding block declaration. i.e., the
BEGIN
or END
keyword, patterns, etc.If they're not, the block will be treated as the main program.
Loop Over the Fields of a Line¶
Loop over the fields of a line and perform a substitution on each field.
Set the field separator to a space (which is default but for example purposes):
BEGIN {FS=" "}
/^#/ { for(i=0;i<=NF;i++) {
gsub("a", "x", $i);
print($i)
}
} "./conditionals_in_bash.md"
a
with x
in each field, then prints that field.
Let's add some conditionals:
BEGIN {FS=" "} /^#/ {
for(i=0;i<=NF;i++) {
if ( i == 1 ) {
print("-X-")
}
else {
gsub("a", "x", $i)
print($i);
}
}
} "./conditionals_in_bash.md"
-X-
instead of the first field.
Loop over the fields of header lines in a markdown file¶
Loop over the fields of header line and perform a substitution on each field.
Keep track of how many header lines are found, and how many substitutions are made:
BEGIN {
FS=" "; headers=0; subs=0;
}
/^#/ {
for(i=0;i<=NF;i++) {
subs+=gsub("a", "x", $i);
printf("Header: %d - %s Subs: %d\n", count, $i, subs)
}
headers++
}
END {
printf("Total Substitutions: %d\nTotal Headers: %d\n", subs, headers)
} "./conditionals_in_bash.md"
Passing External Variables¶
Use the -v
option to pass external variables to awk
:
awk -v var="value" '{print var, $1}' file.txt
Setting Multiple Field Separators¶
Usually when using awk
, you print columns that are separated by a space or other
delimiter that you specify.
printf "one, two three\n" | awk '{ print $2 }'
# output:
# two
awk
is whitespace.
Specify a different field separator with either the -F
command line option, or by
setting the FS
variable in the BEGIN
block.
printf "one, two three - four\n" | awk -F ',' '{ print $2 }'
# or
printf "one, two three - four\n" | awk 'BEGIN { FS="," } { print $2 }'
# Output:
# two three - four
But, what if we wanted to specify multiple field separators?
We can do this by using an array, denoted by square brackets ([ ... ]
), and putting
the field separators we want to use into that array.
printf "one, two three - four\n" | awk -F '[,-]' '{ print $2 }'
# or
printf "one, two three - four\n" | awk -F 'BEGIN { FS="[,-]" } { print $2 }'
# Output:
# two three
,
and a dash -
, which gives us three columns in
total.
Column 1 | Column 2 | Column 3 |
---|---|---|
one |
two three |
four |
Note the spaces in the second and third column. Those aren't stripped if we're splitting fields based on characters other than whitespace.
The characters that will be stripped are the field separators.
printf "one, two three - four\n" | awk -F '[,-]' '{ print $1 $2 $3 }'
# Output:
# one two three four
If we didn't want to include spaces in the fields, we could do a couple different things.
We could include a space in the field separator array.
Builtin Functions¶
Awk, as a programming language in and of itself, has builtin functions that can be used to parse text.
Below I've listed some of the builtin functions that are available to awk
by
default.
Note: Anything in square brackets
[ ]
is optional.
Awk String Functions¶
gsub(r, s [, t])
: Globally substitutess
for each match of the regular
expressionr
in the stringt
.- If
t
is not supplied, operates on$0
. - Several chars/strings can be given to
r
with the OR|
operator:"a|b"
- If
index(s, t)
: Returns the index of the substringt
in the strings
.- Returns
0
ift
is not provided.
- Returns
length([s])
: Returns the length of strings
.- Returns the length of
$0
ifs
is not provided.
- Returns the length of
match(s, r)
: Tests if the strings
contains a substring matched by the regexr
.- Returns the index of the first match.
split(s, a [, r])
: Splits the strings
into the arraya
on the regexr
.- Returns the number of fields.
sprintf(fmt, ...)
: Returns the string resulting from formatting the arguments
according to the C-style format stringfmt
.printf
can be used to print format string directly.
sub(r, s [, t])
: Substitutess
for the first match of the regular expressionr
in the stringt
.- If
t
is not specified, operates on$0
.
- If
substr(s, i [, n])
: Returns the substring ofs
starting at indexi
, with lengthn
.- If
n
is not supplied, returns the substring fromi
to the end ofs
.
- If
tolower(s)
: Returns an all-lowercase copy of the strings
.toupper(s)
: Returns an all-caps copy of the strings
.
Awk Numeric Functions¶
atan2(y, x)
: Returns the arctangent ofy/x
in radians.cos(x)
: Returns the cosine ofx
, wherex
is in radians.exp(x)
: Returns the exponential ofx
.int(x)
: Returns the integer part ofx
, truncating towards zero.log(x)
: Returns the natural logarithm ofx
.rand()
: Returns a random numbern
, where0 <= n < 1
.sin(x)
: Returns the sine ofx
, wherex
is in radians.sqrt(x)
: Returns the square root ofx
.srand([x])
: Sets the seed forrand()
tox
and returns the previous seed.- If
x
is omitted, uses the system time to set the seed.
- If
Awk Time Functions (GNU awk
)¶
strftime([format [, timestamp]])
: Formats the timestamp according toformat
.- Without arguments, formats the current time.
systime()
: Returns the current time as a timestamp (number of seconds since the "epoch",1970-01-01 00:00:00 UTC
).
Function Examples¶
For example, to get only the numbers (IP, day, port number, etc.) from invalid SSH attempts:
journalctl -u ssh | grep invalid | awk 'BEGIN {FS=" "} { gsub("[^0-9 \.]", "", $0); print($0);}'
journalctl -u ssh |
grep invalid |
awk '
BEGIN {FS=" "}
{
gsub("[^0-9 \.]", "", $0);
printf("IP Address: %s - Port: %d\n", $4, $5);
}'
- Piping through
grep
then toawk
is a common practice, but it's not necessary. - Instead of piping to
grep
, then piping toawk
, we could use a/pattern/
inside theawk
invocation. This prevents the need for extra system calls.
journalctl -u ssh | awk ' BEGIN {FS=" "} /invalid/ { printf("Date: %s %d \n\t", $1, $2); gsub("[^0-9 .]", "", $0); printf("IP Address: %s - Port: %d\n", $4, $5); }'
- I added another
printf
here to extract the date as well. But the functionality remains the same.
- I added another
Looping over a Single Line¶
Use a for
loop to loop over a single line when piping through awk
:
echo "$thing" | awk -F '[ =]' '{
for(i=1; i< NF; i++)
if($i == "ansible_host") print $(i+1) }'
-F '[ =]
: Use either a space or equals sign as the field separator.
Example: Extracting the Node IP from an Ansible Host file¶
while read -r l; do
# E.g., Extracting the node IP from an ansible host file:
NODE=$(printf "%s" "$l" | awk -F '[ =]' '{ for(i=1; i< NF; i++) if($i == "ansible_host") print $(i+1) }')
done < ./hosts
-F '[ =]
: Use either a space or equals sign as the field separator.{ for (i=1; i<NF; i++)
:- The
{
opens the main block meaning it will start processing the input. for(i=1; i<NF; i++)
: A C-style loop that will go from1
to the number of fields there are (separated by either spaces or=
).
- The
if($i == "ansible_host")
: Checks if the current field (held at$i
) is"ansible_host"
.print $(i+1)
: Print the field right after"ansible_host"
.
Using Awk as an Interpreter¶
When running awk
without feeding it any input, either via pipe or file, it will run the
program given as a script.
It will process user input as records (lines), as if it were reading from a file.
E.g.:
awk 'BEGIN {FS=" "} {gsub("a|e|i", "x", $0); printf("%s - %d\n", $0, length($0))}'
a
, e
, or i
with x
, and
output the result.
Removing Duplicate Lines with Awk¶
You can use awk
to remove all duplicate lines from a file by using an associative
array and creating keys out of the lines.
awk '!seen[$0]++' file > file.deduped
seen[$0]
: This uses the entire line as a key in the associate arrayseen
.++
: Increment the count for that line (from0
to1
).!
: Check if the condition is false.!seen[$0]++
: Returnstrue
only for the first time a line is seen (before incrementing).- This will print only the first occurrence of each unique line.
By default, when using a condition like this, awk will print the whole line ($0
) if
there are no other arguments.
If you wanted to provide arguments to this condition, then you could do so:
awk '!seen[$0]++ { print $1 }'
This follows the same pattern { action }
format that any other awk program does.