sed & awk Workshop

This is a somewhat crude transcript of a sed & awk workshop for the Linux User Group Bolzano-Bozen-Bulsan from the 25 January 2003.

Regular Expressions

Atoms

Atoms are the basic components of a RE

x
the character 'x' itself
\X
if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*')
\123
the character with octal value 123
\xe5
the character with hexadecimal value e5
.
any character (byte) except newline
[xyz]
a character class: x OR y OR z
[ako-sP]
a character class with a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P'
[^A-Z]
a negated character class: i.e., any character but those in the class. In our example, any character except an uppercase letter
[:str:]
a character class expression: Allowed only within another character class. The valid contents of str are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit

Pieces

Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated

(r)
the RE r itself
rs
the RE r followed by the RE s
r|s
the RE r OR the RE s
r*
the RE r zero or more times
r+
the RE r one or more time
r?
the RE r zero or one time
r{2,6}
the RE r anywhere from two to six times
r{2,}
the RE r two or more times
r{,6}
the RE r up to six times
r{4}
the RE r exactly for times

Regular Examples

The RE (x|y|z) is equivalent to the RE [xyz].

And the RE (a|b) is equivalent to (b|a).

The RE (B|F)al{2} matches both the strings Ball and Fall.

Regular Examples to match real numbers

Real numbers (simple)

[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?

Real numbers (character class)

[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?

Problem: numbers like 3. are accepted, but not .3.

Real numbers (catch all)

(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?

Basic REs

Real numbers as Extended RE

[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?

Real numbers as Basic RE

[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}

Real numbers as Basic RE, written in the shell

\[0-9\]\[0-9\]\*\\.\[0-9\]\*\\\(\[eE\]\[+-\]\\\{0,1\\\}\[0-9\]\[0-9\]\*\\\)\\\{0,1\\\}

sed

sed Synopsis

bash$ sed [options] program [inputfile]

This simple program consists of the command 'd'. It tells sed to delete the pattern buffer.

bash$ sed -e 'd' /etc/hosts

Another command is 'p'. It tells sed to print the pattern buffer. (Every line is printed twice)

bash$ sed -e 'p' /etc/hosts

We don't always want to work on the whole document --> There must be a mechanism to address a line or several lines

Addresses

n
selects the line n
$
selects the last line
/re/
selects the lines matching the RE re
\crec
selects the lines matching the RE re. The c may be any character
first~step
GNU extension! Selects every step'th line starting with line first
addr1,addr2
Address range: selects all input lines which match the inclusive range of lines starting from the first address and continuing to the second address
addr!
select those lines, where the addr does not match

Examples

The command = prints the current line number. A substitute program for wc -l might be:

bash$ sed -n -e '$='

This one emulates head:

bash$ sed -n -e '1,10p'
bash$ sed -e '10q'

sed commands

Eliminate comments

bash$ sed -e 's/#.*//' /etc/inetd

The substitute command:

s/re/repl/flags

flags is zero or more of the characters

s/abc/abc/g

This is not a endless loop!

s/otto/o/g

The String ottotto will be changed to otto, not to o.

Eliminate comments

bash$ sed -e 's/#.*//' /etc/inetd

Eliminate comments and empty lines

bash$ sed -e 's/#.*//;/^$/d' /etc/inetd

Have a 133t prompt

bash$ ls -l | sed -e 's/o/0/;s/l/1/;s/e/3/'
bash$ ls -l | sed -e 's/o/0/g;s/l/1/g;s/e/3/g'
bash$ ls -l | sed -e 'y/ole/013/g'

Convert a file from DOS to UNIX and back

# Under UNIX: convert DOS newlines (CR/LF) to Unix format

bash$ sed 's/.$//' file    # assumes that all lines end with CR/LF
bash$ sed 's/^M$// file    # in bash/tcsh, press Ctrl-V then Ctrl-M
# Under DOS: convert Unix newlines (LF) to DOS format
C:\> sed 's/$//' file    # method 1
C:\> sed -n p file       # method 2

Or use the utilities dos2unix and unix2dos, or the command

tr -d [^M] < inputfile > outputfile

for a conversion from DOS to UNIX, or

:set fileformat=dos
:set fileformat=unix

from within vim, or...

The character # is a command (which cannot have any address)

This is useful if the sed program is stored in a file. The whole program can be executed with

bash$ sed -f programfile < inputdata

The { and } commands group different commands. } is a command --> it must be preceded by a semicolon.

bash$ sed -ne '/gimme this line number/{=;q;}'

The command n reads a new line from stdin

/skip this line/{d;n;}
 # do some nasty stuff
 ...

REs are greedy

eliminating HTML tags from a file

bash$ sed -e 's/<.*>//g' text.html

If the file contains a line like:

This <b> is </b> a <i>example</i>.,

then the result will be:

This.

Solution:

bash$ sed -e 's/<[^>]*>//g' text.html

References

The elleff-Language:

Every vowel c in a word is substituted with clcfc.

--> The ampersand (&) holds the matched string:

bash$ sed -e 's/[aeiou]\+/&l&f&/g'

Referencing a sub-string

Sub-strings enclosed with \( and \) can be referenced with \n (n is a digit from 1 to 9)

bash$ sed -e 's/\([^ ]\+\)  *\([^ ]\+\)  *\([^ ]\+\)/\3 \2 \1/'

The elleff back-transform

The RE [aeiou]l[aeiou]f[aeiou] matches strings which are not ellef vowels.

Basic REs can use the back-reference in the RE itself!

bash$ sed -e 's/\([aeiou]\+\)l\1f\1/\1/g'

Space Balls

D
Delete text in the pattern space up to the first newline
N
Add a newline to the pattern space, then append the next line of input to the pattern space
P
Print out the portion of the pattern space up to the first newline
h
Replace the contents of the hold space with the contents of the pattern space
H
Append a newline to the contents of the hold space, and then append the contents of the pattern space to that of the hold space
g
Replace the contents of the pattern space with the contents of the hold space
G
Append a newline to the contents of the pattern space, and then append the contents of the hold space to that of the pattern space
x
Exchange the contents of the hold and pattern spaces

Space Balls: Example

Print the first line as last

bash$ sed -n -e '1h;1!p;${g;p;}'
h
hold space <- pattern space
g
pattern space <- hold space

Emulation of tac

bash$ sed -n -e 'G;h;$p'
G
pattern space <<- '\n' hold space

Problem:

The output shows a exceeding newline at the end: it is because G adds a newline followed by the content of the hold buffer to the pattern buffer, even in the first line (which is printed at the end).

tac improved

bash$ sed -n -e 'G;h;$s/.$//p'
bash$ sed -n -e '1!G;h;$p'

A simple counter in sed

/^[[:digit:]][[:digit:]]*$/!n;         # the line must contain only digits
x;s/.*//;x;                            # clear the hold space
: add
/9$/{s/9$//;x;s/.*/0&/;x;b add;};      # eliminate the last 9 from the p.s.
                                       # and add a 0 in front of the h.s.
s/8$/9/
s/7$/8/
s/6$/7/
s/5$/6/
s/4$/5/
s/3$/4/
s/2$/3/
s/1$/2/
s/0$/1/
s/^$/1/
G;s/\n//g;            # add the content of the h.s to the p.s

Branches

: label
Definition of label (up to 8 characters)
b label
unconditionally branch to label
t label
branch to label only if there has been a successful 's'ubstitution since the last input line was read or 't' branch was taken

If label is omitted in the b or t command, then the next cycle is started.

Eliminate K/K++ comments

#!/bin/sed -f

# delete K++ comments
/^[[:blank:]]*kk.*/d
s/kk.*//

# If no comment is found, then start a new cicle
: test
/ko/!b

# Append new lines to the pattern space until a entire K-comment is in the
# pattern space
: append
/ok/!{N;b append;}

# delete every K-comment (but don't be greedy!)
s/ko\([^o]\|o[^k]\)*o\?ok//g

t test

awk

Program Structure

organisation of an awk program

pattern { action }

A pattern can be:

A simple program

BEGIN { print "START" }
{ print }
END { print "STOP" }

A simple program with quit command

BEGIN { print "START" }
/quit/{ exit }
{ print }
END { print "STOP" }

Use of variables

bash$ awk '{ a++ } END{ print a, "lines." }'

Example: Slicing the input

bash$ ls -lg | awk '{ print $3, ":", $7 }'

Who tells awk which character to take as field separator?

And why are there spaces between the fields in the output string?

The field separator can be specified with the FS variable.

BEGIN { FS=":"; OFS=""; }
{ print $1, "'s name is: ", $5 }

called as

awk -f programfile /etc/passwd

Some built-in variables

Example: Emulation of wc -w

BEGIN{ w=0 }
{ w+= NF }
END{ print w }

or

END{ print w }; { w+=NF }; BEGIN{ w=0 }

would work too.

Example: String manipulation

bash$ awk '{ sub(/[^ ]* */,""); print $0 }'

A simple calculator

BEGIN{ print "type a number" }
{ print $1 "square =" $1*$1 }

Example: Rotating the input column

{ j=1+j%3; print $j }

Control structures

Conditions

if (expr) statement
if (expr) statement else statement

Loops

while (expr) statement
do statement while (expr)
for (opt_expr ; opt_expr ; opt_expr) statement
for (var in array) statement

and also

continue
break

Arrays

One-dimensional Arrays

printing all elements of a array:

for (i in A)
    print A[i]

Multidimensional Arrays

for ( (i,j) in A ) print A[i,j]

Functions

A function is defined as

function name( args ) { statements }

and can return a value

return expression

All variables are global..

function set_n(i)
{ n=i; }

BEGIN{ n=6; set_n(1); print n }

.. but arguments are local

function set_n(i,   n)
{ n=i; }

BEGIN{ n=6; set_n(1); print n }

I/O

print
writes $0 ORS to standard output.
print expr1, expr2, ..., exprn
writes expr1 OFS expr2 OFS ... exprn ORS to standard output.
printf format, expr-list
duplicates the printf C library function writing to standard output.
print > file
writes $0 ORS to file
getline
reads into $0, updates the fields, NF, NR and FNR
getline < file
reads into $0 from file, updates the fields and N
getline var
reads the next record into var, updates NR and FNR
getline var < file
reads the next record of file into var
command | getline
pipes a record from command into $0 and updates the fields and NF
command | getline var
pipes a record from command into var

Dividing even/odd pages of a Text (RFC)

Assuming the pages pre-formatted and separated by ^L (0x0c)

BEGIN{ job = 1 }

{ print > "txt.out." job }

/^\x0c$/ { job = job % 2 + 1 }

Passing variables to awk

Let us suppose we have a shell variable $SearchString, and we want to pass it to an awk program (which emulates grep)

First Try

awk '/$SearchString/{ print }' textfile.txt

This doesn't work, because the shell inhibits variable expansion between single quotes (').

awk /$SearchString'/{ print }' textfile.txt

What happens if $SearchString contains a space?

Second Try

awk /"$SearchString"'/{ print }'

Another solution

awk -v ss="$SearchString" '$0 ~ ss { print }'

The -v option is available with POSIX compliant awk implementations. mawk and gawk support it, but oawk does not. Some of the nawk implementations support it, some do not.

Statistic of password generators

BEGIN { bytes = 0 }
{
    n=length($0)
    for(i=1; i<=n; i++)
        A[substr($0,i,1)]++
    bytes+=n
}
END {
    n = 0
    med = 0
    print bytes, "bytes"
    for(i in A)
    {
        med+=A[i]
        n++
    }
    print n, "chars"
    med/=n
    print "average frequency =", med
    var = 0;
    for(i in A)
        var+=(A[i]-med)^2/n
    print "variance =", var
    print "std. dev =", sqrt(var)
}

base64-encoded random values

bash$ (base64-encode < /dev/urandom | tr -d +/\\n | \ 
head -c "${1:-8000}" 2> /dev/null ; echo) | awk -f stat
8000 bytes
62 chars
average frequency = 129.032
variance = 108.354

passwords generated with pwgen

bash$ pwgen -c -n 8 1000|awk -f stat
8000 bytes
56 chars
average frequency = 142.857
variance = 38521.9

ed

The form of a ed command

[address [,address]]command[parameters]

The addresses are like those of sed with many extensions:

Some of the commands are

Examples

Invocation of ed

bash$ ed textfile.txt < commandfile
bash$ ed textfile.txt <<EOF
a
This is now the last line.
.
wq
EOF

Inserts a line at the end of a file (the initial position is the last line).

A single period exits from insert mode

The notation <<string\n...\nstring: here-document of the shell

Print all lines matching a RE

bash$ ed textfile.txt <<EOF
g/re/p
q
EOF

Useful readings

  1. Notes on the history of sed, awk, ed etc.
  2. Documentation of sed
  3. Documentation of awk
  4. Other resources