Most of the text processing can be processed by awk and sed. Sed is non-interactive stream editor that allows you to specify all editing instructions in one place and execute them on a single pass through the file. Awk is a pattern-matching programming language.
Using sed and awk requires some understanding of regular expressions. Here’s the basics of regular expression signs:
sign | operation |
^ | matches beginning of line |
$ | matches end of line |
. | matches any single character (wildcard) |
* | repeat previous token zero, one or more times. |
+ | repeat previous token one or more times |
? | repeat previous token zero or one time |
[…] | matches any one of the class of characters enclosed between the classes. ^ as first character reverses the match – is used to ndicate a range of characters |
() | groups regex |
| | either preceding or following regex can be matched |
{n,m} | matches a range of occurrences of the single character that immediately precedes it. {n} will match exactly n occurrences {n,} will match at least n occurrences {n,m} will match any number of occurrences between n and m |
Common expressions
exp | interpretation |
[^0-9] | excluding number |
[15]00* | matches “10”, “50”, “100”, “500”, “1000”, “5000”. Here the first 0 is literal, the second is modified by *, see the table above |
.* | any number (including 0) of any character |
<.*> | any html tags |
book | matches book with preceding and following spaces |
books* | matches books, or book, but not “book.” “book?” etc |
book.* | matches book, followed by any number of characters, or none followed by a space |
Note that regular expression comes in several different flavours, which can be confusing and frustrating. This is a good summary. There are DFA (Deterministic Finite Automata) based engines and NFA (Non-Deterministic Finite Automata) based engines:
- NFA based engines can “go back” in the regex, used in Perl, Python, vim, sed and GNU grep.
- DFA based engines cannot “go back” in the regex, used in awk and BSD grep.
Standard | IEEE POSIX BRE | IEEE POSIX ERE | PCRE |
Detail | Basic Regular Expressions | Extended Regular Expressions that add repetition, alternation on top of BRE | Perl Compatible regular expression. |
Engine | DFA | DFA | NFA |
GNU grep | grep by default, or grep -G | egrep grep -E | grep -P |
BSD grep | grep | egrep | |
GNU sed | sed | sed -r | NA. Just use perl |
BSD sed | sed | sed -E | NA. Just use perl |
awk | awk |
The best way to check isn on BSD manual and GNU.
Here are several examples of sed and awk I came across at work:
Find and remove duplicate lines:
awk '!x[$0]++' input_file.txt > output_file.txt
Remove white spaces at the beginning and end of each line:
awk '{$1=$1}1' input_file.txt > output_file.txt
Print with multiple dilimiters (;, , , and |)
awk -F '[;,|]' '{print $1, $3, $5}'
Print with calculation between columns
awk '{res=$1-$2;print res,$0}'
Print rows conditionally
awk '$1>20{print;}'
Prefix each line of a file
awk '$0="PREFIX|"$0' input.txt > prefix.input.txt
Replace string original to new in file
sed -i 's/original/new/g' file.txt
Remove multiple patterns
sed 's/pattern1\|pattern2\|pattern3//g'
Delete the first matching pattern only
sed 's/pattern//'
Remove blank lines:
sed -i '/^$/d' input_file.txt > output_file.txt
Copy from line 100 to line 500 of input file to output file
sed -n 100,500p input.log>output.log
Merge every three lines:
sed 'N;N;N; s/\n/ /g'