Shell tutorial part 2

From LXF Wiki

Revision as of 14:25, 29 May 2008; view current revision
←Older revision | Newer revision→

TEXT FORMATTING

Shell secrets

PART 2 Time-saving tips for modifying and processing text from the command line. By Marco Fioretti.

Much of our shell tutorial this month concerns metacharacters, the text symbols sprinkled (at random, it often seems) throughout command line instructions. If you can find out what they do and learn how to use them, you’ll be able to create powerful programs for finding, inserting and scrubbing out text.

Our first example will help you explain to a program how to recognise a certain piece of text and what to do with that text afterwards. The standard description of the structure of a string of text is called a regular expression – or regex. These are dark, mysterious beasts, but easy to use once you’ve tamed them. In regular expressions, the characteristics of complex text patterns are defined by a vast array of metacharacters:

/linux/                                                                
/^linux/                                                               
/linux$/                                                                   
/^linux.*format$/
                                                                                          

Weird, huh? But don’t be afraid – come closer. The first regex here simply means that we’re looking for any line containing the string ‘linux’ (regardless of its case, or if it’s part of a longer word). The second and third are a bit more specific: they’ll match only lines beginning (^) or ending ($) with that string. The last regex describes all lines that start with the ‘linux’ tring, end with ‘format’ and have any (*) number of any character (.) in between. In other words it will match with:

linuxformat                                                            
linux Format                                                            
Linux users love Linux Format                                              

Regular expressions are also used to substitute some text patterns for others:

s/linux/Linux/                                                         
s/\d\d\d\d-12-25/2005-12-25/                                           

Here the first regex capitalises all occurrences of linux, and the second one replaces all the dates of Christmas past with that of the next one: ‘\d’ is another metacharacter, meaning ‘any digit’, so four of them will match any year expressed in that form.

The role of the interpreter

In practice, regular expressions are fed as arguments to applications, or interpreters, that can put them into practice. The location of the interpreter inside the file system is written right after the shebang. Not sure what the shebang is? Simple: it’s Unix lingo for the two characters at the very beginning of every script – the charming ‘#!’ couplet. They mark the rest of the file as a script – in other words, a series of executable commands meant for an interpreter. Therefore, the first line #! /bin/bash declares that you want the program bash in the bin directory to execute your commands (note the space after ‘!’). If the file mentioned after the shebang doesn’t exist, or is not an interpreter, the system the shebang doesn’t exist, or is not an interpreter, the system will simply return a ‘Command Not Found’ error and quit. Some Unix variants place a tight limit on the length of the shebang line, truncating everything to 32 characters or so. What this means in practice is that you may get the ‘Command Not Found’ error even if you’ve entered a valid interpreter file – what’s happened is that it’s just too far from the shebang for the system to recognise (‘Not Found’ is not the same as ‘Not There’). Two interpreters you’re likely to use for your regexes are AWK and SED. They have been around since the very beginning of Unix and although there are several other interpreters (chief among them Perl) that can do much more, the original two are faster and, for this reason, still widely used in boot-time scripts.

Using SED and AWK

SED works on streams of text (the name SED is just a contraction of stream editor). It loads one line at a time, edits it according to the commands it has received, and prints it to standard output.

cat somefile | sed ‘/^0/d’

The command above will delete all lines beginning with 0. AWK gets its, er, awkward name from the surnames of its creators: Aho, Weinberger and Kernighan. It is a bit more powerful than SED, but works in the same way – one input record at a time. By default, each line is a separate record, referred to as $0. Records are made of (typically) space-separated fields, accessible as $1, $2 and so on.

awk ‘/fax/ { print }’ bin/*

Here we found and printed all lines containing the ‘fax’ string in all the files of the bin directory. So far our examples have concerned individual phrases of text, be it finding them, formatting them or deleting them. But there are ways to use the shell to locate whole sections of text. What do you do when you find a classified ad in a newspaper page that you want to keep in your wallet? You cut it out with scissors and discard everything else. You can program the command line to do exactly the same thing with text streams.