Shell tutorial part 2

From LXF Wiki

Revision as of 16:20, 29 May 2008; view current revision
←Older revision | Newer revision→


Shell secrets

PART 2 Time-saving tips for modifying and processing text from the command line. By Marco Fioretti.

Much of our shell tutorial this month concerns metacharacters, the text symbols sprinkled (at random, it often seems) throughout command line instructions. If you can find out what they do and learn how to use them, you’ll be able to create powerful programs for finding, inserting and scrubbing out text.

Our first example will help you explain to a program how to recognise a certain piece of text and what to do with that text afterwards. The standard description of the structure of a string of text is called a regular expression – or regex. These are dark, mysterious beasts, but easy to use once you’ve tamed them. In regular expressions, the characteristics of complex text patterns are defined by a vast array of metacharacters:


Weird, huh? But don’t be afraid – come closer. The first regex here simply means that we’re looking for any line containing the string ‘linux’ (regardless of its case, or if it’s part of a longer word). The second and third are a bit more specific: they’ll match only lines beginning (^) or ending ($) with that string. The last regex describes all lines that start with the ‘linux’ tring, end with ‘format’ and have any (*) number of any character (.) in between. In other words it will match with:

linux Format                                                            
Linux users love Linux Format                                              

Regular expressions are also used to substitute some text patterns for others:


Here the first regex capitalises all occurrences of linux, and the second one replaces all the dates of Christmas past with that of the next one: ‘\d’ is another metacharacter, meaning ‘any digit’, so four of them will match any year expressed in that form.

Table of contents

The role of the interpreter

In practice, regular expressions are fed as arguments to applications, or interpreters, that can put them into practice. The location of the interpreter inside the file system is written right after the shebang. Not sure what the shebang is? Simple: it’s Unix lingo for the two characters at the very beginning of every script – the charming ‘#!’ couplet. They mark the rest of the file as a script – in other words, a series of executable commands meant for an interpreter. Therefore, the first line #! /bin/bash declares that you want the program bash in the bin directory to execute your commands (note the space after ‘!’). If the file mentioned after the shebang doesn’t exist, or is not an interpreter, the system the shebang doesn’t exist, or is not an interpreter, the system will simply return a ‘Command Not Found’ error and quit. Some Unix variants place a tight limit on the length of the shebang line, truncating everything to 32 characters or so. What this means in practice is that you may get the ‘Command Not Found’ error even if you’ve entered a valid interpreter file – what’s happened is that it’s just too far from the shebang for the system to recognise (‘Not Found’ is not the same as ‘Not There’). Two interpreters you’re likely to use for your regexes are AWK and SED. They have been around since the very beginning of Unix and although there are several other interpreters (chief among them Perl) that can do much more, the original two are faster and, for this reason, still widely used in boot-time scripts.

Using SED and AWK

SED works on streams of text (the name SED is just a contraction of stream editor). It loads one line at a time, edits it according to the commands it has received, and prints it to standard output.

cat somefile | sed ‘/^0/d’

The command above will delete all lines beginning with 0. AWK gets its, er, awkward name from the surnames of its creators: Aho, Weinberger and Kernighan. It is a bit more powerful than SED, but works in the same way – one input record at a time. By default, each line is a separate record, referred to as $0. Records are made of (typically) space-separated fields, accessible as $1, $2 and so on.

awk ‘/fax/ { print }’ bin/*

Here we found and printed all lines containing the ‘fax’ string in all the files of the bin directory. So far our examples have concerned individual phrases of text, be it finding them, formatting them or deleting them. But there are ways to use the shell to locate whole sections of text. What do you do when you find a classified ad in a newspaper page that you want to keep in your wallet? You cut it out with scissors and discard everything else. You can program the command line to do exactly the same thing with text streams.

When you you need to find and extract only relevant rows and columns of characters it can be very convenient to visualise the terminal window (or a whole text stream) in the same way – as if they were sheets or rolls of paper.

Extracting blocks of text

The four most useful utilities for this task are the programs tail, head, cut and grep. The first two return the first or last few lines of a text stream. This is how you would get the 16th to 20th line of somefile.txt:

head -20 somefile.txt | tail -5                                       

The cut command does the same thing, but vertically:

cut -c20,23 somefile.txt                                              
ls -lrt | cut -c44-                                                   

The first example returns only the columns from 20 to 23 of somefile.txt. The second takes a detailed file listing and strips everything but the modification date and file name. Last but not least is the grep family. These are, on Linux, three separate commands (grep, egrep, fgrep) that can extract from files all the lines matching a given regex. Each grep variant has several options and understands a limited set of regular expression constructs. In all cases, regex matches cannot span multiple lines. Here are some classic uses of grep:

grep Linux *.txt
grep -i -v Windows *.txt
egrep ‘Euro|Sterling’ invoice*.txt

Executing these commands would first of all return all the lines containing the Linux string in all files with a .txt extension. The second would give you all the lines from the same files that do NOT contain (-v) the word Windows, regardless of its case (-I). Finally, use the last example to show all the lines containing either Euro or Sterling from all invoice files.

The ‘here documents’ tool

Still working with long blocks of text, we move to here documents. They exploit a great feature of working within the shell, namely that that you don’t have to put templates in external files. With here documents, you can place a block of text, possibly containing some variables, straight into a script, and use it either as the standard input of a command or for a variable assignment.

Here documents use a dedicated operator, <<, to define the block of text. The syntax is very simple:

your account is past due.
Please send $INVOICE to Linux Format today

As you can see, the string right after the << operator (END_OF_EMBEDDED_TEXT) is the same that marks the end of the here document. Now imagine that the code above is in a loop, going over the contents of a text database. The code would create a series of payment requests with the actual names and outstanding payments of every customer. Printing or emailing them would be easy. Another good use of here documents is to create temporary files or to feed sequences of instructions to interactive programs like FTP.

How to find broken bookmarks

The last part of this tutorial is a handy script. We bet you have hundreds – if not thousands – of links in your web bookmark files. Chances are, a good percentage of those links are broken: web pages move and disappear all the time. You can immediately find out which links are dead with the script below. It was made for Mozilla bookmarks, but modifying it for other browser formats if you need to should be pretty straightforward. To fully understand the script, refer to earlier parts of this tutorial to remind yourself what the various metacharacters do.

#! /bin/bash
\rm url_list
\rm url_control_tmp
touch url_control_tmp
grep ‘
<A HREF=”’ $1 | cut ‘-d”’ -f2 > url_list for URL in `cat url_list ` do echo -n $URL >> url_control_tmp curl --head $URL 2>/dev/null | grep ‘Not Found’ >> url_control_tmp done awk ‘{print $1}’ url_control_tmp | sort | cat -n exit

The first three commands simply remove (rm) any temporary file created by previous runs and then create (touch) a new one, for reasons that will become clear later. Then the fun starts. The bookmark file is passed to the script as first argument, so its name is contained in the $1 variable. In the Mozilla bookmark file the lines that contain links start with the

<A HREF=” string. The script extracts them with grep and then, using the double quote character as separator (cut ‘-d”’), discards everything but the second field (-f2); that is, the actual URL. In this way all the links and nothing else end up, one per line, in the url_list file. The for line iterates every line of the url_list file, provided courtesy of the cat command. Inside the for loop, the echo instruction simply appends to another file, without newline (-n), the current URL. For the append operation to work, the file must already exist. That’s why it was created (or touched) at the beginning. Remember now? Curl is a nice web browsing utility that works from the command line to automatically retrieve all kinds of documents from the internet. In this example it is launched once for every URL, but it only downloads the page HTTP headers (-head). The headers contain bits of data associated with each document, like this:

HTTP/1.1 200 OK
Date: Fri, 04 Feb 2005 23:09:54 GMT
Server: Apache/1.3.27 (Unix) (Red Hat/Linux)
Content-Type: text/html

The relevant line is the first one: 200 OK means that the page is available. A non-existent page would have returned something like 404 Not Found. When curl is launched its error messages are ignored: STDERR has the I/O stream number 2 (0 is input, 1 is output), so 2> /dev/null means that this stream must be sent to the fake device (dev/null) provided by Unix for cases just like this. The grep part of the command saves only the lines containing the HTTP return code not found to the url_control_tmp file. The instruction starting with awk prints only the URL value (first field, $1) to its standard output. The resulting list is then sorted and printed with a serial number (cat -n). When I tested the script, the result started with these lines:


This neat script shows that learning shell commands can enhance your browsing pleasure as well as help your coding, and it’s a nice note to end this month’s tutorial on. LXF