# Shell tutorial part 2

(Difference between revisions)
 Revision as of 16:20, 29 May 2008Towy71 (Talk | contribs)← Go to previous diff Current revisionGuy (Talk | contribs) categories Line 1: Line 1: TEXT FORMATTING TEXT FORMATTING + (Original version written by Marco Fioretti for LXF issue 66.) + + == Shell secrets == - Shell secrets PART 2 Time-saving tips for modifying and processing text from the command line. By ''Marco Fioretti.'' PART 2 Time-saving tips for modifying and processing text from the command line. By ''Marco Fioretti.'' Line 71: Line 73: == The ‘here documents’ tool == == The ‘here documents’ tool == - + - + Still working with long blocks of text, we move to here documents. They exploit a great feature of working within the shell, namely that that you don’t have to put templates in external files. With here documents, you can place a block of text, possibly containing some variables, straight into a script, and use it either as the standard input of a command or for a variable assignment. Still working with long blocks of text, we move to here documents. They exploit a great feature of working within the shell, namely that that you don’t have to put templates in external files. With here documents, you can place a block of text, possibly containing some variables, straight into a script, and use it either as the standard input of a command or for a variable assignment. Line 88: Line 89: == How to find broken bookmarks == == How to find broken bookmarks == + +
+
'''RESOURCES''' + Inspired? + + Here’s where to go next The ultimate source of reference for regular expressions is the book Mastering Regular Expressions, published by O’Reilly ([http://www.oreilly.com/catalog/regex2/]). Introductory tutorials are at [http://www.zvon.org/other/PerlTutorial/Output/contents.html] and you’ll find a brief introduction to the SED and AWK interpreters at [http://www.faqs.org/docs/abs/HTML/sedawk.html]. All the other commands mentioned in this article have detailed Unix man pages. Just type: man +
- The last part of this tutorial is a handy script. We bet you have hundreds – if not thousands – of links in your web bookmark files. Chances are, a good percentage of those links are broken: web pages move and disappear all the time. You can immediately find out which links are dead with the script below. The last part of this tutorial is a handy script. We bet you have hundreds – if not thousands – of links in your web bookmark files. Chances are, a good percentage of those links are broken: web pages move and disappear all the time. You can immediately find out which links are dead with the script below. It was made for Mozilla bookmarks, but modifying it for other browser formats if you need to should be pretty straightforward. It was made for Mozilla bookmarks, but modifying it for other browser formats if you need to should be pretty straightforward. Line 107: Line 114: The first three commands simply remove (rm) any temporary file created by previous runs and then create (touch) a new one, for reasons that will become clear later. The first three commands simply remove (rm) any temporary file created by previous runs and then create (touch) a new one, for reasons that will become clear later. - Then the fun starts. The bookmark file is passed to the script as first argument, so its name is contained in the $1 variable. In the Mozilla bookmark file the lines that contain links start with the ## Current revision TEXT FORMATTING (Original version written by Marco Fioretti for LXF issue 66.)  Table of contents ## Shell secrets PART 2 Time-saving tips for modifying and processing text from the command line. By Marco Fioretti. Much of our shell tutorial this month concerns metacharacters, the text symbols sprinkled (at random, it often seems) throughout command line instructions. If you can find out what they do and learn how to use them, you’ll be able to create powerful programs for finding, inserting and scrubbing out text. Our first example will help you explain to a program how to recognise a certain piece of text and what to do with that text afterwards. The standard description of the structure of a string of text is called a regular expression – or regex. These are dark, mysterious beasts, but easy to use once you’ve tamed them. In regular expressions, the characteristics of complex text patterns are defined by a vast array of metacharacters: /linux/ /^linux/ /linux$/
/^linux.*format$/  Weird, huh? But don’t be afraid – come closer. The first regex here simply means that we’re looking for any line containing the string ‘linux’ (regardless of its case, or if it’s part of a longer word). The second and third are a bit more specific: they’ll match only lines beginning (^) or ending ($) with that string. The last regex describes all lines that start with the ‘linux’ tring, end with ‘format’ and have any (*) number of any character (.) in between. In other words it will match with:

linuxformat
linux Format
Linux users love Linux Format


Regular expressions are also used to substitute some text patterns for others:

s/linux/Linux/
s/\d\d\d\d-12-25/2005-12-25/


Here the first regex capitalises all occurrences of linux, and the second one replaces all the dates of Christmas past with that of the next one: ‘\d’ is another metacharacter, meaning ‘any digit’, so four of them will match any year expressed in that form.

## The role of the interpreter

In practice, regular expressions are fed as arguments to applications, or interpreters, that can put them into practice. The location of the interpreter inside the file system is written right after the shebang. Not sure what the shebang is? Simple: it’s Unix lingo for the two characters at the very beginning of every script – the charming ‘#!’ couplet. They mark the rest of the file as a script – in other words, a series of executable commands meant for an interpreter. Therefore, the first line #! /bin/bash declares that you want the program bash in the bin directory to execute your commands (note the space after ‘!’). If the file mentioned after the shebang doesn’t exist, or is not an interpreter, the system the shebang doesn’t exist, or is not an interpreter, the system will simply return a ‘Command Not Found’ error and quit. Some Unix variants place a tight limit on the length of the shebang line, truncating everything to 32 characters or so. What this means in practice is that you may get the ‘Command Not Found’ error even if you’ve entered a valid interpreter file – what’s happened is that it’s just too far from the shebang for the system to recognise (‘Not Found’ is not the same as ‘Not There’). Two interpreters you’re likely to use for your regexes are AWK and SED. They have been around since the very beginning of Unix and although there are several other interpreters (chief among them Perl) that can do much more, the original two are faster and, for this reason, still widely used in boot-time scripts.

## Using SED and AWK

SED works on streams of text (the name SED is just a contraction of stream editor). It loads one line at a time, edits it according to the commands it has received, and prints it to standard output.

cat somefile | sed ‘/^0/d’


Please send $INVOICE to Linux Format today  END_OF_EMBEDDED_TEXT  As you can see, the string right after the << operator (END_OF_EMBEDDED_TEXT) is the same that marks the end of the here document. Now imagine that the code above is in a loop, going over the contents of a text database. The code would create a series of payment requests with the actual names and outstanding payments of every customer. Printing or emailing them would be easy. Another good use of here documents is to create temporary files or to feed sequences of instructions to interactive programs like FTP. ## How to find broken bookmarks RESOURCES Inspired? Here’s where to go next The ultimate source of reference for regular expressions is the book Mastering Regular Expressions, published by O’Reilly ([1] (http://www.oreilly.com/catalog/regex2/)). Introductory tutorials are at [2] (http://www.zvon.org/other/PerlTutorial/Output/contents.html) and you’ll find a brief introduction to the SED and AWK interpreters at [3] (http://www.faqs.org/docs/abs/HTML/sedawk.html). All the other commands mentioned in this article have detailed Unix man pages. Just type: man <command_name> The last part of this tutorial is a handy script. We bet you have hundreds – if not thousands – of links in your web bookmark files. Chances are, a good percentage of those links are broken: web pages move and disappear all the time. You can immediately find out which links are dead with the script below. It was made for Mozilla bookmarks, but modifying it for other browser formats if you need to should be pretty straightforward. To fully understand the script, refer to earlier parts of this tutorial to remind yourself what the various metacharacters do. #! /bin/bash \rm url_list \rm url_control_tmp touch url_control_tmp grep ‘<A HREF=”’$1 | cut ‘-d”’ -f2 > url_list
for URL in cat url_list 
do
echo -n $URL >> url_control_tmp curl --head$URL 2>/dev/null | grep ‘Not Found’ >> url_control_tmp
done
awk ‘{print $1}’ url_control_tmp | sort | cat -n exit  The first three commands simply remove (rm) any temporary file created by previous runs and then create (touch) a new one, for reasons that will become clear later. Then the fun starts. The bookmark file is passed to the script as first argument, so its name is contained in the$1 variable. In the Mozilla bookmark file the lines that contain links start with the DT><A HREF=” string. The script extracts them with grep and then, using the double quote character as separator (cut ‘-d”’), discards everything but the second field (-f2); that is, the actual URL. In this way all the links and nothing else end up, one per line, in the url_list file. The for line iterates every line of the url_list file, provided courtesy of the cat command. Inside the for loop, the echo instruction simply appends to another file, without newline (-n), the current URL. For the append operation to work, the file must already exist. That’s why it was created (or touched) at the beginning. Remember now? Curl is a nice web browsing utility that works from the command line to automatically retrieve all kinds of documents from the internet. In this example it is launched once for every URL, but it only downloads the page HTTP headers (-head). The headers contain bits of data associated with each document, like this:

HTTP/1.1 200 OK
Date: Fri, 04 Feb 2005 23:09:54 GMT
Server: Apache/1.3.27 (Unix) (Red Hat/Linux)
Content-Type: text/html


The relevant line is the first one: 200 OK means that the page is available. A non-existent page would have returned something like 404 Not Found. When curl is launched its error messages are ignored: STDERR has the I/O stream number 2 (0 is input, 1 is output), so 2> /dev/null means that this stream must be sent to the fake device (dev/null) provided by Unix for cases just like this. The grep part of the command saves only the lines containing the HTTP return code not found to the url_control_tmp file. The instruction starting with awk prints only the URL value (first field, \$1) to its standard output. The resulting list is then sorted and printed with a serial number (cat -n). When I tested the script, the result started with these lines:

1 http://analogbubblebath.net/~chris/misc/doc/xultut/allofit.html.
2 http://au2.php.net/manual/en/install.configure.php.
3 http://netmail.tiscalinet.it/servizi/netmail.


This neat script shows that learning shell commands can enhance your browsing pleasure as well as help your coding, and it’s a nice note to end this month’s tutorial on. LXF