PHP - Handling tar files part 1

From LXF Wiki

Table of contents

Practical PHP Programming

(Original version written by Paul Hudson for Linux Format magazine issue 42.)

We continue production of a PHP extension to handle tar files.

If you were reading the last tutorial, you'll remember that our extension can now be compiled into PHP correctly. From here on, the process is simply a matter of editing the tar.c and tar.h file, and recompiling.

So, as this extension is designed to handle tar files, the first step is to create the functions available to PHP users. Although there are lots of things you can do with tar files, space is limited here, so we'll stick to three functions: tar_list(), tar_add(), and tar_extract(). These names are in line with the PHP coding standards (see CODING_STANDARDS included with the PHP source code) which dictate that functions should be lowercase, with words separated by an underscore, and should start with the "family" name of the function, in this case "tar". Here are the PHP function prototypes we'll work to:

  • array tar_list(string tarfile)
  • bool tar_add(string tarfile, string file_to_add)
  • bool tar_extract(string tarfile, string location)

It would be technically better to have an object-oriented solution, as it wouldn't require using "string tarfile" in each function. However, it would also require much more explanation than there is space for here.

Listing tar contents

This is the easiest part of manipulating tar files, which is why it's first. As you can see in the PHP prototype, the function should accept a string of the tar file to read from, and should return an array of the files in that tar file.

Open up tar.h into your favourite text editor (one that includes C syntax highlighting is preferable). Look for the line "PHP_FUNCTION(confirm_tar_compiled);" and change it to "PHP_FUNCTION(tar_list);". Now that we know our extension compiles into PHP properly, we can replace the debug function confirm_tar_compiled with our own function, tar_list.

php42-screenshot2.png-thumb.png (
Using a text editor that includes C syntax highlighting will make your life much easier

Now open up tar.c, and scan to line 42 or thereabouts. Change the line "PHP_FE(confirm_tar_compiled, NULL)" to "PHP_FE(tar_list, NULL)". This merely mimics the change made in tar.h, and both are used to inform PHP of what functions this module offers.

Further down the file, you'll see PHP_FUNCTION(confirm_tar_compiled). This is where the actual confirm_tar_compiled function lies. Change that line, and the two lines above it, to read:

/* {{{ proto array tar_list(string arg)
   Return an array of files in a given tar file */

As you can see, PHP source code is self-documenting - a habit that's good to keep.

Those changes above merely rename the confirm_tar_compiled function to tar_list(). As you can guess, each function needs an appropriate PHP_FUNCTION() line in tar.h, a PHP_FE() line near the top of tar.c, and also a PHP_FUNCTION() line somewhere in tar.c.

Save your changes, cd to the PHP source directory, and run make and make install again. make should take substantially less time now, because only part of the project has changed.

Once the new version of PHP is installed, execute this command:

php -r "echo tar_list('foo'), \"\n\";"

That will call the tar_list function, passing in the parameter "foo". As you can see by the output, the actual contents of the function remain the same as confirm_tar_compiled - let's change that.

The C implementation

Here's the new code for tar_list:

1 PHP_FUNCTION(tar_list)
2 {
3	TAR *t;
4	char *tarfile = NULL;
5	int tarfile_len;
6	int i;

7	if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &tarfile, &tarfile_len) == FAILURE) {
8		return;
9	}

10	if (array_init(return_value)==FAILURE) RETURN_FALSE;

11	if (tar_open(&t, tarfile, NULL, O_RDONLY, 0, TAR_GNU) == -1)
12	{
13		php_error(E_WARNING, "%s", strerror(errno));
15	}

16	 while ((i = th_read(t)) == 0)
17	{
18		add_next_index_string(return_value, th_get_pathname(t), 1);

19		if (TH_ISREG(t) && tar_skip_regfile(t) != 0)
20		{
21			php_error(E_WARNING, "%s", strerror(errno));
22			tar_close(t);
24		}
25	}

26	if (tar_close(t) != 0)
27	{
28		php_error(E_WARNING, "%s", strerror(errno));
30	}

31 }

If the code seems confusing at first, relax - half the new code is PHP-related and half the code is tar-related; you're learning two new things at once here.

How it works

Lines three to six define our variables for this function. t is a pointer to a tar file, which is defined as type TAR (the built-in libtar definition of a tar file). tarfile is a pointer to the filename we're going to load - this is passed in as our first parameter. Finally, we have two integers - tarfile_len, to store the length of the filename to load, and i, which is used later.

Lines seven to nine deal with parameters being passed into the function, and was inspired by Python (see? Competition is never bad!) The zend_parse_parameters() function takes the parameters passed into our function, then places them into C variables we define, automatically performing type conversion where possible.

zend_parse_parameters() takes a variable number of parameters. The first parameter is nearly always "ZEND_NUM_ARGS() TSRMLS_CC", which is a little confusing in itself. The first part, "ZEND_NUM_ARGS()" is quite straightforward - it returns the number of parameters actually passed in by the end-user to our function. TSRMLS_CC, however, is a little bit of magic - it's a macro standing for Thread Safe Resource Manager Local Storage Call with Comma, and ensures thread safety for your extension. The reason it's "with Comma" is because it's actually passed in as parameter /two/ to zend_parse_parameters, because it's preceded by a comma. However, it's easier just to treat "ZEND NUM_ARGS() TSRMLS_CC" as one parameter, and let Zend perform the magic.

The third and subsequent parameters are the important part of the function. Parameter three is a list of the types of parameters you expect to receive in mnemonic format - l is a long, d is a double, b is a boolean, a is an array, r is a resource, etc. "s" is special, because it stands for "string" and receives /two/ parameters - a char* string of characters and also the length of the string as an integer.

So, our third, fourth, and fifth parameters are "s" (we want to receive a string), "&tarfile" (the location of our char* ready to receive the contents of the string parameter), and "&tarfile_len" (the location of our int ready to receive the length of the string parameter). If we were to receive a string and a boolean, we would use "sb", then add a variable to store the boolean. If we request a string and the user passes an integer, zend_parse_parameters() will automatically convert the string to an integer before giving it to us, and it will perform similar conversions for most other types. Important exceptions to this are arrays, objects, and references, which cannot be converted because of their arbitrary nature.

Note that zend_parse_parameters() returns FAILURE on failure, and SUCCESS on success. In our example, we bail out with "return" if the function call fails. It's important to note that zend_parse_parameters will automatically flag up warnings if the incorrect number of parameters are received or if a type conversion isn't possible.

Continuing on with the code, line 10 sets up the value we wish to return from the function call, which is an array. Returning values in PHP is remarkably simple on the whole, however returning arrays and objects is trickier due to their naturally complex nature. Returning values from functions is handled by a special zval called return_value. /zvals/ are how the Zend Engine stores variables - see the box "Anatomy of a zval" for detailed information. The simple definition of a zval is "a multi-type variable that handles refcounts".

When we use PHP_FUNCTION(somefunc), we're actually using the PHP_FUNCTION macro, which internally sets each function up with basic information. Particularly, it ensures that each function accepts several parameters, one of which is a zval called return_value. Also passed, just so you get a more complete picture, is an integer called return_value_used - if this is 1, the calling script needs a return value, and it's 0 if no return value is needed (ie function return values aren't assigned to a variable or used in another function).

So, line ten sets up return_value to be an array through the array_init() macro, which sets the zval to be an array and allocates it some space.

Lines 11 to 30 deal with the libtar part of the extension, with a little Zend code in there. As the focus of this tutorial is PHP, I'll have to keep libtar-specific information to a minimum - sorry!

These lines of code can be split into three distinct section: opening a tar file, reading from the tar tar, and closing the tar file. This functionality is split over lines 11 to 15, 16 to 25, and 19 to 30 respectively.

php_error() is a macro pointing the zend_error() function, which automatically outputs various errors to users. It takes two parameters - the kind of error to issue, and the text of the error. Internally, zend_error() sets up extra information such as the line number where the error occurred, the file in which it occurred, etc.

Generally speaking, only three error types should be issued: E_ERROR, E_WARNING, and E_NOTICE. The difference between the three is that E_ERROR halts execution of the script, and E_NOTICE is generally disabled as it's relatively minor. E_WARNING should be the most commonly used error type.

RETURN_FALSE is one of the many ways to return values to users, and simply returns the value "false". Other basic return types include RETURN_BOOL, RETURN_NULL, and RETURN_STRING. As you can see, sending non-array/object values back to users is remarkably easy!

Line 11 calls the libtar function tar_open(), which takes six parameters: the address to save the tar file handle to, the pathname to the tar file to open, NULL, how the file should be opened (O_RDONLY for readonly or O_WRONLY for write only), 0, then special flags that can be ORed together.

Parameter one needs to be of type TAR* (see line 3), and the special flags can consist of one or more of TAR_GNU (enables GNU extensions to the tar format), TAR_VERBOSE (send status messages to stdout), TAR_IGNORE_CRC (skip validation of the tar CRC), and others. tar_open will return 0 on success, or -1 on failure.

So, lines 11 to 15 translate as "if opening the tar file as read-only fails, flag up a PHP warning, then exit the function". If the function succeeds, our tar file will be ready to work with in the variable t.

Moving on, lines 16 to 25 contain the code to handle reading files from the tar file, and adding the name of each file to the array we intend to return. th_read() is a function that reads one file header block from the TAR variable passed as parameter one. A file header block describes each file inside a tar file, and there is one header block for each file. th_read() returns -1 on error, 0 on success, and 1 when it reaches the end of the tar file, so we use a loop to continue iterating through file header blocks until a value other than 0 is returned.

add_next_index_string() is a function that handles adding strings to an array in an ordered fashion. The "next_index" part of the function name means that the string is added in the next available numerical slot in the array, so, as we're using this function exclusively to add array elements, our array will contain elements numbered from 0 to n-1, where n is the number of files in the tar file.

The function takes three parameters - the array to add to, the char* to add, and a 1 or a 0 depending on whether the string needs to be copied into Zend's memory (1) or whether the Zend Engine can use the existing pointer (0). 1 is generally used, owing to 0's tendency towards instability. th_get_pathname() extracts the pathname of the current tar file header, and returns it in char* format - perfect for parameter two.

Line 19 calls tar_skip_regfile() to move onto the next file in the tar archive - this "just works" and is best left alone. However, there is a little bit of error handling in there - if libtar is unable to move onto the next file in the archive, it will return -1, and hence run our standard error reporting code. Note, though, that this time we close the tar file with tar_close() (see later) so that we clean up properly.

We're onto the last chunk of code now, at last. tar_close() takes a TAR file parameter, and closes it down and frees any associated memory. If libtar fails to close the tar file for some reason, it will return -1, which will thereby run our standard error handling code.

= Phew =!

If you've made it this far, you're very brave indeed! Hopefully I've gone into enough depth to give you a full understanding of now just /how/ things work, but also /why/ they work that way. Anyway, the hard work is now done - let's make use of the new extension!

The first is to rebuild PHP by executing make from the PHP source directory - this will only rebuild the new extension, so it shouldn't take long.

Once it's built, run make install as root, then we're ready to test. Here's an example PHP script that makes use of the new function. You'll need to find or create a tar file for testing purposes - make sure it's just a tar file (.tar) and not a gzipped tar file (.tar.gz).

  $result = tar_list('/path/to/your.tar');

  echo "Result is: $result", "\n\n";

  foreach($result as $var => $val) {
    echo "Item $var: $val\n";

Save that as tartest.php, then run:

php -f tartest.php

If you've followed my steps precisely, you should see something similar the screenshot, although, of course, the files inside the tar will be different.

php42-screenshot5.png-thumb.png (
It works! All the blood, sweat, and RSI was worth it!


Scanning over this article and part 1, you should be able to see that there's actually not all that much code involved - for instance, editing the m4 file is something that you just get used to and is mostly the same for every extension you write. Furthermore, learning how to make use buildconf, configure, and make for PHP extensions is really just a one-off - it'll become second nature given practice.

This tutorial has hopefully taught you how PHP extensions work, how their code is laid out, how Zend works with variables, and also perhaps just a little of how PHP and Zend work internally.

Modules made easy

How to use ext_skel for maximum efficiency

Extension writing for PHP needn't be a hassle, and it's important not to make your life harder than it ought to be. Using ext_skel is the best way to start an extension, as it creates lots of code for you to handle all sorts of eventualities - using global variables, reading php.ini values, adding entries to phpinfo() and more.

ext_skel can be passed parameters to make it produce a more customised extension skeleton, and these options are generally used by veteran extension writers.

Particularly of interest will be:

  • --xml; generate XML documentation that can be added to the PHP documentation
  • --no-help; don't add comments throughout the files, and also don't create the initial test function
  • --stubs=file; leave out all module-specific information and just write function stubs

Anatomy of a zval

Each zval contains a value, a type, a reference count, and whether or not it's a reference to another variable or not. The reference count is used for garbage collection - the count is incremented for each reference that exists to this variable, and decremented when a reference is lost. When the count hits zero, the zval is automatically freed.

The value is a union structure in itself, so it's a little more complicated. A value can either be a long (lval), a double (dval), a string and its length (str), a hash table value (ht, for arrays), or an object (obj). How the value is read from a variable depends on what kind of variable it is, naturally. However, it's important to remember that reading one type of a union that had its data set as another type will result in garbage being returned.

To read a value from a zval that holds a string, useZend's built-in macros, Z_STRVAL(zval) and Z_STRLEN(zval). While you can access the value of a zval directly, it's much easier just to use the macro. In the same way, Z_LVAL(zval) reads the long value of the zval, Z_BVAL(zval) reads the boolean value of the zval, etc.

Writing to a zval can also be done by hand, but, because of the need to set the /type/ field of zvals, it's easier again to use the Zend macros for setting variables: ZVAL_LONG, ZVAL_DOUBLE, ZVAL_STRING, etc. In the case of setting a string variable, the Zend macro also calculates the length of the string for you.

ZVAL_LONG, ZVAL_DOUBLE, etc, all use the same format for setting their value - parameter one is the zval to set, and parameter two is the value to assign to it, eg:

ZVAL_LONG(new_long, 10);

Owing to the more memory-reliant nature of strings, the ZVAL_STRING macro takes a third parameter, which should be a 1 if the string should be duplicated before being stored, or a 0 if not. Generally this should be a 1, otherwise data will go out of scope and cause problems. The primary occasion where a 0 needs to be passed is when you want to create a new variable referring to a string that's already allocated in Zend's internal memory.

There's one final thing you need to know about zvals before getting started using them, and that's how they are initialised. MAKE_STD_ZVAL is a special macro that should be called after each zval has been declared, and does basic house-keeping tasks such as allocating memory for the zval, setting the reference count to one, etc.

So, to conclude, a complete zval declaration and definition would be:

zval *myint;
ZVAL_LONG(myint, 10);
php42-screenshot4.png-thumb.png (
The PHP source code is, of course, the easiest way to see how things work. Well worth a read.