LXF71.tut php.pdf

From LXF Wiki

Table of contents

PHP: SimpleXML and XPath

(Original version written by Paul Hudson for LXF issue 71.)

In another cheap attempt to give you all a leg-up in our Sudoku bounty, Paul Hudson delves into the use of XML files for fun and profit.

One of the many Latin proverbs attributed to Julius Caesar is, "Beati Hispani quibus vivere bibere est". Loosely translated, he said: "Blessed are the Spanish, for whom living is drinking." And who would disagree? Little did he know that the same phrase would be similar in modern-day Spanish, where you would say, "Dichosos los espaƱoles, para quienes vivir es beber." Any Spanish speaker could look at the Latin and guess its meaning, thanks to the fact that Spanish is a Romance language—not because it's the language of love, but because it comes from the colloquial Latin spoken by the Romans in later years.

Surprisingly, we have a similar degree of standardisation in computer science. Yes, even though we can't agree how many buttons a mouse should have, we managed to create a system for sharing data in an easy-to-understand format. XML (the eXtensible Markup Language) is a text-based format that allows you to load and save data easily, as well as share that data with others, even over the web. Like Latin and the Romance tongues, XML schemas (the rules by which the files are constructed) differ, but not enough to render them incomprehensible if you're only fluent in one of them. In this tutorial we'll use XML to store crosswords: it will store them, load them, and present them on the screen. These won't be real puzzles, though—that's a different problem entirely!

Why use XML?

The reason we have so many file formats—even in an age when XML is everywhere—is that XML isn't perfect. It's very verbose, it's typeless and being human-readable makes it slow compared to binary formats. However, it has advantages, chiefly that it can be verified without knowing the schema, it supports Unicode, it is self-documenting and it has strict syntax rules that allow for fast parsing and no mistakes.

Overall, XML came to power despite its problems because it can easily be shared across programs and sent over the web. We'll be using it to store crosswords, and by doing so we're opening our file format to other programs that want to use it. *If you read the Ajax feature in last month's issue [Playtime On The Web, page 68], you'll also be aware of the possibilities that XML brings to client-side programming. Unless you have specific needs, XML should be your file format of choice for most projects.*[Not sure where this article is]--Ajlewis2 20:21, 7 Feb 2009 (UTC)

Making XML simple

XML support has had a rocky time in PHP. Since it was first introduced, there have been various implementations, several rewrites and a number of extensions that all help you read and write XML. The latest is SimpleXML, so named because it's designed to take the pain out of XML processing by allowing you to treat an XML document as a set of PHP variables.

Here's an XML example, which I've called requiem.xml:

<requiem>
 <line>
  <latin>Confutatis maledictis</latin>
  <english>When the wicked are confounded</english>
 </line>
 <line>
  <latin>Flammis acribus addictis</latin>
  <english>Doomed to flames of woe unbounded</english>
 </line>
</requiem>

In that example, the root element is requiem. There are two 'line' elements, and each line contains a latin and an english element—nothing too tricky at this point. We can parse that XML and print it out using two lines of PHP:

 <?php
   $file = simplexml_load_file("requiem.xml");
   var_dump($file);
 ?>

Using the var_dump() function is a great way to learn how things work, as you can see at a glance exactly what data is in an object. Here's the output:

object(SimpleXMLElement)#1 (1) {
 ["line"]=>
 array(2) {
   [0]=>
   object(SimpleXMLElement)#2 (2) {
     ["latin"]=>
     string(21) "Confutatis maledictis"
     ["english"]=>
     string(30) "When the wicked are confounded"
   }
   [1]=>
   object(SimpleXMLElement)#3 (2) {
     ["latin"]=>
     string(25) "Flammis acribus addictis"
     ["english"]=>
     string(33) "Doomed to flames of woe unbounded"
    }
  }
}

Understanding what that output means is crucial to mastering SimpleXML. At the root of the output is a SimpleXMLElement object, which contains one variable, line. That's actually an array, which contains two elements (0 and 1, for our two <line> elements) that are also SimpleXMLElement objects in their own right. Each of these objects contains the strings latin and english with the data from the XML. So, what we get back from our call to simplexml_load_file() is a mix of objects and arrays that hold the structure of the XML, with normal variables for the data itself. Using var_dump() to output the return value to the screen is a bit of a cop-out, so let's rewrite the code to output only English text:

<?php
 $file = simplexml_load_file("requiem.xml");
 foreach($file->line as $line) {
   echo $line->english, "\n";
 }
?>

Note how $file->line and $line->english are used, as both $file and $line are objects and so you must access their variables using the -> operator. You can of course treat objects like arrays—using them inside foreach loops will cause PHP to iterate over the variables in there as if they were array elements. That said, it is best to keep intact the mental separation between objects and arrays, because SimpleXML uses array-style accesses to make attributes available. For example, if you amend the first line of the XML to this:

<requiem key="D">

you can now amend your parsing code to this:

$file = simplexml_load_file("requiem.xml");
print $file["key"];

This will read the key attribute from the root element. Using $file->key wouldn't work, as it would look for a child element called <key>, which doesn't exist. Yes, this is black magic, as it's not how you would expect the array operator to work—but that's why it uses SimpleXMLElement objects rather than just flat arrays.

XML in, objects out

As we've seen, the magic of SimpleXML is that it lets you work with your data files as PHP objects and arrays, which means you can ignore XML semantics and concentrate on getting your code right. This ease of use extends neatly to setting data inside the variables, because once you have them as PHP objects and arrays you can do what you want with them. More interestingly, you can make your changes and export the data back into XML. To demonstrate this we'll use some new XML:

<park>
 <squirrel name="Squirly">
  <nuts>320</nuts>
 </squirrel>
 <squirrel name="Nick">
  <nuts>0</nuts>
 </squirrel>
</park>

So we have two squirrels called Squirly and Nick, in a park. Squirly has lots of nuts and Nick has none. To solve this we need SimpleXML's ability to change values on the fly, and, it being 'Simple' XML, doing so is as simple as assigning to it, like this:

<?php
 $park = simplexml_load_file("squirrels.xml");
 $park->squirrel[1]->nuts = 10;
 print $park->asXML();
?>
 

This script loads the squirrels into the usual set of objects and arrays, then accesses one of the squirrels (the second one; remember it's all zero-based, so squirrel[1] is Nick) and reinstates his nuts. However, the important part is the asXML() call, which converts our modified SimpleXML data back in to XML—and it actually gives us better XML than we started with! Here's the output:

<?xml version="1.0"?>
<park>
 <squirrel name="Squirly">
   <nuts>320</nuts>
 </squirrel>
 <squirrel name="Nick">
   <nuts>10</nuts>
 </squirrel>
</park>

Nick has his nuts, and the whole XML also starts with a nice version tag at the beginning, which our original didn't have. The asXML() function is available to any SimpleXMLElement object, which means you could use $park->squirrel[1]->asXML() to print only the XML for Nick.

Note, though, that if you use asXML() with anything other than the root element, the XML version number isn't prepended—you just get a snippet of code. Once you have the output as XML, use the file_put_contents() function to save it to a file of your choice.

A number of extra standards have been developed to extend the power of XML, of which the most notable is XPath. This makes XML more SQL-like: rather than just use it to store information, you can query it too. The basics of XPath allow you to pull out specific parts of your XML using a filename-style path. In PHP you do this by returning an array of SimpleXMLElement objects that matched your query.

First steps on the XPath

Here's some new XML, books.xml:

<books>     
 <author nationality="British" name="Jane Austen">       
  <book>Pride And Prejudice</book>        
  <book>Sense And Sensibility</book>      
 </author>      
 <author nationality="Colombian" name="Gabriel Garcia Marquez">       
  <book>Cien anos de soledad</book>        
  <book>El coronel no tiene quien le escriba</book>      
 </author>      
 <author nationality="British" name="David Baddiel">       
  <book>Time For Bed</book>        
  <book>The Secret Purposes</book>      
 </author>    
</books>            

This time there are three <author> elements, of which two are British; and all three have books associated with them. Using XPath we could pull out a list of all the books in one query, using this script:

<?php     
 $authors = simplexml_load_file("books.xml");      
 $books = $authors->XPath("//book");      
 foreach($books as $book) {       
  echo $book, "\n";      
 }     
?>

The //book query looks for book elements anywhere in the document. The // part is the 'look anywhere' path, which means the xpath() function will find all root book elements, all book elements inside author parents; and indeed all book elements anywhere, and return them as an array. We then pass that array into a foreach loop, printing them out. This is another curious thing: if you use var_dump($book) rather than just printing it out, you'll see that it's a SimpleXMLElement object. This is part of the same magic that has SimpleXML overriding the array operator—in this case it uses the __tostring() magic function so you can use print rather than resort to periphrasis.

If you want to grab specific books, XPath can help you there too. For example, add this XML to books.xml, just before the </books> line:

<library name="British Library">     
 <book>       
  <title>The Peloponnesian War</title>        
  <author>Donald Kagan</authot>      
 </book>      
 <book>       
  <title>The Peloponnesian War</title>       
  <author>Thucidydes</author>      
 </book>    
</library>      

Now we have more <book> elements, except that they are different—the author books were ones they had written, and the library books are ones they have available. However, if we use the //book search, it will return them all, regardless of their placement in the data—not ideal if you're interested only in library books rather than all books.

The solution is to run a more specific search: you can specify XML hierarchy, like this:

 $books = $authors->XPath("/books/library/book");

This will pick out the <book> elements that are in libraries, not books written by the authors. Keep in mind that this still returns SimpleXMLElement objects, which means that you'll get back library books that have title and author variables—we could have used /books/library/book/title if we wanted that exact element, but as each book has a title and an author we should really grab the book and use normal object-oriented programming methods to pull out the child data. For example:

<?php   
 $all_books = simplexml_load_file("books.xml");    
 $library_books = $all_books->XPath("/books/library/book");    
 foreach($library_books as $book) {     
  echo "{$book->title} was written by {$book->author}\n";    
 }  
?>      

Although XML isn't fast, XPath is: at this point the XML has already been parsed and converted into an internal data structure, and that's what gets searched when you use xpath().

Divide and query

XPath can also filter based values using a limited set of operators. For example, we can use an XPath query to filter authors so that only British authors get returned:

<?php  
 $all_books = simplexml_load_file("books.xml");   
 $british_authors = $all_books->XPath('/books/author[@nationality="British"]');  
 foreach($british_authors as $author) {    
  echo "{$author["name"]} is British.\n";   
 }  
?>     

The key part is inside square brackets: we specify that we want to select <author> elements with <books> as their parents, but the square brackets contain the XPath filter. The @nationality means, 'has an attribute that matches', and as you can see we have used ="British" to limit the search to British authors only. The @ sign is key: without it, XPath will search for child elements that have that name, rather than attributes. For example, /books/author[book="Cien anos de soledad"] would return Gabriel Garcia Marquez.

Along with =, you'll find the usual suspects: <, >, <=, >= and != all work as expected, but you also have and and or, thus:

$club_1830_eligible = $holidaymakers->XPath('/books/author[@age>=18 or @age<=30]');

If you have multi-part expressions, express them using brackets, like this: /books/author[@age>=18 and (@name="Jim" or @name="Bob")].

As well as comparison operators, it's possible run some basic calculations on the values and filter by the result. For this purpose you'd use +, -, *, div and mod alongside the others. Here are some examples of XPath queries that filter results:

$blessed = $people->XPath('//person[nationality="Spanish"]);   
$meaning_of_life = $earth->XPath('//monkeys[@favourite number = 7 * 6]');
$a_grade_freshmen = $university->XPath('/students/student[@year = 1 and grades > 80.0];   
$dangerous = $people->XPath('//adults[@iq = @shoesize]'); // note how we can compare one attribute against another   
$squirrels_with_comedy_oversized_tails = $animals->XPath('//squirrels[@tail > @body-length * 6]');   
$offenders = $people->XPath('//people/[@outstanding_penalty = true()]');   
// this returns records that have an outstanding_penalty attribute, empty or otherwise 
$good_wines = $wines->XPath('/drinks/alcoholic/wines/wine[@year mod 2 = 1 and (@country="Australia" or @country="France")]');       

Looking at that last line, I think you'll agree that complex expressions are hard to read in XML—they are best avoided!

Crossed words

Now you have an idea of how SimpleXML works, it's time to turn our attention to a new problem: storing a crossword puzzle. Crosswords use large grids, sort of like in Sudoku, where each box is filled in with a value by the solver as they progress, also like Sudoku. This is, of course, completely coincidental.

There are two ways this data can be stored: using a 'pure' solution, where every square has its own element; and an'impure' solution, where the grid has an element that's made up of characters that represent the grid. We'll turn to an impure solution first, as it's simplest. Here's how it might look in XML:

<crossword>
 <title>My Excellent Crossword</title>
  <grid>
-l--t-
-linux
-a--l-
-m--i-
-agape
------
  </grid>
</crossword>

To load this, you would pull out the <grid> element and parse each character into a crossword square. This isn't ideal: how do you store the crossword numbers for squares? How do you attach clues? How can people save their progress? These problems need 'proper' XML: every square has to be an element. These square elements will have all the attributes we need:

  • type (black or white).
  • number (if applicable).
  • direction (down, across or both, if applicable).
  • downclue (if applicable).
  • acrossclue (if applicable).
  • correctanswer (the letter that should be in there).
  • currentanswer (the letter that the solver has put in there).
  • guessedanswer (a letter the solver isn't sure about).

The elements will be sorted so that the first row comes first, then the second row, and so on until the end. The grid will store settings: author, difficulty (a number between 1 and 4, 1 being easiest), and grid size (6 would be six squares across and down).

To save space here, we'll use a 3x3 grid, like this one:

<grid author="Paul Hudson" difficulty="1" size="3">
 <square type="white" number="1" direction="both" downclue="Water stopper" acrossclue="Four-legged cat-hater" correct="d" current="" guessed="" />
 <square type="white" correct="o" current="" guessed="" />
 <square type="white" number="2" direction="down" downclue="Water stopper" correct="g" current=""  guessed="" />
 <square type="white" correct="a" current="" guessed="" />
 <square type="black" />
 <square type="white" correct="o" current="" guessed="" />
 <square type="white" number="3" direction="across" acrossclue="Crazily annoyed" correct="m" current="" guessed="" />
 <square type="white" correct="a" current="" guessed="" />
 <square type="white" correct="d" current="" guessed="" />
</grid>

If that were saved in the file crossword.xml, we could use the following PHP to print it out with all the correct answers.

<?php
 $crossword = simplexml_load_file("crossword.xml");
 $i = 0; // square counter
 foreach($crossword->square as $square) {
  if ($square["type"] == "white") {
    print $square["correct"];
  } else {
    print " ";
  }
  ++$i;
  if ($i % $crossword["size"] == 0) print "\n";
 }
?>
    

The if ($i % part is what inserts line breaks: the XML elements are loaded sequentially, so we need to insert a line break each time we hit $crossword["size"] or a multiple of it.

Is that it? Well, our effort only displays puzzles rather than allowing people to solve them, it doesn't generate puzzles (hint: snag one of the many free dictionaries from the web), and it doesn't have a pretty GUI attached to it. I'm leaving the first two for you to fix yourself, but it just so happens that next tutorial will help you with the third.

SUDOKU HINTS, PART 2

Here are some more tips to help you win our Sudoku bounty. Am I nice or what?

  • Create a function called solve() that takes a square as its parameter and tries to place a number into a given box.
  • Inside the solve() function, call various other functions that try one individual method of solving, for example solveTwin(), solveBrute() and so on. This makes it easy to order them for maximum performance later on.
  • Use a serialised array for the 'guessed' attribute—your puzzlers can make multiple guesses, remember.
  • Write three functions that return an array of all elements in the same box, row, and column as the current square. You'll use this often, so get it right in functions for easy re-use.
  • Don't try to write a complicated multi-threaded solution. Unless you have some serious brainfarts, your code will be plenty fast enough.
  • For the most basic solution, use random numbers to determine where you start. To get an idea of difficulty, run the program three times and average the number of moves required to solve it.
  • For a more advanced solution, examine what's on the board and intelligently choose somewhere to start. This is harder: get the random number solver working first, then go from there—it's better to have a basic solver than no solver at all!

That's enough hints for now, but you might want to refer to last tutorial for more. If you're lucky, you'll get a third set of clues next tutorial, but you should be on your way to a solution by then!

QUICK TIP: Multiple expressions

It's possible to run multiple expressions at once. For example, you might want the names of heads of states of countries in the world in North America, but only if the head of state has children:

$na_hos_with_kids = $countries->XPath('/countries/country[continent="North America"]/headofstate[haschildren = true()]/name');

Yes, it works, and no, it's not easily read; but it's all we have!