PHP - The Curl library

From LXF Wiki

Table of contents

PHP: Come Curling

(Original version written by Paul Hudson for LXF issue 68.)


Sockets are shiny, and FTP is efficient, but why must you treat them separately? We get on the ice to show off Curl.


PHP 5.1 looms on the horizon like the Super Star Destroyer in The Empire Strikes Back. Of course, it has the PHP development team at its helm rather than Darth Vader, but let's not quibble over details: it's coming, and should enter beta in the next few months. We'll be giving it a thorough examination once the beta process starts, but for now we recommend you visit the PHP snapshots site at http://snaps.php.net and compile the source yourself to see what's changing.

This issue our main focus is the Curl library, which stands for Client URL Request Library. This is a unifying system that groups HTTP, FTP, Telnet, and other protocols under one roof, meaning you needn't worry about individual PHP functions. Basic Curl usage is easy, however there are quite a few special constants to learn - more on that later!


Getting started

The first thing to try with Curl is to initialise the library, download a URL, then close the library. Curl is really just a state machine, which means you need to set its configuration options completely before asking it to execute. Although the Curl extension has quite a few functions, the main four are:

  • curl_init() - creates the Curl instance and returns it for you to store in a variable
  • curl_setopt() - sets configuration options for a given Curl instance
  • curl_exec() - runs the Curl instance
  • curl_close() - closes the Curl instance and frees up the memory

We need all four to get one complete Curl script, like so:

<?php
   $curl = curl_init();
   curl_setopt ($curl, CURLOPT_URL, "http://www.worldcurlingfederation.org");
   curl_exec ($curl);
   curl_close ($curl);
?>

So, we squirrel away the return value from curl_init() for use in later functions - this is crucial, as with most PHP extensions. Then, curl_setopt() is called, passing in that Curl instance along with CURLOPT_URL and a URL. It's a bit of a no-brainer what that does: it simply tells Curl to use the third parameter as the URL to visit. The next line uses curl_exec() to execute the Curl request, then curl_close() frees up the memory. The important part, as you can see, is the curl_setopt() parameter: that's where you set all the options for your Curl request, and can drastically change what Curl does.

By default, curl_exec() runs the request, then outputs whatever it received back directly to the screen. This can be changed with curl_setopt(), using the CURLOPT_RETURNTRANSFER setting - it's set to 0 by default, which prints to the screen, but changing it to 1 will cause curl_exec() to return the received data rather than just print it out. We can rewrite the above script like this:

<?php
   $curl = curl_init();
   curl_setopt($curl, CURLOPT_URL, "http://www.worldcurlingfederation.org");
   curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
   $return = curl_exec($curl);
   curl_close($curl);
   print $return;
?>

Although the two scripts are functionally identical, the latter allows us to post-process the web site before printing it, if it gets printed at all - there's no reason it couldn't be saved to a file or sent back over the wire someplace else. If you do plan to save the data to a file, Curl can handle that for you also with a different constant: CURLOPT_FILE. This should be a file pointer opened for writing, like this:

<?php
   $curl = curl_init();
   $file = fopen("output.txt", "w");
   curl_setopt($curl, CURLOPT_URL, "http://www.worldcurlingfederation.org");
   curl_setopt($curl, CURLOPT_FILE, $file);
   curl_exec($curl);
   curl_close($curl);
   fclose($file);
?>

Now you have seen curl_setopt() taking a string, an integer, and a file pointer as its third parameter - it's variability means you only need the one function for setting options.

The default setting for CURLOPT_URL is to have the request sent as HTTP GET. This can be changed through two other constants: CURLOPT_POST, which enables HTTP POST mode, and CURLOPT_POSTFIELDS, which is where you specify the fields you want to send over POST. To test this, we need to create a second script that will output the data we send to it with the first script, so save this code as postreceive.php:

<?php
  var_dump($_REQUEST);
?>

We then need to modify the original script to this:

<?php
   $curl = curl_init();
   curl_setopt($curl, CURLOPT_URL, "http://localhost/postreceive.php");
   curl_setopt($curl, CURLOPT_POST, 1);
   curl_setopt($curl, CURLOPT_POSTFIELDS, "This=Test&That=Test&Suxxors=Nick");
   curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
   $result = curl_exec($curl);
   curl_close($curl);
   print $result;
?>

That sends three fields over http POST: "This", with a value of "Test", "That", which also has a value of "Test", and "Suxxors", which has a value of "Nick". If you saved your postreceive.php file into your public HTML directory and have Apache configured correctly, you should get the following output:

array(3) {
   ["This"] =>
   string(4) "Test"
   ["That"] =>
   string(4) "Test"
   ['Suxxors"] =>
   string(4) "Nick"
}

Because the POST values are defined using = and & symbols, you should be careful that your variable names (and their values) don't user either of these.


Switching to FTP

Curl unifies protocols, which means we can mix HTTP and FTP using the same functions. Indeed, with WebDAV HTTP enabled, there is little difference between an FTP server and a HTTP server, at least from the perspective of the guest. Curl abstracts this nicely: we can use our existing scripts to access FTP servers, simply by changing the protocol. For example:

<?php
   $curl = curl_init();
   curl_setopt ($curl, CURLOPT_URL, "ftp://ftp2.futurenet.co.uk");
   curl_exec ($curl);
   curl_close ($curl);
?>

Other than the URL changing, that's identical to our first Curl script; Curl has allowed us to switch protocols without switching code. We've asked it to connect to the Future FTP server, so Curl will connect to it and list the contents of the base directory - your output should be this:

total 3934
d--x--x--x   2 bin          512 Jan 17  2001 bin
drwxrwx---   2 4001         512 Sep 19  2001 csl
dr-xr-xr-x   2 sys          512 Jan 17  2001 dev
drwxr-xr-x   7 root         512 Mar 13  2000 diskeds
d--x--x--x   2 sys          512 Jan 17  2001 etc
drwx------  14 root        8192 Aug 12  2003 lost+found
drwxrwxr-x   6 4            512 Apr 26  2004 pub
-rw-------   1 other    1985288 Nov  7  2002 restoresymtable
dr-xr-xr-x   4 sys          512 Jan 17  2001 usr

That gives us the ownership and privilege rights, as well as all the files. There are FTP-specific options for Curl if you want to use them, such as the ability to specify custom credentials, or change the output, for example. Custom credentials - that is, a specific username and password - can be enabled with the CURLOPT_USERPWD, then providing a username and password separated by a colon. Without this information, Curl logs on anonymously for you, which works for the majority of FTP servers. We could rewrite the previous script to log onto the Future FTP server with invalid credentials - this will cause an error, and give us a chance to use a new function, curl_error(), that outputs error messages from Curl:

<?php
   $curl = curl_init();
   curl_setopt($curl, CURLOPT_URL, "ftp://ftp2.futurenet.co.uk");
   curl_setopt($curl, CURLOPT_USERPWD, "paul:helloworld");
   curl_exec ($curl);
   echo curl_error($curl);
   curl_close ($curl);
?>

This time we'll get the output "the username and/or the password are incorrect", which is what you would expect given that we made up the credentials. If you change the credentials to "anonymous:your@email.com", the login should work, because it will use the anonymous account. If you want Curl to list the files without their permission information, you can enable the CURLOPT_FTPLISTONLY option. This returns filenames one per line, so you should be able to convert that result into an array and actually use it to download files, like this:

<?php
   $location = "ftp://ftp.gnu.org/gnu/bash/";

   $curl = curl_init();
   curl_setopt($curl, CURLOPT_URL, $location);
   curl_setopt($curl, CURLOPT_FTPLISTONLY, 1);
   curl_setopt($curl, CURLOPT_USERPWD, "anonymous:your@email.com");
   curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

   $return = trim(curl_exec($curl));
   $files = explode("\n", $return);
   curl_close ($curl);

   $randfile = array_rand($files);
   $randfile = trim($files[$randfile]);

   $curl = curl_init();
   $file = fopen("$randfile", "w");

   curl_setopt($curl, CURLOPT_URL, "$location$randfile");
   curl_setopt($curl, CURLOPT_USERPWD, "anonymous:your@email.com");
   curl_setopt($curl, CURLOPT_FILE, $file);
   curl_exec($curl);
   curl_close ($curl);   

   fclose($file);
?>

That script connects to the GNU FTP server, and snags a random file from the Bash directory. It may take a few minutes to run, depending on your connection speed and also what file is randomly chosen. If you don't use CURLOPT_FILE with that script, it will print the file to the screen and look messy - but on the flip side means you can pipe the output from the script to something else (perhaps tar, to extract the download?).


Debugging

The sheer amount of automation inside Curl means that it's hard to bug fix your scripts using echo statements. Some things - particularly FTP - involve a lot of negotiation of credentials, and so tend to feel slow. In these situations, you need the CURLOPT_VERBOSE switch, which forces Curl to output its communications with the server. This is automatically output separately to the output from the URL itself. That is, if you only want the debug output, you can enable CURLOPT_RETURNTRANSFER then just ignore the return value from curl_exec(), like this:

<?php
   $curl = curl_init();
   curl_setopt($curl, CURLOPT_URL, "http://www.linuxformat.co.uk");
   curl_setopt($curl, CURLOPT_VERBOSE, 1);
   curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
   curl_exec($curl);
   curl_close ($curl);
?>

That will request the index page from the LXF homepage, print the verbose debug output, but not print the output from the page. When you run that script, you should see something like this:

* About to connect() to www.linuxformat.co.uk port 80
* Connected to www.linuxformat.co.uk (212.113.202.71) port 80
> GET / HTTP/1.1
Host: www.linuxformat.co.uk
Pragma: no-cache
Accept: */*

< HTTP/1.1 200 OK
< Date: Wed, 13 Apr 2005 10:38:18 GMT
< Server: Apache/1.3.26 (Unix) Debian GNU/Linux PHP/4.1.2
< X-Powered-By: PHP/4.1.2
< Set-Cookie: POSTNUKESID=44fca78a116342c31c63befab03d6389; expires=Sat, 16-Apr-05 10:38:18 GMT; path=/
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: cache
< Pragma: no-cache
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=iso-8859-1
* Connection #0 left intact
* Closing connection #0

Lines starting with an asterisk (*) are information, lines starting with a closing angle bracket (>) are sent by Curl, and lines starting with an opening angle bracket (<) are received by Curl. After the initial "about to connect" messages, Curl sends out a standard HTTP request for the index page of the site. This is split across several lines, as dictated by the HTTP standard. The LXF web server then responds with various HTTP headers (the page isn't printed out, as planned), then leaves the connection intact (ie, Keepalive is enabled). Curl doesn't have any further requests, so it closes the connection and the script ends. Using the CURLOPT_VERBOSE option is really the only way to know what Curl is doing behind your back as it were, but keep in mind that Curl will send the debug text to output, which might be a security risk if you are sending passwords.


Assorted options

Curl has many more options than we have covered here, and you should refer to the online PHP manual for the full list. However, there are a few you can try out now to have a play with Curl's powerful features. First up are CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS, which dictate how Curl should respond to HTTP redirects. For example, if the page you request wants to send you to a different page, the server will send a Location header pointing to the new file; if you want Curl to get the new file instead, set CURLOPT_FOLLOWLOCATION to 1. You can control how many redirects Curl should follow with the CURLOPT_MAXREDIRS option; this is recommended just in case there's a problem and the Location redirect points to itself, causing an infinite loop.

Usage of CURLOPT_FOLLOWLOCATION is important when sites use URL redirectors for file downloads. For example, if you download PHP from the PHP site, the URL is "http://uk.php.net/get/php-5.0.4.tar.bz2/from/this/mirror" - clearly not a file! Instead, that redirects you to the correct file. This next script tries to grab that file with CURLOPT_FOLLOWLOCATION disabled and debug output turned on:

<?php
   $curl = curl_init();
   curl_setopt($curl, CURLOPT_URL, "http://uk.php.net/get/php-5.0.4.tar.bz2/from/this/mirror");
   curl_setopt($curl, CURLOPT_VERBOSE, 1);
   curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 0);

   curl_exec($curl);
   curl_close ($curl);
?>

If you run that, you will see that the output contains the line " < Location: http://uk.php.net/distributions/php-5.0.4.tar.bz2" - the server has told us to grab the real file instead. With CURLOPT_FOLLOWLOCATION turned off, Curl will do nothing, but you can just change the 0 for a 1 in the previous script to have it get the bz2 file as requested.

You can use the CURLOPT_HTTPHEADER setting to pass in an array of HTTP headers you want to send along with your request. This requires some thinking - you ought to be careful with what you pass in, because it needs to be precisely according to the HTTP standard. For example, if you want to enable content compression, you need to use this:

<?php
   $curl = curl_init();
   $http_headers = array("Accept-encoding: gzip");

   curl_setopt($curl, CURLOPT_URL, "http://slashdot.org/");
   curl_setopt($curl, CURLOPT_HTTPHEADER, $http_headers);

   curl_exec($curl);
   curl_close ($curl);
?>

Each element in the $http_headers array needs to be written in the HTTP standard manner, ie what you want to set, a colon, then its value. You can of course send as many headers as you want, but you should be careful with the example above because it enables content compression - what you get back will be a jumble of characters that may screw up your terminal (run "reset" immediately afterwards if this happens).

There are two cunning Curl options that let you fake parts about your request. CURLOPT_REFERER lets you pretend that the link was requested through a hyperlink, and CURLOPT_USERAGENT enables you to pretend that Curl is Firefox or some other browser. Both of these have dubious uses at best; we'll leave it up to you to try them out!

Finally, the CURLOPT_RESUME_FROM option lets you specify a position in bytes where an FTP transfer should pick up from. This is directly reliant on the feature being supported by the remote server, which means Curl may not always be able to pick up the transfer.

This has only been a short excursion into the world of Curl, however the functions and options we have covered should be sufficient for most people. PHP 5 introduced new ways to run multiple Curl requests with one connection, but these are quite advanced and probably outside the needs of the majority. There are, as mentioned, many options for curl_setopt() beyond those covered here - the PHP manual is the best place to look for more information about these. Granted, Curl is quite a dry topic, but you should at least be aware of its power now - good luck!


Installing Curl

If you installed PHP as a package through your distro, you should be able to just add the "php-curl" package to install Curl on your system. If you compiled by hand, you will need to install the package (or download and compile) libcurl, although it might be called curl-devel depending on your distro. With that installed, run your PHP configure line again, this time appending "--with-curl" to the end. You'll need to re-run make and make install to complete the installation.


Curl Dos and Don'ts

  • DO: Always free your Curl resource when you're done
  • DO: Specify HTTP headers exactly as they should be sent
  • DO: Use CURLOPT_VERBOSE for debugging
  • DON'T: Rely on Curl to provide anonymous credentials; be explicit and provide your own
  • DON'T: Pass in a read-only file handle with CURLOPT_FILE, unless you like error messages!
  • DON'T: Use a very high number for CURLOPT_MAXREDIRS, as this could waste lots of CPU time


Living without Curl

If you haven't got Curl, you can still use individual functions that perform the same tasks. For example, fsockopen() allows you to send and receive custom HTTP content using many of the same functions as fopen(). If you want to write to FTP, there are over 30 functions waiting for you. Both of these solutions are much more powerful than Curl, but take more learning and certainly don't allow you to switch protocols without changing your code. If you're looking to do advanced FTP, you will have to bite the bullet and use the individual functions, but most of us should be fine sticking with Curl!