Downloading remote files in PHP

Wed, Jul 22, 2009 - 2:17pm -- Isaac Sukin

I was recently asked to download a series of images from a remote server. The image URLs were not known ahead of time, so to figure them out, I had to submit a search form, parse the results for a way to determine the file name, and then download the images onto the local server.

Somewhat problematically, the search form submitted via AHAH, and broke when AHAH was turned off. Not only that, but the links in the search results that led to a page with the image URL were not actual links, but JavaScript callbacks. Luckily, the JavaScript callbacks contained keywords that were used in the file names of the images. I also had some problems downloading the files: at first, the script was taking forever to run, but ultimately I figured out that this was because some of the images didn't exist on the remote server, so I just skipped them.

I also had to download the images to relevant folders on the local server that may or may not exist. Specifically, I needed to save the images in different folders based on what year (from 1880 to 1980) they referred to. This turned out to be easy to solve with PHP's file_exists() and mkdir() functions -- once I figured out that the file path passed to these functions should not start with an opening forward slash.

The code I used -- simplified slightly to remove the information relevant to the specific site I needed to grab the images from -- is below. Use it with discretion according to the relevant copyright laws.

// $Id$
 
/**
 * @file
 *   Submits a search form, gets keywords from the results, and uses them to download images to the local server.
 */
 
/**
 * Implementation of hook_cron().
 * We download a certain number of images at cron so that we don't end up with a script that runs for a really long time.
 */
function example_cron() {
  //Edit these variables based on your search parameters.
  $query = 'MY_QUERY';
  $num_results = 25;
  //The information you want to get from the search results should be between these.
  $pre = 'PRE';
  $post = 'POST';
 
  //My particular use case was to search by year; this keeps track of the year and increments it after downloading the relevant images.
  $curyear = variable_get('curyear', 1880);
  //For my use case I wanted the years 1880 to 1980; this exits the script if this range has completed.
  if ($curyear > 1980) {
    return;
  }
  variable_set('curyear', variable_get('curyear', 1880) + 1);
 
  //Run the search query and get the resulting page.
  $contents = example_retrieve($curyear, $num_results, $query);
 
  $lastoffset = 0;
  //Download images.
  for ($i = 1; $i <= $num_results; $i++) {
    //Gets the text between $pre and $post.
    //To use this effectively you need to know what text (usually HTML tags with unique IDs/classes) surrounds the information you want (see beginning of function).
    $result = example_get_between($contents, $pre, $post, $lastoffset);
    if (!$result) {
      break;
    }
    $lastoffset = $result['offset'];
    //You may have to do more processing here to get the exact information you want out of $result['text'].
    //In my particular case, I got text that contained a date, like this: 27-08-1880.
    //I used that information to obtain $date (the file name) and $img (the image URL, which used this pattern:
    //http://[FOREIGN_ROOT]/viewer.php&imageFile=[YEAR]/[MONTH]/[DAY].jpg
    //So you should be able to see how I figured that one out.
 
    //Download $img (generated as explained in the above comment).
    example_saveimage($img, $query, $curyear, $date);
  }
  watchdog('example', '%num images successfully downloaded for year %year', array('%num' => $num_results, '%year' => $curyear));
}
 
/**
 * Gets text between other text.
 *
 * @param $contents
 *   The long string (in this case, read via cURL from HTML) within which to search.
 * @param $pre
 *   The information being searched for comes after this text.
 * @param $post
 *   The information being searched for comes before this text.
 * @param $offset
 *   This allows searching for text multiple times within $contents.
 * @return
 *   The text after $offset and between $pre and $post within $contents.
 */
function example_get_between($contents, $pre, $post, $offset = 0) {
  $start = strpos($contents, $pre, $offset) + strlen($pre);
  $end = strpos($contents, $post, $offset + 1);
  //If the tag isn't found, we're at the end of the results.
  if (!$end) {
    return FALSE;
  }
  $length = $end - $start;
  return array('text' => drupal_substr($contents, $start, $length), 'offset' => $end);
}
 
/**
 * Run cURL to retrieve the relevant search page.
 *
 * @param $curyear
 *   The year within which to look. You may not need this parameter.
 * @param $results
 *   The number of search results required.
 * @param $query
 *   The text for which to search.
 */
function example_retrieve($curyear, $results, $query) {
  //You will need to figure out ahead of time what these are.
  //$url is the page with the form on it. If the page with the form loads via AHAH
  //(so you don't know what the URL is) you will have to do some digging
  //to figure out from which page the AHAH loads the form.
  //You will need to use external software to figure out what post values to send.
  //I recommend using the Live HTTP Headers addon for FireFox. You need to open it,
  //submit the search form, and then look for the header with your data in it.
  //Then copy that data here and change whatever you need. It should be in the format
  //parameter=value&parameter2=value2...
  $url = '';
  $post_values = '';
 
  //Run cURL.
  $c = curl_init(); //Creates a cURL handler.
  curl_setopt($c, CURLOPT_RETURNTRANSFER, 1); //Makes cURL return the HTML of the page it gets.
  curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 8); //Times out the connection if it can't find the URL within 8 seconds.
  curl_setopt($c, CURLOPT_URL, $url); //Tells cURL which URL to connect to.
  curl_setopt($c, CURLOPT_POST, 1); //Tells cURL to use a POST request.
  curl_setopt($c, CURLOPT_POSTFIELDS, $post_values); //Tells cURL what data to send in the POST request.
  $contents = curl_exec($c); //Runs the query and puts the resulting search page HTML into $contents.
  curl_close($c); //Ends the cURL request.
 
  return $contents;
}
 
/**
 * Saves an image.
 *
 * @param $img
 *   The URL of the image to be downloaded from the remote server onto the local server.
 * @param $query
 *   The results will be saved in a folder with the same name as the search query that was run.
 * @param $year
 *   I saved the images one level deeper, in a subfolder named by the year I was searching.
 * @param $name
 *   The name the image should have (minus the extension) when it gets saved locally.
 */
function example_saveimage($img, $query, $year, $name) {
  //Make sure the image exists before we try to download it or the script will run forever.
  if (!url_exists($img)) {
    watchdog('example', 'Copying image %img to disk as %path failed because the remote image does not exist.', array('%img' => $img, '%path' => "sites/default/files/$query/$year/$name.jpg"), WATCHDOG_ERROR);
    return;
  }
  $query = drupal_strtolower($query);
  //0777 is the default, so it's not necessary to include this, but I wanted to illustrate that the directory needs to be writeable.
  //This script assumes that it will be the only thing creating the directories;
  //otherwise, you need to run is_writeable() on the directories if they exist and chmod() them if they do not allow writing.
  //Also note that the paths here do not have any opening forward slash.
  if (!file_exists("sites/default/files/$query")) {
    mkdir("sites/default/files/$query", 0777);
  }
  if (!file_exists("sites/default/files/$query/$year")) {
    mkdir("sites/default/files/$query/$year", 0777);
  }
  //This call to copy() is what does the real work in this script.
  //It downloads the file from the remote server onto the local server.
  //Change the file extension as necessary.
  if (!file_exists("sites/default/files/$query/$year/$name.jpg")) {
    copy($img, "sites/default/files/$query/$year/$name.jpg");
  }
  //We assume above (for the sake of speed) that there won't be other files with the same name already in the directory.
  //However, if there are, we deal with that by adding an incrementing number to the end of the filename.
  else {
    $i = 0;
    //For as long as we're finding files that already exist with the name we're testing, increment $i to build a new filename to test.
    while (file_exists("sites/default/files/$query/$year/$name-$i.jpg")) {
      $i++;
      //Prevent infinite recursion - not that this should ever happen.
      if ($i > 10) {
        watchdog('example', 'Copying image %img to disk as %path failed by infinite recursion.', array('%img' => $img, '%path' => "sites/default/files/$query/$year/$date.jpg"), WATCHDOG_ERROR);
        return;
      }
    }
    copy($img, "sites/default/files/$query/$year/$name-$i.jpg");
  }
  //You may wish to record here (e.g. with watchdog()) that an image has successfully downloaded;
  //However, when downloading thousands of images, this can quickly get unmanageable.
}
 
/**
 * Checks whether a remote file exists. file_exists() only works locally.
 * An alternative method is to use @get_headers() and check the resulting array for a 200-series result code, but I think that is slower.
 *
 * @param $url
 *   The full URL of the file whose existence will be checked.
 * @return
 *   TRUE if the file exists or FALSE if it does not.
 */
function url_exists($url) {
  $c = curl_init($url);
  if ($c === FALSE) {
    return FALSE;
  }
  //Certain sites (Digg is an example) also require that you set CURLOPT_USERAGENT or the request will be denied.
  curl_setopt($c, CURLOPT_HEADER, 0); //Includes the header in the output.
  curl_setopt($c, CURLOPT_NOBODY, 1); //Excludes the body from the output.
  curl_setopt($c, CURLOPT_FAILONERROR, 1); //Fail silently if the URL cannot be reached (does not exist).
  curl_setopt($c, CURLOPT_RETURNTRANSFER, 0); //Return whether the request succeeded or failed (as opposed to the HTML result of the request).
  $content = curl_exec($c); //$content can now be used to determine whether the request succeeded or failed.
  curl_close($c);
  return $content;
}