Deduplication Snapshots on Amazon S3

Deduplication is a term that refers to the practice of storing files by breaking up into chunks (or slices), getting a unique hash for each chunk, then storing the chunks and keeping metadata that explains how to reassemble the file later. This is useful in backup strategy, because you don’t have to back up the same file chunk twice. This is very useful when backing up multiple systems that contain the same, or similar files. Imagine that I backup files on one system, including common operating system files, and other common files. When I go to back them up on another machine, I don’t have to upload the file chunks on the second system, instead I’m merely storing the metadata about the file and how to reassemble it. Another good use is with files that grow. When I back it up a second time, rather than storing the same information, I only upload the new parts. A third use is taking snapshot backups of the same directory. If I take a full snapshot backup of a directory, the second time I take a snapshot of that same directory, I only upload the deltas. In other words, say I take a snapshot every day of a particular directory – instead of storing a full copy of mostly redundant data, I only save the new file chunks. The snapshot is a point in time map explaining which files existed and which chunks to use to reassemble each file.

The concept: Take each File -> break up into 5 MB chunks -> create a unique md4 hash of each chunk -> compare each hash to hashes already stored -> upload chunks that do not yet exist in the storage area -> save metadata so you can re-assemble the files later.

So, for this proof of concept, I used a Linux system and PHP, no surprise, to take snapshot backups of any directory. I made a set of command line tools:

s3snapshot – takes a snapshot backup of current working directory
s3snapls – displays information about taken snapshots and their contents
s3snaprestore – used to restore snapshots or files from a snapshot
s3snaprm – takes a date and a prefix on the command line and deletes snapshots older than the given date. Note: this doesn’t delete the file chunks, just the metadata
s3snapclean – deletes chunks that are no longer in use.

Before I show you the source code, let’s take a look at them in use. Some examples:

s3snapshot example 1

      [root@ip-68-178-172-38 ~]# cd /var/log
      [root@ip-68-178-172-38 log]# s3snapshot
      dir=/var/log
      2 hashes to work with.
      /var/log/messages.8.gz - 1 (512 KB max) chunks, with 0 dedups added to snap.
      /var/log/maillog.4.gz - 1 (512 KB max) chunks, with 0 dedups added to snap.
      [snip long list of files]
      /var/log/cron.3.gz - 1 (512 KB max) chunks, with 0 dedups added to snap.
      /var/log/spooler.9.gz - 1 (512 KB max) chunks, with 1 dedups added to snap.
      Wrote 1731/1731 KB to S3.
      Meta data stored in dedupsnap1/var/log/2008-02-17_14:46:37 consuming 4 KB
      [root@ip-68-178-172-38 log]#

s3snapshot example 2

      [root@ip-68-178-172-38 log]# s3snapshot
      dir=/var/log
      154 hashes to work with.
      [snip long list of files]
      /var/log/spooler.1.gz - 1 (512 KB max) chunks, with 1 dedups added to snap.
      /var/log/cron.3.gz - 1 (512 KB max) chunks, with 1 dedups added to snap.
      /var/log/spooler.9.gz - 1 (512 KB max) chunks, with 1 dedups added to snap.
      Wrote 20/1733 KB to S3.
      Meta data stored in dedupsnap1/var/log/2008-02-17_14:49:54 consuming 4 KB
      [root@ip-68-178-172-38 log]#

s3snapls and s3snaprestore working together to restore a file

      [root@ip-68-178-172-38 log]# mv secure.2.gz test
      [root@ip-68-178-172-38 log]# s3snapls ./
      dir=/var/log
      /var/log/2008-02-17_14:25:21
      /var/log/2008-02-17_14:46:37
      /var/log/2008-02-17_14:49:54
      [root@ip-68-178-172-38 log]# s3snapls /var/log/2008-02-17_14:49:54
      dir=/var/log/2008-02-17_14:49:54
      /var/log/2008-02-17_14:49:54
      [snip]
      secure.2.gz 1 chunks
      messages.3.gz 1 chunks
      messages.7.gz 1 chunks
      [snip]
      [root@ip-68-178-172-38 log]# s3snaprestore /var/log/2008-02-17_14:49:54 secure.2.gz
      Restoring secure.2.gz
      [root@ip-68-178-172-38 log]# diff secure.2.gz test

s3snapshot and s3snapls show off time

      [root@ip-68-178-172-38 tmp]# pwd
      /home/dave/tmp
      [root@ip-68-178-172-38 tmp]# find . -type f
      ./ModPHP_Logo.gif
      ./2/ModPHP_Logo.gif
      ./2/ModPHP_Logo.png
      ./ModPHP_Logo.png
      [root@ip-68-178-172-38 tmp]# s3snapshot
      dir=/home/dave/tmp
      156 hashes to work with.
      /home/dave/tmp/ModPHP_Logo.gif - 1 (512 KB max) chunks, with 0 dedups added to snap.
      /home/dave/tmp/2/ModPHP_Logo.gif - 1 (512 KB max) chunks, with 1 dedups added to snap.
      /home/dave/tmp/2/ModPHP_Logo.png - 1 (512 KB max) chunks, with 0 dedups added to snap.
      /home/dave/tmp/ModPHP_Logo.png - 1 (512 KB max) chunks, with 0 dedups added to snap.
      Wrote 39/46 KB to S3.
      Meta data stored in dedupsnap1/home/dave/tmp/2008-02-17_15:05:07 consuming 0.1 KB
      [root@ip-68-178-172-38 tmp]# s3snapshot
      dir=/home/dave/tmp
      159 hashes to work with.
      /home/dave/tmp/ModPHP_Logo.gif - 1 (512 KB max) chunks, with 1 dedups added to snap.
      /home/dave/tmp/2/ModPHP_Logo.gif - 1 (512 KB max) chunks, with 1 dedups added to snap.
      /home/dave/tmp/2/ModPHP_Logo.png - 1 (512 KB max) chunks, with 1 dedups added to snap.
      /home/dave/tmp/ModPHP_Logo.png - 1 (512 KB max) chunks, with 1 dedups added to snap.
      Wrote 0/46 KB to S3.
      Meta data stored in dedupsnap1/home/dave/tmp/2008-02-17_15:05:23 consuming 0.1 KB
      [root@ip-68-178-172-38 tmp]# s3snapls ./
      dir=/home/dave/tmp
      /home/dave/tmp/2008-02-17_15:05:07
      /home/dave/tmp/2008-02-17_15:05:23
      [root@ip-68-178-172-38 tmp]# s3snapls /home/dave/tmp/2008-02-17_15:05:23
      dir=/home/dave/tmp/2008-02-17_15:05:23
      /home/dave/tmp/2008-02-17_15:05:23
      ModPHP_Logo.gif 1 chunks
      2/ModPHP_Logo.gif 1 chunks
      2/ModPHP_Logo.png 1 chunks
      ModPHP_Logo.png 1 chunks
      [root@ip-68-178-172-38 tmp]#

restoring files from one snapshot to another directory

      [root@ip-68-178-172-38 ~]# mkdir tmp2
      [root@ip-68-178-172-38 ~]# cd tmp2
      [root@ip-68-178-172-38 tmp2]# pwd
      /home/dave/tmp2
      [root@ip-68-178-172-38 tmp2]# ls
      [root@ip-68-178-172-38 tmp2]# s3snaprestore /home/dave/tmp/2008-02-17_15:05:23
      Restoring ModPHP_Logo.gif
      Restoring 2/ModPHP_Logo.gif
      Restoring 2/ModPHP_Logo.png
      Restoring ModPHP_Logo.png
      [root@ip-68-178-172-38 tmp2]# find . -type f
      ./ModPHP_Logo.gif
      ./2/ModPHP_Logo.gif
      ./2/ModPHP_Logo.png
      ./ModPHP_Logo.png
      [root@ip-68-178-172-38 tmp2]#

The beauty is you can restore a single file from a snapshot, or restore the entire snapshot. And, you don’t have to restore to the same location. Essentially, you can browse through your snapshots, and selectively restore what you want.

I’m skipping the examples of s3snaprm and s3snapclean. They’re used to clean up old data you don’t want anymore.

Here’s the source:

s3snapconfig.php

<?php
define('BUCKET_CHUNKS', 'dedupchunk1');
define('BUCKET_SNAPSHOTS', 'dedupsnap1');
define('CHUNK_SIZE', 1024*1024*5); // 1024 KB or 1 MB * 5 = 5 MB
define('KEY_ID', '<your_key>');
define('SECRET_KEY', '<your_secret>');
define('S3_URL', 'https://s3.amazonaws.com/');
define('CACHE_FILE', '/tmp/.s3snapcache_');
define('CACHE_TIME', 60*60*24*7); // 7 days
?>

s3snapshot

#!/usr/bin/php -q
<?php
require_once 'Crypt/HMAC.php';
require_once 'HTTP/Request.php';
require_once 'S3.class.php';
require_once 's3snapconfig.php';
require_once 'Xml.class.php';

$dir = $argv&#91;1&#93;;
if ( !$dir ) $dir = getcwd();
if ( !ereg("^/", $dir) ) $dir = realpath($dir);
print "dir=$dir\n";
$prefix = ereg_replace("/&#91;^/&#93;+$", "", $dir)."/";
$s3dir = ereg_replace("^/+", "", $dir);

$s3 = new S3(KEY_ID, SECRET_KEY);

// Key list of all hashes so far:
$s3chunks = $s3->getDedupHashes(BUCKET_CHUNKS);
print count($s3chunks)." hashes to work with.\n";

// write files that don't exist on S3, or have changed
$list = `find $dir -type f`;
$localfiles = split("\n", trim($list));
$duped = 0;
$written = 0;
$totalkb = 0;
$snaps = array();
foreach ($localfiles as $file)
{
  $file = realpath(dirname($file))."/".basename($file);
  $fp = fopen($file, "r");
  if ( !$fp )
  {
    print "WARNING: cannot read $file, skip\n";
    continue;
  }
  $chunk_array = array();
  $fduped = 0;
  while ($data = fread($fp, CHUNK_SIZE))
  {
    $hash = hash('md4', $data);
    $kb = strlen($data) / 1024;
    $totalkb += $kb;
    if ( !$s3chunks[$hash] )
    {
      $s3->putObject ($hash, $data, BUCKET_CHUNKS, 'private', 'application/binary');
      $written += $kb;
    }
    else
      $fduped++;
    $s3chunks[$hash] = 1;
    $chunk_array[] = $hash;
    $data = null;
  }
  fclose($fp);

  $file_contents = implode("\n", $chunk_array);
  $snaps[$file] = $file_contents;
  $duped += $fduped;
  print ".";
#print "$file - ".count($chunk_array)." (".round(CHUNK_SIZE/1024)." KB max) chunks, with $fduped dedups added to snap.\n";
}
print "\n";
#file_put_contents(META_CACHE_FILE, serialize($s3chunks));

$snapshot_file = $s3dir . "/" . date("Y-m-d_H:i:s");
$snapshot_data = gzdeflate(serialize($snaps));
$s3->putObject($snapshot_file, $snapshot_data, BUCKET_SNAPSHOTS, 'private', 'application/binary');
$s3->updateCache(BUCKET_CHUNKS, $s3chunks);

print "Wrote ".round($written)."/".round($totalkb)." KB to S3.\nMeta data stored in ".BUCKET_SNAPSHOTS."/$snapshot_file consuming ".round(strlen($snapshot_data)/1024, 1)." KB\n";

?>

s3snaprestore

#!/usr/bin/php -q
<?php
require_once 'Crypt/HMAC.php';
require_once 'HTTP/Request.php';
require_once 'S3.class.php';
require_once 's3snapconfig.php';
require_once 'Xml.class.php';

$snap = $argv&#91;1&#93;;
$snap = ereg_replace("^/", "", $snap);
$rfile = $argv&#91;2&#93;;

$prefix = "/".ereg_replace("/&#91;^/&#93;+$", "", $snap)."/";

$s3 = new S3(KEY_ID, SECRET_KEY);

print "Fetching contents of $snap from ".BUCKET_SNAPSHOTS."\n";

$data = $s3->getObject($snap, BUCKET_SNAPSHOTS);

#print_r(gzinflate($data));

if (!$data)
  die("$snap not found\n");
$parts = unserialize(gzinflate($data));

#print_r($parts);

foreach ( $parts as $file => $chunks )
{

  $file = ereg_replace("^".quotemeta($prefix), "", $file);
#print "Comparing $file to $rfile\n";
  if ( !$rfile || $rfile == $file)
  {
    print "Restoring $file\n";
    $mkdir = ereg_replace("/[^/]+$", "", $file);
    ereg("/", $file) && $mkdir && !is_dir($mkdir) && print "mkdir -p $mkdir\n" && `mkdir -p $mkdir`;
    $fp = fopen($file, "w");
    if ( !$fp )
    {
      print "WARNING: Could not open $file for write\n";
      continue;
    }
    $chunks = explode("\n", $chunks);
    foreach ($chunks as $chunk)
    {
      $filepart = $s3->getObject($chunk, BUCKET_CHUNKS);
      fwrite($fp, $filepart);
    }
    fclose($fp);
  }
}

?>

s3snaprm

#!/usr/bin/php -q
<?php
require_once 'Crypt/HMAC.php';
require_once 'HTTP/Request.php';
require_once 'S3.class.php';
require_once 's3snapconfig.php';
require_once 'Xml.class.php';

$deldate = strtotime(str_replace("_", " ", $argv&#91;1&#93;));
$deldate = date("Y-m-d H:i:s", $deldate);
if ( $deldate < '2005-01-01' || $deldate >= date("Y-m-d H:i:s") )
  die("'$deldate' is not an acceptable delete before date. Usage: {$_SERVER['_']} '<delete_before_date>' <prefix>\n");

$prefix = $argv[2];
if ( !$prefix )
  die("You need a prefix. Usage: {$_SERVER['_']} '<delete_before_date>' <prefix>\n");

$prefix = ereg_replace("^/+", "", $prefix);

$deldate = strtotime($deldate);
$s3 = new S3(KEY_ID, SECRET_KEY);

$s3files = $s3->getAllObjects(BUCKET_SNAPSHOTS, $prefix);
print count($s3files)." snapshots to examine\n";
$total=0;
$deleted=0;
foreach($s3files as $file => $fdata)
{
  $date = basename($file);
  $date = ereg_replace("_", " ", $date);
  $date = strtotime($date);
  $total++;
  if ( $date < $deldate )
  {
    print "Deleted $file in ".BUCKET_SNAPSHOTS."\n";
    $s3->deleteObject($file, BUCKET_SNAPSHOTS);
    $deleted++;
  }
}
print "Deleted $deleted/$total snapshots\n";

?>

s3snapclean

#!/usr/bin/php -q
<?php
require_once 'Crypt/HMAC.php';
require_once 'HTTP/Request.php';
require_once 'S3.class.php';
require_once 's3snapconfig.php';
require_once 'Xml.class.php';

print "This will fetch all snapshots and all chunks and cross-examine, purging chunks no longer in use.\n";
print "You have 10 seconds to cancel, then we're proceeding.\n";
sleep(10);
print "Proceeding...\n";

$s3 = new S3(KEY_ID, SECRET_KEY);
$s3files = $s3->getAllObjects(BUCKET_SNAPSHOTS);
print count($s3files)." snapshots to examine\n";

$total=0;
$deleted=0;
foreach($s3files as $file => $data)
{
  $data = $s3->getObject($file, BUCKET_SNAPSHOTS);
  $data = unserialize(gzinflate($data));
  foreach($data as $f => $hash)
  {
    $chunks_in_use[$hash] = 1;
  }
}
print count($chunks_in_use)." chunks in files\n";

$s3chunks = $s3->getAllObjects(BUCKET_CHUNKS);
print count($s3chunks)." chunks in chunks dir\n";

foreach ($s3chunks as $hash =>$data)
{
  $total++;
  if (!$chunks_in_use[$hash])
  {
    $deleted++;
    $s3->deleteObject($hash, BUCKET_CHUNKS);
  }
}
print "Deleted $deleted/$total chunks\n";
?>

S3.class.php

<?php
/**
 *  Amazon S3 REST API Implementation
 *
 *  This a generic PHP class that can hook-in to Amazon's S3 Simple Storage Service
 *
 *  Contributions and/or donations are welcome.
 *
 *  Author: Geoffrey P. Gaudreault
 *  http://www.neurofuzzy.net
 *
 *  This code is free, provided AS-IS with no warranty expressed or implied.  Use at your own risk.
 *  If you find errors or bugs in this code, please contact me at interested@zanpo.com
 *  If you enhance this code in any way, please send me an update.  Thank you!
 *
 *  Version: 0.31a
 *  Last Updated: 9/09/2006
 *
 *      NOTE: ENTER YOUR API ID AND SECRET KEY BELOW!!!
 *
 *   2/10/2008 - Modifications made by David Koopman to:
 *     Move the keyId and secretKey into the contructor
 *     Removed the set of get/set methods
 *     Made $objectdata a pass by reference var, since likely to be very large
 *    Added method, parseObjects
 *
 */

// REQUIRES PEAR PACKAGE
// get with "pear install Crypt_HMAC"
require_once 'Crypt/HMAC.php';
require_once 'HTTP/Request.php';
require_once 'Xml.class.php';

class S3 {

        // The API access point URL
        var $S3_URL = "https://s3.amazonaws.com/";

        // list of valid actions (validation not implemented yet)
        var $verbs = array("GET"=>1, "DELETE"=>1, "PUT"=>1);

        // set to true to echo debug info
        var $_debug = false;

        // -----------------------------------------
        // -----------------------------------------
        // your API key ID
        var $keyId = ""; // to be set in the constructor

        // your API Secret Key
        var $secretKey = ""; // to be set in the constructor
        // -----------------------------------------
        // -----------------------------------------

        // default action
        var $_verb = "GET";

        // default ACL
        var $_acl = "private";

        // default content type
        #var $_contentType = "image/jpeg";
        var $_contentType = "application/binary";

        // default response content type
        var $_responseContentType = "text/xml";

        // bucket object name prefix
        var $prefix = "";

        // bucket list marker (useful for pagination)
        var $marker = "";

        // number of keys to retrieve in a list
        var $max_keys = "";

        // list delimiter
        var $delimiter = "";

        // your default bucket name
        var $bucketname = "modphpbackup";

        // your current object name
        var $objectname = ""; // to be set later


        /*
        * Constructor: Amazon S3 REST API implementation
        */
        function s3($keyId, $secretKey, $options = NULL) {

                define('DATE_RFC822', 'D, d M Y H:i:s T');
                $this->httpDate = gmdate(DATE_RFC822);
                $this->keyId = $keyId;
                $this->secretKey = $secretKey;

                $available_options = array("acl", "contentType");

                if (is_array($options)) {

                        foreach ($options as $key => $value) {

                                $this->debug_text("Option: $key");

                                if (in_array($key, $available_options) ) {

                                        $this->debug_text("Valid Config options: $key");
                                        $property = '_'.$key;
                                        $this->$property = $value;
                                        $this->debug_text("Setting $property to $value");

                                } else {

                                        $this->debug_text("ERROR: Config option: $key is not a valid option");

                                }

                        }

                }

                $this->hasher =& new Crypt_HMAC($this->secretKey, "sha1");
        }

        /*
        * Method: sendRequest
        * Sends the request to S3
        *
        * Parameters:
        * resource - the name of the resource to act upon
        * verb - the action to apply to the resource (GET, PUT, DELETE, HEAD)
        * objectdata - the source data (body) of the resource (only applies to objects)
        * acl - the access control policy for the resource
        * contentType - the contentType of the resource (only applies to objects)
        * metadata - any metadata you want to save in the header of the object
        */
        function sendRequest ($resource, $verb = NULL, &$objectdata = NULL, $acl = NULL, $contentType = NULL, $metadata = NULL) {

                if ($verb == NULL) {
                        $verb = $this->verb;
                }

                if ($acl == NULL) {
                        $aclstring = "";
                } else {
                        $aclstring = "x-amz-acl:$acl\n";
                }

                $contenttypestring = "";

                if ($contentType != NULL && ($verb == "PUT") && ($objectdata != NULL) && ($objectdata != "")) {
                        $contenttypestring = "$contentType";
                }

                // update date / time on each request
                $this->httpDate = gmdate(DATE_RFC822);

                $httpDate = $this->httpDate;

                $paramstring = "";
                $delim = "?";

                if (strlen($this->prefix)) {

                        $paramstring .= $delim."prefix=".urlencode($this->prefix);
                        $delim = "&";

                }

                if (strlen($this->marker)) {

                        $paramstring .= $delim."marker=".urlencode($this->marker);
                        $delim = "&";

                }

                if (strlen($this->max_keys)) {

                        $paramstring .= $delim."max-keys=".$this->max_keys;
                        $delim = "&";

                }

                if (strlen($this->delimiter)) {

                        $paramstring .= $delim."delimiter=".urlencode($this->delimiter);
                        $delim = "&";

                }

                $this->debug_text("HTTP Request sent to: " . $this->S3_URL . $resource . $paramstring);

                $req =& new HTTP_Request($this->S3_URL . $resource . $paramstring);
                $req->setMethod($verb);

                if (($objectdata != NULL) && ($objectdata != "")) {
                        # NICE FEATURE, BUT MEMORY HOG feature, have to make a copy of the data for md5, tisk, tisk:
                        #$contentMd5 = $this->hex2b64(md5($objectdata));
                        #$req->addHeader("CONTENT-MD5", $contentMd5);
                        #$this->debug_text("MD5 HASH OF DATA: " . $contentMd5);
                        #$contentmd5string = $contentMd5;

                } else {

                        $contentmd5string = "";

                }

                if (strlen($contenttypestring)) {
                        $this->debug_text("Setting content type to $contentType");
                        $req->addHeader("CONTENT-TYPE", $contentType);
                }

                $req->addHeader("DATE", $httpDate);

                if (strlen($aclstring)) {
                        $this->debug_text("Setting acl string to $acl");
                        $req->addHeader("x-amz-acl", $acl);
                }

                $metadatastring = "";

                if (is_array($metadata)) {

                        ksort($metadata);

                        $this->debug_text("Metadata found.");

                        foreach ($metadata as $key => $value) {

                                $metadatastring .= "x-amz-meta-".$key.":".trim($value)."\n";

                                $req->addHeader("x-amz-meta-".$key, trim($value));

                                $this->debug_text("Setting x-amz-meta-$key to '$value'");

                        }

                }

                if (($objectdata != NULL) && ($objectdata != "")) {

                        $req->setBody($objectdata);

                }

                $stringToSign = "$verb\n$contentmd5string\n$contenttypestring\n$httpDate\n$aclstring$metadatastring/$resource";
                $this->debug_text("Signing String: $stringToSign");
                $signature = $this->hex2b64($this->hasher->hash($stringToSign));
                $req->addHeader("Authorization", "AWS " . $this->keyId . ":" . $signature);

                $req->sendRequest();

                $this->_responseContentType = $req->getResponseHeader("content-type");

                if (strlen($req->getResponseBody())) {

                        $this->debug_text($req->getResponseBody());
                        return $req->getResponseBody();

                } else {

                        $this->debug_text($req->getResponseHeader());
                        return $req->getResponseHeader();

                }

        }


        /*
        * Method: getBuckets
        * Returns a list of all buckets
        */
        function getBuckets () {
                return $this->sendRequest("","GET");
        }


        /*
        * Method: getBucket
        * Gets a list of all objects in the default bucket
        */
        function getBucket ($bucketname = NULL) {

                if ($bucketname == NULL) {

                        return $this->sendRequest($this->bucketname,"GET");

                } else {

                        return $this->sendRequest($bucketname,"GET");

                }

        }


        /*
        * Method: getObjects
        * Gets a list of all objects in the specified bucket
        *
        * Parameters:
        * prefix - (optional) Limits the response to keys which begin with the indicated prefix. You can use prefixes to separate a bucket into different sets of keys in a way similar to how a file system uses folders.
        * marker - (optional) Indicates where in the bucket to begin listing. The list will only include keys that occur lexicographically after marker. This is convenient for pagination: To get the next page of results use the last key of the current page as the marker.
        * max-keys - (optional) The maximum number of keys you'd like to see in the response body. The server may return fewer than this many keys, but will not return more.
        */
        function getObjects ($bucketname, $prefix = NULL, $marker = NULL, $max_keys = NULL, $delimiter = NULL) {

                if ($prefix != NULL) {

                        $this->prefix = $prefix;

                } else {

                        $this->prefix = "";

                }

                if ($marker != NULL) {

                        $this->marker = $marker;

                } else {

                        $this->marker = "";

                }

                if ($max_keys != NULL) {

                        $this->max_keys = $max_keys;

                } else {

                        $this->max_keys = "";

                }

                if ($delimiter != NULL) {

                        $this->delimiter = $delimiter;

                } else {

                        $this->delimiter = "";

                }

                if ($bucketname != NULL) {

                        return $this->sendRequest($bucketname,"GET");

                } else {

                        return false;

                }

        }


        /*
        * Method: getObjectInfo
        * Get header information about the object. The HEAD operation is used to retrieve information about a specific object,
        * without actually fetching the object itself
        *
        * Parameters:
        * objectname - The name of the object to get information about
        * bucketname - (optional) the name of the bucket containing the object.  If none is supplied, the default bucket is used
        */
        function getObjectInfo ($objectname, $bucketname = NULL) {
                if ($bucketname == NULL) {
                        $bucketname = $this->bucketname;
                }
                return $this->sendRequest($bucketname."/".$objectname,"HEAD");
        }


        /*
        * Method: getObject
        * Gets an object from S3
        *
        * Parameters:
        * objectname - the name of the object to get
        * bucketname - (optional) the name of the bucket containing the object.  If none is supplied, the default bucket is used
        */
        function getObject ($objectname, $bucketname = NULL) {
                if ($bucketname == NULL) {
                        $bucketname = $this->bucketname;
                }
                return $this->sendRequest($bucketname."/".$objectname,"GET");
        }


        /*
        * Method: putBucket
        * Creates a new bucket in S3
        *
        * Parameters:
        * bucketname - the name of the bucket.  It must be unique.  No other S3 users may have this bucket name
        */
        function putBucket ($bucketname) {
                return $this->sendRequest($bucketname,"PUT");
        }


        /*
        * Method: putObject
        * Puts an object into S3
        *
        * Parameters:
        * objectname - the name of the object to put
        * objectdata - the source data (body) of the resource (only applies to objects)
        * bucketname - (optional) the name of the bucket containing the object.  If none is supplied, the default bucket is used
        * acl - the access control policy for the resource
        * contentType - the contentType of the resource (only applies to objects)
        * metadata - any metadata you want to save in the header of the object
        */
        function putObject ($objectname, &$objectdata, $bucketname = NULL, $acl = NULL, $contentType = NULL, $metadata = NULL) {

                if ($bucketname == NULL) {
                        $bucketname = $this->bucketname;
                }

                if ($acl == NULL || $acl == "") {
                        $acl = $this->_acl;
                }

                if ($contentType == NULL || $contentType == "") {
                        $contentType = $this->_contentType;
                }

                if ($objectdata != NULL) {
                        return $this->sendRequest($bucketname."/".$objectname, "PUT", $objectdata, $acl, $contentType, $metadata);
                } else {
                        return false;
                }

        }


        /*
        * Method: deleteBucket
        * Deletes bucket in S3.  The bucket name will fall into the public domain.
        */
        function deleteBucket ($bucketname) {
                return $this->sendRequest($bucketname, "DELETE");
        }


        /*
        * Method: deleteObject
        * Deletes an object from S3
        *
        * Parameters:
        * objectname - the name of the object to delete
        * bucketname - (optional) the name of the bucket containing the object.  If none is supplied, the default bucket is used
        */
        function deleteObject ($objectname, $bucketname = NULL) {

                if ($bucketname == NULL) {

                        $bucketname = $this->bucketname;

                }

                return $this->sendRequest($bucketname."/".$objectname, "DELETE");

        }


        /*
        * Method: hex2b64
        * Utility function for constructing signatures
        */
        function hex2b64($str) {

                $raw = '';
                for ($i=0; $i < strlen($str); $i+=2) {
                        $raw .= chr(hexdec(substr($str, $i, 2)));
                }
                return base64_encode($raw);

        }


        /*
        * Method: debug_text
        * Echoes debug information to the browser.  Set this->debug to false for production use
        */
        function debug_text($text) {

                if ($this->_debug) {
                        echo("<br>\n");
                        print_r($text);
                        echo("<br><br>\n\n");
                }

                return true;

        }

  function parseObjects( $objects )
  {
$x = new Xml();
$x->parse($objects);

$s3files = array();
for($i=0; $i< count($x->structure); $i++)
{
        $item = $x->structure[$i];
        if ( $item['tag'] == 'KEY' )
        {
                $filename = $item['data'];
                $lastmodified = null;
                $filesize = null;
                $found = 0;
                while ( $i < count($x->structure) && $found < 2)
                {
                        $i++;
                        $item = $x->structure[$i];
                        if ( $item['tag'] == 'LASTMODIFIED' )
                        {
                                $lastmodified = $item['data'];
                                $found++;
                        }
                        else if ( $item['tag'] == 'SIZE' )
                        {
                                $filesize = $item['data'];
                                $found++;
                        }
                }
                $s3files[$filename] = array('filesize'=>$filesize, 'lastmodified'=>$lastmodified);
        }
}
  return $s3files;
  }

  function getAllObjects($bucket, $prefix=null)
  {
    $s3chunks = array();
    $i=0;
    $marker = null;
    do {
      $objects = $this->getObjects($bucket, $prefix, $marker);
      $tmp = $this->parseObjects($objects);
      $s3chunks = array_merge($s3chunks, $tmp);
      $marker = array_pop(array_keys($tmp));
    } while ( count($tmp) > 0 );
    return $s3chunks;
  }

  function getDedupHashes($bucket)
  {
    // CACHE_FILE, CACHE_TIME
    $cache_file = CACHE_FILE . $bucket;
    if ( ! file_exists($cache_file) || filemtime($cache_file) + CACHE_TIME <= time() )
    {
      $hashes = $this->getAllObjects($bucket);
print "NOTICE: Writing cache_file '$cache_file'\n";
print_r($hashes);
      if ( ! file_put_contents($cache_file, serialize($hashes)) )
        die("ERROR - could not write to cache file '$cache_file'\n");
      return $hashes;
    }
    $hashes = unserialize(file_get_contents($cache_file));
    return $hashes;
  }
  function updateCache($bucket, &$hashes)
  {
    $cache_file = CACHE_FILE . $bucket;
print "NOTICE: Updating cache_file '$cache_file'\n";
    if ( ! file_put_contents($cache_file, serialize($hashes)) )
        die("ERROR - could not write to cache file '$cache_file'\n");

  }

}

?>

Xml.class.php

<?php
class Xml  {
    var $parser;
    var $structure;
    var $currentTag;
    var $currentAttributes;

    function xml()
    {
        $this->parser = xml_parser_create();

        xml_set_object($this->parser, $this);
        xml_set_element_handler($this->parser, "tag_open", "tag_close");
        xml_set_character_data_handler($this->parser, "cdata");
    }

    function parse($data)
    {
        xml_parse($this->parser, $data);
    }

    function tag_open($parser, $tag, $attributes)
    {
        #var_dump($parser, $tag, $attributes);
        $this->currentTag = $tag;
        $this->currentAttributes = $attributes;
    }

    function cdata($parser, $cdata)
    {
        $this->structure[] = array('tag'=>$this->currentTag, 'attributes'=>$this->currentAttributes, 'data' => $cdata);
        #var_dump($parser, $cdata);
        #print "\n--------------------\n";
        #print_r($parser);
        #print "\n--\n";
        #print_r($cdata);
    }

    function tag_close($parser, $tag)
    {
        #var_dump($parser, $tag);
        $this->currentTag = null;
        $this->currentAttributes = null;
    }

} // end of class xml

?>

backupsnaps.sh

#!/bin/sh -x

DATE=`date`

# Snapshot cleanups older than 30 days:
DELETEDATE=$(date -d "1970-01-01 $(($(date -d "$DATE" "+%s") - $((60 * 60 * 24 * 30)))) sec" "+%Y-%m-%d_%H:%M:%S")
echo "DELETEDATE=$DELETEDATE"

/scripts/s3snaprm $DELETEDATE /www
/scripts/s3snaprm $DELETEDATE /home/dkoopman
/scripts/s3snaprm $DELETEDATE /scripts
/scripts/s3snaprm $DELETEDATE /var/log
/scripts/s3snaprm $DELETEDATE /backup
/scripts/s3snaprm $DELETEDATE /usr/local/mailman
/scripts/s3snaprm $DELETEDATE /usr/local/swish

# Make new snapshots:
/scripts/s3snapshot /www
/scripts/s3snapshot /home/dkoopman
/scripts/s3snapshot /scripts
#/scripts/s3snapshot /var/log
/scripts/s3snapshot /backup
#/scripts/s3snapshot /usr/local/mailman
#/scripts/s3snapshot /usr/local/swish

# Clean up old data:
/scripts/s3snapclean

Note about CHUNK_SIZE: I started with a chunk size of 512 KB, then moved to 1 MB, then moved to 5 MB. The problem with smaller chunk size is it equates to larger amounts of metadata. The more metadata you have, the more processing you have to do, and the more PUT, GET and LIST requests must be made against Amazon S3. Amazon charges $0.01 per 10,000 PUT or LIST request, so it adds up. The downside to having larger chunk size, is say I have a 512 KB file that grows by 512 KB per day. The second day, my file is 1 MB, I must discard my original 512 KB chunk and create a new 1 MB chunk, and so on until I reach 5 MB. Once I’m at 5 MB, that chunk stays. So, a 5 MB chunk size is fine, except for slowly growing files, which kind of sucks. The tradeoff is worth it, though. I like 5 MB chunk size the best.

DaveK

Comments are closed.