Code - Koopman.ME

Expect script to print out IronPort config, showconfig

May 9, 2010July 18, 2009 by Dave Koopman

Logging this here for easy fetching later. #!/usr/bin/expect set timeout 30 spawn ssh USERNAME@HOSTNAME expect_after eof { exit 0 } ## interact with SSH expect { “yes/no” { send “yes\r” } -re “.assword:” { send “PASSWORD\r” } } expect “> ” { send “showconfig\r” } expect “> ” { send “Y\r” } expect “Press Any … Read moreExpect script to print out IronPort config, showconfig

Parallel Distributed Computing Example

May 9, 2010April 24, 2009 by Dave Koopman

You may have seen article, Hadoop Example, AccessLogCountByHourOfDay. This is a distributed computing solution, using Hadoop. The purpose of this article is to dive into the theory behind this.

To understand the power of distributed computing, we need to step back and understand the problem. First we’ll look at a command line java program that will process each http log file, one file at a time, one line at a time, until done. To speed up the job, we’ll then look at another approach: multi-threaded; we should be able to get the job done faster if we break the job up into a set of sub tasks and run them in parallel. Then, we’ll come to Hadoop, distributed computing. Same concept of breaking the job up into a set of sub tasks, but rather than running with one server, we’ll run on multiple servers in parallel.

At first you’d think that Hadoop would be the fastest, but in our basic example, you’ll see that Hadoop takes isn’t significantly faster. Why? The Hadoop overhead of scheduling the job and tracking the tasks is slowing us down. In order to see the power of Hadoop, we need much larger data sets. Think about our single server approach for a minute. As we ramp up the size and/or number of files to process, there is going to be a point where the server will hit resource limitations (cpu, ram, disk). If we have 4 threads making use of 4 cores of our CPU effectively, we may be able to do a job 4 times faster than single threaded. But, if we have a terabyte of data to process and it takes say 100 second per GB, it’s going to take 100,000 seconds to finish (that’s more than 1 day). With Hadoop, we can scale out horizontally. What if we had a 1000 node Hadoop cluster. Suddenly the overhead of scheduling the job and tracking the tasks is minuscule in comparison to the whole job. The whole job may complete in 100 seconds or less! We went from over a day to less than 2 minutes. Wow.

Please note: the single thread and multi-threaded examples in this article are not using the Map/Reduce algorithm. This is intentional. I’m trying to demonstrate the evolution of thought. When we think about how to solve the problem, the first thing that comes to mind is to walk through the files, one line at a time, and accumulate the result. Then, we realize we could split the job up into threads and gain some speed. The last evolution is is the Map/Reduce algorithm across a distributed computing platform.

Let’s dive in….

Hadoop Example, AccessLogCountByHourOfDay

May 9, 2010April 23, 2009 by Dave Koopman

Inspired by an article written by Tom White, AWS author and developer:
“Running Hadoop MapReduce on Amazon EC2 and Amazon S3”

Instead of minute of the week, this one does by Hour Of The Day. I just find this more interesting than the minute of the week that’s most popular. The output is:
00\t
…
23\t

The main reason for writing this, however, is to provide a working example that will compile. I found a number of problems in the original post.

hadoop-0.18.3 Could not create the Java virtual machine

May 9, 2010April 22, 2009 by Dave Koopman

Installed hadoop on a VM, and needed to set the java heap size, -Xmx1000m, lower than the default 1000 to get it to work. I set the HADOOP_HEAPSIZE var in the conf/hadoop-env.sh dir to the lower value, but hadoop continued to spit out this error:

# hadoop -help
Could not create the Java virtual machine.
Exception in thread "main" java.lang.NoClassDefFoundError: Could_not_reserve_enough_space_for_object_heap
Caused by: java.lang.ClassNotFoundException: Could_not_reserve_enough_space_for_object_heap
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: Could_not_reserve_enough_space_for_object_heap.  Program will exit.

Didn’t matter what I set the HADOOP_MAXHEAP to, the problem persisted. I never did find the answer online, so figured I do the world a favor today and make a note about how to fix it. Maybe I’ll save someone else the 2 hours it took me to figure this out!

THE SOLUTION:

DbTable and all its glory

November 27, 2008November 22, 2008 by Dave Koopman

The example: http://www.koopman.me/dbtable/

I got the concept of the class, DbTable, from a book called PHP5 Professional. The idea is we have a class, this abstract class, that allows us to quickly make a new class out of any database table. Database tables make good objects. We often make them names of objects, like Shopper or Product. It makes sense to create classes that represent these objects, and to have a clean, consistent way to manipulate the data in the table. It also abstracts the database layer from application logic. If you just looked at example.phps, you’d have no idea if the database was flat file, MySQL, postgreSQL, or if it was even a database at all. Abstraction is a good thing, and one of the principles of object oriented programming.

Deduplication Snapshots on Amazon S3

April 25, 2009March 15, 2008 by Dave Koopman

Deduplication is a term that refers to the practice of storing files by breaking up into chunks (or slices), getting a unique hash for each chunk, then storing the chunks and keeping metadata that explains how to reassemble the file later. This is useful in backup strategy, because you don’t have to back up the same file chunk twice. This is very useful when backing up multiple systems that contain the same, or similar files. Imagine that I backup files on one system, including common operating system files, and other common files. When I go to back them up on another machine, I don’t have to upload the file chunks on the second system, instead I’m merely storing the metadata about the file and how to reassemble it. Another good use is with files that grow. When I back it up a second time, rather than storing the same information, I only upload the new parts. A third use is taking snapshot backups of the same directory. If I take a full snapshot backup of a directory, the second time I take a snapshot of that same directory, I only upload the deltas. In other words, say I take a snapshot every day of a particular directory – instead of storing a full copy of mostly redundant data, I only save the new file chunks. The snapshot is a point in time map explaining which files existed and which chunks to use to reassemble each file.

The concept: Take each File -> break up into 5 MB chunks -> create a unique md4 hash of each chunk -> compare each hash to hashes already stored -> upload chunks that do not yet exist in the storage area -> save metadata so you can re-assemble the files later.

Amazon S3 Tools, Using PHP

May 6, 2008February 10, 2008 by Dave Koopman

If you haven’t heard of Amazon S3, check it out here. It’s remote storage for your files, at $0.15 per GB per month. You sign up for an account for free, then pay at the end of each month for the GB and data transfer you used. It’s nice and cheap, and I find it better than an FTP account to backup files, for a couple reasons. 1) It’s cheap, pay only for what you use. 2) Interface is all HTTP REST making it easier to interface with in code. 3) It’s cheap 4) You can make select files public readable and available via an HTTP address 5) There is a Firefox extension, S3 Organizer, that looks like an FTP client, you can move files back and forth from your desktop. 6) It’s all the hype right now 7) It’s cheap remote backup.

The HTTP REST interface is easy to use with PHP. I made a few command line utils with PHP:
s3ls <path> – lists file details in S3 account that are prefixed with <path>
s3put <file> – stores file in S3 account in the same dir it’s in on your server
s3get <file> – retrieves file from S3 account – can use absolute path, else it assumes current working directory
s3syncdir <dir> – removes files from S3 that no longer exist on server, then uploads any missing or modified files to S3

Store a file on Amazon’s S3 service

April 25, 2009February 10, 2008 by Dave Koopman

s3put.php:
Note: This file requires:
pear install Crypt_HMAC
pear install HTTP_Request