Blog - Koopman.ME

Linux Sync Date – Super easy

May 9, 2010September 3, 2009 by Dave Koopman

Expect script to print out IronPort config, showconfig

May 9, 2010July 18, 2009 by Dave Koopman

Logging this here for easy fetching later. #!/usr/bin/expect set timeout 30 spawn ssh USERNAME@HOSTNAME expect_after eof { exit 0 } ## interact with SSH expect { “yes/no” { send “yes\r” } -re “.assword:” { send “PASSWORD\r” } } expect “> ” { send “showconfig\r” } expect “> ” { send “Y\r” } expect “Press Any … Read moreExpect script to print out IronPort config, showconfig

Make a Private/Public Key Pair, Self Signed, No Password

May 9, 2010July 15, 2009 by Dave Koopman

This is take straight from http://devsec.org/info/ssl-cert.html. I’m getting it on my blog, as a reference to myself, so I can make a key pair quickly in the future. Make a new ssl private key: * Generate a new unencrypted rsa private key in PEM format: openssl genrsa -out privkey.pem 2048 You can create an encrypted … Read moreMake a Private/Public Key Pair, Self Signed, No Password

openldap-2.3 on CentOS5 Tutorial

May 9, 2010June 26, 2009 by Dave Koopman

I had to get a dummy openldap setup that had “mail” as one of it’s attributes for the records. I specifically needed all the records to live in the root ou, meaning no Organizational Units, just the root, then all the records. Like this:

dn: cn=1,dc=example,dc=com
cn: 1
objectClass: top
objectClass: dkuser
mail: someemail1@somedomain1.com
mailHost: somesmtphostname1:25

dn: cn=2,dc=example,dc=com
cn: 2
objectClass: top
objectClass: dkuser
mail: someemail2@somedomain2.com
mailHost: somesmtphostname2:25

…. and so on.

It was hard to find a step by step instruction set. So, in this tutorial, I’ll give you command by command steps to install, configure and load openldap on a CentOS5 OS.

First, install the packages with Yum:

yum install openldap openldap-clients openldap-servers nss_ldap python-ldap

Next, set ldap to run at system startup time:

/sbin/chkconfig ldap on

Next, get your password for slapd.conf:

cd /etc/openldap/
/usr/sbin/slappasswd

…. it’ll prompt you for a new password, type it twice. All it does is spit out a password that you can copy paste into slapd. Looks like this:

New password:
Re-enter new password:
{SSHA}zskkuz1hd90SyXA4y+zN4AA0FBQorVEd

Load Balancing Techniques

May 9, 2010June 13, 2009 by Dave Koopman

Load balancing is a term that describes a method to distribute incoming socket connections to different servers. It’s not distributed computing, where jobs are broken up into a series of sub-jobs, so each server does a fraction of the overall work. It’s not that at all. Rather, incoming socket connections are spread out to different servers. Each incoming connection will communicate with the node it was delegated to, and the entire interaction will occur there. Each node is not aware of the other nodes existence.

Why do you need load balancing?
Simple answer: Scalability and Redundancy.

Scalability

If your application becomes busy, resource limits, such as bandwidth, cpu, memory, disk space, disk I/O, and more may reach its limits. In order to remedy such problem, you have two options: scale up, or scale out. Load balancing is a scale out technique. Rather than increasing server resources, you add cost effective, commodity servers, creating a “cluster” of servers that perform the same task. Scaling out is more cost effective, because commodity level hardware provides the most bang for the buck. High end super computers come at a premium, and can be avoided in many cases.

Redundancy

Servers crash, this is the rule, not the exception. Your architecture should be devised in a way to reduce or eliminate single points of failure (SPOF). Load balancing a cluster of servers that perform the same role provides room for a server to be taken out manually for maintenance tasks, without taking down the system. You can also withstand a server crashing. This is called High Availability, or HA for short. Load balancing is a tactic that assists with High Availability, but is not High Availability by itself. To achieve high availability, you need automated monitoring that checks the status of the applications in your cluster, and automates taking servers out of rotation, in response to failure detected. These tools are often bundled into Load Balancing software and appliances, but sometimes need to be programmed independently.

How to perform load balancing?

There are 3 well known ways:

DNS based
Hardware based
Software based

Parallel Distributed Computing Example

May 9, 2010April 24, 2009 by Dave Koopman

You may have seen article, Hadoop Example, AccessLogCountByHourOfDay. This is a distributed computing solution, using Hadoop. The purpose of this article is to dive into the theory behind this.

To understand the power of distributed computing, we need to step back and understand the problem. First we’ll look at a command line java program that will process each http log file, one file at a time, one line at a time, until done. To speed up the job, we’ll then look at another approach: multi-threaded; we should be able to get the job done faster if we break the job up into a set of sub tasks and run them in parallel. Then, we’ll come to Hadoop, distributed computing. Same concept of breaking the job up into a set of sub tasks, but rather than running with one server, we’ll run on multiple servers in parallel.

At first you’d think that Hadoop would be the fastest, but in our basic example, you’ll see that Hadoop takes isn’t significantly faster. Why? The Hadoop overhead of scheduling the job and tracking the tasks is slowing us down. In order to see the power of Hadoop, we need much larger data sets. Think about our single server approach for a minute. As we ramp up the size and/or number of files to process, there is going to be a point where the server will hit resource limitations (cpu, ram, disk). If we have 4 threads making use of 4 cores of our CPU effectively, we may be able to do a job 4 times faster than single threaded. But, if we have a terabyte of data to process and it takes say 100 second per GB, it’s going to take 100,000 seconds to finish (that’s more than 1 day). With Hadoop, we can scale out horizontally. What if we had a 1000 node Hadoop cluster. Suddenly the overhead of scheduling the job and tracking the tasks is minuscule in comparison to the whole job. The whole job may complete in 100 seconds or less! We went from over a day to less than 2 minutes. Wow.

Please note: the single thread and multi-threaded examples in this article are not using the Map/Reduce algorithm. This is intentional. I’m trying to demonstrate the evolution of thought. When we think about how to solve the problem, the first thing that comes to mind is to walk through the files, one line at a time, and accumulate the result. Then, we realize we could split the job up into threads and gain some speed. The last evolution is is the Map/Reduce algorithm across a distributed computing platform.

Let’s dive in….

Hadoop Example, AccessLogCountByHourOfDay

May 9, 2010April 23, 2009 by Dave Koopman

Inspired by an article written by Tom White, AWS author and developer:
“Running Hadoop MapReduce on Amazon EC2 and Amazon S3”

Instead of minute of the week, this one does by Hour Of The Day. I just find this more interesting than the minute of the week that’s most popular. The output is:
00\t
…
23\t

The main reason for writing this, however, is to provide a working example that will compile. I found a number of problems in the original post.

Inspiring MapReduce lectures by Google

May 9, 2010April 23, 2009 by Dave Koopman

Watched a set of 3 lectures run at Google, by Aaron Kimball, on MapReduce was inspiring to me. I feel like I have a much more solid grasp on MapReduce after watching these. I really liked how it started out with some basic functional programming and socket theory, then moved into how MapReduce builds on … Read moreInspiring MapReduce lectures by Google