Posted by boneill42
on December 19, 2011 at 7:22 PM PST
Ever wonder how to kick-off a remote Hadoop job programmatically? Here's how.
I'm adding the ability to deploy a Map/Reduce job to a remote Hadoop cluster in Virgil . With this, Virgil allows users to make a REST POST to schedule a Hadoop job. (pretty handy)
To get this to work properly, Virgil needed to be able to remotely deploy a job. Ordinarily, to run a job against a remote cluster you issue a command from the shell:
hadoop jar $JAR_FILE $CLASS_NAME
We wanted to do the same thing, but from within the Virgil runtime. It was easy enough to find the class we needed to use: RunJar. RunJar's main() method stages the jar and submits the job. Thus, to achieve the same functionality as the command line, we used the following:
List args = new ArrayList();
That worked just fine, but would result in a local job deployment. To get it to deploy to a remote cluster, we needed Hadoop to load the cluster configuration. For Hadoop, cluster configuration is spread across three files: core-site.xml, hdfs-site.xml, and mapred-site.xml. To get the Hadoop runtime to load the configuration, you need to include these files on your classpath. The key line is found in the configuration Hadoop Javadoc .
"Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:"
Once we dropped the cluster configuration onto the classpath, everything worked like a charm.