oozie using if else,fork and join,ssh,distcp and sub-workflow action

I have covered most of the oozie actions in the previous tutorial and below are some of the random topics which can be useful.

If else in workflow

In programming languages, if-then-else and switch-case statements are usually used to control the flow of execution depending on certain conditions being met or not. Similarly, Oozie workflows use <decision> nodes to determine the actual execution path of a workflow.

A <decision> node behavior is best described as an if-then-else-if-then-else sequence, where the first predicate that resolves to true will determine the execution path. Unlike a <fork> node where all execution paths are followed, only one execution path will be followed in a <decision> node.

[xml]

<workflow-app xmlns=”uri:oozie:workflow:0.5″ name=”decisionNodeWF”>
<start to=”decision”/>
<decision name=”decision”>
<switch>
<case to=”mapReduce”>${jobType eq “mapReduce”}</case>
<case to=”hive”>${jobType eq “hive”}</case>
<case to=”pig”>${jobType eq “pig”}</case>
<default to=”mapReduce”/>
</switch>
</decision>

<action name=”mapReduce”>

<ok to=”done”/>
<error to=”done”/>
</action>

<action name=”hive”>

<ok to=”done”/>
<error to=”done”/>
</action>

[/xml]

fork and join

Simple workflows execute one action at a time.When actions don’t depend on the result of each other, it is possible to execute actions in parallel using the <fork> and <join> control nodes to speed up the execution of the workflow.When Oozie encounters a <fork> node in a workflow, it starts running all the paths defined by the fork in parallel. These parallel execution paths run independent of each other. All the paths of a <fork> node must converge into a <join> node. A workflow does not proceed its execution beyond the <join> node until all execution paths from the <fork> node reach the <join> node.

[xml]

<workflow-app name=”BLOG_WORKFLOW” xmlns=”uri:oozie:workflow:0.4″>
<global>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</global>

<start to=”PARALLEL_PROCESS_FORK”/>
<fork name=”PARALLEL_PROCESS_FORK”>
<path start=”RAW_DATA_PROCESSING_1″/>
<path start=”RAW_DATA_PROCESSING_2″/>
<path start=”RAW_DATA_PROCESSING_3″/>
</fork>

<action name=”RAW_DATA_PROCESSING_1″>
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${workflow_root}/hive-site.xml</job-xml>
<script>data_processing_1.pig</script>
<file>hive-site.xml#hive-site.xml</file>
</pig>
<ok to=”joining”/>
<error to=”kill”/>
</action>

<action name=”RAW_DATA_PROCESSING_2″>
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${workflow_root}/hive-site.xml</job-xml>
<script>data_processing_2.pig</script>
<file>hive-site.xml#hive-site.xml</file>
</pig>
<ok to=”joining”/>
<error to=”kill”/>
</action>

<action name=”RAW_DATA_PROCESSING_3″>
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${workflow_root}/hive-site.xml</job-xml>
<script>data_processing_3.pig</script>
<file>hive-site.xml#hive-site.xml</file>
</pig>
<ok to=”joining”/>
<error to=”kill”/>
</action>

<join name=”joining” to=”createSuccessMarkerFile”/>

<action name=”createSuccessMarkerFile”>
<fs>
<delete path=’${nameNode}/user/queue/outputpath’ />
<mkdir path=’/user/queue/success’ />
</fs>
<ok to=”end” />
<error to=”kill” />
</action>

<kill name=”kill”>
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name=”end”/>
</workflow-app>

[/xml]

Synchronous Versus Asynchronous Actions

Filesystem action, email action, SSH action, and sub-workflow action are executed by the Oozie server itself and are called synchronous actions.The execution of these synchronous actions do not require running any user code—just access to some libraries. The Oozie filesystem action performs lightweight filesystem operations not involving data transfers and is executed by the Oozie server itself.

The email action sends emails; this is done directly by the Oozie server via an SMTP server. The subworkflow action is executed by the Oozie server also, but it just submits a new workflow. The SSH action makes Oozie invoke a secure shell on a remote machine, though the actual shell command itself does not run on the Oozie server. These actions are all relatively lightweight and hence safe to be run synchronously on the Oozie server machine itself.

SSH Action

The <ssh> action runs a shell command on a specific remote host using a secure shell. The command should be available in the path on the remote machine and it is executed in the user’s home directory on the remote machine. The shell command can be run as another user on the remote host from the one running the workflow. We can do this using typical ssh syntax: user@host. However, the oozie.action.ssh.allow.user.at.host should be set to true in oozie-site.xml for this to be enabled. By default, this variable is false.

DistCp Action

DistCp action supports the Hadoop distributed copy tool, which is typically used to copy data across Hadoop clusters. Users can use it to copy data within the same cluster as well, and to move data between Amazon S3 and Hadoop clusters.

Sub-Workflow Action

The sub-workflow action runs a child workflow as part of the parent workflow. You can think of it as an embedded workflow. From a parent’s perspective, this is a single action and it will proceed to the next action in its workflow if and only if the subworkflow is done in its entirety. The child and the parent have to run in the same Oozie system and the child workflow application has to be deployed in that Oozie system.The tags that are supported are app-path (required),propagate-configuration,configuration.

The properties for the sub-workflow are defined in the <configuration> section. The <propagate_configuration> element can also be optionally used to tell Oozie to pass the parent’s job configuration to the sub-workflow. Note that this is to propagate the job configuration.

[xml]

<action name=”mySubWorkflow”>
<sub-workflow>
<app-path>hdfs://user/haas_queue/workflows/data_loader/sub_workflow</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to=”success”/>
<error to=”fail”/>
</action>

[/xml]