spring-batch/build/reference/html/scalability.html

<html><head>
      <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
   <title>7.&nbsp;Scaling and Parallel Processing</title><link rel="stylesheet" type="text/css" href="css/manual-multipage.css"><meta name="generator" content="DocBook XSL Stylesheets V1.78.1"><link rel="home" href="index.html" title="Spring Batch - Reference Documentation"><link rel="up" href="index.html" title="Spring Batch - Reference Documentation"><link rel="prev" href="readersAndWriters.html" title="6.&nbsp;ItemReaders and ItemWriters"><link rel="next" href="repeat.html" title="8.&nbsp;Repeat"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">7.&nbsp;Scaling and Parallel Processing</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="readersAndWriters.html">Prev</a>&nbsp;</td><th width="60%" align="center">&nbsp;</th><td width="20%" align="right">&nbsp;<a accesskey="n" href="repeat.html">Next</a></td></tr></table><hr></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="scalability" href="#scalability"></a>7.&nbsp;Scaling and Parallel Processing</h1></div></div></div><p>Many batch processing problems can be solved with single threaded,
  single process jobs, so it is always a good idea to properly check if that
  meets your needs before thinking about more complex implementations. Measure
  the performance of a realistic job and see if the simplest implementation
  meets your needs first: you can read and write a file of several hundred
  megabytes in well under a minute, even with standard hardware.</p><p>When you are ready to start implementing a job with some parallel
  processing, Spring Batch offers a range of options, which are described in
  this chapter, although some features are covered elsewhere. At a high level
  there are two modes of parallel processing: single process, multi-threaded;
  and multi-process. These break down into categories as well, as
  follows:</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>Multi-threaded Step (single process)</p></li><li class="listitem"><p>Parallel Steps (single process)</p></li><li class="listitem"><p>Remote Chunking of Step (multi process)</p></li><li class="listitem"><p>Partitioning a Step (single or multi process)</p></li></ul></div><p>Next we review the single-process options first, and then the
  multi-process options.</p><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="multithreadedStep" href="#multithreadedStep"></a>7.1&nbsp;Multi-threaded Step</h2></div></div></div><p>The simplest way to start parallel processing is to add a
    <code class="classname">TaskExecutor</code> to your Step configuration, e.g. as an
    attribute of the <code class="literal">tasklet</code>:</p><pre class="programlisting"><span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"loading"</span><span class="hl-tag">&gt;</span>
    <span class="hl-tag">&lt;tasklet</span> <span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span><span class="hl-tag">&gt;</span>...<span class="hl-tag">&lt;/tasklet&gt;</span>
<span class="hl-tag">&lt;/step&gt;</span></pre><p>In this example the taskExecutor is a reference to another bean
    definition, implementing the <code class="classname">TaskExecutor</code>
    interface. <code class="classname">TaskExecutor</code> is a standard Spring
    interface, so consult the Spring User Guide for details of available
    implementations. The simplest multi-threaded
    <code class="classname">TaskExecutor</code> is a
    <code class="classname">SimpleAsyncTaskExecutor</code>.</p><p>The result of the above configuration will be that the Step
    executes by reading, processing and writing each chunk of items
    (each commit interval) in a separate thread of execution.  Note
    that this means there is no fixed order for the items to be
    processed, and a chunk might contain items that are
    non-consecutive compared to the single-threaded case. In addition
    to any limits placed by the task executor (e.g. if it is backed by
    a thread pool), there is a throttle limit in the tasklet
    configuration which defaults to 4.  You may need to increase this
    to ensure that a thread pool is fully utilised, e.g.</p><pre class="programlisting"><span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"loading"</span><span class="hl-tag">&gt;</span> <span class="hl-tag">&lt;tasklet</span>
    <span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span>
    <span class="hl-attribute">throttle-limit</span>=<span class="hl-value">"20"</span><span class="hl-tag">&gt;</span>...<span class="hl-tag">&lt;/tasklet&gt;</span>
<span class="hl-tag">&lt;/step&gt;</span></pre><p>Note also that there may be limits placed on concurrency by
    any pooled resources used in your step, such as
    a <code class="classname">DataSource</code>.  Be sure to make the pool in
    those resources at least as large as the desired number of
    concurrent threads in the step.</p><p>There are some practical limitations of using multi-threaded Steps
    for some common Batch use cases. Many participants in a Step (e.g. readers
    and writers) are stateful, and if the state is not segregated by thread,
    then those components are not usable in a multi-threaded Step. In
    particular most of the off-the-shelf readers and writers from Spring Batch
    are not designed for multi-threaded use. It is, however, possible to work
    with stateless or thread safe readers and writers, and there is a sample
    (parallelJob) in the Spring Batch Samples that show the use of a process
    indicator (see <a class="xref" href="readersAndWriters.html#process-indicator" title="6.12&nbsp;Preventing State Persistence">Section&nbsp;6.12, &#8220;Preventing State Persistence&#8221;</a>) to keep
    track of items that have been processed in a database input table.</p><p>Spring Batch provides some implementations of
	  <code class="classname">ItemWriter</code> and
	<code class="classname">ItemReader</code>.  Usually they say in the
	Javadocs if they are thread safe or not, or what you have to do to
	avoid problems in a concurrent environment.  If there is no
	information in Javadocs, you can check the implementation to see
	if there is any state.  If a reader is not thread safe, it may
	still be efficient to use it in your own synchronizing delegator.
	You can synchronize the call to <code class="literal">read()</code> and as
	long as the processing and writing is the most expensive part of
	the chunk your step may still complete much faster than in a
	single threaded configuration.
</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="scalabilityParallelSteps" href="#scalabilityParallelSteps"></a>7.2&nbsp;Parallel Steps</h2></div></div></div><p>As long as the application logic that needs to be parallelized can
    be split into distinct responsibilities, and assigned to individual steps
    then it can be parallelized in a single process. Parallel Step execution
    is easy to configure and use, for example, to execute steps
    <code class="literal">(step1,step2)</code> in parallel with
    <code class="literal">step3</code>, you could configure a flow like this:</p><pre class="programlisting"><span class="hl-tag">&lt;job</span> <span class="hl-attribute">id</span>=<span class="hl-value">"job1"</span><span class="hl-tag">&gt;</span>
    <span class="hl-tag">&lt;split</span> <span class="hl-attribute">id</span>=<span class="hl-value">"split1"</span> <span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span> <span class="hl-attribute">next</span>=<span class="hl-value">"step4"</span><span class="hl-tag">&gt;</span>
        <span class="hl-tag">&lt;flow&gt;</span>
            <span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step1"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s1"</span> <span class="hl-attribute">next</span>=<span class="hl-value">"step2"</span><span class="hl-tag">/&gt;</span>
            <span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step2"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s2"</span><span class="hl-tag">/&gt;</span>
        <span class="hl-tag">&lt;/flow&gt;</span>
        <span class="hl-tag">&lt;flow&gt;</span>
            <span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step3"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s3"</span><span class="hl-tag">/&gt;</span>
        <span class="hl-tag">&lt;/flow&gt;</span>
    <span class="hl-tag">&lt;/split&gt;</span>
    <span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step4"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s4"</span><span class="hl-tag">/&gt;</span>
<span class="hl-tag">&lt;/job&gt;</span>

<span class="hl-tag">&lt;beans:bean</span> <span class="hl-attribute">id</span>=<span class="hl-value">"taskExecutor"</span> <span class="hl-attribute">class</span>=<span class="hl-value">"org.spr...SimpleAsyncTaskExecutor"</span><span class="hl-tag">/&gt;</span></pre><p>The configurable "task-executor" attribute is used to specify which
    TaskExecutor implementation should be used to execute the individual
    flows. The default is <code class="classname">SyncTaskExecutor</code>, but an
    asynchronous TaskExecutor is required to run the steps in parallel. Note
    that the job will ensure that every flow in the split completes before
    aggregating the exit statuses and transitioning.</p><p>See the section on <a class="xref" href="configureStep.html#split-flows" title="5.3.5&nbsp;Split Flows">Section&nbsp;5.3.5, &#8220;Split Flows&#8221;</a> for more
    detail.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="remoteChunking" href="#remoteChunking"></a>7.3&nbsp;Remote Chunking</h2></div></div></div><p>In Remote Chunking the Step processing is split across multiple
    processes, communicating with each other through some middleware. Here is
    a picture of the pattern in action:</p><div class="mediaobject" align="center"><img src="images/remote-chunking.png" align="middle"></div><p>The Master component is a single process, and the Slaves are
    multiple remote processes. Clearly this pattern works best if the Master
    is not a bottleneck, so the processing must be more expensive than the
    reading of items (this is often the case in practice).</p><p>The Master is just an implementation of a Spring Batch
    <code class="classname">Step</code>, with the ItemWriter replaced with a generic
    version that knows how to send chunks of items to the middleware as
    messages. The Slaves are standard listeners for whatever middleware is
    being used (e.g. with JMS they would be
    <code class="classname">MesssageListeners</code>), and their role is to process
    the chunks of items using a standard <code class="classname">ItemWriter</code> or
    <code class="classname">ItemProcessor</code> plus
    <code class="classname">ItemWriter</code>, through the
    <code class="classname">ChunkProcessor</code> interface. One of the advantages of
    using this pattern is that the reader, processor and writer components are
    off-the-shelf (the same as would be used for a local execution of the
    step). The items are divided up dynamically and work is shared through the
    middleware, so if the listeners are all eager consumers, then load
    balancing is automatic.</p><p>The middleware has to be durable, with guaranteed delivery and
    single consumer for each message. JMS is the obvious candidate, but other
    options exist in the grid computing and shared memory product space (e.g.
    Java Spaces).</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="partitioning" href="#partitioning"></a>7.4&nbsp;Partitioning</h2></div></div></div><p>Spring Batch also provides an SPI for partitioning a Step execution
    and executing it remotely. In this case the remote participants are simply
    Step instances that could just as easily have been configured and used for
    local processing. Here is a picture of the pattern in action:</p><div class="mediaobject" align="center"><img src="images/partitioning-overview.png" align="middle"></div><p>The Job is executing on the left hand side as a sequence of Steps,
    and one of the Steps is labelled as a Master. The Slaves in this picture
    are all identical instances of a Step, which could in fact take the place
    of the Master resulting in the same outcome for the Job. The Slaves are
    typically going to be remote services, but could also be local threads of
    execution. The messages sent by the Master to the Slaves in this pattern
    do not need to be durable, or have guaranteed delivery: Spring Batch
    meta-data in the <code class="classname">JobRepository</code> will ensure that
    each Slave is executed once and only once for each Job execution.</p><p>The SPI in Spring Batch consists of a special implementation of Step
    (the <code class="classname">PartitionStep</code>), and two strategy interfaces
    that need to be implemented for the specific environment. The strategy
    interfaces are <code class="classname">PartitionHandler</code> and
    <code class="classname">StepExecutionSplitter</code>, and their role is show in
    the sequence diagram below:</p><div class="mediaobject" align="center"><img src="images/partitioning-spi.png" align="middle"></div><p>The Step on the right in this case is the "remote" Slave, so
    potentially there are many objects and or processes playing this role, and
    the PartitionStep is shown driving the execution. The PartitionStep
    configuration looks like this:</p><pre class="programlisting"><span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step1.master"</span><span class="hl-tag">&gt;</span>
    <span class="hl-tag">&lt;partition</span> <span class="hl-attribute">step</span>=<span class="hl-value">"step1"</span> <span class="hl-attribute">partitioner</span>=<span class="hl-value">"partitioner"</span><span class="hl-tag">&gt;</span>
        <span class="hl-tag">&lt;handler</span> <span class="hl-attribute">grid-size</span>=<span class="hl-value">"10"</span> <span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span><span class="hl-tag">/&gt;</span>
    <span class="hl-tag">&lt;/partition&gt;</span>
<span class="hl-tag">&lt;/step&gt;</span></pre><p>Similar to the multi-threaded step's throttle-limit
    attribute, the grid-size attribute prevents the task executor from
    being saturated with requests from a single step.</p><p>There is a simple example which can be copied and extended in the
    unit test suite for Spring Batch Samples (see
    <code class="classname">*PartitionJob.xml</code> configuration).</p><p>Spring Batch creates step executions for the partitions called
    "step1:partition0", etc., so many people prefer to call the master step
    "step1:master" for consistency. With Spring 3.0 you can do this using an
    alias for the step (specifying the <code class="literal">name</code> attribute
    instead of the <code class="literal">id</code>). </p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="partitionHandler" href="#partitionHandler"></a>7.4.1&nbsp;PartitionHandler</h3></div></div></div><p>The <code class="classname">PartitionHandler</code> is the component that
      knows about the fabric of the remoting or grid environment. It is able
      to send <code class="classname">StepExecution</code> requests to the remote
      Steps, wrapped in some fabric-specific format, like a DTO. It does not
      have to know how to split up the input data, or how to aggregate the
      result of multiple Step executions. Generally speaking it probably also
      doesn't need to know about resilience or failover, since those are
      features of the fabric in many cases, and anyway Spring Batch always
      provides restartability independent of the fabric: a failed Job can
      always be restarted and only the failed Steps will be
      re-executed.</p><p><code class="classname">The PartitionHandler</code> interface can have
      specialized implementations for a variety of fabric types: e.g. simple
      RMI remoting, EJB remoting, custom web service, JMS, Java Spaces, shared
      memory grids (like Terracotta or Coherence), grid execution fabrics
      (like GridGain). Spring Batch does not contain implementations for any
      proprietary grid or remoting fabrics.</p><p>Spring Batch does however provide a useful implementation of
      <code class="classname">PartitionHandler</code> that executes Steps locally in
      separate threads of execution, using the
      <code class="classname">TaskExecutor</code> strategy from Spring. The
      implementation is called
      <code class="classname">TaskExecutorPartitionHandler</code>, and it is the
      default for a step configured with the XML namespace as above. It can
      also be configured explicitly like this:</p><pre class="programlisting"><span class="hl-tag">&lt;step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step1.master"</span><span class="hl-tag">&gt;</span>
    <span class="hl-tag">&lt;partition</span> <span class="hl-attribute">step</span>=<span class="hl-value">"step1"</span> <span class="hl-attribute">handler</span>=<span class="hl-value">"handler"</span><span class="hl-tag">/&gt;</span>
<span class="hl-tag">&lt;/step&gt;</span>

<span class="hl-tag">&lt;bean</span> <span class="hl-attribute">class</span>=<span class="hl-value">"org.spr...TaskExecutorPartitionHandler"</span><span class="hl-tag">&gt;</span>
    <span class="hl-tag">&lt;property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"taskExecutor"</span> <span class="hl-attribute">ref</span>=<span class="hl-value">"taskExecutor"</span><span class="hl-tag">/&gt;</span>
    <span class="hl-tag">&lt;property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"step"</span> <span class="hl-attribute">ref</span>=<span class="hl-value">"step1"</span><span class="hl-tag"> /&gt;</span>
    <span class="hl-tag">&lt;property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"gridSize"</span> <span class="hl-attribute">value</span>=<span class="hl-value">"10"</span><span class="hl-tag"> /&gt;</span>
<span class="hl-tag">&lt;/bean&gt;</span></pre><p>The <code class="literal">gridSize</code> determines the number of separate
      step executions to create, so it can be matched to the size of the
      thread pool in the <code class="classname">TaskExecutor</code>, or else it can
      be set to be larger than the number of threads available, in which case
      the blocks of work are smaller.</p><p>The <code class="classname">TaskExecutorPartitionHandler</code> is quite
      useful for IO intensive Steps, like copying large numbers of files or
      replicating filesystems into content management systems. It can also be
      used for remote execution by providing a Step implementation that is a
      proxy for a remote invocation (e.g. using Spring Remoting).</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="stepExecutionSplitter" href="#stepExecutionSplitter"></a>7.4.2&nbsp;Partitioner</h3></div></div></div><p>The Partitioner has a simpler responsibility: to generate
      execution contexts as input parameters for new step executions only (no
      need to worry about restarts). It has a single method:</p><pre class="programlisting"><span class="hl-keyword">public</span> <span class="hl-keyword">interface</span> Partitioner {
    Map&lt;String, ExecutionContext&gt; partition(<span class="hl-keyword">int</span> gridSize);
}</pre><p>The return value from this method associates a unique name for
      each step execution (the <code class="classname">String</code>), with input
      parameters in the form of an <code class="classname">ExecutionContext</code>.
      The names show up later in the Batch meta data as the step name in the
      partitioned <code class="classname">StepExecutions</code>. The
      <code class="classname">ExecutionContext</code> is just a bag of name-value
      pairs, so it might contain a range of primary keys, or line numbers, or
      the location of an input file. The remote <code class="classname">Step</code>
      then normally binds to the context input using <code class="literal">#{...}</code>
      placeholders (late binding in step scope), as illustrated in the next
      section.</p><p>The names of the step executions (the keys in the
      <code class="classname">Map</code> returned by
      <code class="classname">Partitioner</code>) need to be unique amongst the step
      executions of a Job, but do not have any other specific requirements.
      The easiest way to do this, and to make the names meaningful for users,
      is to use a prefix+suffix naming convention, where the prefix is the
      name of the step that is being executed (which itself is unique in the
      <code class="classname">Job</code>), and the suffix is just a counter. There is
      a <code class="classname">SimplePartitioner</code> in the framework that uses
      this convention.</p><p>An optional interface
      <code class="classname">PartitioneNameProvider</code> can be used to
      provide the partition names separately from the partitions
      themselves.  If a <code class="classname">Partitioner</code> implements
      this interface then on a restart only the names will be queried.
      If partitioning is expensive this can be a useful optimisation.
      Obviously the names provided by the
      <code class="classname">PartitioneNameProvider</code> must match those
      provided by the <code class="classname">Partitioner</code>.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="bindingInputDataToSteps" href="#bindingInputDataToSteps"></a>7.4.3&nbsp;Binding Input Data to Steps</h3></div></div></div><p>It is very efficient for the steps that are executed by the
      PartitionHandler to have identical configuration, and for their input
      parameters to be bound at runtime from the ExecutionContext. This is
      easy to do with the StepScope feature of Spring Batch (covered in more
      detail in the section on <a class="xref" href="configureStep.html#late-binding" title="5.4&nbsp;Late Binding of Job and Step Attributes">Late Binding</a>). For example
      if the <code class="classname">Partitioner</code> creates
      <code class="classname">ExecutionContext</code> instances with an attribute key
      <code class="literal">fileName</code>, pointing to a different file (or
      directory) for each step invocation, the
      <code class="classname">Partitioner</code> output might look like this:</p><div class="table"><a name="d5e3165" href="#d5e3165"></a><p class="title"><b>Table&nbsp;7.1.&nbsp;Example step execution name to execution context provided by
        Partitioner targeting directory processing</b></p><div class="table-contents"><table summary="Example step execution name to execution context provided by&#xA;        Partitioner targeting directory processing" style="border-collapse: collapse;border-top: 0.5pt solid ; border-bottom: 0.5pt solid ; border-left: 0.5pt solid ; border-right: 0.5pt solid ; "><colgroup><col><col></colgroup><tbody><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><span class="bold"><strong>Step Execution Name
              (key)</strong></span></td><td style="border-bottom: 0.5pt solid ; "><span class="bold"><strong>ExecutionContext
              (value)</strong></span></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; ">filecopy:partition0</td><td style="border-bottom: 0.5pt solid ; ">fileName=/home/data/one</td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; ">filecopy:partition1</td><td style="border-bottom: 0.5pt solid ; ">fileName=/home/data/two</td></tr><tr><td style="border-right: 0.5pt solid ; ">filecopy:partition2</td><td style="">fileName=/home/data/three</td></tr></tbody></table></div></div><br class="table-break"><p>Then the file name can be bound to a step using late binding to
      the execution context:</p><pre class="programlisting"><span class="hl-tag">&lt;bean</span> <span class="hl-attribute">id</span>=<span class="hl-value">"itemReader"</span> <span class="hl-attribute">scope</span>=<span class="hl-value">"step"</span>
      <span class="hl-attribute">class</span>=<span class="hl-value">"org.spr...MultiResourceItemReader"</span><span class="hl-tag">&gt;</span>
    <span class="hl-tag">&lt;property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"resource"</span> <span class="hl-attribute">value</span>=<span class="hl-value">"</span><span class="bold"><strong>#{stepExecutionContext[fileName]}/*</strong></span>"/&gt;
<span class="hl-tag">&lt;/bean&gt;</span></pre></div></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="readersAndWriters.html">Prev</a>&nbsp;</td><td width="20%" align="center">&nbsp;</td><td width="40%" align="right">&nbsp;<a accesskey="n" href="repeat.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">6.&nbsp;ItemReaders and ItemWriters&nbsp;</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top">&nbsp;8.&nbsp;Repeat</td></tr></table></div></body></html>