223 lines
26 KiB
HTML
223 lines
26 KiB
HTML
<html><head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
|
|
<title>7. Scaling and Parallel Processing</title><link rel="stylesheet" type="text/css" href="css/manual-multipage.css"><meta name="generator" content="DocBook XSL Stylesheets V1.78.1"><link rel="home" href="index.html" title="Spring Batch - Reference Documentation"><link rel="up" href="index.html" title="Spring Batch - Reference Documentation"><link rel="prev" href="readersAndWriters.html" title="6. ItemReaders and ItemWriters"><link rel="next" href="repeat.html" title="8. Repeat"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">7. Scaling and Parallel Processing</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="readersAndWriters.html">Prev</a> </td><th width="60%" align="center"> </th><td width="20%" align="right"> <a accesskey="n" href="repeat.html">Next</a></td></tr></table><hr></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="scalability" href="#scalability"></a>7. Scaling and Parallel Processing</h1></div></div></div><p>Many batch processing problems can be solved with single threaded,
|
|
single process jobs, so it is always a good idea to properly check if that
|
|
meets your needs before thinking about more complex implementations. Measure
|
|
the performance of a realistic job and see if the simplest implementation
|
|
meets your needs first: you can read and write a file of several hundred
|
|
megabytes in well under a minute, even with standard hardware.</p><p>When you are ready to start implementing a job with some parallel
|
|
processing, Spring Batch offers a range of options, which are described in
|
|
this chapter, although some features are covered elsewhere. At a high level
|
|
there are two modes of parallel processing: single process, multi-threaded;
|
|
and multi-process. These break down into categories as well, as
|
|
follows:</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>Multi-threaded Step (single process)</p></li><li class="listitem"><p>Parallel Steps (single process)</p></li><li class="listitem"><p>Remote Chunking of Step (multi process)</p></li><li class="listitem"><p>Partitioning a Step (single or multi process)</p></li></ul></div><p>Next we review the single-process options first, and then the
|
|
multi-process options.</p><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="multithreadedStep" href="#multithreadedStep"></a>7.1 Multi-threaded Step</h2></div></div></div><p>The simplest way to start parallel processing is to add a
|
|
<code class="classname">TaskExecutor</code> to your Step configuration, e.g. as an
|
|
attribute of the <code class="literal">tasklet</code>:</p><pre class="programlisting"><span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"loading"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><tasklet</span> <span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span><span class="hl-tag">></span>...<span class="hl-tag"></tasklet></span>
|
|
<span class="hl-tag"></step></span></pre><p>In this example the taskExecutor is a reference to another bean
|
|
definition, implementing the <code class="classname">TaskExecutor</code>
|
|
interface. <code class="classname">TaskExecutor</code> is a standard Spring
|
|
interface, so consult the Spring User Guide for details of available
|
|
implementations. The simplest multi-threaded
|
|
<code class="classname">TaskExecutor</code> is a
|
|
<code class="classname">SimpleAsyncTaskExecutor</code>.</p><p>The result of the above configuration will be that the Step
|
|
executes by reading, processing and writing each chunk of items
|
|
(each commit interval) in a separate thread of execution. Note
|
|
that this means there is no fixed order for the items to be
|
|
processed, and a chunk might contain items that are
|
|
non-consecutive compared to the single-threaded case. In addition
|
|
to any limits placed by the task executor (e.g. if it is backed by
|
|
a thread pool), there is a throttle limit in the tasklet
|
|
configuration which defaults to 4. You may need to increase this
|
|
to ensure that a thread pool is fully utilised, e.g.</p><pre class="programlisting"><span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"loading"</span><span class="hl-tag">></span> <span class="hl-tag"><tasklet</span>
|
|
<span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span>
|
|
<span class="hl-attribute">throttle-limit</span>=<span class="hl-value">"20"</span><span class="hl-tag">></span>...<span class="hl-tag"></tasklet></span>
|
|
<span class="hl-tag"></step></span></pre><p>Note also that there may be limits placed on concurrency by
|
|
any pooled resources used in your step, such as
|
|
a <code class="classname">DataSource</code>. Be sure to make the pool in
|
|
those resources at least as large as the desired number of
|
|
concurrent threads in the step.</p><p>There are some practical limitations of using multi-threaded Steps
|
|
for some common Batch use cases. Many participants in a Step (e.g. readers
|
|
and writers) are stateful, and if the state is not segregated by thread,
|
|
then those components are not usable in a multi-threaded Step. In
|
|
particular most of the off-the-shelf readers and writers from Spring Batch
|
|
are not designed for multi-threaded use. It is, however, possible to work
|
|
with stateless or thread safe readers and writers, and there is a sample
|
|
(parallelJob) in the Spring Batch Samples that show the use of a process
|
|
indicator (see <a class="xref" href="readersAndWriters.html#process-indicator" title="6.12 Preventing State Persistence">Section 6.12, “Preventing State Persistence”</a>) to keep
|
|
track of items that have been processed in a database input table.</p><p>Spring Batch provides some implementations of
|
|
<code class="classname">ItemWriter</code> and
|
|
<code class="classname">ItemReader</code>. Usually they say in the
|
|
Javadocs if they are thread safe or not, or what you have to do to
|
|
avoid problems in a concurrent environment. If there is no
|
|
information in Javadocs, you can check the implementation to see
|
|
if there is any state. If a reader is not thread safe, it may
|
|
still be efficient to use it in your own synchronizing delegator.
|
|
You can synchronize the call to <code class="literal">read()</code> and as
|
|
long as the processing and writing is the most expensive part of
|
|
the chunk your step may still complete much faster than in a
|
|
single threaded configuration.
|
|
</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="scalabilityParallelSteps" href="#scalabilityParallelSteps"></a>7.2 Parallel Steps</h2></div></div></div><p>As long as the application logic that needs to be parallelized can
|
|
be split into distinct responsibilities, and assigned to individual steps
|
|
then it can be parallelized in a single process. Parallel Step execution
|
|
is easy to configure and use, for example, to execute steps
|
|
<code class="literal">(step1,step2)</code> in parallel with
|
|
<code class="literal">step3</code>, you could configure a flow like this:</p><pre class="programlisting"><span class="hl-tag"><job</span> <span class="hl-attribute">id</span>=<span class="hl-value">"job1"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><split</span> <span class="hl-attribute">id</span>=<span class="hl-value">"split1"</span> <span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span> <span class="hl-attribute">next</span>=<span class="hl-value">"step4"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><flow></span>
|
|
<span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step1"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s1"</span> <span class="hl-attribute">next</span>=<span class="hl-value">"step2"</span><span class="hl-tag">/></span>
|
|
<span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step2"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s2"</span><span class="hl-tag">/></span>
|
|
<span class="hl-tag"></flow></span>
|
|
<span class="hl-tag"><flow></span>
|
|
<span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step3"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s3"</span><span class="hl-tag">/></span>
|
|
<span class="hl-tag"></flow></span>
|
|
<span class="hl-tag"></split></span>
|
|
<span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step4"</span> <span class="hl-attribute">parent</span>=<span class="hl-value">"s4"</span><span class="hl-tag">/></span>
|
|
<span class="hl-tag"></job></span>
|
|
|
|
<span class="hl-tag"><beans:bean</span> <span class="hl-attribute">id</span>=<span class="hl-value">"taskExecutor"</span> <span class="hl-attribute">class</span>=<span class="hl-value">"org.spr...SimpleAsyncTaskExecutor"</span><span class="hl-tag">/></span></pre><p>The configurable "task-executor" attribute is used to specify which
|
|
TaskExecutor implementation should be used to execute the individual
|
|
flows. The default is <code class="classname">SyncTaskExecutor</code>, but an
|
|
asynchronous TaskExecutor is required to run the steps in parallel. Note
|
|
that the job will ensure that every flow in the split completes before
|
|
aggregating the exit statuses and transitioning.</p><p>See the section on <a class="xref" href="configureStep.html#split-flows" title="5.3.5 Split Flows">Section 5.3.5, “Split Flows”</a> for more
|
|
detail.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="remoteChunking" href="#remoteChunking"></a>7.3 Remote Chunking</h2></div></div></div><p>In Remote Chunking the Step processing is split across multiple
|
|
processes, communicating with each other through some middleware. Here is
|
|
a picture of the pattern in action:</p><div class="mediaobject" align="center"><img src="images/remote-chunking.png" align="middle"></div><p>The Master component is a single process, and the Slaves are
|
|
multiple remote processes. Clearly this pattern works best if the Master
|
|
is not a bottleneck, so the processing must be more expensive than the
|
|
reading of items (this is often the case in practice).</p><p>The Master is just an implementation of a Spring Batch
|
|
<code class="classname">Step</code>, with the ItemWriter replaced with a generic
|
|
version that knows how to send chunks of items to the middleware as
|
|
messages. The Slaves are standard listeners for whatever middleware is
|
|
being used (e.g. with JMS they would be
|
|
<code class="classname">MesssageListeners</code>), and their role is to process
|
|
the chunks of items using a standard <code class="classname">ItemWriter</code> or
|
|
<code class="classname">ItemProcessor</code> plus
|
|
<code class="classname">ItemWriter</code>, through the
|
|
<code class="classname">ChunkProcessor</code> interface. One of the advantages of
|
|
using this pattern is that the reader, processor and writer components are
|
|
off-the-shelf (the same as would be used for a local execution of the
|
|
step). The items are divided up dynamically and work is shared through the
|
|
middleware, so if the listeners are all eager consumers, then load
|
|
balancing is automatic.</p><p>The middleware has to be durable, with guaranteed delivery and
|
|
single consumer for each message. JMS is the obvious candidate, but other
|
|
options exist in the grid computing and shared memory product space (e.g.
|
|
Java Spaces).</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="partitioning" href="#partitioning"></a>7.4 Partitioning</h2></div></div></div><p>Spring Batch also provides an SPI for partitioning a Step execution
|
|
and executing it remotely. In this case the remote participants are simply
|
|
Step instances that could just as easily have been configured and used for
|
|
local processing. Here is a picture of the pattern in action:</p><div class="mediaobject" align="center"><img src="images/partitioning-overview.png" align="middle"></div><p>The Job is executing on the left hand side as a sequence of Steps,
|
|
and one of the Steps is labelled as a Master. The Slaves in this picture
|
|
are all identical instances of a Step, which could in fact take the place
|
|
of the Master resulting in the same outcome for the Job. The Slaves are
|
|
typically going to be remote services, but could also be local threads of
|
|
execution. The messages sent by the Master to the Slaves in this pattern
|
|
do not need to be durable, or have guaranteed delivery: Spring Batch
|
|
meta-data in the <code class="classname">JobRepository</code> will ensure that
|
|
each Slave is executed once and only once for each Job execution.</p><p>The SPI in Spring Batch consists of a special implementation of Step
|
|
(the <code class="classname">PartitionStep</code>), and two strategy interfaces
|
|
that need to be implemented for the specific environment. The strategy
|
|
interfaces are <code class="classname">PartitionHandler</code> and
|
|
<code class="classname">StepExecutionSplitter</code>, and their role is show in
|
|
the sequence diagram below:</p><div class="mediaobject" align="center"><img src="images/partitioning-spi.png" align="middle"></div><p>The Step on the right in this case is the "remote" Slave, so
|
|
potentially there are many objects and or processes playing this role, and
|
|
the PartitionStep is shown driving the execution. The PartitionStep
|
|
configuration looks like this:</p><pre class="programlisting"><span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step1.master"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><partition</span> <span class="hl-attribute">step</span>=<span class="hl-value">"step1"</span> <span class="hl-attribute">partitioner</span>=<span class="hl-value">"partitioner"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><handler</span> <span class="hl-attribute">grid-size</span>=<span class="hl-value">"10"</span> <span class="hl-attribute">task-executor</span>=<span class="hl-value">"taskExecutor"</span><span class="hl-tag">/></span>
|
|
<span class="hl-tag"></partition></span>
|
|
<span class="hl-tag"></step></span></pre><p>Similar to the multi-threaded step's throttle-limit
|
|
attribute, the grid-size attribute prevents the task executor from
|
|
being saturated with requests from a single step.</p><p>There is a simple example which can be copied and extended in the
|
|
unit test suite for Spring Batch Samples (see
|
|
<code class="classname">*PartitionJob.xml</code> configuration).</p><p>Spring Batch creates step executions for the partitions called
|
|
"step1:partition0", etc., so many people prefer to call the master step
|
|
"step1:master" for consistency. With Spring 3.0 you can do this using an
|
|
alias for the step (specifying the <code class="literal">name</code> attribute
|
|
instead of the <code class="literal">id</code>). </p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="partitionHandler" href="#partitionHandler"></a>7.4.1 PartitionHandler</h3></div></div></div><p>The <code class="classname">PartitionHandler</code> is the component that
|
|
knows about the fabric of the remoting or grid environment. It is able
|
|
to send <code class="classname">StepExecution</code> requests to the remote
|
|
Steps, wrapped in some fabric-specific format, like a DTO. It does not
|
|
have to know how to split up the input data, or how to aggregate the
|
|
result of multiple Step executions. Generally speaking it probably also
|
|
doesn't need to know about resilience or failover, since those are
|
|
features of the fabric in many cases, and anyway Spring Batch always
|
|
provides restartability independent of the fabric: a failed Job can
|
|
always be restarted and only the failed Steps will be
|
|
re-executed.</p><p><code class="classname">The PartitionHandler</code> interface can have
|
|
specialized implementations for a variety of fabric types: e.g. simple
|
|
RMI remoting, EJB remoting, custom web service, JMS, Java Spaces, shared
|
|
memory grids (like Terracotta or Coherence), grid execution fabrics
|
|
(like GridGain). Spring Batch does not contain implementations for any
|
|
proprietary grid or remoting fabrics.</p><p>Spring Batch does however provide a useful implementation of
|
|
<code class="classname">PartitionHandler</code> that executes Steps locally in
|
|
separate threads of execution, using the
|
|
<code class="classname">TaskExecutor</code> strategy from Spring. The
|
|
implementation is called
|
|
<code class="classname">TaskExecutorPartitionHandler</code>, and it is the
|
|
default for a step configured with the XML namespace as above. It can
|
|
also be configured explicitly like this:</p><pre class="programlisting"><span class="hl-tag"><step</span> <span class="hl-attribute">id</span>=<span class="hl-value">"step1.master"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><partition</span> <span class="hl-attribute">step</span>=<span class="hl-value">"step1"</span> <span class="hl-attribute">handler</span>=<span class="hl-value">"handler"</span><span class="hl-tag">/></span>
|
|
<span class="hl-tag"></step></span>
|
|
|
|
<span class="hl-tag"><bean</span> <span class="hl-attribute">class</span>=<span class="hl-value">"org.spr...TaskExecutorPartitionHandler"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"taskExecutor"</span> <span class="hl-attribute">ref</span>=<span class="hl-value">"taskExecutor"</span><span class="hl-tag">/></span>
|
|
<span class="hl-tag"><property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"step"</span> <span class="hl-attribute">ref</span>=<span class="hl-value">"step1"</span><span class="hl-tag"> /></span>
|
|
<span class="hl-tag"><property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"gridSize"</span> <span class="hl-attribute">value</span>=<span class="hl-value">"10"</span><span class="hl-tag"> /></span>
|
|
<span class="hl-tag"></bean></span></pre><p>The <code class="literal">gridSize</code> determines the number of separate
|
|
step executions to create, so it can be matched to the size of the
|
|
thread pool in the <code class="classname">TaskExecutor</code>, or else it can
|
|
be set to be larger than the number of threads available, in which case
|
|
the blocks of work are smaller.</p><p>The <code class="classname">TaskExecutorPartitionHandler</code> is quite
|
|
useful for IO intensive Steps, like copying large numbers of files or
|
|
replicating filesystems into content management systems. It can also be
|
|
used for remote execution by providing a Step implementation that is a
|
|
proxy for a remote invocation (e.g. using Spring Remoting).</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="stepExecutionSplitter" href="#stepExecutionSplitter"></a>7.4.2 Partitioner</h3></div></div></div><p>The Partitioner has a simpler responsibility: to generate
|
|
execution contexts as input parameters for new step executions only (no
|
|
need to worry about restarts). It has a single method:</p><pre class="programlisting"><span class="hl-keyword">public</span> <span class="hl-keyword">interface</span> Partitioner {
|
|
Map<String, ExecutionContext> partition(<span class="hl-keyword">int</span> gridSize);
|
|
}</pre><p>The return value from this method associates a unique name for
|
|
each step execution (the <code class="classname">String</code>), with input
|
|
parameters in the form of an <code class="classname">ExecutionContext</code>.
|
|
The names show up later in the Batch meta data as the step name in the
|
|
partitioned <code class="classname">StepExecutions</code>. The
|
|
<code class="classname">ExecutionContext</code> is just a bag of name-value
|
|
pairs, so it might contain a range of primary keys, or line numbers, or
|
|
the location of an input file. The remote <code class="classname">Step</code>
|
|
then normally binds to the context input using <code class="literal">#{...}</code>
|
|
placeholders (late binding in step scope), as illustrated in the next
|
|
section.</p><p>The names of the step executions (the keys in the
|
|
<code class="classname">Map</code> returned by
|
|
<code class="classname">Partitioner</code>) need to be unique amongst the step
|
|
executions of a Job, but do not have any other specific requirements.
|
|
The easiest way to do this, and to make the names meaningful for users,
|
|
is to use a prefix+suffix naming convention, where the prefix is the
|
|
name of the step that is being executed (which itself is unique in the
|
|
<code class="classname">Job</code>), and the suffix is just a counter. There is
|
|
a <code class="classname">SimplePartitioner</code> in the framework that uses
|
|
this convention.</p><p>An optional interface
|
|
<code class="classname">PartitioneNameProvider</code> can be used to
|
|
provide the partition names separately from the partitions
|
|
themselves. If a <code class="classname">Partitioner</code> implements
|
|
this interface then on a restart only the names will be queried.
|
|
If partitioning is expensive this can be a useful optimisation.
|
|
Obviously the names provided by the
|
|
<code class="classname">PartitioneNameProvider</code> must match those
|
|
provided by the <code class="classname">Partitioner</code>.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="bindingInputDataToSteps" href="#bindingInputDataToSteps"></a>7.4.3 Binding Input Data to Steps</h3></div></div></div><p>It is very efficient for the steps that are executed by the
|
|
PartitionHandler to have identical configuration, and for their input
|
|
parameters to be bound at runtime from the ExecutionContext. This is
|
|
easy to do with the StepScope feature of Spring Batch (covered in more
|
|
detail in the section on <a class="xref" href="configureStep.html#late-binding" title="5.4 Late Binding of Job and Step Attributes">Late Binding</a>). For example
|
|
if the <code class="classname">Partitioner</code> creates
|
|
<code class="classname">ExecutionContext</code> instances with an attribute key
|
|
<code class="literal">fileName</code>, pointing to a different file (or
|
|
directory) for each step invocation, the
|
|
<code class="classname">Partitioner</code> output might look like this:</p><div class="table"><a name="d5e3165" href="#d5e3165"></a><p class="title"><b>Table 7.1. Example step execution name to execution context provided by
|
|
Partitioner targeting directory processing</b></p><div class="table-contents"><table summary="Example step execution name to execution context provided by
 Partitioner targeting directory processing" style="border-collapse: collapse;border-top: 0.5pt solid ; border-bottom: 0.5pt solid ; border-left: 0.5pt solid ; border-right: 0.5pt solid ; "><colgroup><col><col></colgroup><tbody><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><span class="bold"><strong>Step Execution Name
|
|
(key)</strong></span></td><td style="border-bottom: 0.5pt solid ; "><span class="bold"><strong>ExecutionContext
|
|
(value)</strong></span></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; ">filecopy:partition0</td><td style="border-bottom: 0.5pt solid ; ">fileName=/home/data/one</td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; ">filecopy:partition1</td><td style="border-bottom: 0.5pt solid ; ">fileName=/home/data/two</td></tr><tr><td style="border-right: 0.5pt solid ; ">filecopy:partition2</td><td style="">fileName=/home/data/three</td></tr></tbody></table></div></div><br class="table-break"><p>Then the file name can be bound to a step using late binding to
|
|
the execution context:</p><pre class="programlisting"><span class="hl-tag"><bean</span> <span class="hl-attribute">id</span>=<span class="hl-value">"itemReader"</span> <span class="hl-attribute">scope</span>=<span class="hl-value">"step"</span>
|
|
<span class="hl-attribute">class</span>=<span class="hl-value">"org.spr...MultiResourceItemReader"</span><span class="hl-tag">></span>
|
|
<span class="hl-tag"><property</span> <span class="hl-attribute">name</span>=<span class="hl-value">"resource"</span> <span class="hl-attribute">value</span>=<span class="hl-value">"</span><span class="bold"><strong>#{stepExecutionContext[fileName]}/*</strong></span>"/>
|
|
<span class="hl-tag"></bean></span></pre></div></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="readersAndWriters.html">Prev</a> </td><td width="20%" align="center"> </td><td width="40%" align="right"> <a accesskey="n" href="repeat.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">6. ItemReaders and ItemWriters </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 8. Repeat</td></tr></table></div></body></html> |