spring-batch/build/reference-work/schema-appendix.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE appendix PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd">
<appendix id="metaDataSchema">
  <title>Meta-Data Schema</title>

  <section id="metaDataSchemaOverview">
    <title>Overview</title>

    <para>The Spring Batch Meta-Data tables very closely match the Domain
    objects that represent them in Java. For example,
    <classname>JobInstance</classname>, <classname>JobExecution</classname>,
    <classname>JobParameters</classname>, and
    <classname>StepExecution</classname> map to BATCH_JOB_INSTANCE,
    BATCH_JOB_EXECUTION, BATCH_JOB_EXECUTION_PARAMS, and BATCH_STEP_EXECUTION,
    respectively. <classname>ExecutionContext</classname> maps to both
    BATCH_JOB_EXECUTION_CONTEXT and BATCH_STEP_EXECUTION_CONTEXT. The
    <classname>JobRepository</classname> is responsible for saving and storing
    each Java object into its correct table. The following appendix describes
    the meta-data tables in detail, along with many of the design decisions
    that were made when creating them. When viewing the various table creation
    statements below, it is important to realize that the data types used are
    as generic as possible. Spring Batch provides many schemas as examples,
    which all have varying data types due to variations in individual database
    vendors' handling of data types. Below is an ERD model of all 6 tables and
    their relationships to one another:</para>

    <mediaobject>
      <imageobject role="html">
        <imagedata align="center" fileref="images/meta-data-erd.png"
                   scale="100" />
      </imageobject>

      <imageobject role="fo">
        <imagedata align="center" fileref="images/meta-data-erd.png"
                   scale="45" />
      </imageobject>
    </mediaobject>

    <section id="exampleDDLScripts">
      <title>Example DDL Scripts</title>

      <para>The Spring Batch Core JAR file contains example
      scripts to create the relational tables for a number of database
      platforms (which are in turn auto-detected by the job repository factory
      bean or namespace equivalent). These scripts can be used as is, or
      modified with additional indexes and constraints as desired. The file
      names are in the form <literal>schema-*.sql</literal>, where "*" is the
      short name of the target database platform.  The scripts are in
	  the package <literal>org.springframework.batch.core</literal>.</para>
    </section>

    <section id="metaDataVersion">
      <title>Version</title>

      <para>Many of the database tables discussed in this appendix contain a
      version column. This column is important because Spring Batch employs an
      optimistic locking strategy when dealing with updates to the database.
      This means that each time a record is 'touched' (updated) the value in
      the version column is incremented by one. When the repository goes back
      to try and save the value, if the version number has change it will
      throw <classname>OptimisticLockingFailureException</classname>,
      indicating there has been an error with concurrent access. This check is
      necessary since, even though different batch jobs may be running in
      different machines, they are all using the same database tables.</para>
    </section>

    <section id="metaDataIdentity">
      <title>Identity</title>

      <para>BATCH_JOB_INSTANCE, BATCH_JOB_EXECUTION, and BATCH_STEP_EXECUTION
      each contain columns ending in _ID. These fields act as primary keys for
      their respective tables. However, they are not database generated keys,
      but rather they are generated by separate sequences. This is necessary
      because after inserting one of the domain objects into the database, the
      key it is given needs to be set on the actual object so that they can be
      uniquely identified in Java. Newer database drivers (Jdbc 3.0 and up)
      support this feature with database generated keys, but rather than
      requiring it, sequences were used. Each variation of the schema will
      contain some form of the following:</para>

      <programlisting language="sql">CREATE SEQUENCE BATCH_STEP_EXECUTION_SEQ;
CREATE SEQUENCE BATCH_JOB_EXECUTION_SEQ;
CREATE SEQUENCE BATCH_JOB_SEQ;</programlisting>

      <para>Many database vendors don't support sequences. In these cases,
      work-arounds are used, such as the following for MySQL:</para>

      <programlisting language="sql">CREATE TABLE BATCH_STEP_EXECUTION_SEQ (ID BIGINT NOT NULL) type=InnoDB;
INSERT INTO BATCH_STEP_EXECUTION_SEQ values(0);
CREATE TABLE BATCH_JOB_EXECUTION_SEQ (ID BIGINT NOT NULL) type=InnoDB;
INSERT INTO BATCH_JOB_EXECUTION_SEQ values(0);
CREATE TABLE BATCH_JOB_SEQ (ID BIGINT NOT NULL) type=InnoDB;
INSERT INTO BATCH_JOB_SEQ values(0);</programlisting>

      <para>In the above case, a table is used in place of each sequence. The
      Spring core class <classname>MySQLMaxValueIncrementer</classname> will
      then increment the one column in this sequence in order to give similar
      functionality.</para>
    </section>
  </section>

  <section id="metaDataBatchJobInstance">
    <title>BATCH_JOB_INSTANCE</title>

    <para>The BATCH_JOB_INSTANCE table holds all information relevant to a
    <classname>JobInstance</classname>, and serves as the top of the overall
    hierarchy. The following generic DDL statement is used to create
    it:</para>

    <programlisting language="sql">CREATE TABLE BATCH_JOB_INSTANCE  (
  JOB_INSTANCE_ID BIGINT  PRIMARY KEY ,
  VERSION BIGINT,
  JOB_NAME VARCHAR(100) NOT NULL ,
  JOB_KEY VARCHAR(2500)
);</programlisting>

    <para>Below are descriptions of each column in the table:</para>

    <itemizedlist>
      <listitem>
        <para>JOB_INSTANCE_ID: The unique id that will identify the instance,
        which is also the primary key. The value of this column should be
        obtainable by calling the <methodname>getId</methodname> method on
        <classname>JobInstance</classname>.</para>
      </listitem>

      <listitem>
        <para>VERSION: See above section.</para>
      </listitem>

      <listitem>
        <para>JOB_NAME: Name of the job obtained from the
        <classname>Job</classname> object. Because it is required to identify
        the instance, it must not be null.</para>
      </listitem>

      <listitem>
        <para>JOB_KEY: A serialization of the
        <classname>JobParameters</classname> that uniquely identifies separate
        instances of the same job from one another.
        (<classname>JobInstances</classname> with the same job name must have
        different <classname>JobParameters</classname>, and thus, different
        JOB_KEY values).</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="metaDataBatchJobParams">
    <title>BATCH_JOB_EXECUTION_PARAMS</title>

    <para>The BATCH_JOB_EXECUTION_PARAMS table holds all information relevant to the
    <classname>JobParameters</classname> object. It contains 0 or more
    key/value pairs passed to a <classname>Job</classname> and serve as a record of the parameters
    a job was run with. For each parameter that contributes to the generation of a job's identity,
    the IDENTIFYING flag is set to true.  It should be noted that the table has been
    denormalized. Rather than creating a separate table for each type, there
    is one table with a column indicating the type:</para>

    <programlisting language="sql">CREATE TABLE BATCH_JOB_EXECUTION_PARAMS  (
	JOB_EXECUTION_ID BIGINT NOT NULL ,
	TYPE_CD VARCHAR(6) NOT NULL ,
	KEY_NAME VARCHAR(100) NOT NULL ,
	STRING_VAL VARCHAR(250) ,
	DATE_VAL DATETIME DEFAULT NULL ,
	LONG_VAL BIGINT ,
	DOUBLE_VAL DOUBLE PRECISION ,
	IDENTIFYING CHAR(1) NOT NULL ,
	constraint JOB_EXEC_PARAMS_FK foreign key (JOB_EXECUTION_ID)
	references BATCH_JOB_EXECUTION(JOB_EXECUTION_ID)
);</programlisting>

    <para>Below are descriptions for each column:</para>

    <itemizedlist>
      <listitem>
        <para>JOB_EXECUTION_ID: Foreign Key from the BATCH_JOB_EXECUTION table
        that indicates the job execution the parameter entry belongs to. It
        should be noted that multiple rows (i.e key/value pairs) may exist for
        each execution.</para>
      </listitem>

      <listitem>
        <para>TYPE_CD: String representation of the type of value stored,
        which can be either a string, date, long, or double. Because the type
        must be known, it cannot be null.</para>
      </listitem>

      <listitem>
        <para>KEY_NAME: The parameter key.</para>
      </listitem>

      <listitem>
        <para>STRING_VAL: Parameter value, if the type is string.</para>
      </listitem>

      <listitem>
        <para>DATE_VAL: Parameter value, if the type is date.</para>
      </listitem>

      <listitem>
        <para>LONG_VAL: Parameter value, if the type is a long.</para>
      </listitem>

      <listitem>
        <para>DOUBLE_VAL: Parameter value, if the type is double.</para>
      </listitem>

      <listitem>
        <para>IDENTIFYING: Flag indicating if the parameter contributed to the identity of the related <classname>JobInstance</classname>.</para>
      </listitem>
    </itemizedlist>

    <para>It is worth noting that there is no primary key for this table. This
    is simply because the framework has no use for one, and thus doesn't
    require it. If a user so chooses, one may be added with a database
    generated key, without causing any issues to the framework itself.</para>
  </section>

  <section id="metaDataBatchJobExecution">
    <title>BATCH_JOB_EXECUTION</title>

    <para>The BATCH_JOB_EXECUTION table holds all information relevant to the
    <classname>JobExecution</classname> object. Every time a
    <classname>Job</classname> is run there will always be a new
    <classname>JobExecution</classname>, and a new row in this table:</para>

    <programlisting language="sql">CREATE TABLE BATCH_JOB_EXECUTION  (
  JOB_EXECUTION_ID BIGINT  PRIMARY KEY ,
  VERSION BIGINT,
  JOB_INSTANCE_ID BIGINT NOT NULL,
  CREATE_TIME TIMESTAMP NOT NULL,
  START_TIME TIMESTAMP DEFAULT NULL,
  END_TIME TIMESTAMP DEFAULT NULL,
  STATUS VARCHAR(10),
  EXIT_CODE VARCHAR(20),
  EXIT_MESSAGE VARCHAR(2500),
  LAST_UPDATED TIMESTAMP,
  JOB_CONFIGURATION_LOCATION VARCHAR(2500) NULL,
  constraint JOB_INSTANCE_EXECUTION_FK foreign key (JOB_INSTANCE_ID)
  references BATCH_JOB_INSTANCE(JOB_INSTANCE_ID)
) ;</programlisting>

    <para>Below are descriptions for each column:</para>

    <itemizedlist>
      <listitem>
        <para>JOB_EXECUTION_ID: Primary key that uniquely identifies this
        execution. The value of this column is obtainable by calling the
        <methodname>getId</methodname> method of the
        <classname>JobExecution</classname> object.</para>
      </listitem>

      <listitem>
        <para>VERSION: See above section.</para>
      </listitem>

      <listitem>
        <para>JOB_INSTANCE_ID: Foreign key from the BATCH_JOB_INSTANCE table
        indicating the instance to which this execution belongs. There may be
        more than one execution per instance.</para>
      </listitem>

      <listitem>
        <para>CREATE_TIME: Timestamp representing the time that the execution
        was created.</para>
      </listitem>

      <listitem>
        <para>START_TIME: Timestamp representing the time the execution was
        started.</para>
      </listitem>

      <listitem>
        <para>END_TIME: Timestamp representing the time the execution was
        finished, regardless of success or failure. An empty value in this
        column even though the job is not currently running indicates that
        there has been some type of error and the framework was unable to
        perform a last save before failing.</para>
      </listitem>

      <listitem>
        <para>STATUS: Character string representing the status of the
        execution. This may be COMPLETED, STARTED, etc. The object
        representation of this column is the
        <classname>BatchStatus</classname> enumeration.</para>
      </listitem>

      <listitem>
        <para>EXIT_CODE: Character string representing the exit code of the
        execution. In the case of a command line job, this may be converted
        into a number.</para>
      </listitem>

      <listitem>
        <para>EXIT_MESSAGE: Character string representing a more detailed
        description of how the job exited. In the case of failure, this might
        include as much of the stack trace as is possible.</para>
      </listitem>

      <listitem>
        <para>LAST_UPDATED: Timestamp representing the last time this
        execution was persisted.</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="metaDataBatchStepExecution">
    <title>BATCH_STEP_EXECUTION</title>

    <para>The BATCH_STEP_EXECUTION table holds all information relevant to the
    <classname>StepExecution</classname> object. This table is very similar in
    many ways to the BATCH_JOB_EXECUTION table and there will always be at
    least one entry per <classname>Step</classname> for each
    <classname>JobExecution</classname> created:</para>

    <programlisting language="sql">CREATE TABLE BATCH_STEP_EXECUTION  (
  STEP_EXECUTION_ID BIGINT  PRIMARY KEY ,
  VERSION BIGINT NOT NULL,
  STEP_NAME VARCHAR(100) NOT NULL,
  JOB_EXECUTION_ID BIGINT NOT NULL,
  START_TIME TIMESTAMP NOT NULL ,
  END_TIME TIMESTAMP DEFAULT NULL,
  STATUS VARCHAR(10),
  COMMIT_COUNT BIGINT ,
  READ_COUNT BIGINT ,
  FILTER_COUNT BIGINT ,
  WRITE_COUNT BIGINT ,
  READ_SKIP_COUNT BIGINT ,
  WRITE_SKIP_COUNT BIGINT ,
  PROCESS_SKIP_COUNT BIGINT ,
  ROLLBACK_COUNT BIGINT ,
  EXIT_CODE VARCHAR(20) ,
  EXIT_MESSAGE VARCHAR(2500) ,
  LAST_UPDATED TIMESTAMP,
  constraint JOB_EXECUTION_STEP_FK foreign key (JOB_EXECUTION_ID)
  references BATCH_JOB_EXECUTION(JOB_EXECUTION_ID)
) ;</programlisting>

    <para>Below are descriptions for each column:</para>

    <itemizedlist>
      <listitem>
        <para>STEP_EXECUTION_ID: Primary key that uniquely identifies this
        execution. The value of this column should be obtainable by calling
        the <methodname>getId</methodname> method of the
        <classname>StepExecution</classname> object.</para>
      </listitem>

      <listitem>
        <para>VERSION: See above section.</para>
      </listitem>

      <listitem>
        <para>STEP_NAME: The name of the step to which this execution
        belongs.</para>
      </listitem>

      <listitem>
        <para>JOB_EXECUTION_ID: Foreign key from the BATCH_JOB_EXECUTION table
        indicating the JobExecution to which this StepExecution belongs. There
        may be only one <classname>StepExecution</classname> for a given
        <classname>JobExecution</classname> for a given
        <classname>Step</classname> name.</para>
      </listitem>

      <listitem>
        <para>START_TIME: Timestamp representing the time the execution was
        started.</para>
      </listitem>

      <listitem>
        <para>END_TIME: Timestamp representing the time the execution was
        finished, regardless of success or failure. An empty value in this
        column even though the job is not currently running indicates that
        there has been some type of error and the framework was unable to
        perform a last save before failing.</para>
      </listitem>

      <listitem>
        <para>STATUS: Character string representing the status of the
        execution. This may be COMPLETED, STARTED, etc. The object
        representation of this column is the
        <classname>BatchStatus</classname> enumeration.</para>
      </listitem>

      <listitem>
        <para>COMMIT_COUNT: The number of times in which the step has
        committed a transaction during this execution.</para>
      </listitem>

      <listitem>
        <para>READ_COUNT: The number of items read during this
        execution.</para>
      </listitem>

      <listitem>
        <para>FILTER_COUNT: The number of items filtered out of this
        execution.</para>
      </listitem>

      <listitem>
        <para>WRITE_COUNT: The number of items written and committed during
        this execution.</para>
      </listitem>

      <listitem>
        <para>READ_SKIP_COUNT: The number of items skipped on read during this
        execution.</para>
      </listitem>

      <listitem>
        <para>WRITE_SKIP_COUNT: The number of items skipped on write during
        this execution.</para>
      </listitem>

      <listitem>
        <para>PROCESS_SKIP_COUNT: The number of items skipped during
        processing during this execution.</para>
      </listitem>

      <listitem>
        <para>ROLLBACK_COUNT: The number of rollbacks during this execution.
        Note that this count includes each time rollback occurs, including
        rollbacks for retry and those in the skip recovery procedure.</para>
      </listitem>

      <listitem>
        <para>EXIT_CODE: Character string representing the exit code of the
        execution. In the case of a command line job, this may be converted
        into a number.</para>
      </listitem>

      <listitem>
        <para>EXIT_MESSAGE: Character string representing a more detailed
        description of how the job exited. In the case of failure, this might
        include as much of the stack trace as is possible.</para>
      </listitem>

      <listitem>
        <para>LAST_UPDATED: Timestamp representing the last time this
        execution was persisted.</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="metaDataBatchJobExecutionContext">
    <title>BATCH_JOB_EXECUTION_CONTEXT</title>

    <para>The BATCH_JOB_EXECUTION_CONTEXT table holds all information relevant
    to an <classname>Job</classname>'s
    <classname>ExecutionContext</classname>. There is exactly one
    <classname>Job</classname> <classname>ExecutionContext</classname> per
    <classname>JobExecution</classname>, and it contains all of the job-level
    data that is needed for a particular job execution. This data typically
    represents the state that must be retrieved after a failure so that a
    <classname>JobInstance</classname> can 'start from where it left
    off'.</para>

    <programlisting language="sql">CREATE TABLE BATCH_JOB_EXECUTION_CONTEXT  (
  JOB_EXECUTION_ID BIGINT PRIMARY KEY,
  SHORT_CONTEXT VARCHAR(2500) NOT NULL,
  SERIALIZED_CONTEXT CLOB,
  constraint JOB_EXEC_CTX_FK foreign key (JOB_EXECUTION_ID)
  references BATCH_JOB_EXECUTION(JOB_EXECUTION_ID)
) ;</programlisting>

    <para>Below are descriptions for each column:</para>

    <itemizedlist>
      <listitem>
        <para>JOB_EXECUTION_ID: Foreign key representing the
        <classname>JobExecution</classname> to which the context belongs.
        There may be more than one row associated to a given execution.</para>
      </listitem>

      <listitem>
        <para>SHORT_CONTEXT: A string version of the
        SERIALIZED_CONTEXT.</para>
      </listitem>

      <listitem>
        <para>SERIALIZED_CONTEXT: The entire context, serialized.</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="metaDataBatchStepExecutionContext">
    <title>BATCH_STEP_EXECUTION_CONTEXT</title>

    <para>The BATCH_STEP_EXECUTION_CONTEXT table holds all information
    relevant to an <classname>Step</classname>'s
    <classname>ExecutionContext</classname>. There is exactly one
    <classname>ExecutionContext</classname> per
    <classname>StepExecution</classname>, and it contains all of the data that
    needs to persisted for a particular step execution. This data typically
    represents the state that must be retrieved after a failure so that a
    <classname>JobInstance</classname> can 'start from where it left
    off'.</para>

    <programlisting language="sql">CREATE TABLE BATCH_STEP_EXECUTION_CONTEXT  (
  STEP_EXECUTION_ID BIGINT PRIMARY KEY,
  SHORT_CONTEXT VARCHAR(2500) NOT NULL,
  SERIALIZED_CONTEXT CLOB,
  constraint STEP_EXEC_CTX_FK foreign key (STEP_EXECUTION_ID)
  references BATCH_STEP_EXECUTION(STEP_EXECUTION_ID)
) ;</programlisting>

    <para>Below are descriptions for each column:</para>

    <itemizedlist>
      <listitem>
        <para>STEP_EXECUTION_ID: Foreign key representing the
        <classname>StepExecution</classname> to which the context belongs.
        There may be more than one row associated to a given execution.</para>
      </listitem>

      <listitem>
        <para>SHORT_CONTEXT: A string version of the
        SERIALIZED_CONTEXT.</para>
      </listitem>

      <listitem>
        <para>SERIALIZED_CONTEXT: The entire context, serialized.</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="metaDataArchiving">
    <title>Archiving</title>

    <para>Because there are entries in multiple tables every time a batch job
    is run, it is common to create an archive strategy for the meta-data
    tables. The tables themselves are designed to show a record of what
    happened in the past, and generally won't affect the run of any job, with
    a couple of notable exceptions pertaining to restart:</para>

    <itemizedlist>
      <listitem>
        <para>The framework will use the meta-data tables to determine if a
        particular JobInstance has been run before. If it has been run, and
        the job is not restartable, then an exception will be thrown.</para>
      </listitem>

      <listitem>
        <para>If an entry for a JobInstance is removed without having
        completed successfully, the framework will think that the job is new,
        rather than a restart.</para>
      </listitem>

      <listitem>
        <para>If a job is restarted, the framework will use any data that has
        been persisted to the ExecutionContext to restore the Job's state.
        Therefore, removing any entries from this table for jobs that haven't
        completed successfully will prevent them from starting at the correct
        point if run again.</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="multiByteCharacters">
    <title>International and Multi-byte Characters</title>

    <para>If you are using multi-byte character sets (e.g. Chines or Cyrillic)
	  in your business processing, then those characters might need to be
	  persisted in the Spring Batch schema.  Many users find that
	  simply changing the schema to double the length of the <literal>VARCHAR</literal>
	  columns is enough.  Others prefer to configure the <link
    linkend="configuringJobRepository"><classname>JobRepository</classname></link> with <literal>max-varchar-length</literal> half the value of the <literal>VARCHAR</literal> column length is enough.  Some users have also reported that
	they use <literal>NVARCHAR</literal> in place of <literal>VARCHAR</literal>
	in their schema definitions.  The best result will depend on the database
	platform and the way the database server has been configured locally.</para>
  </section>

  <section id="recommendationsForIndexingMetaDataTables">
    <title>Recommendations for Indexing Meta Data Tables</title>

    <para>Spring Batch provides DDL samples for the meta-data tables in the
    Core jar file for several common database platforms. Index declarations
    are not included in that DDL because there are too many variations in how
    users may want to index depending on their precise platform, local
    conventions and also the business requirements of how the jobs will be
    operated. The table below provides some indication as to which columns are
    going to be used in a WHERE clause by the Dao implementations provided by
    Spring Batch, and how frequently they might be used, so that individual
    projects can make up their own minds about indexing.</para>

    <table>
      <title>Where clauses in SQL statements (excluding primary keys) and
      their approximate frequency of use.</title>

      <tgroup cols="3">
        <tbody>
          <row>
            <entry>Default Table Name</entry>

            <entry>Where Clause</entry>

            <entry>Frequency</entry>
          </row>

          <row>
            <entry>BATCH_JOB_INSTANCE</entry>

            <entry>JOB_NAME = ? and JOB_KEY = ?</entry>

            <entry>Every time a job is launched</entry>
          </row>

          <row>
            <entry>BATCH_JOB_EXECUTION</entry>

            <entry>JOB_INSTANCE_ID = ?</entry>

            <entry>Every time a job is restarted</entry>
          </row>

          <row>
            <entry>BATCH_EXECUTION_CONTEXT</entry>

            <entry>EXECUTION_ID = ? and KEY_NAME = ?</entry>

            <entry>On commit interval, a.k.a. chunk</entry>
          </row>

          <row>
            <entry>BATCH_STEP_EXECUTION</entry>

            <entry>VERSION = ?</entry>

            <entry>On commit interval, a.k.a. chunk (and at start and end of
            step)</entry>
          </row>

          <row>
            <entry>BATCH_STEP_EXECUTION</entry>

            <entry>STEP_NAME = ? and JOB_EXECUTION_ID = ?</entry>

            <entry>Before each step execution</entry>
          </row>
        </tbody>
      </tgroup>
    </table>
  </section>
</appendix>