spring-batch-extensions

Files

Stefano Cordio fa958e29af Bump spring-boot-starter-parent from 3.4.2 to 3.4.5 (#172 )

Signed-off-by: Stefano Cordio <stefano.cordio@gmail.com>

2025-05-06 07:50:40 +02:00

.mvn

Add maven.config file

2022-12-06 14:19:25 +01:00

src

Update documentation

2024-09-18 16:50:02 +02:00

.editorconfig

Initial version 0.1.0 changes from original repository

2021-03-26 16:53:54 +01:00

.gitignore

Use spring-boot-starter-parent in spring-batch-excel

2024-12-20 02:23:08 +01:00

mvnw

Initial version 0.1.0 changes from original repository

2021-03-26 16:53:54 +01:00

mvnw.cmd

Initial version 0.1.0 changes from original repository

2021-03-26 16:53:54 +01:00

pom.xml

Bump spring-boot-starter-parent from 3.4.2 to 3.4.5 (#172 )

2025-05-06 07:50:40 +02:00

README.adoc

Update documentation

2024-09-18 16:50:02 +02:00

README.adoc

= spring-batch-excel

Spring Batch extension containing an `ItemReader` implementation for Excel based on https://poi.apache.org[Apache POI]. It supports reading both XLS and XLSX files. For the latter, there is also (experimental) streaming support.

The `PoiItemReader` has the most features but is also the most memory intensive and might lead to memory issues with large XLS(X) sheets.

To reduce the memory footprint the `StreamingXlsxItemReader` can be used, this will only keep the current row in memory and discard it afterward. Not everything is supported while streaming the XLSX file. It can be that formulas don't get evaluated or lead to an error.

WARNING: The `ItemReader` classess are **not threadsafe**. The API from https://poi.apache.org/help/faq.html#20[Apache POI] itself isn't threadsafe as well as the https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/item/support/AbstractItemCountingItemStreamItemReader.html[`AbstractItemCountingItemStreamItemReader`] used as a base class for the `ItemReader` classes. Reading from multiple threads is therefore not supported. Using a multi-threaded processor/writer should work as long as you use a single thread for reading.

*Compatibility:* Spring Batch Excel is compatible with Spring Batch 5.x.

== Configuration of `PoiItemReader`

Next to the https://docs.spring.io/spring-batch/reference/html/configureJob.html[configuration of Spring Batch] one needs to configure the `PoiItemReader`.

Configuration of can be done in XML or Java Config.

=== XML

[source,xml]
----
 <bean id="excelReader" class="org.springframework.batch.extensions.excel.poi.PoiItemReader" scope="step">
     <property name="resource" value="file:/path/to/your/excel/file" />
     <property name="rowMapper">
         <bean class="org.springframework.batch.extensions.excel.mapping.PassThroughRowMapper" />
     </property>
 </bean>
----

=== Java Config

[source,java]
----
@Bean
@StepScope
public PoiItemReader excelReader(RowMapper rowMapper) {
    PoiItemReader reader = new PoiItemReader();
    reader.setResource(new FileSystemResource("/path/to/your/excel/file"));
    reader.setRowMapper(rowMapper);
    return reader;
}

@Bean
public RowMapper rowMapper() {
    return new PassThroughRowMapper();
}
----

== Configuration of `StreamingXlsxItemReader`

Configuration can be done in XML or Java Config.

=== XML

[source,xml]
----
 <bean id="excelReader" class="org.springframework.batch.extensions.excel.streaming.StreamingXlsxItemReader" scope="step">
     <property name="resource" value="file:/path/to/your/excel/file" />
     <property name="rowMapper">
         <bean class="org.springframework.batch.extensions.excel.mapping.PassThroughRowMapper" />
     </property>
 </bean>
----

=== Java Config

[source,java]
----
@Bean
@StepScope
public StreamingXlsxItemReader excelReader(RowMapper rowMapper) {
    StreamingXlsxItemReader reader = new StreamingXlsxItemReader();
    reader.setResource(new FileSystemResource("/path/to/your/excel/file"));
    reader.setRowMapper(rowMapper);
    return reader;
}

@Bean
public RowMapper rowMapper() {
    return new PassThroughRowMapper();
}
----

== Configuration properties
[cols="1,1,1,4"]
.Properties for item readers
|===
| Property | Required | Default | Description

| `endAfterBlankLines` | no | 1 | The number of blank lines before stopping to read.
| `linesToSkip` | no | 0 | The number of lines to skip, this applies to each sheet in the Excel file, can be useful if the first couple of lines provide header information.
| `password` | no | `null` | The password used to protect an XLS file. Only works for XLS files not XLSX files (not supported with streaming).
| `resource` | yes | `null` | Location of the excel file to read, can be any resource supported by Spring.
| `rowMapper` | yes | `null` | transforms the rows read from the sheet(s) to an object which you can use in the rest of the process.
| `rowSetFactory` | no | `DefaultRowSetFactory` | For reading rows a `RowSet` abstraction is used. To construct a `RowSet` for the current `Sheet` a `RowSetFactory` is needed. The `DefaultRowSetFactory` constructs a `DefaultRowSet` and `DefaultRowSetMetaData`. For construction of the latter a `ColumnNameExtractor` is needed. At the moment there are 2 implementations
| `skippedRowsCallback` | no | `null` | When rows are skipped an optional `RowCallbackHandler` is called with the skipped row. This comes in handy when one needs to write the skipped rows to another file or create some logging.
| `strict` | no | `true` | This controls wether or not an exception is thrown if the file doesn't exists or isn't readable, by default an exception will be thrown.
| `datesAsIso` | no | `false` | Controls if dates need to be parsed as ISO or to use the format as specified in the excel sheet.
| `userLocale` | no | `null` | Set the `java.util.Locale` to use when formatting dates when there is no explicit format set in the Excel document.
| `dataFormatterCustomizer` | no | `DataFormatterCustomizer.DEFAULT` | To additionally configure the https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html[`DataFormatter`] in use to format the data. The default will set the `useCachedValuesForFormulaCells` property to `true` to use cached values instead of evaluating the formulas.
| `formulaEvaluatorFactory` | no | `FormulaEvaluatorFactory.NOOP` | A factory approach to create a `FormulaEvaluator` used by Apache POI to evaluate the formulas in the, the default implementation will return `null` as the default is to use the cached values.
|===

== ColumnNameExtractors

- `StaticColumnNameExtractor` uses a preset list of column names.
 - `RowNumberColumnNameExtractor` (**the default**) reads a given row (default 0) to determine the column names of the current sheet

== RowMappers
To map a read row a `RowMapper` is needed. Out-of-the-box there are 2 implementations. The `PassThroughRowMapper` and `BeanWrapperRowMapper`.

=== PassThroughRowMapper
Transforms the read row from excel into a `String[]`.

=== BeanWrapperRowMapper
Uses a `BeanWrapper` to convert a given row into an object. Uses the column names of the given `RowSet` to map column to properties of the `targetType` or prototype bean.

[source,xml]
----
<bean id="excelReader" class="org.springframework.batch.extensions.excel.poi.PoiItemReader" scope="step">
    <property name="resource" value="file:/path/to/your/excel/file" />
    <property name="rowMapper">
        <bean class="org.springframework.batch.extensions.excel.mapping.BeanWrapperRowMapper">
            <property name="targetType" value="com.your.package.Player" />
        </bean>
    </property>
</bean>
----

or the same in Java configuration.

[source,java]
----
@Bean
public PoiItemReader excelReader(BeanWrapperRowMapper rowMapper) {
  var excelReader = new PoiItemReader();
  excelReader.setResource(new FileSystemResouce("file:/path/to/your/excel/file"));
  excelReader.setRowMapper(rowMapper);
  return excelReader;
}

@Bean
public BeanWrapperRowMapper rowMapper() {
  var rowMapper = new BeanWrapperRowMapper<Player>();
  rowMapper.setTargetType(Player.class);
  return rowMapper;
}
----

NOTE: When using the `BeanWrapperRowMapper` with the `StreamingXlsxItemReader` it is required to use the `StaticColumnNameExtractor` to provide the column names for mapping purposes. The reason for this is that we cannot read a specific row while streaming the results.

[source,xml]
----
 <bean id="excelReader" class="org.springframework.batch.extensions.excel.streaming.StreamingXlsxItemReader" scope="step">
    <property name="resource" value="file:/path/to/your/excel/file" />
    <property name="rowMapper">
        <bean class="org.springframework.batch.extensions.excel.mapping.BeanWrapperRowMapper">
            <property name="targetType" value="org.springframework.batch.extensions.excel.Player" />
        </bean>
    </property>
    <property name="rowSetFactory">
        <bean class="org.springframework.batch.extensions.excel.support.rowset.DefaultRowSetFactory">
            <property name="columnNameExtractor">
                <bean class="org.springframework.batch.extensions.excel.support.rowset.StaticColumnNameExtractor">
                    <constructor-arg value="id,position,lastName,firstName,birthYear,debutYear,comment" />
                </bean>
            </property>
        </bean>
    </property>
 </bean>
----

And the same in Java Configuration.

[source,java]
----
@Bean
public StreamingXlsxItemReader excelReader(BeanWrapperRowMapper rowMapper) {
  var columns = new String[] {"id", "position", "lastName", "firstName", "birthYear", "debutYear","comment"};
  var columnNameExtractor = new StaticColumnNameExtractor(columns);
  var rowSetFactory = new DefaultRowSetFactory();
  rowSetFactory.setColumnNameExtractor(columnNameExtractor);

  var excelReader = new StreamingXlsxItemReader();
  excelReader.setResource(new FileSystemResouce("file:/path/to/your/excel/file"));
  excelReader.setRowMapper(rowMapper);
  excelReader.set
  return excelReader;
}

@Bean
public BeanWrapperRowMapper rowMapper() {
  var rowMapper = new BeanWrapperRowMapper<Player>();
  rowMapper.setTargetType(Player.class);
  return rowMapper;
}
----

== Frequently Asked Questions

=== Not able to open large Excel
When opening large Excel files or Excel files with large amounts of data in a single cell it might fail with an error

[source]
----
"Unexpected error Tried to allocate an array of length 162,386,364, but the maximum length for this record type is 100,000,000. If the file is not corrupt or large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()"
----

This is due to the maximum lenght for certain datatypes is limited. To prevent this from happening you can use the `IOUtils.setByteArrayMaxOverride()` method to increase the allowed size. It is however important that this is set _before_ anything POI related has been processed/configured.

Ideally, when using Spring Boot, you can set this before launching the application or by putting this in a `static {}` initializer block of the Spring Batch job configuration.

[source,java]
----
import org.apache.poi.util.IOUtils;
@SpringBootApplication
public class MyBatchApplication {

  public static void main(String[] args) {
    IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE);
    SpringApplication.run(MyBatchApplication.class, args);
  }
}
----