Spring Batch 6 Guide Part 3: Transactions · Failure Handling — Skip · Retry · Restart

Spring Batch 6 Guide Part 3: Transactions · Failure Handling — Skip · Retry · Restart


Introduction

In Part 2 we saw the chunk cycle — read N → process N → write N → commit — and that its boundary is the transaction boundary. But what happens when one item inside a chunk breaks?

Real-world batches are dirty. Three of a million orders are malformed, the database occasionally throws a deadlock, and the external notification API times out now and then. At that moment you have three choices — throw the whole thing away, skip the dirty one, or retry shortly after. And if a job dies midway, it must be able to resume from where it stopped instead of starting over.

Part 3 covers this “toolkit for failure” in five groups: exactly where the chunk transaction boundary rolls back, how to design the faultTolerant Step’s Skip · Retry · NoRollback policies by exception meaning, the six listeners that observe failures, how ExecutionContext preserves the restart position, and finally idempotency that makes re-running the same job safe.

The target reader is a backend engineer who understands Part 2’s chunk mechanism. Transaction and exception-hierarchy fundamentals are assumed.


TL;DR

  • One failed item = the whole chunk rolls back, by default — if one item in a 1,000-item chunk throws, the other 999 roll back too. Previously committed chunks survive. This is why the chunk is the transaction boundary and the basis for restart.
  • Skip · Retry · NoRollback are decided by exception meaningdata problems Skip (malformed rows), transient errors Retry (deadlock, timeout), side-effect failures NoRollback (a notification that doesn’t dirty the data). Everything else fails.
  • Turning on Skip makes the chunk re-process item by item (scan) — a failed chunk rolls back and is replayed one item at a time to isolate the culprit. The Processor and Writer get called again for the same items, so they must be side-effect free.
  • ExecutionContext is the key to restart — the Reader stores “how far it read” in the StepExecution’s ExecutionContext. Re-launching with the same JobParameters skips completed Steps and resumes the failed Step from the saved position.
  • Idempotency is a JobParameters key + Writer upsert — the idempotency key is the business date (targetDate), and the Writer uses PostgreSQL INSERT ... ON CONFLICT DO UPDATE. Running the same date twice yields the same result.

1. The Chunk Transaction Boundary

1.1 The fork between success and failure

A chunk-oriented Step processes one chunk as one transaction. On success it commits, bumps the metadata counters, and moves to the next chunk. On failure it rolls back the entire chunk.

sequenceDiagram
    participant Step
    participant Tx as TransactionManager
    participant RW as Reader · Processor · Writer
    participant Meta as JobRepository

    Note over Step,Meta: Chunk #1 (1000 items) — success
    Step->>Tx: begin
    Step->>RW: read·process·write 1000
    RW-->>Step: OK
    Step->>Tx: commit
    Step->>Meta: write_count += 1000, commit_count += 1

    Note over Step,Meta: Chunk #2 (1000 items) — one item fails
    Step->>Tx: begin
    Step->>RW: read·process·write in progress
    RW-->>Step: exception at item 743
    Step->>Tx: rollback
    Step->>Meta: rollback_count += 1
    Note over Step,Meta: Chunk #1's 1000 items already committed — they survive

Two things matter here.

  • All 1,000 items in the failed chunk disappear — one bad item at position 743 rolls back the preceding 742 too. “Only one failed, so the rest got saved” is wrong.
  • Previously committed chunks are untouched — chunk #1’s 1,000 items were already committed in a separate transaction. That “already finished” point is the checkpoint restart relies on (§4).

1.2 The trail left in metadata

Each chunk’s outcome accumulates in the BATCH_STEP_EXECUTION counters. In operations, these numbers are the first diagnostic when reading job state.

CounterMeaning
read_countitems read by the Reader
write_countitems written by the Writer
commit_countcommitted chunks (transactions)
rollback_countrolled-back chunks
skip_countskipped items (§2)
process_skip_count · write_skip_countskips per phase

If read_count and write_count diverge and skip_count is non-zero, something is being filtered out.

1.3 Caution: turning on Skip makes the chunk re-process item by item

The default behavior is “one failed item → roll back the whole chunk → fail the Step.” But once you enable Skip (§2), the behavior gets one layer more complex. Spring Batch rolls back the failed chunk and then re-processes it one item at a time (scan) to find which item is the culprit — so it can skip only that one and commit the rest.

Caution: during the scan, already-read items live in a buffer, so the Reader is not called again. But the Processor and Writer are called again for the same items. So if the Processor calls an external API or increments a counter, that side effect runs twice. The Processor and Writer must produce the same result no matter how many times they are called for the same input — which is exactly why §5 idempotency matters.


2. Skip · Retry · NoRollback Policies

2.1 Three decisions — decided by what the exception means

When you hit a failure, the choice is driven by what the exception means, not by row count or frequency.

SituationException naturePolicyExample
The data is dirtypermanent, item-localSkipmalformed CSV row, failed-validation order
Temporarily blockedtransient, clears on retryRetryDB deadlock, lock timeout, brief network glitch
Data is fine, a side task failedunrelated to data integrityNoRollbacka notification send failing
Everything elsea real error that must stop the jobFailschema mismatch, NPE, config error

The test is simple. “Does retrying fix it?” → Retry. “If I drop just this item, is the rest fine?” → Skip. “Does this exception dirty the data transaction?” → if not, NoRollback. If none apply, Fail is correct. When in doubt, Fail should be the default — a job that silently moves on is the most dangerous kind.

2.2 The faultTolerant Step

Skip, Retry, and NoRollback only work on a Step marked faultTolerant(). Layering the policies onto Part 2’s aggregation Step looks like this.

import org.springframework.batch.core.Step
import org.springframework.batch.core.repository.JobRepository
import org.springframework.batch.core.step.builder.StepBuilder
import org.springframework.batch.item.ItemProcessor
import org.springframework.batch.item.ItemWriter
import org.springframework.batch.item.database.JpaPagingItemReader
import org.springframework.batch.item.validator.ValidationException
import org.springframework.context.annotation.Bean
import org.springframework.dao.TransientDataAccessException
import org.springframework.transaction.PlatformTransactionManager

@Bean
fun aggregateStep(
    jobRepository: JobRepository,
    txManager: PlatformTransactionManager,
    orderReader: JpaPagingItemReader<Order>,
    salesProcessor: ItemProcessor<Order, DailySalesLine>,
    salesWriter: ItemWriter<DailySalesLine>,
    skipRecorder: SkippedItemRecorder,
): Step =
    StepBuilder("aggregateStep", jobRepository)
        .chunk<Order, DailySalesLine>(1000, txManager)
        .reader(orderReader)
        .processor(salesProcessor)
        .writer(salesWriter)
        .faultTolerant()
        .skip(ValidationException::class.java)   // data problem → Skip
        .skipLimit(10)                            // fail the Step if more than 10 are skipped
        .retry(TransientDataAccessException::class.java)  // transient error → Retry
        .retryLimit(3)                            // the same item up to 3 times
        .noRollback(NotificationFailedException::class.java)  // side-task failure doesn't roll back
        .listener(skipRecorder)                   // record skipped items (§3)
        .build()

2.3 Skip — tolerable dirtiness

skip(exception) + skipLimit(n) means “drop just this item for this exception and keep going, but if the cumulative count exceeds n, then fail the Step.” The skipLimit is what matters. You can ignore 3 dirty items, but if half are broken, the data source itself is broken and the job should stop.

  • skipLimit is the “line of acceptable normality” — unbounded skipping silently buries errors. Set a realistic threshold like 0.1% of the total.
  • Always record skipped items — just dropping them means you’ll never know why 3 went missing. Use a SkipListener to persist them to a table or log (§3).
  • For fine-grained control, SkipPolicy — if you need “skip read errors unboundedly, write errors up to 5,” implement a SkipPolicy.

2.4 Retry — transient errors

retry(exception) + retryLimit(n) is for transient errors. A deadlock usually clears when the transaction restarts, and a lock timeout passes a moment later. Putting Retry on a permanent error (validation failure) just repeats the same failure n times for nothing.

Retry and Skip compose. Put one exception in both the retry and skip lists, and it will first retry up to retryLimit and then, if still failing, be skipped.

Note: retry backoff (the interval between attempts) is configured via RetryTemplate/BackOffPolicy. A short fixed interval is typical for deadlock retries, exponential backoff for external APIs. Since chunk-level retry re-processes the entire chunk, it’s better to control external calls at the Writer rather than the Processor to limit side effects.

2.5 NoRollback — moving on without rolling back

By default every exception thrown in a chunk rolls the transaction back. But if the data write succeeded and only a subsequent notification send failed, there’s no reason to undo the write. noRollback(exception) declares “even if this exception is thrown, do not roll back the transaction.”

The precondition is crucial. A NoRollback exception must not dirty the transaction state. Put NoRollback on an exception thrown mid-DB-operation and you can leave half-committed data behind. The only safe candidates are pure side-task failures that happen after the data write is done.


3. The Six Listeners That Observe Failures

Once you’ve set Skip/Retry policies, listeners are the window into what they did. Spring Batch lets you hook callbacks at each point of the job lifecycle.

3.1 Quick catalog

ListenerKey callbacksTypical use
StepExecutionListenerbeforeStep / afterStepStep start/end hooks, adjusting ExitStatus
ChunkListenerbeforeChunk / afterChunk / afterChunkErrorchunk-boundary logging, detecting failed chunks
ItemReadListenerbeforeRead / afterRead / onReadErrordiagnosing read/parse errors
ItemWriteListenerbeforeWrite / afterWrite / onWriteErrordiagnosing write / batch-insert errors
SkipListeneronSkipInRead / onSkipInProcess / onSkipInWritepersisting skipped items to a table/log
RetryListenerretry lifecycle (open / onError / close)observing retry counts and causes

ItemProcessListener (the process phase) and JobExecutionListener (the whole job) exist in the same vein. You don’t need all six — by far the most used in practice is SkipListener.

3.2 SkipListener — recording skipped items

§2 said “always record skipped items.” The tool for that is SkipListener. Its skip callbacks are invoked after the chunk commits, so what you record there is stored safely alongside the good data (it is not swept away by the rollback).

import org.springframework.batch.core.SkipListener
import org.springframework.stereotype.Component

@Component
class SkippedItemRecorder(
    private val skipLogRepository: SkipLogRepository,
) : SkipListener<Order, DailySalesLine> {

    override fun onSkipInProcess(item: Order, t: Throwable) {
        skipLogRepository.save(
            SkipLog(orderId = item.id, phase = "PROCESS", reason = t.message ?: t.javaClass.simpleName),
        )
    }

    override fun onSkipInWrite(item: DailySalesLine, t: Throwable) {
        skipLogRepository.save(
            SkipLog(orderId = item.orderId, phase = "WRITE", reason = t.message ?: t.javaClass.simpleName),
        )
    }

    override fun onSkipInRead(t: Throwable) {
        // A read-phase skip (parse error, etc.) can't identify the source item, so log the message only.
        skipLogRepository.save(SkipLog(orderId = null, phase = "READ", reason = t.message))
    }
}

The resulting skip_log table becomes the basis for tracing and re-processing “which orders dropped from yesterday’s job, and why.”


4. ExecutionContext and Restart

4.1 What ExecutionContext is

ExecutionContext is a key-value store holding a job’s intermediate state. There are two of them — one per JobExecution and one per StepExecution — each serialized into the BATCH_JOB_EXECUTION_CONTEXT and BATCH_STEP_EXECUTION_CONTEXT tables (see the six metadata tables in Part 1).

Its core purpose is preserving the restart position. Even if a job dies midway, recording how far it got into the DB lets the next run continue from there.

4.2 How the Reader saves its position

Most Readers implement ItemStream. While the Step runs, update() is called periodically to record “read N items so far” into the StepExecution’s ExecutionContext. On restart, open() reads that value and starts after item N.

  • saveState (default true) — whether the Reader saves state. Turn it off and a restart re-reads from the beginning. Keep it on if you need restart.
  • name() is the key prefix — the Reader’s name() from Part 2 is the prefix that prevents ExecutionContext key collisions. Two Readers in the same Step must have different names.
  • Custom keys too — you can implement ItemStream directly or stash arbitrary values in stepExecution.executionContext from a listener.

4.3 How restart works

The premise of restart is the same JobInstance — i.e., the identifying JobParameters (e.g., targetDate) must match. Re-running with the same key makes the framework create a new JobExecution but bind it to the same JobInstance and restore the last failure point.

flowchart TD
    A["Run — targetDate=2026-05-16"] --> B{"Step result"}
    B -->|"all succeed"| C["JobExecution COMPLETED"]
    B -->|"fails at item 3200 of 5000"| D["JobExecution FAILED<br/>StepExecution saves read.count=3200"]
    D --> E["Re-run with the same JobParameters"]
    E --> F["new JobExecution · same JobInstance"]
    F --> G["COMPLETED Steps are skipped<br/>the failed Step resumes from item 3200"]
    G --> C

The already-committed 3,000 items (three chunks) are not re-processed; it continues from the chunk starting at item 3,001. This is where “previously committed chunks survive” from §1 pays off.

4.4 Common traps

  • A COMPLETED JobInstance can’t be re-run with the same parameters — re-running a successfully finished job with the same targetDate throws JobInstanceAlreadyCompleteException. If you need a re-run, change the parameters (Part 4’s run.id) or solve it with §5 idempotency.
  • Keep ExecutionContext small — it is serialized to the DB, so don’t put large objects or collections in it. Only small values like a position.
  • saveState=false makes restart meaningless — sometimes it’s turned off where saving state is hard (multi-threaded Steps, Part 5), but then restart is from scratch.

5. Idempotency Patterns

5.1 Why idempotency is needed even with restart

The reason idempotency is still needed despite restart is that “restart” and “re-run” are different. Restart continues a failed JobInstance, so it doesn’t touch already-written data. But in operations, an intentional re-run — “yesterday’s aggregation looks off, run it again” — is common. Re-loading the same date’s data then can double the revenue figure.

Idempotent = running with the same input any number of times yields the same result. In batch this is guaranteed at two layers — job identity (JobParameters) and the write operation (Writer).

5.2 The business date as the JobParameters idempotency key

An aggregation job’s idempotency key is the business value pointing at what’s being processed — targetDate. “Aggregate sales for 2026-05-16” must be a single JobInstance no matter how many times it’s defined.

  • The identifying parameter is the business datetargetDate=2026-05-16. Same date = same JobInstance = duplicate completion is blocked.
  • Per-run values must be non-identifying — things like an execution timestamp must be excluded from identity (Part 4’s run.id incrementer) to allow re-runs.
  • That alone only blocks “re-running a succeeded job.” To allow a re-run that overwrites data, the Writer must be idempotent.

5.3 Writer upsert — PostgreSQL ON CONFLICT

The canonical write-idempotency is upsert. Reuse the JdbcBatchItemWriter + PostgreSQL INSERT ... ON CONFLICT DO UPDATE from Part 2 §4.3. If the same (sale_date, member_id) already exists, it UPDATEs instead of INSERTing — so running the same date twice doesn’t duplicate rows, it just refreshes values.

import org.springframework.batch.item.database.JdbcBatchItemWriter
import org.springframework.batch.item.database.builder.JdbcBatchItemWriterBuilder
import org.springframework.context.annotation.Bean
import javax.sql.DataSource

@Bean
fun salesWriter(dataSource: DataSource): JdbcBatchItemWriter<DailySalesLine> =
    JdbcBatchItemWriterBuilder<DailySalesLine>()
        .dataSource(dataSource)
        .sql(
            """
            INSERT INTO daily_sales (sale_date, member_id, amount)
            VALUES (:saleDate, :memberId, :amount)
            ON CONFLICT (sale_date, member_id)
            DO UPDATE SET amount = EXCLUDED.amount
            """.trimIndent(),
        )
        .beanMapped()
        .build()

For this to work, daily_sales must have a UNIQUE (sale_date, member_id) constraint. Idempotency is guaranteed by that unique constraint, not by the code.

5.4 Alternative — delete-then-insert

If upsert feels heavy, or “rewrite that day’s data wholesale” reads more naturally, put a delete Tasklet Step before the aggregation Step.

@Bean
fun purgeStep(
    jobRepository: JobRepository,
    txManager: PlatformTransactionManager,
    jdbcTemplate: JdbcTemplate,
    @Value("#{jobParameters['targetDate']}") targetDate: LocalDate,
): Step =
    StepBuilder("purgeStep", jobRepository)
        .tasklet({ _, _ ->
            jdbcTemplate.update("DELETE FROM daily_sales WHERE sale_date = ?", targetDate)
            RepeatStatus.FINISHED
        }, txManager)
        .build()

Wiring the job as purgeStepaggregateStep empties that date and refills it on every re-run. The trade-off is clear — upsert is row-level idempotent (strong for partial updates), while delete-then-insert is a “clean slate” (simple to reason about, but it opens an empty window between delete and load). Running them sequentially within one job confines that window to job-execution time.


Recap

The key takeaways from Part 3, one line each:

  • One failed item rolls back the whole chunk — 999 items vanish with it, and only previously committed chunks survive. That boundary is the restart checkpoint.
  • Skip · Retry · NoRollback are decided by exception meaning — data problems Skip, transient errors Retry, side-task failures NoRollback, everything else Fail. When in doubt, Fail is safe.
  • Skip replays the chunk item by item — the Processor and Writer get called again for the same items, so they must be side-effect free. Always record skipped items with a SkipListener.
  • ExecutionContext preserves the restart position — with the same JobParameters it binds to the same JobInstance, skipping completed Steps and resuming the failed Step from the saved point.
  • Idempotency is two layers: a JobParameters key + Writer upsert — identify the job by business date, and make same-date re-runs safe with ON CONFLICT DO UPDATE.

Part 4 takes on Job Launch · Scheduling · Operations. So far we’ve been writing jobs; Part 4 is about running them. The difference between JobLauncher and JobOperator, what to trigger with — @Scheduled / Quartz / K8s CronJob / Argo — how to stop the same job running twice across multiple instances (this part’s §5 idempotency is the last safety net there), and the five patterns for where a batch reads its data.


Appendix

A. Full faultTolerant Step example

Expand — the full Job skeleton wiring purgeStep → aggregateStep
import org.springframework.batch.core.Job
import org.springframework.batch.core.job.builder.JobBuilder
import org.springframework.batch.core.repository.JobRepository
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration

@Configuration
class DailySalesJobConfig {

    @Bean
    fun dailySalesJob(
        jobRepository: JobRepository,
        purgeStep: Step,
        aggregateStep: Step,
    ): Job =
        JobBuilder("dailySalesJob", jobRepository)
            .start(purgeStep)       // empty that date (§5.4)
            .next(aggregateStep)    // re-aggregate and load (§2.2)
            .build()
}

purgeStep is the first layer of idempotency (clean slate), and aggregateStep’s upsert Writer is the second. Either alone is enough, but layering both means no path into a re-run can produce duplicates.

B. Skip / Retry exception cheat sheet

Expand — recommended policy per common exception
ExceptionNatureRecommended policy
org.springframework.batch.item.validator.ValidationExceptionvalidation failure (data)Skip
FlatFileParseExceptionfile-row parse failureSkip
org.springframework.dao.DeadlockLoserDataAccessExceptionDB deadlock (transient)Retry
org.springframework.dao.TransientDataAccessExceptiontransient DB errors in generalRetry
java.net.SocketTimeoutExceptionexternal-call timeoutRetry (with backoff)
org.springframework.dao.DataIntegrityViolationExceptionconstraint violation (integrity)Fail
NullPointerException / config errorscode/environment bugFail

Principle: putting the same exception in both Retry and Skip means “retry, and if it still won’t go, drop it.” A common combo for external-API timeouts.

C. External references

Shop on Amazon

As an Amazon Associate, I earn from qualifying purchases.