Bulk operations in Groovy script tasks
When a Groovy script task calls the Collibra <Resource>Api interfaces, such as assetApi, relationApi, attributeApi, or responsibilityApi, those calls invoke Spring service beans directly rather than making HTTP requests. All of these beans use the Spring default transaction propagation, which means they join the existing transaction rather than opening their own.
A synchronous script task therefore runs entirely in a single database transaction. Every API call made during the script, including reads and writes, participates in that transaction, which stays open until the script completes.
Impact of bulk operations on transaction management
A script that iterates over a large dataset and writes for each item accumulates all those writes in one transaction. The open transaction holds row-level locks on every row it touches. Under concurrent load, other processes attempting to read or modify the same rows are blocked behind those locks. If the script runs long enough, the lock contention can saturate the database connection pool and make Collibra unresponsive.
Setting script tasks to Asynchronous alone does not resolve this issue. The task is moved to a background job thread, but the transaction boundary is unchanged and the entire script still runs in one transaction.
Recommended pattern for bulk operations
To avoid long-running transactions, split the workflow into two script tasks connected by a loop-back gateway:
- A collector script task that runs synchronously in one short transaction. It queries the full dataset, builds a list of work items, and stores it as a process variable using
execution.setVariable("workItems", ...). - A processor script task in asynchronous mode, which gives each execution its own transaction. It takes the first batch of items from workItems, processes them, updates the process variable with the remaining items, and sets
execution.setVariable("hasMoreWork", !workItems.isEmpty()). - An exclusive gateway that routes back to the processor task when
${hasMoreWork}is true, or proceeds to the end event when${!hasMoreWork}is true.
The following diagram illustrates this pattern:
Each time the async job executor picks up the processor task, it runs in its own transaction. A failure in one batch rolls back only that batch, and the remaining work continues.
Choosing a batch size
A batch of 25 to 50 items is a practical starting point. Smaller batches result in shorter transactions and less lock contention. Larger batches reduce job executor overhead but increase the risk of contention under concurrent load.
Alternative: parallel multi-instance subprocess
You can wrap the processor in an asynchronous subprocess with the Multi instance type property set to Parallel to process all batches in parallel, which increases throughput. However, all batch jobs compete for database connections simultaneously. Under high concurrent load, this approach may be counterproductive. Use the sequential loop-back pattern unless throughput is a critical requirement.