About the synchronization component
The synchronization component extends the base import component to support import cycles that can be compared to each other in order to delete or mark assets that are no longer present in the last cycle.
A typical use case for the synchronization component is to keep the data stored in an external system in sync with assets in Collibra or to keep metadata that describe an external system up-to-date. More generally, this component allows to synchronize any set of resources that can be processed together as a group and to keep them up-to-date. As time goes by, more and more data is added, modified or deleted from the external system. Detecting removed data is usually a challenge but the synchronization component solves this by comparing the current synchronization with the previous one. What is no longer in the current synchronization cycle is considered to be no longer present in the external system. During the finalization step of the cycle, you have the choice to delete these missing assets or mark them with a specific status.
On top of this, the synchronization component allows the ingestion of a large amount of data by allowing multiple import REST calls to be part of the same cycle. At the end, an explicit finalize REST call completes the cycle.
A synchronization is identified by a synchronization ID. All REST commands hat have the same synchronization ID form a single synchronization. Commands with different synchronization IDs are independent from each other. This allows you to have multiple synchronizations running in parallel without interfering with each other.
In general, a synchronization cycle is composed of a series of import steps followed by a finalization step that concludes the cycle. Each import step adds or updates resources the same way as a regular import would do, but using a different REST endpoint. That dedicated endpoint also stores additional metadata behind-the-scenes to allow the synchronization component to track changes from cycle to cycle. To complete the cycle, the finalization step compares the assets that were added or updated in this cycle with those from the previous cycle.
When a cycle starts, all assets that were part of the previous cycle are considered inactive. During the cycle, all assets that are updated become active. At the end of the cycle, during the finalization step, all assets that are still inactive are either deleted or have their status changed, depending on your choice. When you opt for a status change, the new status marks assets as deleted through their status instead of actually deleting them. This has the advantage of not losing any information related to those assets and also being able to restore them if the asset absence was transient, for example due to a technical issue, temporary loss of permissions and so on. When a synchronization cycle updates an asset that was marked as deleted through its status, that asset is restored by changing automatically its status back to the default one, which is the first in the list of statuses supported by that asset type.
It is strongly advised to always check the result of a set of import commands before submitting the finalization command. A failed import in the cycle may lead to assets being deleted by mistake during the finalization step as some operations have been canceled. Currently, the finalization step doesn’t check for failed import commands in the same cycle as it cannot tell whether the failed import was already fixed manually or through a new import submission.
The cycles are actually implicit and do not have any specific identifiers. So an import that is submitted with a specific synchronization ID will be considered belonging to the current cycle until you submit a finalization step. From that point, all incoming commands with the same synchronization ID are part of the next cycle. For the moment, the synchronization component doesn’t allow any new cycle to start if a previous one is ongoing. Once you submit a finalization, all further requests that refer to the same synchronization ID are automatically rejected until that finalization step is completed.
Synchronization component remarks
- The synchronization import steps have dedicated REST endpoints that are different from those used by the regular import component.
- A specific REST endpoint is available to submit a full cycle in one single REST call. You should use that endpoint only for small payloads under 50,000 resources.
- Each synchronization stores some metadata in the database to keep track of the progress and to compare cycles. If you consider that a synchronization is no longer needed, you can delete that data and reclaim the database space using the delete REST command.
- Starting with Collibra version 2023.06, marking assets as deleted using their status available for both full and batch synchronization opertaions.
- There is currently nothing that prevents two or more different synchronizations to update the same assets. However, this could lead to unpredictable behavior when facing asset deletion or parallel execution of synchronization. It therefore advised not to share assets between different synchronizations.