Cursor-based pagination

The Collibra APIs have supported offset-based pagination as the main method of browsing data through our REST and Java endpoints. This has proven to have both consistency and performance limitations, which you can overcome by using cursor-based pagination.

Offset-based pagination

In offset-based pagination, the data is browsed in pages from a collection of records which is always sorted on some of its fields. Each page is a window that starts from an arbitrary index and ends at another arbitrary index in the sorted collection. To do so, we expose the following parameters:

  • offset: the index at which the desired page starts.
  • limit: the size of the desired page.

https://<your_collibra_url>/rest/2.0/assets?offset=3000&limit=1000

The response is similar to the following:

{
    "total": 10000000,
    "offset": 3000,
    "limit": 1000,
    "results": [ ... ]
}

The API consumer can browse pages by incrementing the offset by a certain amount. When the list of results is inferior in size to the provided limit, this means that we have reached the end of the sorted collection of records.

The offset based pagination supports the calculation of the total. This can be a resource-intensive operation, which you can disable by providing the countLimit parameter with a zero value: https://<your_collibra_url>/rest/2.0/assets?offset=3000&limit=1000&countLimit=0

Limitations of offset-based pagination

Offset-based pagination can result in consistency problems in the scenario where records are inserted or removed in one of the pages that is located before the page you are currently browsing. This can typically occur in the event of a concurrent user performing such operations at the same time as you browse the data. The following example demonstrates such a scenario.

Consider a collection of records such as the following, which is sorted numerically:

[0, 1, 2, 3, 4, 5]

If we have the following sequence of actions:

  1. User A requests a first page with offset=0 and limit=2, returning the expected data: [0, 1].
  2. User B inserts a new record 0.5, making the collection now look like the following: [0, 0.5, 1, 2, 3, 4, 5].
  3. User A requests the next page, starting at the offset that corresponds to the next page to them: offset=2 and limit=2.

The returned data is [1, 2].

In such a situation of concurrent operations, the pages browsed by user A contain duplicate information. Record 1 of this example appears on two different pages.

If we were to repeat the experiment and have a concurrent user remove a record instead of inserting one, then we would have an inverse consistency problem with records that could then be missing from the pages.

Such consistency issues make technical integrations difficult to write in a robust and secure way.

Cursor-based pagination

Cursor-based pagination uses the concept of a cursor, which can be seen as a technical indicator of a record in a sorted collection of records. Instead of indicating an index, as in offset-based pagination, we are directly indicating a record.

Each API call returns the cursor of the record that is next in the collection of records. You can then use this cursor in the next request to indicate from which record to start the next page.

As a result, inserting or deleting records in the previous pages does not have any logical side effects on the subsequent calls.

The API consumer must provide an empty cursor for the initial call:

https://<your_collibra_url>/rest/2.0/assets?limit=1000&cursor=&sortField=ID

The response is similar to the following:

{
   "total": -1,
   "offset": -1,
   "limit": 1000,
   "results": [ ... ]
   "nextCursor": "QUZURVI6aWQ6ODgxZDc3MjgtOGIwNy00Yjk3LWIwN2UtMjVlMDMxMjQ5Y2U4OnNpZ25pZmllcjowMDAwMDBjZi02N2QzLTQ0ZjMtYjViZS0xYjczMGNmYTY2ZmQ="
}

You can then use the nextCursor value and pass it in the next request:

https://<your_collibra_url>/rest/2.0/assets?limit=1000&cursor=QUZURVI6aWQ6ODgxZDc3MjgtOGIwNy00Yjk3LWIwN2UtMjVlMDMxMjQ5Y2U4OnNpZ25pZmllcjowMDAwMDBjZi02N2QzLTQ0ZjMtYjViZS0xYjczMGNmYTY2ZmQ=&sortField=ID

When no more data is available, the response does not contain the nextCursor field, indicating that there are no more pages available.

Cursor-based pagination is always performed on a sorted collection of records. This means that the API uses a default value, which can vary depending on each API if you do not specify the sortField parameter.

Use of a sort field which is optimized. Currently, only the ID sort field is optimized in our APIs and is always available in our APIs that support cursor-based pagination. This will result in significant performance gains and make the response times uniform across all browsed pages.

Cursor-based and offset-based pagination are mutually exclusive. Using parameters from both results in an error.

Performance comparison

The following image shows response time differences between cursor-bases and offset-based pagination in Collibra Data Intelligence Platform version 2022.08 when requesting pages of 10.000 records from a total of 10 million records through the Core REST API Assets resource.