For the complete documentation index, see llms.txt. This page is also available as Markdown.

Run a DQ Job from a PySpark notebook

This section shows you how to run DQ Jobs using a PySpark notebook environment, such as Databricks Notebook, Google Colab, and Jupyter. You will learn how to:

Prerequisites

  • You are signed into Collibra DQ.

  • You have permission to run DQ Jobs.

  • You have a PySpark notebook, such as Databricks Notebook, Google Colab, or Jupyter.

Steps

This example is based on a notebook created in Google Colab.

Step 1: Create a notebook

Select your PySpark notebook service of choice and create a new notebook.

Step 2: Install PySpark

To get started running a DQ Job from a PySpark notebook, open your new notebook and install PySpark.

1

Open your PySpark notebook.

2

Insert a new code cell into your notebook.

3

Add the following code:

4

Run the cell.

# Install
!pip install -q pyspark==3.4.1         # installs PySpark library version 3.4.1
!pip install -q findspark              # installs the findspark library

Your Spark version should align with the Spark version supplied to you by Collibra DQ. For example, if we send you the default 3.4.1 version (as shown in the example above), the first line of code should be !pip install -q pyspark==3.4.1.

Step 3: Import the libraries

  1. Insert a new code cell and add the following code:

  1. Run the cell.

Step 4: Add the JAR files

In this step, make sure to add the correct JAR file to your notebook environment according to your Collibra DQ core JAR and Spark versions.

Step 5: Add secrets and environment details

  1. Insert a new code cell and add the following code, replacing the sections between "" and '' with your own information:

  1. Run the cell.

Step 6: Start the SparkSession

  1. Insert a new code cell and add the following code, replacing the sections between "" and '' with your own SparkSession preferences:

  1. Run the cell.

Step 7: Load the DataFrame

  1. Insert a new code cell and add the following code, replacing the sections between "" and '' with your own information:

  1. Run the cell.

Step 8: Run the job

  1. Insert a new code cell and add the following code:

  1. Run the cell.

Step 9: Check the Jobs page in Collibra DQ

In another tab or window with Collibra DQ open, click the Jobs icon and check that your job was submitted to the Jobs queue for processing.

Last updated

Was this helpful?