# About Soda

Soda is a Python library that facilitates data quality assessment, validation, and monitoring within your data pipelines and analytical workflows.

### Key Features

* Data validation: Soda provides 25+ built-in validation rules that help you ensure data conforms to expected formats, ranges, constraints, and integrity rules.
* Data vrofiling: Soda helps you understand your data by generating descriptive statistics and summaries, such as data distributions, frequency distributions, and null value counts.
* Data monitoring: With Soda, you can set up automated data quality monitoring tasks to regularly assess the health of your data.
* Custom validation rules: Beyond the built-in validation rules, you can create custom rules tailored to your specific data requirements and business logic.
* Integration with data pipelines: Soda integrates with your existing data pipelines so that you can easily check data quality at multiple points during data processing.

## Set up Soda

Refer to Soda documentation: https://docs.soda.io/soda/quick-start-databricks.html

### Install a Soda Library package with Apache Spark DataFrame

In [None]:
pip install -i https://pypi.cloud.soda.io soda-spark-df

### Import Scan from Soda Library

A scan is a command that executes checks to extract information about data in a dataset. Soda uses the input you provide to prepare SQL queries that it runs against the data in one or more datasets.

In [4]:
from soda.scan import Scan

### Create a Spark DataFrame, or use the Spark API to read data and create a DataFrame

A Spark DataFrame is a distributed collection of data organized into named columns, providing a structured and tabular representation of data within the Apache Spark framework. 

In [None]:
df = spark.table("delta.`/databricks-datasets/adventureworks/tables/adventureworks`")

### Create a view that SodaCL uses as a dataset

In [None]:
df.createOrReplaceTempView("adventureworks")

### Create a scan object

In [None]:
scan = Scan()

### Set a scan definition

Use a scan definition to configure which data to scan and how to execute the scan.

In [None]:
scan.set_scan_definition_name("Databricks Notebook")
scan.set_data_source_name("spark_df")

### Attach a Spark session

In [None]:
scan.add_spark_session(spark)

### Define checks for datasets

A Soda Check is a test that Soda Library performs when it scans a dataset in your data source. You can define your checks in-line in the notebook, or define them in a separate checks.yml fail that is accessible by Spark.

In [None]:
checks = """
checks for dim_customer:
  - invalid_count(email_address) = 0:
      valid format: email
      name: Ensure values are formatted as email addresses
  - missing_count(last_name) = 0:
      name: Ensure there are no null values in the Last Name column
  - duplicate_count(phone) = 0:
      name: No duplicate phone numbers
  - freshness(date_first_purchase) < 7d:
      name: Data in this dataset is less than 7 days old
  - schema:
      warn:
        when schema changes: any
      name: Columns have not been added, removed, or changed
sample datasets:
  datasets:
    - include dim_%

#### OR, define checks in a file accessible via Spark, then use the scan.add_sodacl_yaml method to retrieve the checks.

In [None]:
scan.add_sodacl_yaml_str(checks)

### Add Soda Cloud
Add your Soda Cloud connection configuration using the API Keys you create in Soda Cloud. Sign up for a Soda Cloud account for a free, 45-day trial at https://cloud.soda.io/signup.
Use cloud.soda.io for EU region
Use cloud.us.soda.io for US region

In [None]:
config ="""
soda_cloud:
  host: cloud.soda.io
  api_key_id: 399b**3c9
  api_key_secret: hNSg7**1Q
"""

#### OR, configure the connection details in a file accessible via Spark, then use the scan.add_configuration_yaml method to retrieve the config

In [None]:
scan.add_configuration_yaml_str(config)

### Execute a scan

In [None]:
scan.execute()

### Check the Scan object for methods to inspect the scan result
The following prints all logs to console

In [None]:
print(scan.get_logs_text()) 