ploosh.
Documentation
Spark mode overview
Ploosh supports two execution modes: Native (Pandas) and Spark (PySpark). Spark mode is designed to run within distributed environments like Microsoft Fabric, Databricks, or a local Spark session, enabling validation of large-scale datasets.
Why Spark mode?
| Benefit | Description |
|---|---|
| Distributed processing | Leverage Spark clusters to validate large volumes of data |
| Native platform access | Query Lakehouse tables, KQL databases, and Delta files directly |
| No data movement | Data stays within the platform, avoiding costly exports |
| Integrated execution | Run tests directly from notebooks alongside your data pipelines |
When to use Spark mode?
| Scenario | Recommended mode |
|---|---|
| CI/CD pipeline on a build agent | Native |
| Local development with small datasets | Native |
| Microsoft Fabric notebooks | Spark |
| Databricks notebooks | Spark |
| Large datasets (millions of rows) | Spark |
| Querying Lakehouse/KQL/Delta files on a cluster | Spark |
Spark connectors
Spark mode uses dedicated connectors. You cannot mix Spark and native connectors in the same test case.
| Connector | Type | Description |
|---|---|---|
csvspark | File | Read CSV files via Spark |
jsonspark | File | Read JSON files via Spark |
parquetspark | File | Read Parquet files via Spark |
deltaspark | File | Read Delta tables via Spark |
sqlspark | Query | Execute Spark SQL queries |
fabrickqlspark | Database | Query Fabric KQL databases |
dremiospark | Database | Query Dremio via Arrow Flight SQL |
empty_spark | Utility | Return an empty DataFrame |
Spark comparison engine
The Spark compare engine supports two comparison modes:
| Mode | Description |
|---|---|
| order (default) | Rows are matched by position using a rownumber() window function |
| join | Rows are matched by specified joinkeys columns (Spark only) |
My test case:
options:
compare_mode: join
join_keys:
- employee_id
source:
type: sql_spark
query: SELECT * FROM lakehouse.employees
expected:
type: csv_spark
path: /lakehouse/default/Files/expected/employees.csv
Calling Ploosh from Python
In Spark mode, Ploosh is called programmatically from Python using the execute_cases() function:
from ploosh import execute_casesexecute_cases(
cases="/path/to/cases",
connections="/path/to/connections.yaml",
spark_session=spark,
filter="*.yaml",
path_output="/path/to/output"
)
See the Python API reference for full details.
Platform-specific guides
- Microsoft Fabric setup — Complete guide for Fabric
- Fabric notebook orchestration — Notebook implementation
- Fabric shortcuts strategy — Cross-workspace data access
- Fabric reporting — Power BI dashboards on test results
- Databricks setup — Running Ploosh on Databricks
- Local Spark — Running Ploosh with a local SparkSession