ploosh.
Documentation
What is Ploosh?
Ploosh is an automated testing framework designed for data projects. Based on YAML configuration files, it enables teams to quickly and simply define, execute, and report on data validation tests.
Why Ploosh?
Testing tools for application development are not necessarily suited for data and BI projects. Data systems are often chains of complex workflows with multiple dependencies, making it difficult to test the entire process. In the traditional development world, we are used to the sequence: CI (build & tests) / deployment / execution. In data projects, this becomes: CI (build only) / deployment / execution / tests.
Ploosh fills this gap by providing a dedicated framework for data testing.
Key benefits
- Reduce testing effort: With industrialized tests, teams can focus on development or creating complex and high-value test cases.
- Reduce regression risks: By continuously running tests, regressions can be detected quickly and fixed before they impact production.
- Increase test quality: When a new bug is detected, new test cases can be added to the framework to prevent recurrence.
- Improve project quality: With fewer regression bugs and a more efficient team, the product's quality improves.
How it works
A test case consists of two parts: a source (the data to validate) and an expected (the reference data). For a test to pass, the source must match the expected.
The framework offers three main components:
- Connectors: Query data sources (databases, files, APIs) and store the result in a homogeneous format (DataFrame).
- Compare engine: Compare, for each test case, the source data with the expected data through three successive steps: row count comparison, structural equality check, and row-by-row data comparison.
- Exporters: Export test results in different formats (JSON, CSV, TRX) for integration with reporting tools or CI/CD pipelines.
Two execution modes
Ploosh provides two execution modes to adapt to different environments:
| Mode | Engine | Best for |
|---|---|---|
| Native | Pandas | Local execution, CI/CD agents, small to medium datasets |
| Spark | PySpark | Microsoft Fabric, Databricks, large distributed datasets |
⚠️ A Spark connector can only be used with another Spark connector. It is not possible to mix Spark and native connectors in the same test case.
Supported connectors
| Type | Native connectors | Spark connectors |
|---|---|---|
| Databases | BigQuery, Databricks, Snowflake, SQL Server, PostgreSQL, MySQL, ODBC | SQL Spark, Dremio |
| Files | CSV, Excel, JSON, Parquet, Delta | CSV, JSON, Parquet, Delta |
| BI Tools | Analysis Services, Semantic Model (XMLA) | Fabric KQL |
| Utilities | Empty | Empty |
Supported export formats
| Format | Description |
|---|---|
| JSON | JSON file with detailed results |
| CSV | CSV file with flattened results |
| TRX | Visual Studio Test Results XML format, compatible with Azure DevOps Test Plans |