DeltaMax Technical Documentation
1. Overview
DeltaMax™ is an AI-powered monitoring platform for enterprise data pipelines. It is designed to move beyond traditional, rigid SQL rules-based checks to automatically detect, diagnose, and resolve silent data quality issues. The system is built for petabyte-scale environments like Google Cloud's BigQuery and provides a suite of adaptive machine learning models to ensure data trust.
The primary goal of DeltaMax is to illuminate the entire data supply chain, transforming it from a "black box" into a transparent, monitored asset. It addresses the challenge of "unknown unknowns"—subtle issues that pass standard schema checks but can impact business decisions, such as unexpected changes in data volume or widespread record mismatches after a platform migration.
2. Core Pillars
DeltaMax is built on two core pillars: Anomaly Detection and Intelligent Reconciliation.
2.1 Anomaly & Volatility Detection This pillar enables users to stop manually hunting for problems by automatically flagging meaningful deviations from the norm. The platform learns the unique rhythm of each data source, including its typical volume, values, and arrival patterns. This allows it to differentiate between routine fluctuations and true anomalies, reducing alert fatigue.
Anomaly detection is performed on individual datasets (e.g., a previous month's data and a current month's data) using multiple methods:
- IQR Method: Identifies columns and rows that have triggered anomalies based on the Interquartile Range method.
- Machine Learning Isolation Forest: This ML model identifies observations that have a high anomaly score based on a predefined threshold.
- SHAP Explanations: Ranks the columns from the Isolation Forest model based on SHAP (SHapley Additive exPlanations) scores to explain the top anomalies.
- Variance Analysis: Identifies statistical outliers using parameters such as count, mean, standard deviation, and quartile values.
2.2 Intelligent Dataset Reconciliation When comparing large datasets, a simple count of differences is often insufficient. DeltaMax provides deep context for why records are different, drastically reducing investigation time and certifying data integrity during migrations.
Key reconciliation and comparison modules include:
- Interfile Comparisons:
- T-Tests: Compares two datasets to identify statistically significant differences across numeric columns, using a p-value threshold (e.g., <0.05).
- Population Stability Index (PSI): Monitors the population stability between two datasets to identify significant changes in data distribution. A PSI greater than or equal to 0.25 indicates a significant change, while a value between 0.1 and 0.25 signals a small change.
- High-Level Dataset Comparison: Prints preliminary observations and frequency distributions of data types to provide an initial feel for the two datasets being compared.
- Format and Type Mismatch Analysis:
- Decimal Format Checks: Identifies mismatches in the number of decimal places for common numerical columns between two datasets.
- String Length Mismatches: Compares the length of string columns and flags any discrepancies.
- Data Type Mismatch Analysis: Specifically checks for discrepancies in data types between common columns, which is crucial for accurate analysis and integration.
3. Preprocessing and Business Analysis
To ensure data is clean and ready for advanced analysis, DeltaMax includes several essential modules:
- Data Preprocessing and Imputation: Handles essential data cleaning steps, including the imputation of missing values in numerical columns using the median value of the respective column.
- Business Unique Analysis (A Not B; B Not A): Provides an additional layer of analysis by identifying businesses that are present in one dataset but not the other. This can be used to highlight customer acquisition or churn between periods.
4. Outputs and Deliverables
DeltaMax produces actionable outputs designed for Data Quality and engineering teams.
4.1 Merged File Creation This module combines two monthly CSV files into a single file, aligning the data for each business side-by-side. This is extremely useful for comparing data over time, as it clearly shows continuing, new, and potentially churned businesses in one consolidated view.
4.2 Reason Code File A critical output is the reason_code.csv file, which contains all rows where data quality metrics have been impacted. This file generates specific flags for each issue detected, including:
- IQR and Isolation Forest Anomalies
- T-Test significant differences
- PSI significant changes
- Decimal format and string length mismatches
- Data type mismatches
- Anomalies found in records unique to the current dataset (B not A).
Data Quality teams are advised to use this file to reject, manually remediate, or understand the business reasons behind the anomalous data before it is accepted into historical datasets.
4.3 Visualization The final step involves visualizing the generated reason codes in a dashboarding tool, such as Looker, for clear and accessible analysis of the findings.
