Cloud data management unlocks the power of transformative technologies, including prescriptive AI solutions, advanced analytics and 3rd party data enrichment. But you can’t use new technologies or tools on data you can’t trust. Data Quality Watch™ is an automation solution that used Talend to profile the quality and usability of data within the Snowflake Data Cloud.
In this demo, Passerelle Director of Data Governance Maruthi Govindappa highlights the key features of the Data Quality Watch dashboard and shows how to set up the data profiling process. Organizations can use Data Quality Watch to profile data on Integrity, Completeness, Consistency, Timeliness and Popularity.
Hello everyone. In this session, I will be demoing Passerelle’s Data Quality Watch tool and present its functionalities to you. Data Quality Watch is metadata-driven automation tool that programmatically profiles a specific configured data set and measures its current state of data quality. Quality of the data is measured with five focus areas:
1. Integrity – check data for its uniqueness, referential integrity, and verify if the data is aligned to its business process.
2. Completeness – check if the data is complete, there are no partial, it’s not partial or missing values.
3. Consistency – check if the data is consistent enough. Meaning, for example, check if the values of the columns are in an expected range, perform enumeration checks, pattern checks, and so on.
4. Timeliness – having up-to-date information is critical for analytics. The Data Quality Watch process checks if the data is updated in the expected frequency and doesn’t have a data lag.
5. Popularity – We’re talking about datasets, Data Quality Watch profiles its usage meaning to measure the value of specific data set. The Data Quality Watch process scans through it, the metadata of Snowflake, and provide metrics in terms of how frequently or how widely the data is used.
Data Quality Watch is built using Snowflake and Talend. And then the output of the Data Quality Watch process is visualized either using a Power BI dashboard or Tableau dashboard. To operate the Data Quality Watch process, a configuration has to be updated. So as an example, the user is required to specify which business data of which business unit where the data is stored in terms of database, schema, tables, and so on. And also if there is a need to apply data validation using custom business rules that input has to also has to be configured.
The configuration data is stored within Snowflake under the Data Rocket metadata schema. The Data Quality Watch process uses this configured data sets profiles, the enterprise data or the source data. It also utilizes the Snowflake metadata under the information schema to list out all the tables of a given database or columns according to the configurations.
The Data Quality Watch process is smart enough and it’s really a flexible process – as part of the configurations whether a user can choose to profile the entire database data under an entire database or a specific schema or a specific table sets or a specific column. So it really simplifies the configuration model. You don’t need to explicitly specify all the columns or all the tables. Truly you can choose as I said, either the database or the high level or go to a granular level at the column level. The
Using this configuration parameter, the profile data and the information schema profiles the datasets and records the output or the results in the Snowflake metadata database. As well, the data is made persistent for the trend analysis – at any point of time, you could look back to the historical runs of the data quality process to understand the current state of the data at that point of time. Once installed, the data is visualized using either the Data Quality Watch dashboard in Power BI or Tableau. Talend is used to orchestrate the entire Data Quality Watch process – it automates scheduling, monitors executions, and then gives any insights into the logs and insights required by the IT team.
For this demo, let’s look at a sample data set which is Super Bank. The operational system has data related to loan management, customer accounts and third-party data providing information related to the home equity line of credits – Equifax. If I go and execute, we are looking at around seven tables, these tables are wide, as you see they have more than 30 columns. And then as the first sample, we are looking at hundreds of rows, the volume really, you know, it could be more than millions, depending on the configurations. We’ll go ahead and execute Data Quality Watch in the backend process.
Here’s a quick look at the configurations and what a configuration model look like looks for Data Quality Watch. So, off the shelf, as I said we talked about five focus areas in Data Quality Watch. We call these focus areas dimensions. For each dimension, there are profiling metrics. As an example, under the completeness we look for the metric is empty data – basically we look to see if data is partial or complete consistency outliers and so on.
For the configuration, this is where the user would have to specify which data the Data Quality Watch needs to profile either as I said, you can specify just a database and if there is no further information provided, the Data Quality Watch process will go on to profile all the schemas and all the tables and columns underneath that. If not, you could specify a specific table set as you see here. Looking for the empty data, I’m looking for one specific table loan fact home loans are couple of tables, you know, this has Equifax for one of these configurations and we want to exclude these two tables, and then what are the columns associated to it.
Then you could also go and specify for consistency additional parameters for specifying the range of values that can be done as well – a fully configurable and flexible model. Once we set the profiling configurations for each metric there is an anomaly definition – for an example, we know the fact 80% of the data is non empty. Is this good or you could have to say no, if it is less than 80% then it becomes an anomaly. So for every metric, there is an anomaly associated to it. The definitions are pre-built here, which can be customized as well. Meaning if you look at the thresholds for each of those configurations, that was done previously, you could specify a range of threshold values – lower, mid and higher range.
Again, Data Quality Watch is powerful in this regard, because you don’t have to set one standard threshold across all the data sets. It can be across all the data set or you can choose for specific configurations, what the thresholds looks like, meaning one of the tables you could expect the empty data to be at least 80% whereas in another table you would say no, it has to be 100% it can or more than 90%. So, it can it can be made really flexible.
Once the configurations are made equal, you could also have the flexibility for any given run. You can set if Data Quality Watch has to process all the configurations that have been set or it can disable, as an example, you don’t want to keep measuring the popularity of the data. For example, two days run you could disable that configuration not to get executed. This gives the flexibility to optimize the run in the event if required, but you can choose to keep it on disable it depending on the choice.
If we assemble all these configurations, it should look something like this. For each dimension, what are the metrics, which data set that metrics has been configured? Then if it is been enabled or disabled, and then the thresholds, which are really customizable. Every industry and every business unit has its own business process and it has own thresholds, so you can really go and configure in accordance to the business process. Now that we have set up the configuration, let’s go and trigger the Data Quality Watch process. To do that, we will have to go to the Talend Cloud.
Under the management console, let me go to the management. Look at the task. So we have the data quality task specific to Data Quality Watch. Let’s start that. It can be scheduled to run on a schedule. So to do that, you go to any task, add a trigger, click on the trigger. Specify for example, let’s say daily trigger type daily, daily DPW. Let’s name it like this. Choose your timezone. And so on and repeat everyday, it’s kind of self-explanatory – you would have to go and specify like this, and then just click Save and Close. And then you will have your Data Quality Watch, to be ready to execute it on a daily trigger, as you see here, for now. You could also go and run on demand by clicking “run now.” It will take around 15 to 20 minutes, because we have many columns and lots of columns to process.
While the Data Quality Watch process is running in the background, let’s look at the dashboard for the previously executed runs. So as an example, as of September 6, the summary view provides quick insights into the state of the data for each of the focused areas. So as an example, the integrity seems to be off for as of September 6, the goal, the threshold is supposed to be at 2.5%, whereas it’s just 65%. Same thing with completeness, it seems to be off, the consistency seems to be okay, the data is fresh, and then you get the average hits of the selected data, as per the configuration.
Along with the metrics for a specific date, the summary view also gives the trend of trend of the data for each dimensions for all periods. So as you see, the integrity of the data seems to be growing for six days, but then decline on September 6. Data Quality Watch provides a historical view of the data quality results. So any specific day you can select to see what the data looked like. You could also narrow down the insights to a specific business unit. For example, if you want to filter for loan management, filter through the application as well. You can also filter down to the specific data sets we just have in the configuration one database. And then one schema along with the summary information.
The dashboard also provides a detailed level of information for each of the dimensions of the focus area. As an example, if we look at the integrity, we can note here the percentage of unique value for each of the tables and its trend over the period of time. And then you can drill down to more specific at the column level by going to the view – for example here on September 6, if I go to loan management, and then I can see the loan ID seems to have really low unit value and it is expected to be 80%. You could also narrow it down as I said, by business unit business applications and you can search for a specific column level. So for example, if I want to know the uniqueness of loan ID and if it is presented multiple tables, you could quickly select clone ID and understand its uniqueness across the table. Similar to integrity, and completeness, where we are measuring the percentage of nonempty value by table, you can go and look at the column level, like we did for the integrating. So if I drill down here you can see that Data Quality Watch can be configured for a specific table, where it is looking for the percentage of completeness for each of the columns and then use the measurement, the metric value at the column level.
And then consistency, similar to completeness, you can look at the same as per the configurations, we just looked at one data set for consistency to see what percentage of the value is an outlier or not. And then you get the trend of the value. There is also a quick summary for a given day, if the current state if it is, it’s in a good state or not, in this metric, as well. And some of the quick metrics in terms of how many rows were profiled. So we looked at completeness as well. So here it processed 68 columns for that run and processing of around nearly 175,000 rows of record. Same concept, you can drill down to timeliness to understand the state of the data. So here for this table, it seems to be it was really off the data was not refreshed, and then the process got triggered. Now, it seems to be more up to date in the current run.
For the popularity configurations, we are looking at the data tables that we configured, and then for each of the tables, it provides how many average hits per day. So in the example Home Loans table, this is the most popularly used with average hits per day. And then average across you can also drill down to see if the usage has been consistent or it is declining, or period of the time.
All the configurations we reviewed at the in the Snowflake metadata, but the dashboard also provides what has been configured. It helps really for IT to review what has been configured. For example, if I want to go and check, okay, for integrity, what are we looking, I can select the integrity and I see that the account name table and all the tables except these two tables is being looked for the integrity check, consistently check for one of the HELOC tables is being configured and so on.
So that’s pretty much what the Data Quality Watch product is. It’s really an automation tool. Unlike in a traditional approach where you would have to configure custom develop the data validation process and the data check, Data Quality Watch is an automation tool driven that operates using the metadata, no code required entry it really accelerates and makes proactive for IT engineer to understand the current state of the data. And if any anomalies are found, which basically affects the accuracy of the analytics, a proper proactive action can be taken to improve the quality of data over the period of time. I hope this demo provides value to you. And thank you for watching and I look forward to hearing more from you.