Latest Insights
Blog

Demo: Mastered Data Framework

Passerelle’s Senior Engineering Consultant Guillaume Durand walks through Data Rocket’s Data Mastering Framework using a manufacturing scenario in this demo video. 

Guillaume shows how the manufacturer could use the Data Mastering Framework, built on Talend and Snowflake, to create a single version of the truth from three distinct ERP and CRM source systems. Additionally, you will see how the Mastered Data Framework bridges the business and IT divide by automating data deduplication and allowing data stewards to focus on outlying data anomalies.

Learn how you can use Data Rocket’s Mastered Data Framework to:

  • Get started quickly with a ready-to-use and configurable framework. 
  • Apply simple matching and complex matching by extending a custom machine-learning model. 
  • Simplify the data matching effort to achieve accurate outcomes. 
  • Comply to industry data governance approach and best practices.

Ready to get started? Book a complimentary consultation today. 

Hello, everyone, my name is Guillaume Durand. I’m a data engineer consultant at Passerelle. Today I’m going to show you and walk with you our data mastering with Passerelle Data Rocket will help you to create automated workflows for data deduping, stewardship and visualization. Let’s start with a little context and background. Our demo use case is related to product sales of manufacturing or retail distribution channel.

Let’s pretend we are a fictitious organization that manufactures and sells our products and materials across North America. And this is my Power BI sales dashboard with multiple KPIs related to sales by month. Our data is managed in three different ERP CRM source systems. We have a Salesforce tenant for Canadian Division, a Salesforce tenant for the US division as well as a dynamics 365 ERP.

I belong to a company we acquired some time ago as a product sales executive. My goals are to increase sales of course, but also get accurate insights on top and bottom-selling products and strategize for upsell, better forecast sales, predict optimal inventory utilization that will supply sales demand and finally get accurate market segmentation and capitalization across all of our sales region in North America.

Let’s start with our problem statement. Eyeballing the data, I can already see inaccurate Product Sales Insight due to duplication of products within my data. My top and bottom selling products do not seem to give the real sales amount – for example, right now pastel paper rough color seems to be my most sales product. Looking down below I can see Colourfix Original seems to be there multiple times this week, as the same product is coded with different SKU codes or descriptions across the three different systems or worst even within the same system, as we can see with these two instances of Dynamics 365. For example, Colourfix Original is fully spelled in our U.S. Salesforce, tenant, where in the Canadian one only Colourfix R is there for the Dynamics 365. This seems to be real duplicate entries within the system. Multiple reasons can cause errors and product coding name descriptions causing data duplication, we have three siloed systems with non or not enough efficient data integration between them. Duplicate products are also created with the same system due to human and procedural errors, just like we can see here within the our Dynamics 365 for Colourfix Original.

If I look at our product inventory quantity, the quantity sold and sales amount do not seem to always align with the inventory in stock. This could also be caused by duplicate product data, I could think that our inventory cannot fulfil our sales demand. Just like this between January and July, the sales amount seems higher than the actual available quantity in stock.

Let me switch to after deduplication and look at the master data result. Using the master data Golden Record, I can now get an accurate KPI. The dashboard is now telling me that the real top-selling product is in fact the Colourfix Original. If I look at the inventory for Colourfix Original, it is now aligned with the actual sales demand. This just prevented me from making an erroneous decision to increase production and manufacturing of this product when the need for more inventory was in fact skewed.

Resolving data duplication manually can be a tedious effort and really time-consuming. There may still be errors and inaccuracies due to lack of efficient data governance program and workflow within our company. The solution to get rid of all this data fragmentation is the Passerelle Data Rocket Master Data Framework. Our product is an automation framework and provides multiple workflows to aid organization and unification of data and the creation of a Golden Record for multiple data entities such as products or customers.

How does it work? The Data Rocket Master Data Framework is a well-defined architectural component that is flexible, efficient and complies to the industry data governance best practice. Let’s look at the blueprint of the solution implemented for this fictitious organization for this arts product use case problem.

The Data Rocket Master Framework is really a two-step process. We start by unifying the product data from three source systems and send them to our advanced matching algorithm that is built using Talend Data Quality. All of the already known products within our Master Framework are also looked up so that each new record can be assigned to an existing group or can create their own new group. The advanced matching algorithm is configurable for each data entity. We assign a matching function to each attribute of this data entity. A distance can be calculated between each record from the three different source systems – we also configured a distance threshold. For each product that is above this threshold, they get automatically moved to the product master table. In opposite any product that is below this confidence threshold will be sent to a data steward for manual revision.  Once the data steward makes his decision as to whether those products really duplicates or not, the result is sent back to the product master table.

The second step of the process is the creation of the Golden Record. In our case, we created the dim product dimension for this star schema analytical model that supports the sales dashboard. In our case, we want to create a product Golden Record that would give you a unique price for retails. Our second process will take each and every group of products and try to apply survivorship rules to get the correct unit price similar to the first step for matching a certain threshold established for this field. Any price that doesn’t meet the survivorship rule or cannot be derived, will be sent to a data steward for final revision. The other products for which we can establish a common unit price are automatically moved to our dim product. Here again, when the data steward is able to establish a price, then new product price is sent to the dim product.

Now that we’ve reviewed this blueprint, let’s take a look at some key points of the Data Rocket Master Framework implementation. A key part of the advanced matching algorithm workflow was Talend and the creation of an auto match rule. For our fictitious company of manufacturing and sales of arts products, we created one rule based on those five product attributes. Across all of the three source systems, we realized that the product brand, the product category, the product description, the product detail, and the product name, were reliable enough to try to match records among themselves.

We used different matching functions that suited the attribute type and meaning we also gave some weight. Because some field matches are more important than the others. We first want to try to match the product brand exactly as between themselves. For the second and third product attributes to category and descriptions, we tried to match the record using the fingerprint key algorithm. This matching function tries to match records amongst itself doing the following sequential workflow – it removes leading and trailing whitespaces, changes all character to the lowercase representation, and removes all punctuation and control characters. It then splits this string into whitespace separate tokens, sorts the tokens and removes the duplicate, and joins the majority of the token back together. Because the string parts are sorted, the given order of a token does not matter. For example, a product category that would be labeled as paint brush, and the other one would be as brush paint would be matched together as the same.

For the final fourth and fifth attribute, the product detail and product name, we decided to assign them a matching function of Levenshtein. Levenshtein calculates the minimum number of edits, and inserts delete or substitution required to transform one string into another. For example, the product Colourfix Original was able to be matched to the Colorfix Original even if it was missing some character. We did multiple experiments and iterations when mining this data to come up with this optimal rule that suits this product and our company, we also came up with a threshold of 80%. We are confident enough that for this data, multiple ARD products that do not match between themselves at least 80% can be considered as unique. Unique will flow directly through the master Product table. We also define that our confident matched ratio for automatic grouping is 88%. That means all the arts products that are group matched at 88% or more are automatically moved as one group into our Master Product table as auto-matched. That leaves us with all the remaining record groups that are matching algorithm match between 80 to 88%. Those we call them as uncertain and are sent to the data stewardship for manual review.

Let’s move to the data stewardship portal.

Now this is the Talend Data Stewardship portal. I will now take the role of a data steward within the art products manufacturing company. Even though a human intervention or decision is needed, Talend Data Stewardship has implemented a workflow that prevents humans – me – to directly interact with the data and introduce new errors. Only authorized users can act upon the data, though solid boundaries only allow the stewards to tell the system if the grouping is good. In that case, the Yes option or not the No option. I cannot go and try modifying field by field, you could see every field is marked as locked. As a data steward, I can review the different products in the proposed group. I can also go back to the original source system and the product catalog to do another data analysis and provide the proper response.

We could see that the Data Rocket Master Framework matching algorithm match all of those 12 product within the same group. I can see within my ERP, Salesforce US all of other different IDs, I can also review the product code, I can see that between my three source systems, this synthetic brush, is there multiple times with the same system and also across the same system. We could see that all of those 12 records have a different recommended retail price, even though we suspect that they are in fact the same products. After further investigation within the company’s product catalog, I came to the conclusion that yes, the system correctly matched those 12 records, they all belong to the same group. I can then provide a positive response to the question – are the records in the group a certain match – I know that those 12 are actually the same product, I will mark in the stewardship platform will now show me that this group has been resolved in green. I will do the same and keep my analyzes for all of the other groups.

Once I’ve provided the response for all of the different groups proposed, I can go ahead and validate my choices. Once I validate my choice, there is another Talend job that is scheduled, which will automatically bring those records back in the Data Rocket framework. From there, the record is automatically routed based on the decision, “yesses” are sent to the master product group table where the “nos” should be sent back in the system for another pass.

Now that I have mastered all of my products, and created single instances of the same product by grouping them together, I want to create my dim product dimension for my star schema analytic data model to support my sales dashboard. Again, as shown earlier, we use Talend to try to automatically create the Golden Record price. All anomalies that can be resolved are sent again to the data steward for this art product, manufacturing and sales company, the actual rules would be decided over as a data steward. All the groups of products that have multiple different proposed ERPs as well as all of the product records with no price. We want a data steward to be able to analyze all of the different proposed retail prices and take the decision on what should be the final price within our catalog.

Let’s go back to the data steward portal to see how this is achieved. This time, I’m backed into data stewardship portal. But instead of answering a question and providing a yes or no answer, I’m able to actually take action upon the data. Again, to prevent any manual errors. I cannot edit the product fields such as the product name or the product categories. What is proposed to me is for the series case, synthetic brush. This is all the possible price that is available. The data stewardship campaign asked me to send over what should be the group product final price. As a data steward, I can go ahead and call my sales manager and review the catalog and do some investigation as to which should be the actual retail price between all of the ones proposed to me. The application does not allow me to leave a field blank, I cannot mark this record as complete without providing a decimal value.

After investigation, I realized that the synthetic brush Series K product should have a retail price of $10.50. The same goes for the Onyx series. After investigation and reviews within the art company, I was told that it should be $5.50. Now that this value is okay according to the data rule, I can lock in mark those as resolved and validate my choices again. Just like for the deduplication step, there is an automated Talend job on schedule that is ready to take all of my choices and insert them into my din product on a schedule.

Let’s review some of the key benefits of Passerelle’s Data Rocket Master Data Framework.

  • It is a ready to use and configurable framework. We import the different Talend jobs in your environment, and we help creating the different data model and setting up the data stewardship campaigns.
  • It can be used for simple matching to complex matching by extending custom machine learning model. We work with you to get the optimal results from your data. Our team of data engineer consultants will provide suggestions and guidance to come up with the ultimate matching algorithm.
  • It simplifies the data matching effort and helps you achieve accurate outcomes. In fact, this workflow is meant to be executed daily on a schedule the Master Data Framework bridges technical teams and business teams. To get optimal data resolution effort, this workflow really makes data governance a team sport, no more siloed tribal knowledge.
  • Finally, it complies to industry data governance approach and best practices.

Let’s finish this webinar by taking a look at the data quality metric dashboard that comes out of the box of the Data Rocket Master Data Framework. This dashboard serves as a view to audit and control the automated matching, let’s go back to our fictitious arts material manufacturing and selling company.

Our first KPI is a comparison between the distinct number of products from our three sources, and the result from our Master Data Framework. On the upper left corner, this donut chart shows me how many records were not a match how many were sent to a data steward for manual review, and how many were unique products from all of our three source systems. If I select to view my automatch product, I can see down below this thing product from my three source systems. On the right side, I can now see that the same product within the respective group, we can get a clearer view on how our data was fragmented between the matching and grouping. On the upper right corner, I can see the price adjustment method out of my 117 automatched products, three of those were sent to a data steward to get a price adjustment. The remaining 114 products at a correct source price. On the bottom right corner, I can select a group and see all of the different products scattered within my three source systems. The matching framework was able to regroup those six different instances of a product and identify they are in fact, the same exact product. Now let’s take a look at the price adjustment I did earlier as a data steward, I can do a price comparison and explore this new group. If I want to do comparison, I can include only the product that belongs to this group, it’s important to keep in mind that Data Rocket Master Data Framework is flexible and configurable to any other data entities from your organization. I hope you enjoyed this demo of Passerelle’s Data Rocket Master Data Framework and saw how it can accelerate the resolution of real-world data problems for manufacturing and sales organization. I wish everybody a good day.

Return to Blog