Your Data Is Lying To You: Why Reconciliation Has To Come Before AI

[00:00:06] Speaker A: Hello and thank you for joining us. This is what Counts, a podcast where we explore real world challenges and opportunities shaping information governance today. Each episode draws on our experience working across industries, turning proven strategies into practical insights you can apply within your own organization. Whether you're navigating information governance, facing a specific need, or simply curious about issues like email management, retention, contract data, or asset data management, this podcast gives you clear, actionable perspective on what truly counts in building strong, sustainable governance practices. This is Lee, and in this episode, Moore and I are picking up right where we left off. Remember last time, if you didn't go listen to our last episode, we broke down the AI, generated customer support, escalation workflow and found that behind the big promises or even bigger assumptions, messy data, disconnected systems, security gaps, hallucinated answers, the AI told us it could do it all in 10 minutes. More doubted that one, but anyway, I learned to doubt that after after a while too. We also introduced the concept of metadata mapping, understanding the keys and identifiers that hold your data objects together. And we ended with a warning. When you store the same information, the same customer ID across your billing system, your work order system, and your contract management system, you end up with duplicates. 3 copies, 7 copies, 10 copies of the same counterparty. So today we're tracking the next logical step, reconciliation. How do you figure out what's real, what's redundant, or what's just plain wrong before you ever let AI near it? Mora that was big. So let's set the stage here. We talked about messy data and duplicates, but why does reconciliation have to happen before AI touches anything? Why can't AI just figure it all out? [00:02:10] Speaker B: Well, a lot of people think it can. Well, let me tell you, I don't think it can. And the main reason that I believe this, and we've seen it, is that all of those separate entities, those separate systems that are holding some of your data, they were created for a purpose. They were created over time. Data was added to them over time. Somebody knows why, but the system doesn't necessarily know why. And when you just let AI loose, it's going to take the information kind of at face value. So where there's a conflict, it might decide this must be two different things. Ideally, if you're going to involve AI in this step, it will ask you, are these two different things or not? And if it can get that far to are these two things? And you say, no. Your first review, you say, no, those are the same thing. How do you as a person know that? Well, there are a lot of different steps to that and we're going to get into those. But basically it comes down to the systems that have all this. Legacy data don't necessarily carry the history, they don't carry the context of how that data got there and how, for instance, the use of one field changes over time. It changes over time because when we set this up, we thought there would only ever be one tax ID for a company. And then turns out tax IDs change for different reasons and there's no place to put it. So do we put we overwrite the old one to have the new one in that field that says tax id, but we want to know the old one. So does that go in a note field? What if that note field used to be used for a DBA name, a doing business as or a nickname or an AKA or something else. Name history is another thing. Names change all the time with companies and so somebody knows the choices they made. Perhaps somebody may no longer work for you, but there's been sort of legend handed down as the job was handed off or not. And you're going to have to do some more work on figuring it out. But those gaps that exist between here's three sets of data that might be the same or might not, and certainly there are overlaps, the gaps that are in there. To help you understand out of these three sets, what's the real answer? What are the real entities? Customers. In this case, a person is going to fill in those gaps and AI doesn't have any knowledge to fill those gaps, mainly because people are irrational and the way that those changes happened over time are not rational. They're going to be based on circumstance. It's kind of the same reason I think self driving cars and people driven cars shouldn't be on the same road. [00:05:31] Speaker A: Come on, we're talking about hallucinations here. Now you talk about self driving cars. We can't have hallucinations in that sense. [00:05:40] Speaker B: Hallucinations and self driving cars are a real problem. But yes, that's for another day. So the bottom line here is the, the data is not rational. We've already said it's messy, there's duplicates, there's overlaps. And a person is going to bring knowledge that an AI doesn't have to the, to the process of the reconciliation. [00:06:04] Speaker A: That makes sense. Okay, so when we say reconciliation or data reconciliation, what are we talking about? [00:06:14] Speaker B: So the end, the end goal here is that we've got three sets of data that contain some or all of the same customer records in this case, still building on our example from the last few episodes around that customer service challenge that AI says it can solve in 10 minutes. In 10 minutes, right. So we've got the customer database, we've got the work order database, we've got the contract database. And each one of those has a list of customers. And by list, I mean it has a set of records that represent customers. And those records are different. They contain different data. Because the customer service database and the work order database and the contract database were created for different reasons and they need different data. Those three customers, those three sets of customers, the same customer might appear in all three databases, but it might be represented differently. There might have been just something as simple as a typo. There might have been an inherited set of data where the old system always put everything in all caps and didn't use any punctuation, or the data might be truncated. We've definitely run into systems where that was an interesting. Actually use of other fields, too, where the name of a company couldn't fit in the company field and it would get truncated, and then there would be an overflow field. [00:07:53] Speaker A: Oh, the long description field. [00:07:55] Speaker B: Yeah, the long description or the overflow name field. Because the system, the original system was so old that it had that small of a character limit. But in a migration process, we wanted to bring those names back together, so you had to match all of that up. Anyway. The goal is you have these three sets and you believe that out of these three data sets, you want to end up with one consolidated data set that contains real records that identify real customers. One record per customer is the goal. Matching with validated and accurate and complete data for that customer. [00:08:43] Speaker A: Okay, so you're not just talking about taking out the duplicates. You're talking about one record. One trusted record. [00:08:53] Speaker B: One trusted record. The. The master data term is golden record. The golden record contains all of the data that we know about this customer. And it's a logical record, the golden record, which means that all of those data fields might not be in the same system, but we can bring them together and concatenate them, synthesize them to say, here's everything we know about you. So, for instance, very common problem, you have an erp, a finance system, that contains a vendor record. And the goal for that vendor record in the ERP is to pay invoices. So the most important data to have in that record is the current name of the entity that's being paid, a complete and legal name, the current tax ID for that entity, possibly a federal and a State, but absolutely a federal tax ID and payment information. So that could be. It would be a payment type. Are you sending ach, are you sending wire, Are you sending a check? And depending on which one of those it is, then you have the address. So for ACH or wire, you're going to have bank information, an account number and a routing number and account number, which could be different based on if it's ACH or wire, even at the same bank. If it's a check, then you have to have the address where the check goes. That's the most important information to be in your vendor master file in your erp, because the goal there is get people paid. So that's a very small piece of the data that you probably have in your customer database or in a contract database. Because in the customer database you might have, for instance, this is an interesting one, it doesn't go with the vendor file, but it could go the other way. The place where you installed solar panels, a house where you installed solar panels and the original owner who bought those and had a long term contract for care and maintenance and other things, those solar panels stay with the house when it gets sold. So you need to know how was the original contract created? What obligations does your company have to that customer in the original contract? But who's the customer now and what did they inherit in that contract? So the customer database also needs up to date information because if there's a maintenance issue, somebody needs to know where to go, how to send a work order, how to create a work order and send them to that customer location. But you're looking at. All right, is this work going to be done under warranty or does the customer have to pay us when we get there or do we invoice them? That might be in your contract system and that contract, depending on how it was originally identified, somehow you have to match it to this new customer. So it's more data than you need in that vendor file. So. Or the flip side of the vendor file in an ERP is a customer record, which is how you accept money from a customer as opposed to sending money to a supplier. Same set of data of how that customer is going to pay you, what account numbers, what payment type, etc. Comes, comes in there. So. [00:12:50] Speaker A: So wait a minute. Sorry, but stop me. [00:12:54] Speaker B: I feel like I've gone in too many directions. [00:12:57] Speaker A: Well, I feel like we can talk about a concrete example next. Next time actually, because when you were talking you said there you need people to do interpretation. But now as you were talking about Master Data Management. Sorry, when you were talking earlier, you needed people to do interpretation. Now when you were talking about data, data, data, Data and Master Data Management, it sounded more like an IT problem. But this is not only an IT problem. [00:13:28] Speaker B: No, Master Data is also a person problem. [00:13:31] Speaker A: Right, right. So I think we could cover that more than we could do an example in the four minutes we have left. [00:13:39] Speaker B: Okay, so there's just so much to talk about. Are you sure we have to stop in four minutes? [00:13:45] Speaker A: No, we don't have to. [00:13:47] Speaker B: Because the example is all of these systems were created for a reason, and somebody knows what that reason is. And the golden record, which is a logical record, but not necessarily a physical record in one system, brings all the pieces together. And the way that it does that, like we talked about last time, is having a common id. So you know that the customer ID field in the customer database matches up with a work location ID in the work order field because it's about how do you send somebody to work? So you have to have a relationship between customer 1, 2, 3, 4, 5 and location XYZ. And you do that through the association of those two ID fields. [00:14:37] Speaker A: Okay. [00:14:38] Speaker B: Okay. Then you have the contract ID. It might be a contract number, and it has a customer ID as one of its. So the contract number is its primary identifier, but the customer number is a secondary identifier. The contract doesn't live in a vacuum. It only exists in relationship to the customer. So that customer ID from your customer database is part of the record of the contract number of the contract record in the contract database. In the work order database, you have that location id and you have that customer ID too. [00:15:16] Speaker A: We should have a map. Oh, a data map. [00:15:20] Speaker B: Do you need a data model? A data map that talks about how the, how these different entities exist, what IDs exist for them, and then how they relate. But the reconciliation step is okay. But I have 10 customer IDs that have very similar names, and I have five location IDs and they don't have customer IDs with them, but they have customer names. So how do I know from these five location IDs with the five different customer names, how they match up to these 10 customer IDs with similar names? That is the reconciliation step. And you do that by looking. And we can do this in more depth in a later episode. But there are some computing techniques, some automated techniques, and this is where AI can be helpful as a tool to say there's the Levenshtein distance, there's fuzzy matching, like they use in ediscovery. There are other ways to do searches that give you a confidence rating of we think 12345 and 1234 5.2 are the same. We're 90% confident of that. A system and AI tool can help you get that far. Then a person might say, okay, we know that these customers exist. They're in all these databases. Let's look at the contract type because are they a solar panel owner? And so we know that this customer 1234 5.2 is a solar panel owner. And that's our challenge here, is we're trying to find out is that the right one. We have a solar panel purchase contract. Great. We have a work location that also references 12345, but it left off the dot 2. But that location address matches the address on that purchase contract. Okay, let's bring those two pieces of data together and then we can triangulate to 12345. Two is the same as 12345, customer number 12345. But it's knowing the relationships between all these pieces of data that you can bring in. So in a different example, we looked at counterparties and supplier and customer numbers and contract types over time because there were a lot of very similar names and very similar names, but only some of them were still in the supplier ID world. And then we looked at the contract type and realized it was leases, long term land leases. And so the similar names were family as the land was passed down. And the reason they weren't all in the customer numbers anymore is that some of them weren't being paid anymore. But you have to know that the contract type has an impact on that in order to say why do we have 10 of these and only two of them are in customer where they're in supplier where they're being paid. Let's go back and look at the contracts. So it's bringing all of these different data sets together logically and it's an iterative process. Okay. There's a lot more to that and I think we should spend another episode on it. But I just want to introduce the, the last piece which is how do we stop this from happening again? [00:19:19] Speaker A: Okay. How in the world would you do that? [00:19:22] Speaker B: Why would I want to stop it? [00:19:24] Speaker A: No, no, no. How would you do that? [00:19:26] Speaker B: Oh, how would I stop it? So it's about data governance and it's a combination of policy and process. And systems can help you, technology can help you enforce that. So policy in this sense, we're really looking at how, who is authorized, what group is authorized to create a new customer record and what is the process for doing that? In which system does that new customer record originate? Does it originate in the customer database or in the contract database or in the payment part of the erp? There's no right answer or wrong answer. What's wrong is not answering the question and just letting everybody continue to create a new instance, to create a new record. So first, everybody involved needs to agree, come together and agree, okay, this is where we're going to create the new record. Then you figure out who's going to do it. But this is the system that's going to own the first step of the record, the customer record. Then you need to decide, okay, how do updates get made? Because address changes, tax ID changes, name changes, assignments and assumptions, sales and mergers of companies, they happen all the time. And all the systems need to know, but they don't. All the groups don't necessarily find out at the same time. So you need to put into place the process for, okay, the supplier. People who are trying to get people paid, they often hear first because people also really want to get paid. So they're going to call AP and say, hey, my bank account changed. And that information is going to go first to them. Which is okay because they're actually the only one out of all of these groups that needs the banking information. But suppose they say, oh, and also we got bought by somebody else and now we have a different tax ID and we have a different name. Well, actually, everybody needs to know that. All of the groups need to know that. It's a different location and a different name and a different tax id. So that's not true. The work order team might not need to know about the tax id, but they do need to know about the name. Because if they show up to do some work and they say, hi, are you Bob's gas station? And the clerk says, no, we're Jack's gas station. Then they leave and they haven't fixed the problem. There might be a propane tank problem and they haven't fixed it. Who knows? Figuring out how updates are coming in. And for that you look at how have they been coming in? Who's been getting those calls? Is there a form that people are sending? Are they sending emails to a generic customer service address saying, I have a new information? Is there a systematic process where your company goes out and sends a form to everybody once a year? Please verify your information. I've been getting those this week for a number of different places, LinkedIn, can you verify your information or whatever. So a lot of companies do that, but not everyone. And if they do, it still may not be coordinated on the inside of the company. So first step with the governance is which system is going to be the system of record? How are new records going to be created there? Second step is who is going to receive and be authorized to update information? And if somebody's going to receive information who isn't authorized to update it, then how do they forward it appropriately to the right group to validate and update the information? And that third step is really the validation step, which is all of the information needs to go through some kind of validation process potentially matched to a real world source, an outside source like Secretary of State registration or something, or even just the post office to make sure that this address does exist and the person is real. All of those things have to happen. And as part of your process mapping, designating different groups to take the, to own those pieces of the process. At the end of the day, having done the hard work of reconciling everything, coming out with your golden record of here is our set of 10,000 customers that we know are real and the data is accurate and we understand where that data lives. Then you have your policy and process together of how new records come in, how do records get updated? What is the validation step? And last piece, when do you retire a record? When does that customer become inactive? And when is that record no longer needed for retention or other purposes? Which is really important because we're living in an age of concern, reasonable concern about privacy data and emerging laws around the right to be forgotten by a company or by a site or by a system and, and knowing where your data is and how it's getting created and what your process is for retiring and destroying data in accordance with the retention schedule is going to be a great answer to somebody who says where's all my data? And I want you to forget me. [00:25:07] Speaker A: Okay, so what's amazing run through. What's amazing is that we've done this for very large organizations, several of them. And of course, I think I just reiterate exactly what you said. It's not an IT problem, it's not a records problem. It becomes a governance problem, which is an enterprise problem. [00:25:29] Speaker B: It is, it is an enterprise problem. It manifests in records problems. It manifests in IT problems, but it is a governance problem and it needs people to solve it. [00:25:42] Speaker A: Okay, that's a wrap. You good? [00:25:44] Speaker B: I'm good. [00:25:45] Speaker A: If you have any questions, please send us an email at inforailblazer.us.com or look us up on the web at www.trailblazer.us.com or check out our learning academy at trailblazerlearningacademy.com thank you for listening. Please tune in to our next episode. If you like this episode, please be a champion and share it with people in your social media network. Like always, we appreciate you, the listeners. Special thanks goes to Jason Blake, who created our music. [00:26:18] Speaker B: All right, thanks everyone. Tune in next time and we'll talk about how to reconcile 300,000 counterparty records. I know you can't wait.

Your Data Is Lying To You: Why Reconciliation Has To Come Before AI - E132

Show Notes

Episode Transcript

Other Episodes

Contract Data Governance: Why AI Alone Won't Fix Conflicting Terms - E130

Follow The Yellow Brick Road Of Data - E60

Top 5 Information Governance Challenges - E046