Classification of Data based on the Nature of the Data Collected – Introduction to Customer Data Platform-2
Classification of Data based on the Nature of the Data Collected – Introduction to Customer Data Platform-2

Classification of Data based on the Nature of the Data Collected – Introduction to Customer Data Platform-2

Identity resolution is done using identity stitching, also called user stitching. It is a process through which we collate and tie or stitch together matching identifier data, such as email ids, and device ids, to create a single customer profile. We use the word stitched as normally if one user is using different devices then it will be counted as different users as they will be given different user IDs. But a CDP will try to stitch the different user Ids to a single profile. Identity stitching takes the help of identity graphs.

An identity graph is a type of database that stores all identifiers that correlate with individual users.

Figure 2.5: Identity graph (source: https://www.richpanel.com/blog/what-is-an-identity-graph)

Identity graph focuses on the connection between nodes. The relationship is based on different data points connected via nodes. The nodes can be generated for an anonymous session (user browning without logging in) or an authenticated session where the user has logged in. When a new data point is added, the graph database determines where the new node fits. As more and more data points are fed, it matches those points to a customer identifier.

Identities are stitched using three types of matches:

  • Fuzzy matching: Fuzzy matching is an approximate string-matching technique. It helps to identify two elements of text, string, or entries that are similar but not identical. For example, Joe Smith can be matched to Joseph Smith if the device ids are the same.
  • Deterministic: If there is a clear connection between two data points, like a matching device id or a credit card number, then the data point can be attributed to a relevant user node. This method is called a deterministic match. We can conclusively say that we have matched the data point with the correct user.
  • Probabilistic: For data points where we are less certain of finding a match, we create a model based on predictive algorithms. For example, when we use identifiers like IP address, device type, browser ids, and so on, to find a match. In one house, many users can use the same device or IP address, so this match comes with a certain probability, which is less than 100%.

Table 2.2 shows the differences between these three types of matches:

 Fuzzy MatchingDeterministicProbabilistic
DefinitionRelationship created based on approximate string matchingRelationship created based on clear connectionRelationships created based on predictive algorithms
ExamplesMatch PII data (for example Name, home address, and so on )Match Cookies ID to email when a user logs into a browser [email protected]Match Cokie ID and MAID based on common characteristics Cookie123=IDFA413
Use CaseCRMCustomer SupportEmail MarketingSocial activationScale and reach on the open web

Table 2.2: Differences between different methods of identity stitching

So far, we have just scratched the surface of identity resolution. There is a lot to understand here. Data engineers who work to resolve identity resolution face a lot of challenges in attributing a user behavior to a particular user as the user journey becomes more and more complicated with the advancements of technology. The idea was to introduce the topic of identity resolution to the reader so that they are aware of the concept. We will get into more details in the later chapters.

Leave a Reply

Your email address will not be published. Required fields are marked *