When Identity Links are Too Good to be True

Overconnected identifiers can stem from oversimplified identity — but sometimes, that isn’t all bad. Here’s how to build the right overconnection approach for your identity program.

Bobby Atefi, Chief Data Scientist
March 23, 2023

Identity is built on connecting signals. Starting with a signal like a pseudonymized email address, it connects those signals to others to identify what additional “addresses” like user emails, IP addresses, or devices belong to the same user or household.

But sometimes, you can overestimate the connections  – and your identity graph can be marred by linkages that look correct but really aren’t. Read on to learn why this problem of overconnected signals occurs, the havoc it can wreak, and how identity companies can weed overconnected linkages out of their graphs.

The problem of blindly following the link 

Whatever the methodology, identity providers build linkages – and ultimately graphs –  on the same basic presumption. If we find two identifiers arriving from the same source, the assumption goes, there’s a chance they belong to the same person or household. For instance, if we see a consented email used to log on to a website, there’s a high likelihood that the device belongs to the email account holder, and the Wi-Fi used to log on belongs to that user’s household. Going forward, we’ll assume any cookies dropped to that session should map to that user as well. 

The challenge is that as with any assumption, following these reasonable ideas blindly can lead identity providers astray. The theoretical user above may have in fact shared their login credentials with a friend, and that friend may have logged on from a work device they brought to a coffee shop, where they logged in over the public Wi-Fi. The email address, the device, the Wi-Fi IP address, and the actual session itself all represent completely different people and establishments. However, taking the signals at face value could lead to the wrong conclusion that the original user, the friend, everyone using the coffee shop Wi-Fi, and every device logged on to that Wi-Fi all belong to the same individual or household. 

In technical terms, following these assumptions has led to overconnected linkages. Instead of finding only linkages that reflect real-life connections between a user and their identifiers and devices, the process has wound up connecting too much (hence, “overconnecting”). And because identity graphs are built on webs of linkages, even one overconnected linkage can compound into a highly faulty graph – one that maps a highly inaccurate web of connections across people, devices, and identifiers. 

But while understanding what causes overconnection is straightforward, addressing the problem is too overwhelming for a human to answer on their own.

Automating out the overconnection

If you’re looking at just a handful of data points and an extreme case like the example above, it’s relatively simple to spot the obviously-mistaken cases of overconnectedness. But in the realities of identity, the errors will likely be far subtler. After all, you’re often dealing with data representing millions of individuals, the households they belong to, and the devices they use (and, as in the case above, sometimes share). Often, you’re also managing data in real time. Amidst all the complexity, the filtering challenge can become immense – which is why automated filtering of data sets is key.

Filtering automation is intricate in practice, but the underlying premise is easy to explain. For any reasonably-developed set of connections, you’d expect to find a reasonable number of identifiers that overlap with the original one. Once the number of linkages pass that threshold, the automation logic knows to flag the linkage as problematic. For instance, research finds that in 2020 the average American household had 10 connected devices – so an accurate device graph at that time could plausibly have around that many devices linked to a household. Hundreds of devices linked together is probably a case of overconnection – and those links should be marked as less reliable. 

Filtering automation might look at publicly available statistics as a baseline for their models. They’ll also use outliers in the data set – to zero in on cases where there are far more linkages than usual paired with a specific ID for the given data set. Whatever the specific design, the algorithm starts with a view of the expected degree of connection as a way to flag overconnection.

Overconnected isn’t all-or-nothing.

Not all highly-linked identifiers are necessarily overconnected. For example, a 2021 survey found that the average American had two email addresses, but 28% of them had more than four. Given that finding, would an identifier linked to 5, 10, or even 20 email addresses be an overconnection mistake? Possibly, but not necessarily.

Meanwhile, even clearly overconnected data may not be entirely useless. To offer just one example: A company looking to build out their own graph may still value overconnected linkages for the clues they hold. After all, hundreds of devices sharing a coffee shop Wi-Fi may not belong to the same household – but many of those devices will likely connect to each other in some way. Against the backdrop of further data sets and refining, the company may be able to tease out many of the real signal connections from the noise.

Highly connected linkages may be overconnected linkages, but they aren’t necessarily. Genuinely overconnected linkages may still be valuable for some use cases. With all this in mind, navigating the challenges of overconnected data comes down to setting out the right strategy with your identity partner.

Conclusion: Three steps to smarter connections

To get the accuracy and the scale that’s right for you out of your graph, start with these three rules to keep in mind as you strategize your identity program and partnerships:

Work with a truth set. Having a truth set of linkages on hand will help you sample the new linkages you receive, to see how effective your provider is at spotting and correcting problematic ones. It’s an especially valuable asset when you’re testing out prospective identity providers.  

Define your use case. Because different use cases call for different types of clarity, start your identity journey by zeroing in on the use case. Some identity use cases, like identity for granular attribution, will need the most pristine data sets possible. By contrast in the graph building example above, you may actually prefer less refined, more “raw” data linkages as a stepping-stone to build from. Similarly, in certain reach-focused campaigns you may be willing to dial down on precision to dial up data scale. 

Decide how much transparency you need.  Assuming your identity providers can effectively spot problematic linkages, what should they do with that information? You may want them to simply filter out potentially “bad” linkages entirely. But depending on your use case and in-house data capabilities, you may want your identity company to give you all the linkages they find – and also score the data by confidence in the linkages, letting your data teams route the data to different uses by confidence levels. When it comes to your data org, would too much information be a case of “drinking from the firehose” – or is more insight always better? How you answer that question will help you guide your identity strategy, and bring you closer to finding the right identity provider for your needs. ◆


Sign up for our newsletter

What to Read Next


Take the leap into Data Connectivity.

We want to hear about your data connectivity goals. We’ll show you how MediaWallah solutions can transform them. Set up a meeting to get started.