It seems like every organization has their own arbitrary way of assigning numerical identifiers to people. You have a driver’s license number, a social security number, a tax ID, an employee ID, a voter ID, and on and on. Wouldn’t it be nice if everyone just had one single ID? And wouldn’t it be nice if you didn’t have to look this ID up in some kind of database? What if you could figure it out just from your knowledge of a person, even incomplete knowledge? But at the same time, a person’s identifier doesn’t actually reveal any information about them?
Ok, so privacy experts will probably warn about using a common ID for everything, but let’s just ignore them and think about how this could work.
There’s something called a bloom filter, which is a data structure used to keep tack of membership in a set. In other words, you can add elements to the set, and you can test to see if a given element is in that set (with a high degree of confidence). However, you can’t actually enumerate the items in the set (other than by guess and check).
It works by hashing each element through N different hash functions, so each hash function produces an index value for that element. The index then points into a bit array to specify which bit corresponds to that element. To add an element to the set, simply set of all it’s corresponding bits to one. To test if an element is in the set, check to see if all of its bits are set.
Now a given bit could correspond to multiple different elements, which is why you use multiple hash functions and multiple bits for each one. The idea is that, if you have enough hash functions, then every element will likely correspond to a different set of bits than any other element.
Of course there is still possibility of a complete collision between two elements, or a collision between one element and a set of others. So for instance, maybe if you add A, B, and C to a given set, then all of D’s bits end up set as well, so it will look like D is in the set. But if you make the bit array long enough, and pick the correct number of hash functions, you can generally have high confidence that you won’t get any false positives.
And note that you never turn any bits off (there is no way to remove an element once added), so false negatives aren’t possible; if an element has been added to the set, then it will always test as being in the set.
If you want to get into the details, you can go read the Wikipedia article, but you only need to understand the basics in order to understand where I go with this.
Attribute Sets as Identifiers
In everyday use, we identify people by name. Often just their first name, on occasion their full name if required to disambiguate (or if it’s your kid and they’re really in trouble!).
That usually works well within the circle of a hundred and fifity or so people that you interact closely with on any kind of regular basis. But when you go out into the broader world, it may not work out so well. We all know that names are not unique, even first and last name combinations. How many Kim Lee’s or John Smith’s are there in the world? More than one, that’s for sure.
Now if we narrow our scope down by age, even approximate age, we can drastically reduce the number of people who match our query. How many Kim Lee’s are there in the world? I don’t know, but it’s more than the number of Kim Lee’s who were born between 1970 and 1980.
We can add additional criteria to narrow it down even further. The number of Kim Lee’s born between 1970 and 1980 in Seattle? Well that might just be a handful (it may be none, I honestly have no idea). Factor in the names of the mother and father, and you’ve got a good chance of coming up with one unique person.
So a relatively small set of partially-identifying information can, in combination, be used to uniquely identify a person. The question is, how do you turn this into an ID? And what if you don’t know all of the information?
Bloom Filters as Identifiers
My answer is, you feed it into a bloom filter, that funny little data structure described at the beginning of the article.
For any given piece of partially-identifying information (your name, your parent’s name, your date of birth, etc.), as long as you can find a way to hash it, you can add it to your bloom filter. Once you’ve added in a handful of such attributes, your boom filter should be in a pretty unique state (assuming it was setup appropriately), and you can simply use the bits in the bit array as your personal identifier.
One of the nice things about this is that your identifier is extensible. You can add as many attributes as you want in order to get a more specific identifier, without invalidating the less specific identifiers that are created from subsets of those attributes.
What’s more, adding more attributes to your identifier actually makes it easier for someone to determine your identifier. Or at least, an identifier that applies to you. That’s because you don’t just have one identifier, you actually have an identifier for each subset of attributes that are included in your ID. So a person doesn’t need to have complete knowledge of everything that was added to your ID, as long as they know enough about you to uniquely identify you, they can generate a unique identifier which will be contained by your complete ID.
But your identifier is also anonymous, in that it doesn’t reveal any information about you. Sure, if someone has your identifier and knows what your name is, they can check and see that your name is included in the identifier. But if they have your identifier and don’t know your name, they would need to guess your name and then check your identifier to see if they’re right.
I whipped up a quick proof of concept for this idea, and put it on bitbucket. It’s just a single python file, not well documented, and a lot of the underlying mechanisms are implemented very poorly. But it was really just to see how well it would work.
I used a five-hundred-twelve bit ID, using three hash functions. If I knew more about bloom filters, I could probably figure out the ideal number of hash functions, but this will suffice for a prototype. The hash functions perform over four-thousand iterations of SHA-512 which makes any kind of guess and check attack against the attributes in an identifier a lot slower than just a single hash.
I also put together a demo identifier for myself. I’m not sharing the attributes I put into it because some of it is personal information, but in base64, the identifier looks like this:
Out of the five-hundred-twelve bits, two-hundred-forty-one of them are set, which is a density of about forty-seven percent.
At first, that seemed like a lot. In the extreme case, if I filled up the entire array with set bits, then there’s only one identifier (0xFFFF…..F). But then I did the math, and it turns out there’s something like ten to the power of one-hundred-fifty-two ways for two-hundred-forty-one out of five-hundred-twelve bits to be set. That’s a lot of available identifiers!
For reference, that identifier includes both full and partial names for myself, my wife, my son, my two siblings, my parents, all four of my grandparents, and my wife’s parents, plus birthdays and locations for myself, my son, and my siblings. I think that’s plenty to uniquely identify me, but probably doesn’t provide a lot of alternative paths for finding me.
In the future, I’ll probably add some additional information, like places I’ve worked, schools I’ve attended, places I’ve lived, etc. Some other attributes, like people I’ve known, might have some interesting uses as well. And since I’ve still got a lot of overhead before I start getting into the realm of limited available identifiers, why not?
The more attributes you add to your identifier, the more likely your set will yield false positives. The more you add, the more bits are set, so the more likely that the bits corresponding to some other attribute will also be set by coincidence.
But that’s not actually a bad thing. Anybody who knows information about you can still confirm whether or not an identifier has that information (again, there are no false negatives), but somebody who is trying to guess information about you from your identifier is liable to come up with a bunch of false positives. They may find that my identifier “contains” the name “Brian”, but it may also “contain” the names “Fred”, “Tom”, “Susan”, and “Bangalor-3 of the Trogdorth Empire”.