This post on TechCrunch contains quite a nice analysis of Twitter’s user behavior:
Twitter Data Analysis: An Investor’s Perspective
As the author notes, it was possible to deduce so many key metrics because “Twitter uses auto-incrementing ID numbers (1,2,3,4…) for both users and tweets.”
Auto-incremented ID numbers are a common web development practice. When you run an online business, you and/or your users create data–hopefully lots of data. Each record of data needs a unique identifier that you can use to go find that bit of data for later use. Automatically-incrementing integer IDs are a common implementation choice because they have some desirable properties:
- Compactness – Shorter URLs are generally a good thing. In addition, IDs that can fit into a 32- or 64-bit integer representation allow for efficient storage in a database and efficient manipulation within most programming languages.
- Uniqueness – It is trivially easy to generate unique IDs with an auto-incrementing approach: keep track of the last ID you generated, and add one.
- Efficiency – Generating the next sequential ID value doesn’t take very long. In more precise terms, it takes O(1) time; the speed with which you can generate the next integer is a constant, and this operation will not get slower as your data size grows.
- Convenience – Most databases have a built-in feature that can auto-generate auto-incrementing IDs.
Despite these virtues, auto-incrementing IDs can reveal information about the rate at which you are accumulating customers or content. If your user IDs are sequentially generated, for example, a competitor need only sign up for a new account on your website once a day to determine how fast your user base is growing. You may not mind that risk; the engineering team at Twitter probably discussed this issue when designing their data model and consciously decided not to obscure their ID values.
In cases where you do want to avoid using auto-incrementing ID values, some alternative ID generation strategies are:
- Random – Generate a random number every time you need a new ID. Guaranteeing uniqueness requires a bit more work with this approach, though.
- Random increment – When generating a new ID, instead of adding one to the previously generated ID, add a random number. This approach will exhaust the range of your chosen integer representation more quickly than the traditional auto-increment, though, so it is perhaps best used with a 64-bit representation.
- UUID – Use UUIDs instead of integers as IDs. UUIDs have the properties of efficiency and uniqueness (even with random UUID generation, there are enough bits to make duplicates very rare) but are less compact than a typical auto-incrementing numeric ID. A UUID occupies 36 bytes in human-readable form or 16 bytes (128 bits) in binary form.








Even without signing up every day, with a reasonably small sample, you can get a pretty good estimate of the user base by applying the German Tank Problem. Not really a problem when you actually want people to know how many users you have, but you probably don’t want to tell competitors how many advertisers you have (by using an advertiser id in the URL or query arg).
I like to combine randomness with an autoincremented value: timestamp. Use the most entropic bits (the microseconds) for the high order bits of the id and the less entropic bits (seconds, mins, hours, …) for the low order bits (or simply reverse the timestamp bit order altogether). Then toss on some random bits at the end.
Brian, what flaws, if any, exist by using a hash function to generate the public IDs? Using a hash function like Keccak (http://keccak.noekeon.org/), you could limit the output key to 64 bits (more compact than UUID) and use a private integer as the input key (which could be uniquely, efficiently and conveniently generated by a database) and use a salt (a Keccak feature) to thwart brute force attacks (by computing the has of a series of integers to see which one produces a given hash).
Collisions appear to be a significant issue, but if your user-creation script (perhaps even enforced by the db with a UNIQUE constraint on the ID_HASH column), when a collision is encountered, the penalty is to call the db increment trigger and move to the next integer and hope you don’t get another collision.
The principal downside I can see is that it’s not O(1), but the authors claim “13 cycles per byte”, which is reasonably efficient for modern hardware and the number of bytes being hashed in this application.
Another hash function that could do the trick. Skein: http://www.skein-hash.info/about.