1. Introduction
In this article, we’re going to see what a URL shortener is and a few different ways that we can use to actually generate shortened URLs.
2. What Is a URL Shortener?
URL Shorteners are a mechanism by which we can replace a long, complicated URL with a much shorter one. These are beneficial in cases where the number of characters is constrained – e.g., SMS or Twitter – as well as in cases where a very long, complicated URL might be copied incorrectly.
Generally speaking, a URL shortener will be a web app that runs on a very short domain name, such as t.co or bit.ly, and will contain a path that’s as short as possible. When opening this URL in a browser, you will then be redirected to the much longer URL that was originally intended. For example, we can use http://bit.ly/3LVvzd0 to reach the much longer https://sso.teachable.com/secure/22136/checkout/3632591/ls-master-class.
3. Creating a URL Shortener
Now that we know what a URL shortener is, how would we create one? Our only real requirement is the ability to generate a short string from the original URL in a way where we can reverse this to get the original URL back. In addition, we also want to get the shortest strings possible to represent our URLs. So let’s look at a few ways we could achieve this.
3.1. String Compression
The obvious answer is applying a compression algorithm to the desired strings. If we can find an algorithm that will take a URL, and produce a URL-safe string that’s guaranteed to be shorter, then this would be a good start. This would also have the advantage of achieving the entire process without any form of storage. We could reverse our process to get the original URL back.
Unfortunately, most URLs are already relatively short regarding compression algorithms – most URLs that people are likely to use will be a few hundred characters long at most, with a de facto limit of around 2,000 characters. This is typically not enough for most compression algorithms to work with, so doing this is likely to produce resulting strings that are either longer than the original or not significantly shorter.
For example, if we take the URL “https://sso.teachable.com/secure/22136/checkout/3632591/ls-master-class“, and then apply both gzip and base64 to it – to get a string that’s URL-safe – then the result is “H4sIAKczJWQAAwXBUQqAIBAFwP/uog+VhLrNtiwIKYZvvX8zzf3jDZAzuok2ebpFnQM03cuQcyoV2kzfuR2llnxeCZ1hCN1W0C7k8QN2fj8hSAAAAA==”. This is clearly much longer than the original – in fact it’s 117 characters when the original was only 72.
Unfortunately, this means that simply compressing the original URL is likely not to be a good approach, so what else can we do?
3.2. UUIDs
Given that any form of compressing the original string in place is unlikely to be beneficial, we’re left with generating a unique key that we can exchange for the URL. We can store these in some data store, and then upon invoking the shortened URL, we’ll look up the key and redirect the client to the URL it references.
One such form of a unique key that we could use is a UUID. These are easy to generate, are guaranteed to be unique, and are relatively short. If we were to render them in a form that only uses letters and numbers, our keys always will be 32 characters long. This is clearly a significant improvement over many URLs, and, likely, URLs shorter than this won’t need to be shortened anyway.
3.3. Sequence Numbers
However, we can do better than this. If we have no concerns about an enumeration attack – that is, where a client calls each possible shortened URL – then we can use sequence numbers. These are slightly more effort to generate because we must guarantee that every generated value has never been used. However, the resultant URLs will be significantly shorter – we would need to have generated ten nonillions (a 1 followed by 31 zeros) URLs before we reach 32 characters long. And every single one we generated before that point will be shorter than our UUID pattern.
The biggest downside to this pattern is the generation. Whenever we generate a new shortened URL, we need to take the next unused number. And we need to be able to do this in a thread-safe way. Failure to do this gives us the risk that we’ll assign the same sequence number to two different URLs, at which point we no longer know how to dereference them. However, most database engines will have some support for this kind of thing that we can leverage.
3.4. Alternative Numeric Bases
We’ve already reached a point where we have very short URL substitutions – our first million shortened URLs all fit in 6 characters or less. But we can do even better than this.
If we follow the same pattern of using sequence numbers, but we represent those numbers in a different numeric base, then we can get more numbers into fewer characters. For example, hexadecimal – base 16 – would let us represent 16,777,216 URLs within 6 characters.
But we can push it much further than this. Theoretically, we can represent our numbers in any numeric base with enough unique symbols. Hexadecimal works by adding 6 additional symbols – A, B, C, D, E and F – to our 10 numbers. So how far can we push this?
What if it was all uppercase and lowercase letters, as well as numbers? That would now give us 62 different symbols, which would give us 56,800,235,584 different URLs that we can shorten to 6 characters or less.
If we really need to, we can add some extra URL-safe characters. This can feasibly give us another 22 characters, bringing us up to 84 characters and allowing us to represent 351,298,031,616 URLs with 6 characters or less.
However, at some point, there is less benefit in these extra characters. Using only numbers and uppercase letters, but increasing to 7 characters or less, means we can represent 78,364,164,096 different URLs. So by just adding a single extra character to our shortened URLs, we’ve gained more capacity than we got by allowing mixed case letters.
In fact, it can even make things worse if we use characters that are difficult to tell apart. For example, the characters “1”, “I”, and “l” can sometimes be hard to distinguish, as can “0”, “o” and “O”.
3.5. Vanity URLs
Once we have the ability to transform a shortened URL into the original, we have other options available as well. Vanity URLs, for example, are where the shortened URL is actually a provided string instead of a seemingly random sequence of characters. For example, the URL http://bit.ly/spring_masterclass is also a link to https://sso.teachable.com/secure/22136/checkout/3632591/ls-master-class, but this time it’s immediately readable from the shortened URL as to what it means.