Just here for the code? Here you go: https://github.com/aarondfrancis/URLcrypt
Mantis is coming along at a pretty slow pace, partially because we have a stakeholder problem. What I mean to say is that the business owners (who decide to use Mantis) really want the information and the accuracy that comes with it. The workers, however, don't want to be bothered to keep their time, they'd rather just continue guesstimating. And in many cases the person that's in Mantis day-to-day is not the business owner but more of an office admin, and therefore has less invested in ensuring that the workers stay compliant.
Bosses want it, workers can't be bothered, admins don't care so much.
We're trying a few different things to address this issue, one of which is delivering daily reports to the business owner detailing yesterday's hours worked, payroll, etc. Push the information to them and get them addicted, right? I remembered reading Thomas Fuchs's write-up about how they did this for their time-tracking tool for freelancers, Freckle. I decided to take the same route for Mantis. We are going to be sending them daily charts as images and needed a way to store secure information in the URL as opposed to the database. Thomas open sourced his code for how he did it in Ruby but since Mantis is PHP, I set off to port it over (not knowing any Ruby, mind you). It was an exercise partially for utility, and partially just to learn a thing or two.
I wanted basically the same thing out of my URLs that Thomas and Amy had for Freckle:
- clean: only alphanumeric characters. Looks prettier that way.
- no vowels: prevents curse words that can get caught in spam filters.
- Securely encrypted.
Let the Porty Begin
The first thing I needed to do was set up my table of allowable characters, which are the letters a-z with no vowels, and digits 0-9. This leaves us with exactly 32 characters, which is perfect for what we'll need later down the road. Below is the code used to set up the "table". Vowels are now numbers, extra numbers go at the end.
static $table = "1bcd2fgh3jklmn4pqrstAvwxyz567890";
These are the characters from which we'll be forming our "pretty URL". That's the easy part, the tricky part is condensing a character set that is much larger than 32 (the encrypted string) down to a mere 32 characters. Please note that I'll be mostly covering the encoding, as opposed to the encrypting. You can read lots of better, more in-depth articles about encryption.
Fiddling With Bits
Here's where it actually gets fun. (Yes, truly.) Once I wrapped my head around the basic premise, it actually makes a lot of sense and is particularly clever. Hats off to Thomas. We're going to loop through our string, character by character and do the following:
- convert the character to it's ASCII representation (a base 10 digit)
- convert the base 10 digit into its binary representation (0's and 1's)
- pad the binary representation with 0's, ensuring it's length is exactly 8
- shove this 8 character string of 1's and 0's on the end of a master string
- go to the next character
Now when we're done, we're left with a master string of 1's and 0's that is
8*[Length of String To Encode] characters long. Put another way: 8 bits for every character in your original string.
Below is me doing my best to figure out what exactly I was trying to accomplish. Pseudo-code is often very helpful. I suggest you try to understand the concept as well, as it's exceedingly interesting.
Hang On, Friend
You may be asking yourself at this point: WTF, why are you not using proper integers to keep track of all those bits? You should be bit shifting and generally doing a better job. Fair point. Unfortunately, PHP has a max integer value of either 2 billion or 9E18, depending on your version of PHP and your OS. (Ruby, seemingly, has no limit.) Either way, both of those values are quite restrictive when you're no longer using the bits to keep track of numbers, as we're not. And therein lies the rub. While it may seem obscene to using a string to keep track of "bits", that's what has to be done, as far as I can tell.
The Magic of 31
Remember when I said, regarding our allowable characters, that 32 was the perfect number? That's because 31 is the highest number that can be represented in 5 bits (
0b11111 = 31), meaning that we can fit 32 distinct numbers into 5 bits (0-31). What this means for us, practically, is that we are now going to take that long string of bits, originally 8 per character, and chop them up into 5 bit chunks. Then we'll convert those 5 bits back into a decimal, knowing that it has to result in a number between 0 and 31. After we have it as a base 10 digit (a decimal), we simply go to our table and pick the character that lies at that index. If the decimal is "0", we select the 0th element from our table: "1". If our decimal is "1", then we select the 1st element from our table: "b". (Recalling, of course, that arrays are 0-based, meaning that the first element has an index of 0. The second, 1. Etc.)
Perhaps a picture would help. The example below assumes we are encoding a string "abcde", which yields the encoded string "mfrggzdf". Pretty neat.
It's true, my example is contrived. I picked a string that was exactly five characters long, such that
5 (characters) * 8 (bits) = 40. Forty is a great number because it can be split evenly into 8 chunks of 5 or 5 chunks of 8. What if your string is six characters long, leaving you with a 48 bit string, a number not evenly divisible by 5? Simple, we just pad it to the right with 0's until we reach 50, a number that is happily divisible by 5. When we decode, we work from left to right and the extraneous 0's are simply ignored. Bingo.
The Last Little Bit (Pun Intended)
Now you know how to use PHP to encode information in URLs. You can use this to encode insensitive information, or encrypt and encode sensitive information. (This would be a great way to generate URLs with an expiry date, for example.) It would be beneficial to be reminded that URLs should generally not exceed 2,000 characters. Since every 5 bits is converted to 8 bits, we can calculate our overhead factor to be 1.6x, meaning that our new string is 1.6 times as long as our old string (8 bits/5 bits = 1.6).
I'm always happy to accept pull requests if you can improve upon what I've done here. This was a great learning experience for me and I'm under no delusions that the code is perfect.