Posted 23-Sep-2013 in Software Development

The Complexity of Permalinks and Slugs with User Generated Content

I'm not referring to the complexity of the permalink or slug itself, but the complexity to your code that can be introduced with certain formats, specifcally anything that the user can modify. When I was originally building SceneKids.com I wanted to ensure that the user's username was always in the URI when referring to something that belonged to said user. The format for a status update looks something like this:

http://scenekids.com/username/status/status_id

Easy and straight forward because the status ID is always unique and it results in a simple primary key query to grab the record. Where it gets sketchy is the inclusion of the username for a couple of reasons. Out of the gate, in this type of scenario you need to look up the user by the username but then check if the status ID being requested actually belongs to the user, if not you could either 404 or redirect to the proper URI. If we were to skip this step (and God knows I have before) we end up with a scenario where you can swap the username portion of the URI and pull up the same content, possibly attributing the content to the other user (if you're using the passed username to pull the user content and not the user ID that's tied to the data itself). I know there are probably still some pages that have this bug and fortunately it's not that big of a deal as most folks only know about the correct page and never even go looking for the erroneous ones.

Second issue we have is that the username can be changed at any time on the site. Similar to how GitHub manages username changes, I opted to not maintain any redirects on the content once the username is changed and also allow the old username to become available again. With users that would change their username 10 times a day if I let them (now limited to once every 10 days), the redirects would pile up and would get chaotic pretty quickly, especially with releasing the usernames back into the wild each time. This issue could be solved by not releasing the usernames back into the wild and then maintaining a list of old usernames so that we can track the user down when the old URI is used. That's a very feasible workflow, but it obviously adds complexity to your code.

Now with every new project, I try to right some wrongs from my previous endeavors. With both my effort earlier in the year on a failed platform that shall not be named and on Clipinary, a recipe site that embraces community modifications, I wanted to be sure that I didn't have to worry much about maintaining any 301 redirects but also not punk out and simply omit the support like I did previously. To do this, I went ahead and adopted what I liked to call "lazy slugs" and more recently "irrelevant slugs" to provide a more reliable way to permalink.

I call them lazy and irrelevant because the magical SEO keyword loading portion of the URL doesn't factor in at all. The structure of a Clipinary recipe is like this:

http://clipinary.com/recipe/recipe_id/slug-generated-from-recipe-name

Now I didn't come up with this format (both StackOverflow and Tumblr use similar formats for their permalinks) but I do favor it for anything but a user's profile. Regarding a user's profile, I did try it out once on that aforementioned failure, the URL looked something like this:

http://fubar/u/user_id/username

Why do I favor this format for some endpoints and not others? It boils down to how the URL is going to be shared. No one is ever going to say "go check out my recipe on Clipinary, it's clipinary dot com forward slash recipe forward slash 123 forward slash name dash of dash my dash recipe" they are going to say "go check out my kick ass recipes on Clipinary, it's clipinary.com slash username". The recipe endpoints are search engine destinations more than anything else. When people converse, they never tell someone to check out their Facebook or Twitter update and give out the exact URL, they tell them to go friend them on Facebook and give them their username. So because of that, I like to keep the URL for the user profile as clean as possible and don't feel as compelled to do so on nested endpoints that are below the user's profile as those types of URLs are more likely to be copy and pasted than to be read aloud. It's worth noting that StackOverflow does use this format for their user profiles as well.

Back to the irrelevant slug, the way the URI works is anything after the last slash is ignored so it could be anything you want it to be. In my situation on Clipinary, this compensates for any changes to the recipe title while preserving the functionality of the endpoint. In fact, the last part of the URI itself can in fact be omitted. I use that URL when setting up like and sharing buttons, so that the counts won't ever be reset. You could take this a step further and correct the slug if it doesn't match what you expected it to be for the data being pulled, both Tumblr and StackOverflow do that. I'm unsure if they are doing a redirect or if they are pushing the correct URL to the address bar, either way it's a ton easier than maintaining a history to be able to map back to because the ID is what counts.

Because I'm omitting the username from the URI and making the title slug optional / meaningless, the data lookups before very straight forward. Instead of having to look up a user by username, I can use the user ID stored with the recipe. Instead of looking up the recipe by slug, I can look it up by it's ID. I love primary key selects, if you couldn't tell. Not just that, something not previously mentioned, because we're using unique ID's as part of the URI, there's no need to worry about duplicates in the slug field of the database.

One idea I've toyed with but continually talk myself out of is going with more of a Instagram or YouTube type of URI that includes a UUID that is separate from the ID of the record itself, a slug with a predictable length, basically. I don't think it's a terrible idea, but it falls victim of the same pitfalls. Not only do I have to generate a UUID (10 to 11 characters, alphanumeric seems to be the going rate), I need to make sure it's unique across either the user's recipes (if including the username in the URI again) or across all of the recipes. And at what benefit? to have a shorter URL that actually means less because it's not human friendly? Sure we could tack on some irrelevant slug at the end, but why add another layer of complexity. We have primary keys on tables for a reason, so it makes sense to me to use it. If you wanted to added width to the value, perhaps you could start your auto increment at 100000 or something like that.

So all in all, I'm liking this style of permalink because it's easy to generate, easy to maintain and still provides me with some SEO niceties to the URL. In all honesty though, I wonder about the importance of the clean URL in SEO these days when I see Facebook and other large and wildly successful sites still rocking query string variables. Maybe it doesn't really matter anymore?