If you know me then you know I fucking adore Redis. Over the last few years it’s slowly worked it’s way into the majority of the things I have built. Unfortunately, recently I ran into a problem that Redis could easily accomplish but would require more RAM than I have available in my server. Because of this limitation, I decided to eat crow and migrate the data back to the ever reliable MySQL.
The data itself was the dashboard stream of updates from your friends on my social networks. What feels like ages ago, I was gathering the updates from multiple database tables and caching them for 2 minutes. This worked fine, but left users with a very stagnant dashboard that was only a short snapshot of the most recently 25 updates or so. This eventually evolved into more of a pub/sub model where when a user made an update, the update’s UID would be pushed out to each of your friend’s sorted set in Redis. The sorted set was then truncated to keep the size manageable (anywhere form 100 to 1,000 as I was feeling everything out) and was all done inside of a pipeline so the overhead was negligible even if you had 10,000 friends to update.
That approach actually worked out well for a year or so but for a while now I’ve wanted to give the users the ability to see more and more updates by way of infinite scrolling on the dashboard (more in line with Tumblr). At it’s peak I was allowing users to be able to scroll back through the last 1,000 posts which was essentially a full day’s worth of data for a user with 1,000 friends or so. With that many posts in each user’s sorted set (and another sorted set per user that stored their updates to merge in when you make friends) the size of the Redis instance got up to around 4 to 5 gigs. Still quite manageable, that was until I got the bright idea of attempting to backfill all of the users since the beginning of time into each user’s sorted set.
After I spent a bit of time writing the logic to do the backfilling I went ahead and kicked it off. After only a few moments I realized that this was going to be bad news as the instance was already at over 8 gigs and the script was about 5% of the way through the data. I’m not entirely sure how large the Redis instance would have grown to but I’m pretty sure that it would have exceeded the 32 gigs of RAM currently in the machine. Not just that, but the data set would constantly be growing and would provide more pain points down the road.
So after I stopped the script and wrote some more code to clean up the data and get it back down to a manageable level, I decided to revisit storing the data in MySQL. In addition, I wanted to solve the issue of a user “bombing” another user’s dashboard by posting a bunch of updates in a short period of time. What would happen is the user posting would occupy a large chunk of your sorted set and if you unfriended them because of it you would be left with sorted set with data that would need to be skipped (and dropped) and in some extreme cases you could end up with a completely empty dashboard (if you had few friends or were a new user).
My thought was, instead of maintaining a table that contained a single row for each update from your friends, why not just put a single row out there for each update and then pull the data based on your friend’s unique IDs? My initial concern was performance in the situation where the query’s IN
would contain 10,000 values which would be a common thing for a handful of users with over 10,000 friends. I did a little research and from what I could tell it would have been fine. I went ahead and mocked up the query and it ran amazingly well (the friend ID column was indexed mind you). I went ahead and backfilled the data (which ended up being around 1.2 million rows) and dumped everything out of Redis resulting in a 3 or 4 gig drop in the size of my Redis instance which was now around 1.3 gigs.
That wasn’t the end for Redis though as I still needed a way to keep track of how many new updates were available. Previously I would set a timestamp when they last viewed their dashboard and then count how items were in the sorted set that had a newer timestamp. I really didn’t want to be slamming the database on every page load (the indicator is shown on every page) and using Memcached would have been fine but I wanted something a bit more persistent. Not just that but I wasn’t sure if Memcached had a way to pipeline the way you can with Redis (still don’t). Keep in mind if a user posts an update and they have 10,000 friends, then 10,000 indicators would need to be set.
What I ended up doing was keeping a counter in Redis for each user. When a new update was posted by your friend, I would write the row to the updates table and then increment the counter by 1. This was all done in the same pipeline I had originally written and it replaced 2 commands (set and truncate) that were previously being run. When the user hits the dashboard the counter is set to 0 again.
Even though performance was good, my read / write ratio against MySQL went up to 90/10 from around 84/16. I opted to not cache the the results of the queries at all because of the volatility of new data being added all of the time and the nightmare I feel it would be to cache the data. I have some ideas for how I’d like to cache it if the day ever arises. Fortunately, the table is very straight forward, has few columns and is properly indexed, plus I’m only pulling 10 updates at a time. If I had to guess, the need to cache will probably show up somewhere around 10 million rows or so. I’d much rather spend my time building cool shit than prematurely optimizing though.
This also solved the issue with unfriending and having gaps in your dashboard data as the data is pulled based on your friend at the time. This also alleviated the step of merging in updates when you make a new friend as all of the data is living in one place. Please note that I did purposefully omit any insights to the data structure as I’m thinking that a post dedicated to the architecture of a content stream of friend’s updates would be a good topic for down the road (covering both Redis and RDBMS approaches).