Recently I decided to finally tackle the ever growing S3 bucket for the niche social network I run.

The reason it’s ever growing is because I never implemented any sort of hard deletion logic. At one point, I was planning to move the images over to soft deletions, but never did. The soft deletions were going to help with the infamous “so I accidentally deleted my account and…” requests.

At this point, the site’s pretty much on life support as it dies a fairly slow death, and the S3 bill just doesn’t go down, so it was time to clean things up.

The structure of the bucket is a series of directories or shards, each containing multiple directories named after the user’s ID. Inside of those user ID directories, there are images that are named UNIXTIMESTAMP.{jpg,gif} as well as background.{jpg,gif}.

Not to bog you down with too much domain information, but the database table for the images is keyed on the user ID + Unix time stamp. Any images that were previously deleted would lack a row in the database table.

The background.* images that belonged to users that had since deleted their accounts also needed to be purged.

I could have looped through every deleted user and delete their respective images, but that would leave any images that were deleted by users that still had an active account.

Without any records in the database for the deleted images, I couldn’t accurately generate a list of images that needed deleted. Because of this, the logical approach was to loop through every damn file in the S3, check to see if it had been deleted, and if so, remove it from S3.

And why would I choose PHP for such a task? To be honest, at this point in my life, PHP wouldn’t have been my first pick. The reason I went with it though, is because the project I was working on was originally built in PHP and it was easy enough to hack together a new script leveraging all of the existing infrastructure.

The logic was straight forward enough, unlike most of the AWS documentation. We create an iterator that uses the ListObjects method and the loop until the cows come home.

The following code is going to assume you already have the AWS SDK installed and configured. I’ve omitted my sanity checking logic as well:

<?php
require './path/to/your/autoload.php';

$region = 'Your Region';
$bucket = 'Your Bucket';
$key => 'Your Key';
$secret = 'Your Secret';
// Include this if you want to loop through a specific directory
// $prefix = 'Your Prefix';

$s3 = new Aws\S3\S3Client([
    'version' => 'latest',
    'region' => $region,
    'credentials' => [
        'key' => $key,
        'secret' => $secret,
    ],
]);

$objects = $s3->getIterator('ListObjects', [
  'Bucket' => $bucket,
  // 'Prefix' => $prefix,
]);

foreach ($objects as $object) {
  print_r($object);

  // Deletes the current file in the iterator
  // $s3->deleteObject([
  //   'Bucket' => $bucket,
  //   'Key' => $key,
  // ]);
}

Love it or hate it, this PHP code isn’t all that bad. If you’re looking to implement this in another language, it should be pretty simple to port to another AWS SDK in another language.





Did you enjoy this post?

Cool if I slip into your inbox with more?
Full posts, 1-2 times per week: