Out of the blue this past week, networking between my Docker containers running in my GitLab CI/CD pipeline stopped working.
I bet you’re thinking that I pushed some new code and it broke, and I decided to take the low road and blame my tech stack?
Absolutely not.
After an issue years ago that bit me, I decided to start running my test suite on a schedule, every day before I boot up. By doing so, I can see if any weirdness has cropped up, independent of code changes.
I highly recommend this practice. It helps to identifies scenarios where perhaps you didn’t pin a dependency as hard as you should have. Also helps to identify data/time dependent tests in your suite that break on the first of the month, year, etc.
Sadly, I lost two evenings trying to figure out what actually broke. From my observations, the issue seemed related to my Redis instances. They appeared to be going away at some point in the process.
GitLab Runner Services
My particular setup is a bit more involved than most folks, as I explicitly spin up two Redis instances. My production setup includes a stand-alone Redis instance for shared sessions, shared caching, and tracking global metrics.
Each of my web servers also have Redis available locally. This is an additional layer of caching, for performance as well as providing resilience when the stand-alone instance isn’t available.
Running multiple instances of the same service in GitLab Runner comes with its own set of challenges, that I’ve overcome ages ago. Specifically, you need to make sure you alias one of the services:
test:
stage: test
services:
- mongo
- redis
- name: redis
alias: cache
.gitlab-ci.ymlTroubleshooting the issue
As mentioned, I lost a few hours over a couple of evenings trying to figure out what was up here.
First I tried going down to a single Redis instance in the tests. I added debugging logic in my codebase as well as in my .gitlab-ci.yml
file. I tried a bunch of other shots in the dark, to no avail.
To make matters worse, I knew that GitLab released something around the same time this problem started. Unfortunately, I couldn’t seem to articulate the problem well enough to yield much in terms of content out there when searching.
Usually I’d reach for ChatGPT, but I felt this issue was new enough that the ol’ Robot wouldn’t be much help.
Who moved my cheese?
One of the cardinal sins of building projects, is changing things out from under your users. After even more research, I finally started to gain some traction, seeing many folks having issues crop up out of no where this week.
Bless up, it ain’t just me! 🙌
Seems a change was in fact made on GitLab’s end, that changed the default networking behavior. No network, no way to connect to your services.
I’m actually shocked this wasn’t a full meltdown scenario on X/Reddit/etc. Probably a testament to reach GitLab has compared to GitHub.
The fix ended up being incredible easy, you simply need to add a new variable, FF_NETWORK_PER_BUILD
which will, shocker, enable networking again:
variables:
GIT_STRATEGY: clone
FF_NETWORK_PER_BUILD: "true"
.gitlab-ci.ymlThe time spent on this is time I can’t get back, so I’ll probably be exploring my options as I only have but one project over on GitLab at this point.