Self-hosted Runner のスケール起動で registration が競合する——file lock と retry で解決した話

![GitHub Actions Self-hosted Runner の登録フロー図。Dockerコンテナが登録トークン取得時に競合しリトライする問題と、ファイルロックやリトライループで正常に登録される解決策を、男性キャラクターが説明している。](https://wakatchi.dev/wp-content/uploads/2026/05/github-actions-self-hosted-runner-registration-race-condition-eyecatch.webp) When we tried to scale out our GitHub Actions self-hosted runners with Docker Compose, some containers got stuck in an endless restart loop. ```bash docker compose up -d --scale runner-biz-dev=3 ``` Expectation: All three runners register with GitHub and enter an `Idle` state. Reality: Only one runner survived, while the other two were stuck in a `Restarting (1)` loop. ``` Invalid configuration provided for token. Terminating unattended configuration. Restarting (1)... Restarting (1)... Restarting (1)... ``` This article documents the three phases of our process: identifying the cause, implementing a fix, and verifying the solution. ## Cause: Concurrent Consumption of the Registration Token To register a self-hosted runner with GitHub, you need to obtain a registration token and pass it to `config.sh`. ```bash # Get a registration token REG_TOKEN=$(curl -X POST \ "https://api.github.com/orgs/{org}/actions/runners/registration-token" \ -H "Authorization: Bearer ${PAT}" | jq -r .token) # Register the runner ./config.sh --url https://github.com/{org} --token ${REG_TOKEN} --replace ``` The problem is that **a registration token becomes invalid after a single use**. When three containers start simultaneously, they all fetch the same token. Only the first one to run `config.sh` succeeds. The other two fail with "Invalid configuration provided for token." Docker's restart policy attempts to restart them, but since they don't fetch a new token and keep using the same invalid one, they get stuck in an infinite loop. ## Phase 1: Enhancing Diagnostic Logs First, before making any changes, we set out to create a state where we could accurately observe what was happening. ### Token Fingerprint Logging the full token is a security risk. Instead, we decided to log only the first four and last four characters. ```bash get_registration_token() { local token token=$(curl -s -X POST ... | jq -r .token) if [ ${#token} -ge 8 ]; then local fp_head="${token:0:4}" local fp_tail="${token: -4}" log_step "token" "ok" "fp_head=${fp_head}*** fp_tail=***${fp_tail}" fi echo "$token" } ``` By comparing the logs from all three containers, we could confirm they all had the same fingerprint. This definitively proved that "concurrent token consumption" was the root cause. ### Compressing `config.sh` Failure Logs The stderr output from `config.sh` can span multiple lines, making it hard to follow in `docker compose logs`. We compressed the first 20 lines into a single line for logging. ```bash log_config_failure_head() { local log_file="$1" local compressed compressed=$(head -20 "$log_file" | tr '\n' '|' | cut -c1-500) log_step "config" "error" "head=${compressed}" } ``` ## Phase 2: Serialization with File Locks + Retries ### Serializing Registration with `flock` We set up a lock file for each organization and used `flock(1)` to enforce exclusive access. ```bash with_register_lock() { local lock_dir="${REGISTER_LOCK_DIR:-/var/lock/runner-register}" local lock_timeout="${REGISTER_LOCK_TIMEOUT_S:-180}" # Skip if lock directory doesn't exist (for backward compatibility with single runners) if [ ! -d "$lock_dir" ]; then "$@" return $? fi local lock_file="${lock_dir}/${ORG_NAME}.lock" local t0=$(date +%s) ( if ! flock -w "$lock_timeout" 200; then log_step "register_lock" "error" "timeout after ${lock_timeout}s" exit 1 fi local t1=$(date +%s) log_step "register_lock" "ok" "acquired (waited=$((t1 - t0))s)" "$@" ) 200>"$lock_file" } ``` `flock` is an advisory lock on a file descriptor (FD 200). It's automatically released when the process exits, which prevents stale lock issues. In Docker Compose, we mount a named volume per organization to share the lock file among containers for the same org. ```yaml services: runner-biz-dev: environment: REGISTER_LOCK_DIR: "/var/lock/runner-register" volumes: - runner-biz-dev-register-lock:/var/lock/runner-register volumes: runner-biz-dev-register-lock: ``` ### Retry Logic After acquiring the lock, we also implemented a retry mechanism for cases where registration fails. ```bash register_and_config() { local max_attempts="${CONFIG_RETRY_MAX:-3}" local attempt=0 while [ "$attempt" -lt "$max_attempts" ]; do attempt=$((attempt + 1)) # Get a new token every time (since they are single-use) local reg_token reg_token=$(get_registration_token) || return 1 if ./config.sh --url ... --token "$reg_token" --replace; then log_step "config" "ok" "succeeded (attempt=${attempt})" return 0 fi # Classify error if grep -qE 'Bad credentials|Not Found' "$config_log"; then log_step "config" "error" "non-retryable error" return 1 # Don't retry on authentication errors fi # Backoff local wait_s=$((attempt * 2)) interruptible_sleep "$wait_s" done return 1 } ``` We classified errors into several types. | Error Pattern | Behavior | Reason | |---|---|---| | `Invalid configuration provided for token` | Retry (up to 3 times) | Token has already been used; retry with a new one. | | `A runner exists with the same name` | Retry (up to 3 times) | Timing conflict; can be resolved with `--replace`. | | `Bad credentials` / `Not Found` | Fail immediately | Retrying authentication errors is futile. | | Unknown error | Retry once | To prevent CPU spinning. | ### `die()` Cooldown If `config.sh` ultimately fails, we added a **15-second cooldown** before calling `exit 1`. ```bash die() { local cooldown="${ENTRYPOINT_FAILURE_COOLDOWN_S:-15}" log_step "shutdown" "ok" "sleeping ${cooldown}s before exit" sleep "$cooldown" exit 1 } ``` Without this, Docker's restart policy would immediately restart the container, consuming all the host's CPU. ### interruptible_sleep This allows the script to receive a SIGTERM (for graceful shutdown) while waiting during cooldowns or retries. ```bash interruptible_sleep() { local secs="$1" sleep "$secs" & wait "$!" } ``` The `sleep N &` + `wait $!` pattern allows a trap to be triggered if a SIGTERM is received during the `wait`. A direct call to `sleep N` would block the signal, delaying the deregistration process. ## Phase 3: Smoke Test To ensure the fix continues to work correctly, we added a regression test. ```bash #!/usr/bin/env bash # smoke-scale-startup.sh PASSES="${PASSES:-5}" SCALE="${SCALE:-3}" TIMEOUT="${TIMEOUT:-90}" ``` The test verifies the state in two stages: 1. **Container-side:** Check if the expected number of containers are `running` using `docker compose ps --status=running`. 2. **API-side:** Check via the GitHub API if the runners are `online` and have the correct name prefix. ```bash # Container-side check running=$(docker compose ps --status=running -q | wc -l) [ "$running" -eq "$((SCALE * 2))" ] # API-side check online=$(gh api "orgs/${org}/actions/runners?per_page=100" \ | jq "[.runners[] | select(.status==\"online\") | select(.name | startswith(\"$prefix\"))] | length") [ "$online" -eq "$SCALE" ] ``` We verify that all containers register successfully every time across 5 restarts at a scale of 3. In our CI, this test is scheduled to run every weekday at 18:00 JST to continuously detect regressions. ## Summary | Phase | Action | Purpose | |---|---|---| | Phase 1 | Diagnostic logs (token fingerprint + error compression) | Identify the root cause | | Phase 2 | Serialization with flock + retries + cooldown | Implement a structural fix | | Phase 3 | Smoke test (5 passes × scale=3) | Detect regressions | The key to our success was following the order of 'observe first, then fix, and finally verify.' Without the diagnostic logs from Phase 1, we wouldn't have been able to pinpoint the token race condition as the cause and might have implemented an ineffective fix. ## Reference Links - [flock(1) - Linux man page](https://man7.org/linux/man-pages/man1/flock.1.html) - [GitHub REST API - Create a registration token](https://docs.github.com/en/rest/actions/self-hosted-runners#create-a-registration-token-for-an-organization)

GitHub Actions Self-hosted Runner の登録フロー図。Dockerコンテナが登録トークン取得時に競合しリトライする問題と、ファイルロックやリトライループで正常に登録される解決策を、男性キャラクターが説明している。

When we tried to scale out our GitHub Actions self-hosted runners with Docker Compose, some containers got stuck in an endless restart loop.

docker compose up -d --scale runner-biz-dev=3

Expectation: All three runners register with GitHub and enter an Idle state.
Reality: Only one runner survived, while the other two were stuck in a Restarting (1) loop.

Invalid configuration provided for token. Terminating unattended configuration.
Restarting (1)... Restarting (1)... Restarting (1)...

This article documents the three phases of our process: identifying the cause, implementing a fix, and verifying the solution.

Cause: Concurrent Consumption of the Registration Token

To register a self-hosted runner with GitHub, you need to obtain a registration token and pass it to config.sh.

# Get a registration token
REG_TOKEN=$(curl -X POST \
  "https://api.github.com/orgs/{org}/actions/runners/registration-token" \
  -H "Authorization: Bearer ${PAT}" | jq -r .token)

# Register the runner
./config.sh --url https://github.com/{org} --token ${REG_TOKEN} --replace

The problem is that a registration token becomes invalid after a single use. When three containers start simultaneously, they all fetch the same token. Only the first one to run config.sh succeeds. The other two fail with "Invalid configuration provided for token."

Docker's restart policy attempts to restart them, but since they don't fetch a new token and keep using the same invalid one, they get stuck in an infinite loop.

Phase 1: Enhancing Diagnostic Logs

First, before making any changes, we set out to create a state where we could accurately observe what was happening.

Token Fingerprint

Logging the full token is a security risk. Instead, we decided to log only the first four and last four characters.

get_registration_token() {
  local token
  token=$(curl -s -X POST ... | jq -r .token)
  
  if [ ${#token} -ge 8 ]; then
    local fp_head="${token:0:4}"
    local fp_tail="${token: -4}"
    log_step "token" "ok" "fp_head=${fp_head}*** fp_tail=***${fp_tail}"
  fi
  
  echo "$token"
}

By comparing the logs from all three containers, we could confirm they all had the same fingerprint. This definitively proved that "concurrent token consumption" was the root cause.

Compressing `config.sh` Failure Logs

The stderr output from config.sh can span multiple lines, making it hard to follow in docker compose logs. We compressed the first 20 lines into a single line for logging.

log_config_failure_head() {
  local log_file="$1"
  local compressed
  compressed=$(head -20 "$log_file" | tr '\n' '|' | cut -c1-500)
  log_step "config" "error" "head=${compressed}"
}

Phase 2: Serialization with File Locks + Retries

Serializing Registration with `flock`

We set up a lock file for each organization and used flock(1) to enforce exclusive access.

with_register_lock() {
  local lock_dir="${REGISTER_LOCK_DIR:-/var/lock/runner-register}"
  local lock_timeout="${REGISTER_LOCK_TIMEOUT_S:-180}"
  
  # Skip if lock directory doesn't exist (for backward compatibility with single runners)
  if [ ! -d "$lock_dir" ]; then
    "$@"
    return $?
  fi
  
  local lock_file="${lock_dir}/${ORG_NAME}.lock"
  local t0=$(date +%s)
  
  (
    if ! flock -w "$lock_timeout" 200; then
      log_step "register_lock" "error" "timeout after ${lock_timeout}s"
      exit 1
    fi
    local t1=$(date +%s)
    log_step "register_lock" "ok" "acquired (waited=$((t1 - t0))s)"
    "$@"
  ) 200>"$lock_file"
}

flock is an advisory lock on a file descriptor (FD 200). It's automatically released when the process exits, which prevents stale lock issues.

In Docker Compose, we mount a named volume per organization to share the lock file among containers for the same org.

services:
  runner-biz-dev:
    environment:
      REGISTER_LOCK_DIR: "/var/lock/runner-register"
    volumes:
      - runner-biz-dev-register-lock:/var/lock/runner-register

volumes:
  runner-biz-dev-register-lock:

Retry Logic

After acquiring the lock, we also implemented a retry mechanism for cases where registration fails.

register_and_config() {
  local max_attempts="${CONFIG_RETRY_MAX:-3}"
  local attempt=0
  
  while [ "$attempt" -lt "$max_attempts" ]; do
    attempt=$((attempt + 1))
    
    # Get a new token every time (since they are single-use)
    local reg_token
    reg_token=$(get_registration_token) || return 1
    
    if ./config.sh --url ... --token "$reg_token" --replace; then
      log_step "config" "ok" "succeeded (attempt=${attempt})"
      return 0
    fi
    
    # Classify error
    if grep -qE 'Bad credentials|Not Found' "$config_log"; then
      log_step "config" "error" "non-retryable error"
      return 1  # Don't retry on authentication errors
    fi
    
    # Backoff
    local wait_s=$((attempt * 2))
    interruptible_sleep "$wait_s"
  done
  
  return 1
}

We classified errors into several types.

Error Pattern	Behavior	Reason
`Invalid configuration provided for token`	Retry (up to 3 times)	Token has already been used; retry with a new one.
`A runner exists with the same name`	Retry (up to 3 times)	Timing conflict; can be resolved with `--replace`.
`Bad credentials` / `Not Found`	Fail immediately	Retrying authentication errors is futile.
Unknown error	Retry once	To prevent CPU spinning.

`die()` Cooldown

If config.sh ultimately fails, we added a 15-second cooldown before calling exit 1.

die() {
  local cooldown="${ENTRYPOINT_FAILURE_COOLDOWN_S:-15}"
  log_step "shutdown" "ok" "sleeping ${cooldown}s before exit"
  sleep "$cooldown"
  exit 1
}

Without this, Docker's restart policy would immediately restart the container, consuming all the host's CPU.

interruptible_sleep

This allows the script to receive a SIGTERM (for graceful shutdown) while waiting during cooldowns or retries.

interruptible_sleep() {
  local secs="$1"
  sleep "$secs" &
  wait "$!"
}

The sleep N & + wait $! pattern allows a trap to be triggered if a SIGTERM is received during the wait. A direct call to sleep N would block the signal, delaying the deregistration process.

Phase 3: Smoke Test

To ensure the fix continues to work correctly, we added a regression test.

#!/usr/bin/env bash
# smoke-scale-startup.sh
PASSES="${PASSES:-5}"
SCALE="${SCALE:-3}"
TIMEOUT="${TIMEOUT:-90}"

The test verifies the state in two stages:

Container-side: Check if the expected number of containers are running using docker compose ps --status=running.
API-side: Check via the GitHub API if the runners are online and have the correct name prefix.

# Container-side check
running=$(docker compose ps --status=running -q | wc -l)
[ "$running" -eq "$((SCALE * 2))" ]

# API-side check
online=$(gh api "orgs/${org}/actions/runners?per_page=100" \
  | jq "[.runners[] | select(.status==\"online\") | select(.name | startswith(\"$prefix\"))] | length")
[ "$online" -eq "$SCALE" ]

We verify that all containers register successfully every time across 5 restarts at a scale of 3. In our CI, this test is scheduled to run every weekday at 18:00 JST to continuously detect regressions.

Summary

Phase	Action	Purpose
Phase 1	Diagnostic logs (token fingerprint + error compression)	Identify the root cause
Phase 2	Serialization with flock + retries + cooldown	Implement a structural fix
Phase 3	Smoke test (5 passes × scale=3)	Detect regressions

The key to our success was following the order of 'observe first, then fix, and finally verify.' Without the diagnostic logs from Phase 1, we wouldn't have been able to pinpoint the token race condition as the cause and might have implemented an ineffective fix.

Reference Links

カテゴリー: インフラ

タグ: Docker Compose flock GitHub Actions race condition Self-hosted Runner

Self-hosted Runner のスケール起動で registration が競合する——file lock と retry で解決した話

Cause: Concurrent Consumption of the Registration Token

Phase 1: Enhancing Diagnostic Logs

Token Fingerprint

Compressing `config.sh` Failure Logs

Phase 2: Serialization with File Locks + Retries

Serializing Registration with `flock`

Retry Logic

`die()` Cooldown

interruptible_sleep

Phase 3: Smoke Test

Summary

Reference Links

Self-hosted Runner を1台の Mac で何並列まで動かせるか——VirtioFS 実測とスケール数チューニング

macOS Self-hosted Runner から GHCR に push できない——osxkeychain 問題の回避策

Self-hosted Runner のスケール起動で registration が競合する——file lock と retry で解決した話

Cause: Concurrent Consumption of the Registration Token

Phase 1: Enhancing Diagnostic Logs

Token Fingerprint

Compressing config.sh Failure Logs

Phase 2: Serialization with File Locks + Retries

Serializing Registration with flock

Retry Logic

die() Cooldown

interruptible_sleep

Phase 3: Smoke Test

Summary

Reference Links

Self-hosted Runner を1台の Mac で何並列まで動かせるか——VirtioFS 実測とスケール数チューニング

macOS Self-hosted Runner から GHCR に push できない——osxkeychain 問題の回避策

Compressing `config.sh` Failure Logs

Serializing Registration with `flock`

`die()` Cooldown