Files
claude-mem/SOCKET_DEBUG_HYPOTHESES.md
T
Alex Newman 2d080b0264 Add a basic Unix socket server using Bun
- Implemented a simple server using the net module.
- The server listens on a specified socket path.
- Added error handling for server errors.
- Included checks to verify the existence of the socket file.
2025-10-16 17:07:14 -04:00

6.9 KiB

Socket File Not Created - Debug Hypotheses

Problem Statement

Worker process logs "Socket server listening: /Users/alexnewman/.claude-mem/worker-28.sock" but the socket file never appears on the filesystem. All connection attempts fail with ENOENT.

Hypotheses (Ordered by Likelihood)

H1: Worker Process Exits Immediately After Socket Creation

Theory: Worker creates socket, logs message, then crashes/exits before we poll for the file.

Evidence:

  • We see the log message
  • Socket never appears
  • No other worker output after "listening" message

Tests:

  • Check if worker process is running: ps aux | grep worker
  • Add worker exit handlers to see exit code
  • Check if worker.ts crashes after startSocketServer()

Root Cause Possibilities:

  • Database query fails in loadSession() (worker.ts:75)
  • SDK agent initialization crashes
  • Unhandled promise rejection in run()

H2: detached=false Kills Worker Prematurely

Theory: detached: false causes worker to die when replay script continues execution or when replay script changes process state.

Evidence:

  • Production uses detached: true, stdio: 'ignore'
  • Replay uses detached: false, stdio: ['ignore', 'pipe', 'pipe']
  • Worker might be getting killed by parent process lifecycle

Tests:

  • Change to detached: true, stdio: 'ignore', worker.unref()
  • Check worker persists: ps aux | grep worker after spawn

Expected Fix:

  • Worker should persist independently
  • Socket should remain available

H3: stdio Piping Interferes with Socket Creation

Theory: Piping stdout/stderr (stdio: ['ignore', 'pipe', 'pipe']) prevents proper socket file creation or causes worker to hang.

Evidence:

  • Production uses stdio: 'ignore'
  • We're trying to capture output with pipes
  • This might interfere with Unix domain socket operations

Tests:

  • Change to stdio: 'ignore' (no piping)
  • Worker won't output to our console but should work

H4: Socket Path Mismatch

Theory: Worker creates socket at different path than replay script expects.

Evidence:

  • getWorkerSocketPath(sessionId) used in both places
  • Both should resolve to ~/.claude-mem/worker-.sock
  • But maybe DATA_DIR differs between environments

Tests:

  • Log actual socketPath in worker: console.error('Creating socket at:', this.socketPath)
  • List all sockets: ls -la ~/.claude-mem/*.sock
  • Check if socket appears elsewhere: find /tmp -name "worker-*.sock"

Root Cause Possibilities:

  • CLAUDE_MEM_DATA_DIR environment variable difference
  • Worker started with different env

H5: Permissions Issue

Theory: Worker can't create socket file due to directory permissions.

Evidence:

  • Socket creation might fail silently
  • Worker logs "listening" before checking if socket file was created

Tests:

  • Check ~/.claude-mem permissions: ls -ld ~/.claude-mem
  • Try creating socket manually: nc -U ~/.claude-mem/test.sock
  • Check worker user vs replay script user

Expected Error:

  • Worker should throw EACCES or EPERM but we might not see it

H6: Socket Listen Callback Fires Before File Creation

Theory: The server.listen() callback fires and logs "listening" before the socket file actually appears on filesystem.

Evidence:

  • Node.js/Bun might call callback before filesystem sync
  • We see log but no file

Tests:

  • Add additional wait time after seeing log
  • Add fs.existsSync check inside worker after listen()
  • Increase poll duration/frequency in replay script

H7: CLI Worker Command Routing Broken

Theory: dist/claude-mem.min.js worker <sessionId> doesn't properly route to worker.ts main().

Evidence:

  • cli.ts has .command('worker') handler
  • Handler imports and calls main() from sdk/worker.ts
  • But bundling might break this

Tests:

  • Run directly: dist/claude-mem.min.js worker 28
  • Check if worker main() is actually called
  • Add console.error at top of worker.ts main()

Root Cause Possibilities:

  • Bundle doesn't include worker code
  • Import path broken in minified CLI
  • Commander routing fails

H8: Database Session Not Found by Worker

Theory: Worker can't find session in database, exits early.

Evidence:

  • loadSession() query might return null
  • Code checks if (!session) { exit(1) } (worker.ts:76-79)
  • But we'd expect to see error log

Tests:

  • Verify session exists before spawn: SELECT * FROM sdk_sessions WHERE id = ?
  • Add debug log in loadSession() before query
  • Check DB file path matches

H9: Socket File Created Then Immediately Deleted

Theory: Socket is created but something deletes it (cleanup from previous run, OS, etc).

Evidence:

  • Old socket file might exist and get unlinked (worker.ts:110-112)
  • Maybe multiple workers spawning

Tests:

  • Check for multiple worker processes: ps aux | grep worker
  • Watch filesystem in real-time: watch ls -la ~/.claude-mem/
  • Add delay before cleanup code runs

H10: Bun vs Node Runtime Issue

Theory: Worker runs under different runtime than expected, causing socket issues.

Evidence:

  • Replay script uses bun: #!/usr/bin/env bun
  • Worker spawned via CLI which uses node: #!/usr/bin/env node
  • Runtime difference might affect socket creation

Tests:

  • Spawn with explicit bun: bun dist/claude-mem.min.js worker 28
  • Or spawn with explicit node
  • Check if runtime matters for Unix sockets

H11: Race Condition in Socket Server Startup

Theory: server.listen() completes but socket isn't ready for connections yet.

Evidence:

  • We poll for 15 seconds
  • Maybe socket file appears but isn't ready
  • Connection attempts might be too early

Tests:

  • Increase wait time after socket found
  • Try connecting with retry logic
  • Check socket file permissions/readiness

H12: Worker Logs to Wrong Stream

Theory: Worker logs "listening" to stdout/stderr but then crashes, and we only see initial log.

Evidence:

  • console.error used in worker (worker.ts:86)
  • With stdio: ['ignore', 'pipe', 'pipe'], stderr is piped
  • Maybe crash happens but we don't see it

Tests:

  • Check full worker output captured
  • Look for crash stack traces
  • Add more logging throughout worker.run()

  1. Change spawn config to match production exactly

    • detached: true
    • stdio: 'ignore'
    • worker.unref()
    • This eliminates H2, H3
  2. Check worker process persistence

    • ps aux | grep worker immediately after spawn
    • If not running → H1, H7, H8
    • If running → H4, H5, H6
  3. Check socket file location

    • ls -la ~/.claude-mem/*.sock
    • find /tmp -name "worker-*.sock"
    • If found elsewhere → H4
    • If not found → H1, H5, H6
  4. Run worker directly for debugging

    • dist/claude-mem.min.js worker 28 manually
    • See full output
    • Check if socket appears
  5. Add more worker logging

    • Log at start of main()
    • Log after loadSession()
    • Log after startSocketServer() promise resolves
    • Log socket path being used