Improve error handling and logging across worker services (#528)

* fix: prevent memory_session_id from equaling content_session_id

The bug: memory_session_id was initialized to contentSessionId as a
"placeholder for FK purposes". This caused the SDK resume logic to
inject memory agent messages into the USER's Claude Code transcript,
corrupting their conversation history.

Root cause:
- SessionStore.createSDKSession initialized memory_session_id = contentSessionId
- SDKAgent checked memorySessionId !== contentSessionId but this check
  only worked if the session was fetched fresh from DB

The fix:
- SessionStore: Initialize memory_session_id as NULL, not contentSessionId
- SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start)
- Database migration: Ran UPDATE to set memory_session_id = NULL for 1807
  existing sessions that had the bug

Also adds [ALIGNMENT] logging across the session lifecycle to help debug
session continuity issues:
- Hook entry: contentSessionId + promptNumber
- DB lookup: contentSessionId → memorySessionId mapping proof
- Resume decision: shows which memorySessionId will be used for resume
- Capture: logs when memorySessionId is captured from first SDK response

UI: Added "Alignment" quick filter button in LogsModal to show only
alignment logs for debugging session continuity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: improve error handling in worker-service.ts

- Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages
- Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths)
- Simplify error handling comments to be more concise
- Improve httpShutdown() error discrimination for ECONNREFUSED
- Reduce LARGE_TRY_BLOCK issues in initialization code

Part of anti-pattern cleanup plan (132 total issues)

* refactor: improve error logging in SearchManager.ts

- Pass full error objects to logger instead of just error.message
- Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances)
- Better debugging visibility when Chroma queries fail

Part of anti-pattern cleanup (133 remaining)

* refactor: improve error logging across SessionStore and mcp-server

- SessionStore.ts: Fix error logging in column rename utility
- mcp-server.ts: Log full error objects instead of just error.message
- Improve error handling in Worker API calls and tool execution

Part of anti-pattern cleanup (133 remaining)

* Refactor hooks to streamline error handling and loading states

- Simplified error handling in useContextPreview by removing try-catch and directly checking response status.
- Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks.
- Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling.
- Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success.

* refactor: add error handling back to SearchManager Chroma calls

- Wrap queryChroma calls in try-catch to prevent generator crashes
- Log Chroma errors as warnings and fall back gracefully
- Fixes generator failures when Chroma has issues
- Part of anti-pattern cleanup recovery

* feat: Add generator failure investigation report and observation duplication regression report

- Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes.
- Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes.

* fix: address PR #528 review comments - atomic cleanup and detector improvements

This commit addresses critical review feedback from PR #528:

## 1. Atomic Message Cleanup (Fix Race Condition)

**Problem**: SessionRoutes.ts generator error handler had race condition
- Queried messages then marked failed in loop
- If crash during loop → partial marking → inconsistent state

**Solution**:
- Added `markSessionMessagesFailed()` to PendingMessageStore.ts
- Single atomic UPDATE statement replaces loop
- Follows existing pattern from `resetProcessingToPending()`

**Files**:
- src/services/sqlite/PendingMessageStore.ts (new method)
- src/services/worker/http/routes/SessionRoutes.ts (use new method)

## 2. Anti-Pattern Detector Improvements

**Problem**: Detector didn't recognize logger.failure() method
- Lines 212 & 335 already included "failure"
- Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not

**Solution**: Updated regex patterns to include "failure" for consistency

**Files**:
- scripts/anti-pattern-test/detect-error-handling-antipatterns.ts

## 3. Documentation

**PR Comment**: Added clarification on memory_session_id fix location
- Points to SessionStore.ts:1155
- Explains why NULL initialization prevents message injection bug

## Review Response

Addresses "Must Address Before Merge" items from review:
 Clarified memory_session_id bug fix location (via PR comment)
 Made generator error handler message cleanup atomic
 Deferred comprehensive test suite to follow-up PR (keeps PR focused)

## Testing

- Build passes with no errors
- Anti-pattern detector runs successfully
- Atomic cleanup follows proven pattern from existing methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: FOREIGN KEY constraint and missing failed_at_epoch column

Two critical bugs fixed:

1. Missing failed_at_epoch column in pending_messages table
   - Added migration 20 to create the column
   - Fixes error when trying to mark messages as failed

2. FOREIGN KEY constraint failed when storing observations
   - All three agents (SDK, Gemini, OpenRouter) were passing
     session.contentSessionId instead of session.memorySessionId
   - storeObservationsAndMarkComplete expects memorySessionId
   - Added null check and clear error message

However, observations still not saving - see investigation report.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor hook input parsing to improve error handling

- Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully.
- Replaced direct error throwing with logging of the error details using logger.error.
- Ensured that the process exits cleanly after handling input in all three hooks.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-01-03 18:51:59 -05:00
committed by GitHub
parent e830157e77
commit 817b9e8f27
31 changed files with 4490 additions and 3292 deletions
+3 -3
View File
@@ -3,11 +3,11 @@ import{stdin as y}from"process";var U=JSON.stringify({continue:!0,suppressOutput
${t.stack}`:t.message;if(Array.isArray(t))return`[${t.length} items]`;let r=Object.keys(t);return r.length===0?"{}":r.length<=3?JSON.stringify(t):`{${r.length} keys: ${r.slice(0,3).join(", ")}...}`}return String(t)}formatTool(t,r){if(!r)return t;let e=typeof r=="string"?JSON.parse(r):r;if(t==="Bash"&&e.command)return`${t}(${e.command})`;if(e.file_path)return`${t}(${e.file_path})`;if(e.notebook_path)return`${t}(${e.notebook_path})`;if(t==="Glob"&&e.pattern)return`${t}(${e.pattern})`;if(t==="Grep"&&e.pattern)return`${t}(${e.pattern})`;if(e.url)return`${t}(${e.url})`;if(e.query)return`${t}(${e.query})`;if(t==="Task"){if(e.subagent_type)return`${t}(${e.subagent_type})`;if(e.description)return`${t}(${e.description})`}return t==="Skill"&&e.skill?`${t}(${e.skill})`:t==="LSP"&&e.operation?`${t}(${e.operation})`:t}formatTimestamp(t){let r=t.getFullYear(),e=String(t.getMonth()+1).padStart(2,"0"),n=String(t.getDate()).padStart(2,"0"),o=String(t.getHours()).padStart(2,"0"),i=String(t.getMinutes()).padStart(2,"0"),E=String(t.getSeconds()).padStart(2,"0"),l=String(t.getMilliseconds()).padStart(3,"0");return`${r}-${e}-${n} ${o}:${i}:${E}.${l}`}log(t,r,e,n,o){if(t<this.getLevel())return;let i=this.formatTimestamp(new Date),E=M[t].padEnd(5),l=r.padEnd(6),c="";n?.correlationId?c=`[${n.correlationId}] `:n?.sessionId&&(c=`[session-${n.sessionId}] `);let g="";o!=null&&(o instanceof Error?g=this.getLevel()===0?`
${o.message}
${o.stack}`:` ${o.message}`:this.getLevel()===0&&typeof o=="object"?g=`
`+JSON.stringify(o,null,2):g=" "+this.formatData(o));let O="";if(n){let{sessionId:D,memorySessionId:q,correlationId:z,...m}=n;Object.keys(m).length>0&&(O=` {${Object.entries(m).map(([P,$])=>`${P}=${$}`).join(", ")}}`)}let C=`[${i}] [${E}] [${l}] ${c}${e}${O}${g}`;if(this.logFilePath)try{W(this.logFilePath,C+`
`+JSON.stringify(o,null,2):g=" "+this.formatData(o));let u="";if(n){let{sessionId:D,memorySessionId:q,correlationId:z,...m}=n;Object.keys(m).length>0&&(u=` {${Object.entries(m).map(([P,$])=>`${P}=${$}`).join(", ")}}`)}let C=`[${i}] [${E}] [${l}] ${c}${e}${u}${g}`;if(this.logFilePath)try{W(this.logFilePath,C+`
`,"utf8")}catch(D){process.stderr.write(`[LOGGER] Failed to write to log file: ${D}
`)}else process.stderr.write(C+`
`)}debug(t,r,e,n){this.log(0,t,r,e,n)}info(t,r,e,n){this.log(1,t,r,e,n)}warn(t,r,e,n){this.log(2,t,r,e,n)}error(t,r,e,n){this.log(3,t,r,e,n)}dataIn(t,r,e,n){this.info(t,`\u2192 ${r}`,e,n)}dataOut(t,r,e,n){this.info(t,`\u2190 ${r}`,e,n)}success(t,r,e,n){this.info(t,`\u2713 ${r}`,e,n)}failure(t,r,e,n){this.error(t,`\u2717 ${r}`,e,n)}timing(t,r,e,n){this.info(t,`\u23F1 ${r}`,n,{duration:`${e}ms`})}happyPathError(t,r,e,n,o=""){let c=((new Error().stack||"").split(`
`)[2]||"").match(/at\s+(?:.*\s+)?\(?([^:]+):(\d+):(\d+)\)?/),g=c?`${c[1].split("/").pop()}:${c[2]}`:"unknown",O={...e,location:g};return this.warn(t,`[HAPPY-PATH] ${r}`,O,n),o}},_=new f;import A from"path";import{homedir as G}from"os";import{readFileSync as K}from"fs";var p={DEFAULT:3e5,HEALTH_CHECK:3e4,WORKER_STARTUP_WAIT:1e3,WORKER_STARTUP_RETRIES:300,PRE_RESTART_SETTLE_DELAY:2e3,WINDOWS_MULTIPLIER:1.5};function h(s){return process.platform==="win32"?Math.round(s*p.WINDOWS_MULTIPLIER):s}function I(s={}){let{port:t,includeSkillFallback:r=!1,customPrefix:e,actualError:n}=s,o=e||"Worker service connection failed.",i=t?` (port ${t})`:"",E=`${o}${i}
`)[2]||"").match(/at\s+(?:.*\s+)?\(?([^:]+):(\d+):(\d+)\)?/),g=c?`${c[1].split("/").pop()}:${c[2]}`:"unknown",u={...e,location:g};return this.warn(t,`[HAPPY-PATH] ${r}`,u,n),o}},_=new f;import A from"path";import{homedir as G}from"os";import{readFileSync as K}from"fs";var p={DEFAULT:3e5,HEALTH_CHECK:3e4,WORKER_STARTUP_WAIT:1e3,WORKER_STARTUP_RETRIES:300,PRE_RESTART_SETTLE_DELAY:2e3,WINDOWS_MULTIPLIER:1.5};function h(s){return process.platform==="win32"?Math.round(s*p.WINDOWS_MULTIPLIER):s}function I(s={}){let{port:t,includeSkillFallback:r=!1,customPrefix:e,actualError:n}=s,o=e||"Worker service connection failed.",i=t?` (port ${t})`:"",E=`${o}${i}
`;return E+=`To restart the worker:
`,E+=`1. Exit Claude Code completely
@@ -16,4 +16,4 @@ ${o.stack}`:` ${o.message}`:this.getLevel()===0&&typeof o=="object"?g=`
If that doesn't work, try: /troubleshoot`),n&&(E=`Worker Error: ${n}
${E}`),E}var X=A.join(G(),".claude","plugins","marketplaces","thedotmack"),At=h(p.HEALTH_CHECK),T=null;function u(){if(T!==null)return T;let s=A.join(a.get("CLAUDE_MEM_DATA_DIR"),"settings.json"),t=a.loadFromFile(s);return T=parseInt(t.CLAUDE_MEM_WORKER_PORT,10),T}async function V(){let s=u();return(await fetch(`http://127.0.0.1:${s}/api/readiness`)).ok}function j(){let s=A.join(X,"package.json");return JSON.parse(K(s,"utf-8")).version}async function B(){let s=u(),t=await fetch(`http://127.0.0.1:${s}/api/version`);if(!t.ok)throw new Error(`Failed to get worker version: ${t.status}`);return(await t.json()).version}async function Y(){let s=j(),t=await B();s!==t&&_.debug("SYSTEM","Version check",{pluginVersion:s,workerVersion:t,note:"Mismatch will be auto-restarted by worker-service start command"})}async function N(){for(let r=0;r<75;r++){try{if(await V()){await Y();return}}catch(e){_.debug("SYSTEM","Worker health check failed, will retry",{attempt:r+1,maxRetries:75,error:e instanceof Error?e.message:String(e)})}await new Promise(e=>setTimeout(e,200))}throw new Error(I({port:u(),customPrefix:"Worker did not become ready within 15 seconds."}))}async function J(s){if(await N(),!s)throw new Error("saveHook requires input");let{session_id:t,cwd:r,tool_name:e,tool_input:n,tool_response:o}=s,i=u(),E=_.formatTool(e,n);if(_.dataIn("HOOK",`PostToolUse: ${E}`,{workerPort:i}),!r)throw new Error(`Missing cwd in PostToolUse hook input for session ${t}, tool ${e}`);let l=await fetch(`http://127.0.0.1:${i}/api/sessions/observations`,{method:"POST",headers:{"Content-Type":"application/json"},body:JSON.stringify({contentSessionId:t,tool_name:e,tool_input:n,tool_response:o,cwd:r})});if(!l.ok)throw new Error(`Observation storage failed: ${l.status}`);_.debug("HOOK","Observation sent successfully",{toolName:e}),console.log(U)}var L="";y.on("data",s=>L+=s);y.on("end",async()=>{let s;try{s=L?JSON.parse(L):void 0}catch(t){throw new Error(`Failed to parse hook input: ${t instanceof Error?t.message:String(t)}`)}await J(s)});
${E}`),E}var X=A.join(G(),".claude","plugins","marketplaces","thedotmack"),At=h(p.HEALTH_CHECK),T=null;function O(){if(T!==null)return T;let s=A.join(a.get("CLAUDE_MEM_DATA_DIR"),"settings.json"),t=a.loadFromFile(s);return T=parseInt(t.CLAUDE_MEM_WORKER_PORT,10),T}async function V(){let s=O();return(await fetch(`http://127.0.0.1:${s}/api/readiness`)).ok}function j(){let s=A.join(X,"package.json");return JSON.parse(K(s,"utf-8")).version}async function B(){let s=O(),t=await fetch(`http://127.0.0.1:${s}/api/version`);if(!t.ok)throw new Error(`Failed to get worker version: ${t.status}`);return(await t.json()).version}async function Y(){let s=j(),t=await B();s!==t&&_.debug("SYSTEM","Version check",{pluginVersion:s,workerVersion:t,note:"Mismatch will be auto-restarted by worker-service start command"})}async function N(){for(let r=0;r<75;r++){try{if(await V()){await Y();return}}catch(e){_.debug("SYSTEM","Worker health check failed, will retry",{attempt:r+1,maxRetries:75,error:e instanceof Error?e.message:String(e)})}await new Promise(e=>setTimeout(e,200))}throw new Error(I({port:O(),customPrefix:"Worker did not become ready within 15 seconds."}))}async function J(s){if(await N(),!s)throw new Error("saveHook requires input");let{session_id:t,cwd:r,tool_name:e,tool_input:n,tool_response:o}=s,i=O(),E=_.formatTool(e,n);if(_.dataIn("HOOK",`PostToolUse: ${E}`,{workerPort:i}),!r)throw new Error(`Missing cwd in PostToolUse hook input for session ${t}, tool ${e}`);let l=await fetch(`http://127.0.0.1:${i}/api/sessions/observations`,{method:"POST",headers:{"Content-Type":"application/json"},body:JSON.stringify({contentSessionId:t,tool_name:e,tool_input:n,tool_response:o,cwd:r})});if(!l.ok)throw new Error(`Observation storage failed: ${l.status}`);_.debug("HOOK","Observation sent successfully",{toolName:e}),console.log(U)}var L="";y.on("data",s=>L+=s);y.on("end",async()=>{try{let s;try{s=L?JSON.parse(L):void 0}catch(t){throw new Error(`Failed to parse hook input: ${t instanceof Error?t.message:String(t)}`)}await J(s)}catch(s){_.error("HOOK","save-hook failed",{},s)}finally{process.exit(0)}});