Files
claude-mem/src/shared/worker-utils.ts
T
Alex Newman ba1ef6c42c fix: Issue Blowout 2026 — 25 bugs across worker, hooks, security, and search (#2080)
* fix: resolve search, database, and docker bugs (#1913, #1916, #1956, #1957, #2048)

- Fix concept/concepts param mismatch in SearchManager.normalizeParams (#1916)
- Add FTS5 keyword fallback when ChromaDB is unavailable (#1913, #2048)
- Add periodic WAL checkpoint and journal_size_limit to prevent unbounded WAL growth (#1956)
- Add periodic clearFailed() to purge stale pending_messages (#1957)
- Fix nounset-safe TTY_ARGS expansion in docker/claude-mem/run.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prevent silent data loss on non-XML responses, add queue info to /health (#1867, #1874)

- ResponseProcessor: mark messages as failed (with retry) instead of confirming
  when the LLM returns non-XML garbage (auth errors, rate limits) (#1874)
- Health endpoint: include activeSessions count for queue liveness monitoring (#1867)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: cache isFts5Available() at construction time

Addresses Greptile review: avoid DDL probe (CREATE + DROP) on every text
query. Result is now cached in _fts5Available at construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve worker stability bugs — pool deadlock, MCP loopback, restart guard (#1868, #1876, #2053)

- Replace flat consecutiveRestarts counter with time-windowed RestartGuard:
  only counts restarts within 60s window (cap=10), decays after 5min of
  success. Prevents stranding pending messages on long-running sessions. (#2053)

- Add idle session eviction to pool slot allocation: when all slots are full,
  evict the idlest session (no pending work, oldest activity) to free a slot
  for new requests, preventing 60s timeout deadlock. (#1868)

- Fix MCP loopback self-check: use process.execPath instead of bare 'node'
  which fails on non-interactive PATH. Fix crash misclassification by removing
  false "Generator exited unexpectedly" error log on normal completion. (#1876)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve hooks reliability bugs — summarize exit code, session-init health wait (#1896, #1901, #1903, #1907)

- Wrap summarize hook's workerHttpRequest in try/catch to prevent exit
  code 2 (blocking error) on network failures or malformed responses.
  Session exit no longer blocks on worker errors. (#1901)

- Add health-check wait loop to UserPromptSubmit session-init command in
  hooks.json. On Linux/WSL where hook ordering fires UserPromptSubmit
  before SessionStart, session-init now waits up to 10s for worker health
  before proceeding. Also wrap session-init HTTP call in try/catch. (#1907)

- Close #1896 as already-fixed: mtime comparison at file-context.ts:255-267
  bypasses truncation when file is newer than latest observation.

- Close #1903 as no-repro: hooks.json correctly declares all hook events.
  Issue was Claude Code 12.0.1/macOS platform event-dispatch bug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: security hardening — bearer auth, path validation, rate limits, per-user port (#1932, #1933, #1934, #1935, #1936)

- Add bearer token auth to all API endpoints: auto-generated 32-byte
  token stored at ~/.claude-mem/worker-auth-token (mode 0600). All hook,
  MCP, viewer, and OpenCode requests include Authorization header.
  Health/readiness endpoints exempt for polling. (#1932, #1933)

- Add path traversal protection: watch.context.path validated against
  project root and ~/.claude-mem/ before write. Rejects ../../../etc
  style attacks. (#1934)

- Reduce JSON body limit from 50MB to 5MB. Add in-memory rate limiter
  (300 req/min/IP) to prevent abuse. (#1935)

- Derive default worker port from UID (37700 + uid%100) to prevent
  cross-user data leakage on multi-user macOS. Windows falls back to
  37777. Shell hooks use same formula via id -u. (#1936)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve search project filtering and import Chroma sync (#1911, #1912, #1914, #1918)

- Fix per-type search endpoints to pass project filter to Chroma queries
  and SQLite hydration. searchObservations/Sessions/UserPrompts now use
  $or clause matching project + merged_into_project. (#1912)

- Fix timeline/search methods to pass project to Chroma anchor queries.
  Prevents cross-project result leakage when project param omitted. (#1911)

- Sync imported observations to ChromaDB after FTS rebuild. Import
  endpoint now calls chromaSync.syncObservation() for each imported
  row, making them visible to MCP search(). (#1914)

- Fix session-init cwd fallback to match context.ts (process.cwd()).
  Prevents project key mismatch that caused "no previous sessions"
  on fresh sessions. (#1918)

- Fix sync-marketplace restart to include auth token and per-user port.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve all CodeRabbit and Greptile review comments on PR #2080

- Fix run.sh comment mismatch (no-op flag vs empty array)
- Gate session-init on health check success (prevent running when worker unreachable)
- Fix date_desc ordering ignored in FTS session search
- Age-scope failed message purge (1h retention) instead of clearing all
- Anchor RestartGuard decay to real successes (null init, not Date.now())
- Add recordSuccess() calls in ResponseProcessor and completion path
- Prevent caller headers from overriding bearer auth token
- Add lazy cleanup for rate limiter map to prevent unbounded growth
- Bound post-import Chroma sync with concurrency limit of 8
- Add doc_type:'observation' filter to Chroma queries feeding observation hydration
- Add FTS fallback to all specialized search handlers (observations, sessions, prompts, timeline)
- Add response.ok check and error handling in viewer saveSettings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve CodeRabbit round-2 review comments

- Use failure timestamp (COALESCE) instead of created_at_epoch for stale purge
- Downgrade _fts5Available flag when FTS table creation fails
- Escape FTS5 MATCH input by quoting user queries as literal phrases
- Escape LIKE metacharacters (%, _, \) in prompt text search
- Add response.ok check in initial settings load (matches save flow)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve CodeRabbit round-3 review comments

- Include failed_at_epoch in COALESCE for age-scoped purge
- Re-throw FTS5 errors so callers can distinguish failure from no-results
- Wrap all FTS fallback calls in SearchManager with try/catch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 11:42:09 -07:00

245 lines
8.1 KiB
TypeScript

import path from "path";
import { readFileSync } from "fs";
import { logger } from "../utils/logger.js";
import { HOOK_TIMEOUTS, getTimeout } from "./hook-constants.js";
import { SettingsDefaultsManager } from "./SettingsDefaultsManager.js";
import { MARKETPLACE_ROOT } from "./paths.js";
import { getAuthToken } from "./auth-token.js";
// Named constants for health checks
// Allow env var override for users on slow systems (e.g., CLAUDE_MEM_HEALTH_TIMEOUT_MS=10000)
const HEALTH_CHECK_TIMEOUT_MS = (() => {
const envVal = process.env.CLAUDE_MEM_HEALTH_TIMEOUT_MS;
if (envVal) {
const parsed = parseInt(envVal, 10);
if (Number.isFinite(parsed) && parsed >= 500 && parsed <= 300000) {
return parsed;
}
// Invalid env var — log once and use default
logger.warn('SYSTEM', 'Invalid CLAUDE_MEM_HEALTH_TIMEOUT_MS, using default', {
value: envVal, min: 500, max: 300000
});
}
return getTimeout(HOOK_TIMEOUTS.HEALTH_CHECK);
})();
/**
* Fetch with a timeout using Promise.race instead of AbortSignal.
* AbortSignal.timeout() causes a libuv assertion crash in Bun on Windows,
* so we use a racing setTimeout pattern that avoids signal cleanup entirely.
* The orphaned fetch is harmless since the process exits shortly after.
*/
export function fetchWithTimeout(url: string, init: RequestInit = {}, timeoutMs: number): Promise<Response> {
return new Promise((resolve, reject) => {
const timeoutId = setTimeout(
() => reject(new Error(`Request timed out after ${timeoutMs}ms`)),
timeoutMs
);
fetch(url, init).then(
response => { clearTimeout(timeoutId); resolve(response); },
err => { clearTimeout(timeoutId); reject(err); }
);
});
}
// Cache to avoid repeated settings file reads
let cachedPort: number | null = null;
let cachedHost: string | null = null;
/**
* Get the worker port number from settings
* Uses CLAUDE_MEM_WORKER_PORT from settings file or default (37777)
* Caches the port value to avoid repeated file reads
*/
export function getWorkerPort(): number {
if (cachedPort !== null) {
return cachedPort;
}
const settingsPath = path.join(SettingsDefaultsManager.get('CLAUDE_MEM_DATA_DIR'), 'settings.json');
const settings = SettingsDefaultsManager.loadFromFile(settingsPath);
cachedPort = parseInt(settings.CLAUDE_MEM_WORKER_PORT, 10);
return cachedPort;
}
/**
* Get the worker host address
* Uses CLAUDE_MEM_WORKER_HOST from settings file or default (127.0.0.1)
* Caches the host value to avoid repeated file reads
*/
export function getWorkerHost(): string {
if (cachedHost !== null) {
return cachedHost;
}
const settingsPath = path.join(SettingsDefaultsManager.get('CLAUDE_MEM_DATA_DIR'), 'settings.json');
const settings = SettingsDefaultsManager.loadFromFile(settingsPath);
cachedHost = settings.CLAUDE_MEM_WORKER_HOST;
return cachedHost;
}
/**
* Clear the cached port and host values.
* Call this when settings are updated to force re-reading from file.
*/
export function clearPortCache(): void {
cachedPort = null;
cachedHost = null;
}
/**
* Build a full URL for a given API path.
*/
export function buildWorkerUrl(apiPath: string): string {
return `http://${getWorkerHost()}:${getWorkerPort()}${apiPath}`;
}
/**
* Make an HTTP request to the worker over TCP.
*
* This is the preferred way for hooks to communicate with the worker.
*/
export function workerHttpRequest(
apiPath: string,
options: {
method?: string;
headers?: Record<string, string>;
body?: string;
timeoutMs?: number;
} = {}
): Promise<Response> {
const method = options.method ?? 'GET';
const timeoutMs = options.timeoutMs ?? HEALTH_CHECK_TIMEOUT_MS;
const url = buildWorkerUrl(apiPath);
const init: RequestInit = { method };
// Inject bearer token for worker API auth (#1932/#1933)
// Merge caller headers first, then set Authorization last to prevent override
const authHeaders: Record<string, string> = {
...options.headers,
'Authorization': `Bearer ${getAuthToken()}`
};
init.headers = authHeaders;
if (options.body) {
init.body = options.body;
}
if (timeoutMs > 0) {
return fetchWithTimeout(url, init, timeoutMs);
}
return fetch(url, init);
}
/**
* Check if worker HTTP server is responsive.
* Uses /api/health (liveness) instead of /api/readiness because:
* - Hooks have 15-second timeout, but full initialization can take 5+ minutes (MCP connection)
* - /api/health returns 200 as soon as HTTP server is up (sufficient for hook communication)
* - /api/readiness returns 503 until full initialization completes (too slow for hooks)
* See: https://github.com/thedotmack/claude-mem/issues/811
*/
async function isWorkerHealthy(): Promise<boolean> {
const response = await workerHttpRequest('/api/health', { timeoutMs: HEALTH_CHECK_TIMEOUT_MS });
return response.ok;
}
/**
* Get the current plugin version from package.json.
* Returns 'unknown' on ENOENT/EBUSY (shutdown race condition, fix #1042).
*/
function getPluginVersion(): string {
try {
const packageJsonPath = path.join(MARKETPLACE_ROOT, 'package.json');
const packageJson = JSON.parse(readFileSync(packageJsonPath, 'utf-8'));
return packageJson.version;
} catch (error: unknown) {
const code = error instanceof Error ? (error as NodeJS.ErrnoException).code : undefined;
if (code === 'ENOENT' || code === 'EBUSY') {
logger.debug('SYSTEM', 'Could not read plugin version (shutdown race)', { code });
return 'unknown';
}
throw error;
}
}
/**
* Get the running worker's version from the API
*/
async function getWorkerVersion(): Promise<string> {
const response = await workerHttpRequest('/api/version', { timeoutMs: HEALTH_CHECK_TIMEOUT_MS });
if (!response.ok) {
throw new Error(`Failed to get worker version: ${response.status}`);
}
const data = await response.json() as { version: string };
return data.version;
}
/**
* Check if worker version matches plugin version
* Note: Auto-restart on version mismatch is now handled in worker-service.ts start command (issue #484)
* This function logs for informational purposes only.
* Skips comparison when either version is 'unknown' (fix #1042 — avoids restart loops).
*/
async function checkWorkerVersion(): Promise<void> {
let pluginVersion: string;
try {
pluginVersion = getPluginVersion();
} catch (error: unknown) {
logger.debug('SYSTEM', 'Version check failed reading plugin version', {
error: error instanceof Error ? error.message : String(error)
});
return;
}
// Skip version check if plugin version couldn't be read (shutdown race)
if (pluginVersion === 'unknown') return;
let workerVersion: string;
try {
workerVersion = await getWorkerVersion();
} catch (error: unknown) {
logger.debug('SYSTEM', 'Version check failed reading worker version', {
error: error instanceof Error ? error.message : String(error)
});
return;
}
// Skip version check if worker version is 'unknown' (avoids restart loops)
if (workerVersion === 'unknown') return;
if (pluginVersion !== workerVersion) {
// Just log debug info - auto-restart handles the mismatch in worker-service.ts
logger.debug('SYSTEM', 'Version check', {
pluginVersion,
workerVersion,
note: 'Mismatch will be auto-restarted by worker-service start command'
});
}
}
/**
* Ensure worker service is running
* Quick health check - returns false if worker not healthy (doesn't block)
* Port might be in use by another process, or worker might not be started yet
*/
export async function ensureWorkerRunning(): Promise<boolean> {
// Quick health check (single attempt, no polling)
try {
if (await isWorkerHealthy()) {
await checkWorkerVersion(); // logs warning on mismatch, doesn't restart
return true; // Worker healthy
}
} catch (e) {
// Not healthy - log for debugging
logger.debug('SYSTEM', 'Worker health check failed', {
error: e instanceof Error ? e.message : String(e)
});
}
// Port might be in use by something else, or worker not started
// Return false but don't throw - let caller decide how to handle
logger.warn('SYSTEM', 'Worker not healthy, hook will proceed gracefully');
return false;
}