feat: Fix observation timestamps, refactor session management, and enhance worker reliability (#437)

* Refactor worker version checks and increase timeout settings

- Updated the default hook timeout from 5000ms to 120000ms for improved stability.
- Modified the worker version check to log a warning instead of restarting the worker on version mismatch.
- Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function.
- Enhanced polling mechanism for worker readiness with increased retries and reduced interval.

* feat: implement worker queue polling to ensure processing completion before proceeding

* refactor: change worker command from start to restart in hooks configuration

* refactor: remove session management complexity

- Simplify createSDKSession to pure INSERT OR IGNORE
- Remove auto-create logic from storeObservation/storeSummary
- Delete 11 unused session management methods
- Derive prompt_number from user_prompts count
- Keep sdk_sessions table schema unchanged for compatibility

* refactor: simplify session management by removing unused methods and auto-creation logic

* Refactor session prompt number retrieval in SessionRoutes

- Updated the method of obtaining the prompt number from the session.
- Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy.
- Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter.

* refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager

Phase 7 of session management simplification. Updates SessionManager to derive
prompt numbers from user_prompts table count instead of using the deprecated
prompt_counter column.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: simplify SessionCompletionHandler to use direct SQL query

Phase 8: Remove call to findActiveSDKSession() and replace with direct
database query in SessionCompletionHandler.completeByClaudeId().

This removes dependency on the deleted findActiveSDKSession() method
and simplifies the code by using a straightforward SELECT query.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionCompleted call from SDKAgent

- Delete call to markSessionCompleted() in SDKAgent.ts
- Session status is no longer tracked or updated
- Part of phase 9: simplifying session management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionComplete method (Phase 10)

- Deleted markSessionComplete() method from DatabaseManager
- Removed markSessionComplete call from SessionCompletionHandler
- Session completion status no longer tracked in database
- Part of session management simplification effort

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: replace deleted updateSDKSessionId calls in import script (Phase 11)

- Replace updateSDKSessionId() calls with direct SQL UPDATE statements
- Method was deleted in Phase 3 as part of session management simplification
- Import script now uses direct database access consistently

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* test: add validation for SQL updates in sdk_sessions table

* refactor: enhance worker-cli to support manual and automated runs

* Remove cleanup hook and associated session completion logic

- Deleted the cleanup-hook implementation from the hooks directory.
- Removed the session completion endpoint that was used by the cleanup hook.
- Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies.
- Adjusted the SessionRoutes to reflect the removal of the session completion route.

* fix: update worker-cli command to use bun for consistency

* feat: Implement timestamp fix for observations and enhance processing logic

- Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message.
- Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing.
- Modified `SessionManager` to track the earliest timestamp when yielding messages.
- Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues.
- Verified that all corrupted observations have been repaired and logic for future processing is sound.
- Ensured orphan processing can be safely re-enabled after validation.

* feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries

* Refactor pending queue processing and add management endpoints

- Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint.
- Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results.
- Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process.
- Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically.
- Enhanced logging and error handling for better monitoring of session processing.

* updated agent sdk

* feat: Add manual recovery guide and queue management endpoints to documentation

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2025-12-25 15:36:46 -05:00
committed by GitHub
parent 4cf2c1bdb1
commit 266c746d50
47 changed files with 3417 additions and 1026 deletions
+1 -1
View File
@@ -1,5 +1,5 @@
export const HOOK_TIMEOUTS = {
DEFAULT: 5000, // Standard HTTP timeout (up from 2000ms)
DEFAULT: 120000, // Standard HTTP timeout (up from 2000ms)
HEALTH_CHECK: 1000, // Worker health check (up from 500ms)
WORKER_STARTUP_WAIT: 1000,
WORKER_STARTUP_RETRIES: 15,
+18 -99
View File
@@ -1,10 +1,8 @@
import path from "path";
import { homedir } from "os";
import { spawnSync } from "child_process";
import { existsSync, writeFileSync, readFileSync, mkdirSync } from "fs";
import { readFileSync } from "fs";
import { logger } from "../utils/logger.js";
import { HOOK_TIMEOUTS, getTimeout } from "./hook-constants.js";
import { ProcessManager } from "../services/process/ProcessManager.js";
import { SettingsDefaultsManager } from "./SettingsDefaultsManager.js";
import { getWorkerRestartInstructions } from "../utils/error-messages.js";
@@ -96,123 +94,44 @@ async function getWorkerVersion(): Promise<string> {
/**
* Check if worker version matches plugin version
* If mismatch detected, restart the worker automatically
* Logs a warning if mismatch is detected
*/
async function ensureWorkerVersionMatches(): Promise<void> {
async function checkWorkerVersion(): Promise<void> {
const pluginVersion = getPluginVersion();
const workerVersion = await getWorkerVersion();
if (pluginVersion !== workerVersion) {
logger.info('SYSTEM', 'Worker version mismatch detected - restarting worker', {
logger.warn('SYSTEM', 'Worker version mismatch', {
pluginVersion,
workerVersion
workerVersion,
hint: 'Restart worker with: claude-mem worker restart'
});
// Give files time to sync before restart
await new Promise(resolve => setTimeout(resolve, getTimeout(HOOK_TIMEOUTS.PRE_RESTART_SETTLE_DELAY)));
// Restart the worker
await ProcessManager.restart(getWorkerPort());
// Give it a moment to start
await new Promise(resolve => setTimeout(resolve, 1000));
// Verify it's healthy
if (!await isWorkerHealthy()) {
throw new Error(`Worker failed to restart after version mismatch. Expected ${pluginVersion}, was running ${workerVersion}`);
}
}
}
/**
* Start the worker service using ProcessManager
* Handles both Unix (Bun) and Windows (compiled exe) platforms
*/
async function startWorker(): Promise<boolean> {
// Clean up legacy PM2 (one-time migration)
const dataDir = SettingsDefaultsManager.get('CLAUDE_MEM_DATA_DIR');
const pm2MigratedMarker = path.join(dataDir, '.pm2-migrated');
// Ensure data directory exists (may not exist on fresh install)
mkdirSync(dataDir, { recursive: true });
if (!existsSync(pm2MigratedMarker)) {
spawnSync('pm2', ['delete', 'claude-mem-worker'], { stdio: 'ignore' });
// Mark migration as complete
writeFileSync(pm2MigratedMarker, new Date().toISOString(), 'utf-8');
logger.debug('SYSTEM', 'PM2 cleanup completed and marked');
}
const port = getWorkerPort();
const result = await ProcessManager.start(port);
if (!result.success) {
logger.error('SYSTEM', 'Failed to start worker', {
platform: process.platform,
port,
error: result.error,
marketplaceRoot: MARKETPLACE_ROOT
});
}
return result.success;
}
/**
* Ensure worker service is running
* Checks health and auto-starts if not running
* Also ensures worker version matches plugin version
* Polls until worker is ready (assumes worker-cli.js start was called by hooks.json)
*/
export async function ensureWorkerRunning(): Promise<void> {
// Check if already healthy (will throw on fetch errors)
let healthy = false;
try {
healthy = await isWorkerHealthy();
} catch (error) {
// Worker not running or unreachable - continue to start it
healthy = false;
}
const maxRetries = 25; // 5 seconds total
const pollInterval = 200;
if (healthy) {
// Worker is healthy, but check if version matches
await ensureWorkerVersionMatches();
return;
}
// Try to start the worker
const started = await startWorker();
if (!started) {
const port = getWorkerPort();
throw new Error(
getWorkerRestartInstructions({
port,
customPrefix: `Worker service failed to start on port ${port}.`
})
);
}
// Wait for worker to become responsive after starting
// Try up to 5 times with 500ms delays (2.5 seconds total)
for (let i = 0; i < 5; i++) {
await new Promise(resolve => setTimeout(resolve, 500));
for (let i = 0; i < maxRetries; i++) {
try {
if (await isWorkerHealthy()) {
await ensureWorkerVersionMatches();
await checkWorkerVersion(); // logs warning on mismatch, doesn't restart
return;
}
} catch (error) {
// Continue trying
} catch {
// Continue polling
}
await new Promise(r => setTimeout(r, pollInterval));
}
// Worker started but isn't responding
const port = getWorkerPort();
logger.error('SYSTEM', 'Worker started but not responding to health checks');
throw new Error(
getWorkerRestartInstructions({
port,
customPrefix: `Worker service started but is not responding on port ${port}.`
})
);
throw new Error(getWorkerRestartInstructions({
port: getWorkerPort(),
customPrefix: 'Worker did not become ready within 5 seconds.'
}));
}