feat: Fix observation timestamps, refactor session management, and enhance worker reliability (#437)
* Refactor worker version checks and increase timeout settings - Updated the default hook timeout from 5000ms to 120000ms for improved stability. - Modified the worker version check to log a warning instead of restarting the worker on version mismatch. - Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function. - Enhanced polling mechanism for worker readiness with increased retries and reduced interval. * feat: implement worker queue polling to ensure processing completion before proceeding * refactor: change worker command from start to restart in hooks configuration * refactor: remove session management complexity - Simplify createSDKSession to pure INSERT OR IGNORE - Remove auto-create logic from storeObservation/storeSummary - Delete 11 unused session management methods - Derive prompt_number from user_prompts count - Keep sdk_sessions table schema unchanged for compatibility * refactor: simplify session management by removing unused methods and auto-creation logic * Refactor session prompt number retrieval in SessionRoutes - Updated the method of obtaining the prompt number from the session. - Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy. - Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter. * refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager Phase 7 of session management simplification. Updates SessionManager to derive prompt numbers from user_prompts table count instead of using the deprecated prompt_counter column. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: simplify SessionCompletionHandler to use direct SQL query Phase 8: Remove call to findActiveSDKSession() and replace with direct database query in SessionCompletionHandler.completeByClaudeId(). This removes dependency on the deleted findActiveSDKSession() method and simplifies the code by using a straightforward SELECT query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionCompleted call from SDKAgent - Delete call to markSessionCompleted() in SDKAgent.ts - Session status is no longer tracked or updated - Part of phase 9: simplifying session management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionComplete method (Phase 10) - Deleted markSessionComplete() method from DatabaseManager - Removed markSessionComplete call from SessionCompletionHandler - Session completion status no longer tracked in database - Part of session management simplification effort 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: replace deleted updateSDKSessionId calls in import script (Phase 11) - Replace updateSDKSessionId() calls with direct SQL UPDATE statements - Method was deleted in Phase 3 as part of session management simplification - Import script now uses direct database access consistently 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * test: add validation for SQL updates in sdk_sessions table * refactor: enhance worker-cli to support manual and automated runs * Remove cleanup hook and associated session completion logic - Deleted the cleanup-hook implementation from the hooks directory. - Removed the session completion endpoint that was used by the cleanup hook. - Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies. - Adjusted the SessionRoutes to reflect the removal of the session completion route. * fix: update worker-cli command to use bun for consistency * feat: Implement timestamp fix for observations and enhance processing logic - Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message. - Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing. - Modified `SessionManager` to track the earliest timestamp when yielding messages. - Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues. - Verified that all corrupted observations have been repaired and logic for future processing is sound. - Ensured orphan processing can be safely re-enabled after validation. * feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries * Refactor pending queue processing and add management endpoints - Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint. - Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results. - Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process. - Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically. - Enhanced logging and error handling for better monitoring of session processing. * updated agent sdk * feat: Add manual recovery guide and queue management endpoints to documentation --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
+18
-99
@@ -1,10 +1,8 @@
|
||||
import path from "path";
|
||||
import { homedir } from "os";
|
||||
import { spawnSync } from "child_process";
|
||||
import { existsSync, writeFileSync, readFileSync, mkdirSync } from "fs";
|
||||
import { readFileSync } from "fs";
|
||||
import { logger } from "../utils/logger.js";
|
||||
import { HOOK_TIMEOUTS, getTimeout } from "./hook-constants.js";
|
||||
import { ProcessManager } from "../services/process/ProcessManager.js";
|
||||
import { SettingsDefaultsManager } from "./SettingsDefaultsManager.js";
|
||||
import { getWorkerRestartInstructions } from "../utils/error-messages.js";
|
||||
|
||||
@@ -96,123 +94,44 @@ async function getWorkerVersion(): Promise<string> {
|
||||
|
||||
/**
|
||||
* Check if worker version matches plugin version
|
||||
* If mismatch detected, restart the worker automatically
|
||||
* Logs a warning if mismatch is detected
|
||||
*/
|
||||
async function ensureWorkerVersionMatches(): Promise<void> {
|
||||
async function checkWorkerVersion(): Promise<void> {
|
||||
const pluginVersion = getPluginVersion();
|
||||
const workerVersion = await getWorkerVersion();
|
||||
|
||||
if (pluginVersion !== workerVersion) {
|
||||
logger.info('SYSTEM', 'Worker version mismatch detected - restarting worker', {
|
||||
logger.warn('SYSTEM', 'Worker version mismatch', {
|
||||
pluginVersion,
|
||||
workerVersion
|
||||
workerVersion,
|
||||
hint: 'Restart worker with: claude-mem worker restart'
|
||||
});
|
||||
|
||||
// Give files time to sync before restart
|
||||
await new Promise(resolve => setTimeout(resolve, getTimeout(HOOK_TIMEOUTS.PRE_RESTART_SETTLE_DELAY)));
|
||||
|
||||
// Restart the worker
|
||||
await ProcessManager.restart(getWorkerPort());
|
||||
|
||||
// Give it a moment to start
|
||||
await new Promise(resolve => setTimeout(resolve, 1000));
|
||||
|
||||
// Verify it's healthy
|
||||
if (!await isWorkerHealthy()) {
|
||||
throw new Error(`Worker failed to restart after version mismatch. Expected ${pluginVersion}, was running ${workerVersion}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Start the worker service using ProcessManager
|
||||
* Handles both Unix (Bun) and Windows (compiled exe) platforms
|
||||
*/
|
||||
async function startWorker(): Promise<boolean> {
|
||||
// Clean up legacy PM2 (one-time migration)
|
||||
const dataDir = SettingsDefaultsManager.get('CLAUDE_MEM_DATA_DIR');
|
||||
const pm2MigratedMarker = path.join(dataDir, '.pm2-migrated');
|
||||
|
||||
// Ensure data directory exists (may not exist on fresh install)
|
||||
mkdirSync(dataDir, { recursive: true });
|
||||
|
||||
if (!existsSync(pm2MigratedMarker)) {
|
||||
spawnSync('pm2', ['delete', 'claude-mem-worker'], { stdio: 'ignore' });
|
||||
// Mark migration as complete
|
||||
writeFileSync(pm2MigratedMarker, new Date().toISOString(), 'utf-8');
|
||||
logger.debug('SYSTEM', 'PM2 cleanup completed and marked');
|
||||
}
|
||||
|
||||
const port = getWorkerPort();
|
||||
const result = await ProcessManager.start(port);
|
||||
|
||||
if (!result.success) {
|
||||
logger.error('SYSTEM', 'Failed to start worker', {
|
||||
platform: process.platform,
|
||||
port,
|
||||
error: result.error,
|
||||
marketplaceRoot: MARKETPLACE_ROOT
|
||||
});
|
||||
}
|
||||
|
||||
return result.success;
|
||||
}
|
||||
|
||||
/**
|
||||
* Ensure worker service is running
|
||||
* Checks health and auto-starts if not running
|
||||
* Also ensures worker version matches plugin version
|
||||
* Polls until worker is ready (assumes worker-cli.js start was called by hooks.json)
|
||||
*/
|
||||
export async function ensureWorkerRunning(): Promise<void> {
|
||||
// Check if already healthy (will throw on fetch errors)
|
||||
let healthy = false;
|
||||
try {
|
||||
healthy = await isWorkerHealthy();
|
||||
} catch (error) {
|
||||
// Worker not running or unreachable - continue to start it
|
||||
healthy = false;
|
||||
}
|
||||
const maxRetries = 25; // 5 seconds total
|
||||
const pollInterval = 200;
|
||||
|
||||
if (healthy) {
|
||||
// Worker is healthy, but check if version matches
|
||||
await ensureWorkerVersionMatches();
|
||||
return;
|
||||
}
|
||||
|
||||
// Try to start the worker
|
||||
const started = await startWorker();
|
||||
|
||||
if (!started) {
|
||||
const port = getWorkerPort();
|
||||
throw new Error(
|
||||
getWorkerRestartInstructions({
|
||||
port,
|
||||
customPrefix: `Worker service failed to start on port ${port}.`
|
||||
})
|
||||
);
|
||||
}
|
||||
|
||||
// Wait for worker to become responsive after starting
|
||||
// Try up to 5 times with 500ms delays (2.5 seconds total)
|
||||
for (let i = 0; i < 5; i++) {
|
||||
await new Promise(resolve => setTimeout(resolve, 500));
|
||||
for (let i = 0; i < maxRetries; i++) {
|
||||
try {
|
||||
if (await isWorkerHealthy()) {
|
||||
await ensureWorkerVersionMatches();
|
||||
await checkWorkerVersion(); // logs warning on mismatch, doesn't restart
|
||||
return;
|
||||
}
|
||||
} catch (error) {
|
||||
// Continue trying
|
||||
} catch {
|
||||
// Continue polling
|
||||
}
|
||||
await new Promise(r => setTimeout(r, pollInterval));
|
||||
}
|
||||
|
||||
// Worker started but isn't responding
|
||||
const port = getWorkerPort();
|
||||
logger.error('SYSTEM', 'Worker started but not responding to health checks');
|
||||
throw new Error(
|
||||
getWorkerRestartInstructions({
|
||||
port,
|
||||
customPrefix: `Worker service started but is not responding on port ${port}.`
|
||||
})
|
||||
);
|
||||
throw new Error(getWorkerRestartInstructions({
|
||||
port: getWorkerPort(),
|
||||
customPrefix: 'Worker did not become ready within 5 seconds.'
|
||||
}));
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user