Refactor silent debugging to happy path error handling
- Replaced instances of silentDebug with happy_path_error__with_fallback across multiple files to improve error logging and handling. - Updated the utility function to provide clearer semantics for error handling when expected values are missing. - Introduced a script to find potential silent failures in the codebase that may need to be addressed with the new error handling approach.
This commit is contained in:
@@ -0,0 +1,561 @@
|
||||
# Claude-Mem Smart Install & Plugin Hooks - Comprehensive Analysis
|
||||
|
||||
**Generated:** 2025-12-09
|
||||
**Scope:** Smart install system, all plugin hooks, cross-platform compatibility, error handling, edge cases
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This report provides a comprehensive analysis of claude-mem's smart install system and plugin hook infrastructure. The analysis focuses on cross-platform compatibility, error handling patterns, artificial blockers, and edge case handling.
|
||||
|
||||
**Key Findings:**
|
||||
- ✅ Overall architecture is well-designed with clear separation of concerns
|
||||
- ⚠️ Multiple cross-platform compatibility issues identified
|
||||
- ⚠️ Several silent failure patterns that hinder debugging
|
||||
- ⚠️ Artificial blockers that could prevent legitimate use cases
|
||||
- ⚠️ Inconsistent timeout values across different components
|
||||
- ✅ No nested try-catch anti-patterns found
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Smart Install System Flow
|
||||
|
||||
```
|
||||
User Invokes Hook
|
||||
↓
|
||||
ensureWorkerRunning() [worker-utils.ts]
|
||||
↓
|
||||
isWorkerHealthy() → fetch /health endpoint
|
||||
↓
|
||||
├─ [HEALTHY] → Continue
|
||||
└─ [UNHEALTHY] → startWorker()
|
||||
↓
|
||||
├─ [Windows] → PowerShell Start-Process (hidden window)
|
||||
└─ [Unix] → PM2 start ecosystem.config.cjs
|
||||
↓
|
||||
Wait for health check (15 retries × 1000ms)
|
||||
↓
|
||||
├─ [SUCCESS] → Continue
|
||||
└─ [FAILURE] → Throw error with manual recovery instructions
|
||||
```
|
||||
|
||||
### Plugin Hook Lifecycle
|
||||
|
||||
1. **SessionStart** (context-hook.ts + user-message-hook.ts)
|
||||
- context-hook: Fetches context via HTTP/curl
|
||||
- user-message-hook: Displays context to user via stderr
|
||||
|
||||
2. **UserPromptSubmit** (new-hook.ts)
|
||||
- Creates/retrieves SDK session
|
||||
- Strips privacy tags from prompt
|
||||
- Initializes session via HTTP
|
||||
|
||||
3. **PostToolUse** (save-hook.ts)
|
||||
- Filters skipped tools
|
||||
- Sends observation to worker via HTTP
|
||||
|
||||
4. **Stop** (summary-hook.ts)
|
||||
- Parses transcript JSONL
|
||||
- Extracts last user/assistant messages
|
||||
- Requests summary generation via HTTP
|
||||
|
||||
5. **SessionEnd** (cleanup-hook.ts)
|
||||
- Marks session complete
|
||||
- Fire-and-forget HTTP request
|
||||
|
||||
---
|
||||
|
||||
## Cross-Platform Compatibility Issues
|
||||
|
||||
### 🔴 CRITICAL: curl Dependency (context-hook.ts)
|
||||
|
||||
**Location:** `src/hooks/context-hook.ts:32`
|
||||
|
||||
```typescript
|
||||
const result = execSync(`curl -s "${url}"`, { encoding: "utf-8", timeout: 5000 });
|
||||
```
|
||||
|
||||
**Issues:**
|
||||
1. **Windows Compatibility:** curl is not guaranteed to be available on Windows systems (though included in Windows 10 1803+, it may be missing on older systems or custom installations)
|
||||
2. **Error Handling:** No try-catch around execSync - will throw unhandled exception if curl fails
|
||||
3. **Redundancy:** Uses curl when JavaScript's native `fetch` is already used everywhere else in the codebase
|
||||
|
||||
**Impact:** High - SessionStart hook will crash if curl is unavailable or returns non-zero exit code
|
||||
|
||||
**Edge Cases:**
|
||||
- Corporate proxies blocking curl
|
||||
- Systems without curl in PATH
|
||||
- curl returning non-zero exit with valid output (warnings, etc.)
|
||||
|
||||
**Recommendation:**
|
||||
```typescript
|
||||
// Replace curl with fetch (already used in user-message-hook.ts)
|
||||
const response = await fetch(url, { signal: AbortSignal.timeout(5000) });
|
||||
const result = await response.text();
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🟡 MEDIUM: Platform-Specific Process Spawning (worker-utils.ts)
|
||||
|
||||
**Location:** `src/shared/worker-utils.ts:55-93`
|
||||
|
||||
**Windows Implementation:**
|
||||
```typescript
|
||||
spawnSync('powershell.exe', [
|
||||
'-NoProfile',
|
||||
'-NonInteractive',
|
||||
'-Command',
|
||||
`Start-Process -FilePath 'node' -ArgumentList '${workerScript}' -WorkingDirectory '${MARKETPLACE_ROOT}' -WindowStyle Hidden`
|
||||
])
|
||||
```
|
||||
|
||||
**Issues:**
|
||||
1. **PowerShell Dependency:** Assumes PowerShell is available and in PATH
|
||||
2. **Command Injection Risk:** Worker script path inserted directly into command string without escaping
|
||||
3. **Process Monitoring:** Windows approach launches detached process with no PM2 monitoring - harder to debug/restart
|
||||
4. **Health Check Timeout:** Comment says "Windows needs longer timeouts" but timeout is same for all platforms (500ms)
|
||||
|
||||
**Edge Cases:**
|
||||
- Windows systems with PowerShell execution policy restrictions
|
||||
- Paths containing single quotes or special characters
|
||||
- Windows subsystem for Linux (WSL) environments
|
||||
- Wine/Proton compatibility layers
|
||||
|
||||
**Unix Implementation:**
|
||||
```typescript
|
||||
const localPm2Base = path.join(MARKETPLACE_ROOT, 'node_modules', '.bin', 'pm2');
|
||||
const pm2Command = existsSync(localPm2Base) ? localPm2Base : 'pm2';
|
||||
```
|
||||
|
||||
**Issues:**
|
||||
1. **PM2 Dependency:** Falls back to global pm2 if local not found, but doesn't verify it exists
|
||||
2. **Silent Failure:** If PM2 not installed globally, spawnSync will fail with cryptic ENOENT error
|
||||
|
||||
**Recommendation:**
|
||||
- Add pm2 existence check before spawn
|
||||
- Implement consistent process monitoring across platforms
|
||||
- Add path escaping for Windows command construction
|
||||
- Actually implement longer timeout for Windows if needed
|
||||
|
||||
---
|
||||
|
||||
### 🟡 MEDIUM: Git Dependency (paths.ts)
|
||||
|
||||
**Location:** `src/shared/paths.ts:89-97`
|
||||
|
||||
```typescript
|
||||
export function getCurrentProjectName(): string {
|
||||
try {
|
||||
const gitRoot = execSync('git rev-parse --show-toplevel', {
|
||||
cwd: process.cwd(),
|
||||
encoding: 'utf8',
|
||||
stdio: ['pipe', 'pipe', 'ignore']
|
||||
}).trim();
|
||||
return basename(gitRoot);
|
||||
} catch {
|
||||
return basename(process.cwd());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Issues:**
|
||||
1. **Git Assumption:** Assumes git is installed and available in PATH
|
||||
2. **Non-Git Projects:** Silently falls back to cwd basename, but this behavior is undocumented
|
||||
|
||||
**Edge Cases:**
|
||||
- Projects not using git
|
||||
- Monorepos where cwd !== git root is desired
|
||||
- Systems without git installed
|
||||
|
||||
**Status:** ✅ Already handled with fallback, but could benefit from debug logging
|
||||
|
||||
---
|
||||
|
||||
## Error Handling Analysis
|
||||
|
||||
### 🔴 CRITICAL: Silent Failures Without Logging
|
||||
|
||||
#### 1. Settings File Loading (early-settings.ts:20-28)
|
||||
|
||||
```typescript
|
||||
try {
|
||||
if (existsSync(SETTINGS_PATH)) {
|
||||
const data = JSON.parse(readFileSync(SETTINGS_PATH, 'utf-8'));
|
||||
const fileValue = data.env?.[key];
|
||||
if (fileValue !== undefined) return fileValue;
|
||||
}
|
||||
} catch {
|
||||
// Fail silently - fall through to env var
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:**
|
||||
- Invalid JSON in settings file fails silently
|
||||
- File read permission errors fail silently
|
||||
- Users have no way to know their settings file is being ignored
|
||||
|
||||
**Impact:** High - Users may think settings are applied when they're actually using defaults
|
||||
|
||||
**Recommendation:**
|
||||
```typescript
|
||||
} catch (error) {
|
||||
logger.warn('SETTINGS', 'Failed to load settings file', { path: SETTINGS_PATH }, error);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 2. Worker Startup Failure (worker-utils.ts:104-107)
|
||||
|
||||
```typescript
|
||||
try {
|
||||
// ... worker startup logic ...
|
||||
} catch (error) {
|
||||
// Failed to start worker
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:**
|
||||
- Catches ALL errors during worker startup
|
||||
- Returns boolean with no information about what failed
|
||||
- User only gets generic error after all retries exhausted
|
||||
|
||||
**Impact:** High - Makes debugging worker startup issues extremely difficult
|
||||
|
||||
**Recommendation:**
|
||||
```typescript
|
||||
} catch (error) {
|
||||
logger.error('WORKER', 'Failed to start worker', {}, error as Error);
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 3. Worker Health Check (worker-utils.ts:30-40)
|
||||
|
||||
```typescript
|
||||
async function isWorkerHealthy(): Promise<boolean> {
|
||||
try {
|
||||
const port = getWorkerPort();
|
||||
const response = await fetch(`http://127.0.0.1:${port}/health`, {
|
||||
signal: AbortSignal.timeout(HEALTH_CHECK_TIMEOUT_MS)
|
||||
});
|
||||
return response.ok;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:**
|
||||
- Network errors, timeouts, and non-200 responses all indistinguishable
|
||||
- No logging at all - completely silent
|
||||
|
||||
**Impact:** Medium - Hard to debug why health checks fail
|
||||
|
||||
**Recommendation:**
|
||||
```typescript
|
||||
} catch (error) {
|
||||
logger.debug('WORKER', 'Health check failed', { port }, error);
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 4. Tool Formatting (logger.ts:122-124)
|
||||
|
||||
```typescript
|
||||
try {
|
||||
const input = typeof toolInput === 'string' ? JSON.parse(toolInput) : toolInput;
|
||||
// ...
|
||||
} catch {
|
||||
return toolName;
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:**
|
||||
- Invalid JSON in tool input fails silently
|
||||
- Could mask data corruption issues
|
||||
|
||||
**Impact:** Low - Only affects log formatting
|
||||
|
||||
**Status:** ✅ Acceptable for log formatting, but could log at DEBUG level
|
||||
|
||||
---
|
||||
|
||||
### 🟢 GOOD: No Nested Try-Catch Anti-Patterns
|
||||
|
||||
Analysis confirmed zero instances of nested try-catch blocks. Error handling is consistently at single level per function.
|
||||
|
||||
---
|
||||
|
||||
## Artificial Blockers & Unnecessary Checks
|
||||
|
||||
### 🔴 CRITICAL: First-Run Detection (user-message-hook.ts:14-40)
|
||||
|
||||
```typescript
|
||||
const nodeModulesPath = join(pluginDir, 'node_modules');
|
||||
|
||||
if (!existsSync(nodeModulesPath)) {
|
||||
// Show first-time setup message
|
||||
console.error(`...`);
|
||||
process.exit(3);
|
||||
}
|
||||
```
|
||||
|
||||
**Problems:**
|
||||
1. **False Positive:** Will trigger if user manually deletes node_modules (e.g., for troubleshooting)
|
||||
2. **Installation Race:** Could fail if installation is still in progress
|
||||
3. **Hook-Level Check:** Runs on EVERY SessionStart, not just actual first run
|
||||
|
||||
**Impact:** High - Prevents usage until node_modules exists, even if dependencies are installed elsewhere
|
||||
|
||||
**Edge Cases:**
|
||||
- User runs `rm -rf node_modules` for troubleshooting
|
||||
- Package manager installation interrupted
|
||||
- Symlinked node_modules (some package managers)
|
||||
|
||||
**Recommendation:**
|
||||
- Use a `.first-run-complete` marker file instead
|
||||
- Move check to npm postinstall script
|
||||
- Make check more robust (check for specific required modules)
|
||||
|
||||
---
|
||||
|
||||
### 🟡 MEDIUM: Overly Specific Validation (paths.ts:117-119)
|
||||
|
||||
```typescript
|
||||
if (!existsSync(join(commandsDir, 'save.md'))) {
|
||||
throw new Error('Package commands directory missing required files');
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:**
|
||||
- Checks for ONE specific file to validate entire directory
|
||||
- Hardcoded filename could break if files reorganized
|
||||
- Error message doesn't specify what's missing
|
||||
|
||||
**Impact:** Medium - Could prevent package from working after internal refactoring
|
||||
|
||||
**Recommendation:**
|
||||
- Remove check entirely (let actual command invocation fail with better error)
|
||||
- Or check all required files if validation is critical
|
||||
|
||||
---
|
||||
|
||||
### 🟡 MEDIUM: Duplicate Health Endpoints
|
||||
|
||||
**Locations:**
|
||||
- `src/services/worker-service.ts:107` - `/api/health`
|
||||
- `src/services/worker/http/routes/ViewerRoutes.ts:27` - `/health`
|
||||
|
||||
**Usage:**
|
||||
- `worker-utils.ts` uses `/health`
|
||||
- `mcp-server.ts` uses `/api/health`
|
||||
|
||||
**Problem:**
|
||||
- Redundant endpoints doing the same thing
|
||||
- Inconsistent usage across codebase
|
||||
- Maintenance burden
|
||||
|
||||
**Impact:** Low - Both work, but creates confusion
|
||||
|
||||
**Recommendation:**
|
||||
- Standardize on `/api/health` (follows REST convention)
|
||||
- Remove `/health` endpoint
|
||||
- Update worker-utils.ts to use `/api/health`
|
||||
|
||||
---
|
||||
|
||||
## Timeout Configuration Issues
|
||||
|
||||
### Inconsistent Timeouts Across Components
|
||||
|
||||
| Component | Timeout | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| Health check | 500ms | worker-utils.ts:13 | Check if worker alive |
|
||||
| Worker startup wait | 1000ms | worker-utils.ts:14 | Wait between health checks |
|
||||
| Worker startup retries | 15x | worker-utils.ts:15 | Max retries (15s total) |
|
||||
| Hook HTTP requests | 2000ms | cleanup-hook.ts:61, save-hook.ts:70, summary-hook.ts:164 | Send data to worker |
|
||||
| New hook session init | 5000ms | new-hook.ts:129 | Initialize session |
|
||||
| Context hook fetch | 5000ms | context-hook.ts:32 | Fetch context via curl |
|
||||
| User message hook | 5000ms | user-message-hook.ts:52 | Fetch context display |
|
||||
|
||||
**Problems:**
|
||||
1. **Health Check Too Aggressive:** 500ms may be too short for loaded systems or slow network
|
||||
2. **No Platform Adjustment:** Comment says "Windows needs longer timeouts" but values are same
|
||||
3. **Hook Timeout Variation:** Some hooks use 2s, others use 5s with no clear reasoning
|
||||
|
||||
**Recommendations:**
|
||||
- Increase health check timeout to 1000ms minimum
|
||||
- Actually implement longer timeouts for Windows
|
||||
- Standardize hook timeouts to 5000ms across the board
|
||||
- Make timeouts configurable via settings
|
||||
|
||||
---
|
||||
|
||||
## Edge Case Analysis
|
||||
|
||||
### Handled Well ✅
|
||||
|
||||
1. **JSONL Parsing:** summary-hook.ts continues on malformed lines (60-64, 117-121)
|
||||
2. **Git Not Available:** paths.ts falls back to cwd basename (89-97)
|
||||
3. **Settings File Missing:** early-settings.ts falls back to env vars and defaults (20-28)
|
||||
4. **Privacy Tags:** new-hook.ts handles fully-private prompts (99-109)
|
||||
5. **Tool Skipping:** save-hook.ts filters low-value tools (24-30)
|
||||
|
||||
### Missing Edge Case Handling ⚠️
|
||||
|
||||
1. **curl Failure:** context-hook.ts has no error handling for curl failures
|
||||
2. **PM2 Not Installed:** worker-utils.ts assumes pm2 exists globally
|
||||
3. **PowerShell Restrictions:** worker-utils.ts doesn't check execution policy
|
||||
4. **Concurrent Worker Starts:** No locking to prevent multiple hooks from starting worker simultaneously
|
||||
5. **Port Already In Use:** No detection or recovery if worker port is taken
|
||||
6. **Zombie Processes:** Windows approach doesn't track PIDs, can't detect/kill zombies
|
||||
|
||||
---
|
||||
|
||||
## Recommendations Summary
|
||||
|
||||
### High Priority 🔴
|
||||
|
||||
1. **Replace curl with fetch** in context-hook.ts
|
||||
- Eliminates external dependency
|
||||
- Consistent with rest of codebase
|
||||
- Better error handling
|
||||
|
||||
2. **Add logging to silent failures**
|
||||
- early-settings.ts: Log when settings file fails to load
|
||||
- worker-utils.ts: Log startup failures with details
|
||||
- worker-utils.ts: Log health check failures at debug level
|
||||
|
||||
3. **Fix first-run detection**
|
||||
- Use marker file instead of node_modules check
|
||||
- More reliable and intentional
|
||||
|
||||
### Medium Priority 🟡
|
||||
|
||||
4. **Verify PM2 availability** before attempting to use it
|
||||
- Check existence before spawn
|
||||
- Provide clear error message if missing
|
||||
|
||||
5. **Implement platform-specific timeouts**
|
||||
- Actually use longer timeouts on Windows as comment suggests
|
||||
- Make timeouts configurable
|
||||
|
||||
6. **Standardize health endpoints**
|
||||
- Remove duplicate `/health` endpoint
|
||||
- Use `/api/health` everywhere
|
||||
|
||||
7. **Add path escaping** for Windows PowerShell commands
|
||||
- Prevent injection issues
|
||||
- Handle paths with special characters
|
||||
|
||||
### Low Priority 🟢
|
||||
|
||||
8. **Standardize HTTP timeouts** across all hooks
|
||||
9. **Add concurrent startup protection** (locking mechanism)
|
||||
10. **Improve error messages** with actionable recovery steps
|
||||
|
||||
---
|
||||
|
||||
## Testing Recommendations
|
||||
|
||||
### Cross-Platform Testing Needed
|
||||
|
||||
1. **Windows Environments:**
|
||||
- Windows 10 (various versions)
|
||||
- Windows 11
|
||||
- Windows Server
|
||||
- WSL/WSL2
|
||||
- PowerShell execution policies (Restricted, RemoteSigned, Unrestricted)
|
||||
|
||||
2. **Unix Environments:**
|
||||
- macOS (Intel + Apple Silicon)
|
||||
- Linux (Ubuntu, Fedora, Arch)
|
||||
- FreeBSD
|
||||
|
||||
3. **Edge Environments:**
|
||||
- Docker containers
|
||||
- CI/CD environments
|
||||
- Systems without git installed
|
||||
- Systems without curl (or with restricted curl)
|
||||
- Corporate networks with proxies
|
||||
- Low-spec systems (slow startup)
|
||||
|
||||
### Test Scenarios
|
||||
|
||||
1. **Cold Start:** First run with no existing data
|
||||
2. **Corrupt Settings:** Invalid JSON in settings.json
|
||||
3. **Missing Dependencies:** No PM2, no git, no curl
|
||||
4. **Port Conflicts:** Worker port already in use
|
||||
5. **Rapid Hook Invocations:** Multiple hooks trying to start worker simultaneously
|
||||
6. **Permission Issues:** Read-only filesystem, restricted execution
|
||||
7. **Network Issues:** Localhost blocked, slow network
|
||||
|
||||
---
|
||||
|
||||
## Code Quality Assessment
|
||||
|
||||
### Strengths ✅
|
||||
|
||||
- Clean separation of concerns (hooks → worker → database)
|
||||
- No nested try-catch anti-patterns
|
||||
- Consistent use of modern async/await
|
||||
- Good use of TypeScript for type safety
|
||||
- Idempotent database operations
|
||||
- Clear documentation in critical sections
|
||||
|
||||
### Weaknesses ⚠️
|
||||
|
||||
- Silent failures hinder debugging
|
||||
- Inconsistent error handling patterns
|
||||
- Platform-specific code not fully tested/documented
|
||||
- Timeout configuration hardcoded and inconsistent
|
||||
- Some artificial blockers prevent legitimate use cases
|
||||
|
||||
### Technical Debt
|
||||
|
||||
- Duplicate health endpoints
|
||||
- curl dependency when fetch available
|
||||
- PM2 dependency on Unix but not Windows (inconsistent monitoring)
|
||||
- First-run detection using node_modules existence
|
||||
- Hardcoded timeout values
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The claude-mem smart install and plugin hook system is architecturally sound with a well-designed separation of concerns. However, several cross-platform compatibility issues and silent failure patterns could cause problems in production, particularly on Windows systems or in edge case scenarios.
|
||||
|
||||
The highest priority improvements are:
|
||||
1. Removing the curl dependency
|
||||
2. Adding proper logging to silent failures
|
||||
3. Fixing the fragile first-run detection
|
||||
4. Verifying external dependencies before use
|
||||
|
||||
These changes would significantly improve debuggability and cross-platform reliability without requiring major architectural changes.
|
||||
|
||||
---
|
||||
|
||||
**Analysis Methodology:**
|
||||
- Systematic review of all TypeScript source files
|
||||
- Static analysis of error handling patterns
|
||||
- Cross-platform compatibility assessment
|
||||
- Edge case identification through code path analysis
|
||||
- Comparison against best practices and KISS principles
|
||||
|
||||
**Files Analyzed:**
|
||||
- src/hooks/*.ts (6 files)
|
||||
- src/services/worker-service.ts
|
||||
- src/services/worker/*.ts (10+ files)
|
||||
- src/servers/mcp-server.ts
|
||||
- src/shared/*.ts (worker-utils, early-settings, paths)
|
||||
- src/utils/*.ts (logger, silent-debug, tag-stripping)
|
||||
Reference in New Issue
Block a user