* feat: add mem-search skill with progressive disclosure architecture Add comprehensive mem-search skill for accessing claude-mem's persistent cross-session memory database. Implements progressive disclosure workflow and token-efficient search patterns. Features: - 12 search operations (observations, sessions, prompts, by-type, by-concept, by-file, timelines, etc.) - Progressive disclosure principles to minimize token usage - Anti-patterns documentation to guide LLM behavior - HTTP API integration for all search functionality - Common workflows with composition examples Structure: - SKILL.md: Entry point with temporal trigger patterns - principles/: Progressive disclosure + anti-patterns - operations/: 12 search operation files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * docs: add CHANGELOG entry for mem-search skill Document mem-search skill addition in Unreleased section with: - 100% effectiveness compliance metrics - Comparison to previous search skill implementation - Progressive disclosure architecture details - Reference to audit report documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * docs: add mem-search skill audit report Add comprehensive audit report validating mem-search skill against Anthropic's official skill-creator documentation. Report includes: - Effectiveness metrics comparison (search vs mem-search) - Critical issues analysis for production readiness - Compliance validation across 6 key dimensions - Reference implementation guidance Result: mem-search achieves 100% compliance vs search's 67% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * feat: Add comprehensive search architecture analysis document - Document current state of dual search architectures (HTTP API and MCP) - Analyze HTTP endpoints and MCP search server architectures - Identify DRY violations across search implementations - Evaluate the use of curl as the optimal approach for search - Provide architectural recommendations for immediate and long-term improvements - Outline action plan for cleanup, feature parity, DRY refactoring * refactor: Remove deprecated search skill documentation and operations * refactor: Reorganize documentation into public and context directories Changes: - Created docs/public/ for Mintlify documentation (.mdx files) - Created docs/context/ for internal planning and implementation docs - Moved all .mdx files and assets to docs/public/ - Moved all internal .md files to docs/context/ - Added CLAUDE.md to both directories explaining their purpose - Updated docs.json paths to work with new structure Benefits: - Clear separation between user-facing and internal documentation - Easier to maintain Mintlify docs in dedicated directory - Internal context files organized separately 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Enhance session management and continuity in hooks - Updated new-hook.ts to clarify session_id threading and idempotent session creation. - Modified prompts.ts to require claudeSessionId for continuation prompts, ensuring session context is maintained. - Improved SessionStore.ts documentation on createSDKSession to emphasize idempotent behavior and session connection. - Refined SDKAgent.ts to detail continuation prompt logic and its reliance on session.claudeSessionId for unified session handling. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Alex Newman <thedotmack@gmail.com>
9.8 KiB
Response to PR Review #47
Executive Summary
Thank you for the thorough review. Most of the "issues" identified are actually intentional architectural decisions made to solve production failures. The comprehensive analysis docs (JUST-FUCKING-RUN-IT.md, LINE-BY-LINE-CASCADING-BULLSHIT.md) document why these changes were necessary.
However, you've identified 2 legitimate issues that need fixing:
- ✅ Race condition in worker startup - Valid concern, needs fixing
- ✅ Watch mode in production - Appears to be unintentional leftover from development
The other concerns are working as intended based on documented architectural decisions.
Detailed Response to Each Concern
⚠️ Issue #1: Race Condition in Worker Health Check - VALID CONCERN
Review Comment: "The spawn() call inside the close event handler is non-blocking, but the function returns immediately. Hooks may attempt HTTP requests before worker has started."
Our Response: You're absolutely right. This is a legitimate race condition we need to fix.
However, the suggested fixes (async/await health check, retry loops) are exactly what we intentionally removed because they were causing production failures (see Observation #3602, #3600).
Proposed Solution:
The hooks already have proper error handling for ECONNREFUSED with actionable user messages:
if (error.cause?.code === 'ECONNREFUSED' || error.name === 'TimeoutError' || error.message.includes('fetch failed')) {
throw new Error("There's a problem with the worker. If you just updated, type `pm2 restart claude-mem-worker` in your terminal to continue");
}
We should either:
- Document this as expected behavior (fire-and-forget spawn)
- Add a single synchronous
pm2 listcheck after spawn to verify startup - Keep the current approach and rely on hook error messages
We will NOT re-add: Retry loops, health check polling, or arbitrary delays. Those caused the 100% failure rate we just fixed.
⚠️ Issue #2: Removed Health Endpoint Information - INTENTIONAL
Review Comment: "This removes useful debugging information. When troubleshooting production issues, knowing the PID, active sessions count, and port would be valuable."
Our Documentation:
- Observation #3616: "Simplified Health Check Endpoint to Minimal Response"
- Observation #3601: "Minimum Parameters = Minimum Bugs"
- Observation #3600: "Comprehensive Analysis of Cascading Architectural Problems"
Why We Did This:
- HTTP 200 = Alive: If the endpoint responds, the worker is healthy. Period.
- Diagnostic fields provided no actionable value: PID, activeSessions, chromaSynced didn't help debug the actual production failures
- Part of 87% code reduction: worker-utils.ts went from 113 lines → 15 lines
- Health checks were hiding real problems: Retry logic masked that startup sequence was broken
Original Problem:
- Worker startup: 4-5 seconds (actual)
- Health check timeout: 3 seconds (configured)
- Result: 100% user failure rate
The detailed health response didn't help diagnose this - fixing the startup sequence (HTTP server first) did.
Response: Will not change. The health endpoint serves one purpose: availability signal. Use PM2 commands for diagnostics:
pm2 list- See PID, status, memorypm2 logs claude-mem-worker- See application logsnpm run worker:logs- Convenience wrapper
⚠️ Issue #3: Auto-Session Creation Without Validation - NEEDS FIXING
Review Comment: "Uses non-null assertion (dbSession!) without checking if dbSession is actually null. If getSessionById() returns null, this will throw at runtime."
Our Response: You're absolutely right. This is a legitimate bug.
Action Required: Add null checks to handleObservation and handleSummarize like already exist in handleInit:
const dbSession = db.getSessionById(sessionDbId);
if (!dbSession) {
db.close();
res.status(404).json({ error: 'Session not found in database' });
return;
}
This needs to be fixed before merge.
⚠️ Issue #4: Removed Observation Counter - INTENTIONAL
Review Comment: "Was this used for generating correlation IDs for logging? If so, is there now no way to correlate observations within a session for debugging?"
Our Documentation:
- Observation #3621-3627: Complete removal of observation counter and correlation IDs
- Observation #3602: "Architectural Decision: Remove Health Checks and Arbitrary Delays"
- Observation #3612: "Worker Service Simplification Strategy"
Why We Removed It:
- Over-engineering: Provided per-observation tracking when session-level identification was sufficient
- Part of cascading complexity: Correlation IDs were monitoring infrastructure for complexity that shouldn't exist
- Session-level debugging is sufficient: Most issues diagnosed by knowing which session, not which observation #5 within that session
- Database IDs provide uniqueness: Once stored, observations have DB IDs for precise identification
The Problem It Was Solving (That No Longer Needs Solving):
- Tracking individual observations through worker pipeline
- Monitoring Chroma sync success/failure per observation
- Detailed per-observation timing metrics
Why That's Unnecessary:
- Session-level logging is sufficient for debugging
- Database IDs provide uniqueness after storage
- The monitoring was masking real problems (startup sequence)
Response: Will not change. This was part of the simplification strategy that fixed production failures.
⚠️ Issue #5: PM2 Watch Mode in Production - VALID CONCERN
Review Comment: "Watch mode causes PM2 to restart the process whenever files change. This is useful during development but potentially problematic in production."
Our Investigation:
- Observation #3631: Documents what watch mode does, but no observation documents WHY we enabled it
- Observation #3611: PM2 config was "drastically simplified" by removing 21 unnecessary parameters
- Watch mode was kept during this aggressive simplification
Conclusion: This appears to be unintentional - likely enabled for development and inadvertently left enabled.
Action Required: Either:
- Disable watch mode (recommended) - Users aren't developing, they're using the plugin
- Document it as intentional if there's a reason we want auto-restart on file changes
This should be addressed before merge - likely by disabling watch mode.
⚠️ Issue #6: Duplicate Port Constant - ACKNOWLEDGED
Review Comment: "FIXED_PORT constant is defined in 5 places. Creates maintenance burden."
Our Response: Fair point. This is technical debt we can clean up.
However, it's low priority because:
- Port is unlikely to change
- All values are currently consistent
- Not causing production issues
Action: Add to backlog for post-merge cleanup. Export from worker-utils.ts and import elsewhere.
Summary of Actions
Must Fix Before Merge:
- ✅ Add null checks to auto-session creation in handleObservation and handleSummarize
- ✅ Decide on watch mode - Disable unless there's documented reason to keep it
Will Not Change (Intentional Decisions):
- ❌ Health endpoint simplification - Part of solving 100% failure rate
- ❌ Removed observation counter - Part of simplification strategy
- ❌ Removed health check system - Was causing production failures
- ❌ Fire-and-forget worker spawn - Hooks have proper error handling
Race Condition Discussion Needed:
- 🤔 Worker startup race condition - Valid concern, but retry loops caused the original failures. Options:
- Keep current approach (hooks handle ECONNREFUSED gracefully)
- Add single synchronous
pm2 listcheck after spawn - Document as expected behavior
Nice to Have (Post-Merge):
- 📋 Consolidate FIXED_PORT constant - Technical debt cleanup
Key Documentation References
The architectural decisions are comprehensively documented in:
-
JUST-FUCKING-RUN-IT.md (Observation #3602)
- Architectural decision to remove health checks
- Philosophy: Trust PM2, let HTTP timeouts be the health check
-
LINE-BY-LINE-CASCADING-BULLSHIT.md (Observation #3600)
- Root cause analysis of how health checks caused 100% failure rate
- Documents cascade from arbitrary 3000ms timeout → retry loops → race conditions
-
MINIMUM-PARAMETERS.md (Observation #3601)
- Quantified impact: 21 unnecessary PM2 parameters, ~160 lines deleted
- Philosophy: "Minimum parameters = minimum bugs"
-
STUPID-SHIT-THAT-BROKE-PRODUCTION.md (Observation #3597)
- 8 critical issues causing 100% user failure rate
- Includes worker crashing on Chroma failures despite data already in SQLite
These documents explain why the simplifications were necessary - they weren't arbitrary removal of useful features, they were targeted fixes for production failures.
Production Context
Before This PR:
- 100% user failure rate after v4.x release
- Worker startup took 4-5 seconds but health checks timed out at 3 seconds
stdio: 'ignore'eliminated all debugging visibility- Worker crashed on Chroma failures despite data safely in SQLite
- ChromaSync initialized in constructor, blocking HTTP server
- 113 lines of health check code with retry loops masking real problems
After This PR:
- HTTP server starts immediately
- Worker stays alive through Chroma failures (graceful degradation)
- Errors are visible (
stdio: 'inherit') - Worker-utils.ts: 113 lines → 15 lines (87% reduction)
- Hooks have proper error handling with actionable user messages
- System works with just SQLite FTS5, Chroma is optional enhancement
The "removed observability" was actually removed complexity that was hiding problems, not helping diagnose them.