Files

T

basher83 97d565e3cd Replace search skill with mem-search (#91 )

* feat: add mem-search skill with progressive disclosure architecture

Add comprehensive mem-search skill for accessing claude-mem's persistent
cross-session memory database. Implements progressive disclosure workflow
and token-efficient search patterns.

Features:
- 12 search operations (observations, sessions, prompts, by-type, by-concept, by-file, timelines, etc.)
- Progressive disclosure principles to minimize token usage
- Anti-patterns documentation to guide LLM behavior
- HTTP API integration for all search functionality
- Common workflows with composition examples

Structure:
- SKILL.md: Entry point with temporal trigger patterns
- principles/: Progressive disclosure + anti-patterns
- operations/: 12 search operation files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add CHANGELOG entry for mem-search skill

Document mem-search skill addition in Unreleased section with:
- 100% effectiveness compliance metrics
- Comparison to previous search skill implementation
- Progressive disclosure architecture details
- Reference to audit report documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add mem-search skill audit report

Add comprehensive audit report validating mem-search skill against
Anthropic's official skill-creator documentation.

Report includes:
- Effectiveness metrics comparison (search vs mem-search)
- Critical issues analysis for production readiness
- Compliance validation across 6 key dimensions
- Reference implementation guidance

Result: mem-search achieves 100% compliance vs search's 67%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat: Add comprehensive search architecture analysis document

- Document current state of dual search architectures (HTTP API and MCP)
- Analyze HTTP endpoints and MCP search server architectures
- Identify DRY violations across search implementations
- Evaluate the use of curl as the optimal approach for search
- Provide architectural recommendations for immediate and long-term improvements
- Outline action plan for cleanup, feature parity, DRY refactoring

* refactor: Remove deprecated search skill documentation and operations

* refactor: Reorganize documentation into public and context directories

Changes:
- Created docs/public/ for Mintlify documentation (.mdx files)
- Created docs/context/ for internal planning and implementation docs
- Moved all .mdx files and assets to docs/public/
- Moved all internal .md files to docs/context/
- Added CLAUDE.md to both directories explaining their purpose
- Updated docs.json paths to work with new structure

Benefits:
- Clear separation between user-facing and internal documentation
- Easier to maintain Mintlify docs in dedicated directory
- Internal context files organized separately

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Enhance session management and continuity in hooks

- Updated new-hook.ts to clarify session_id threading and idempotent session creation.
- Modified prompts.ts to require claudeSessionId for continuation prompts, ensuring session context is maintained.
- Improved SessionStore.ts documentation on createSDKSession to emphasize idempotent behavior and session connection.
- Refined SDKAgent.ts to detail continuation prompt logic and its reliance on session.claudeSessionId for unified session handling.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Alex Newman <thedotmack@gmail.com>

2025-11-11 16:15:07 -05:00

9.8 KiB

Raw Blame History

Response to PR Review #47

Executive Summary

Thank you for the thorough review. Most of the "issues" identified are actually intentional architectural decisions made to solve production failures. The comprehensive analysis docs (JUST-FUCKING-RUN-IT.md, LINE-BY-LINE-CASCADING-BULLSHIT.md) document why these changes were necessary.

However, you've identified 2 legitimate issues that need fixing:

✅ Race condition in worker startup - Valid concern, needs fixing
✅ Watch mode in production - Appears to be unintentional leftover from development

The other concerns are working as intended based on documented architectural decisions.

Detailed Response to Each Concern

⚠️ Issue #1: Race Condition in Worker Health Check - VALID CONCERN

Review Comment: "The spawn() call inside the close event handler is non-blocking, but the function returns immediately. Hooks may attempt HTTP requests before worker has started."

Our Response: You're absolutely right. This is a legitimate race condition we need to fix.

However, the suggested fixes (async/await health check, retry loops) are exactly what we intentionally removed because they were causing production failures (see Observation #3602, #3600).

Proposed Solution: The hooks already have proper error handling for ECONNREFUSED with actionable user messages:

if (error.cause?.code === 'ECONNREFUSED' || error.name === 'TimeoutError' || error.message.includes('fetch failed')) {
  throw new Error("There's a problem with the worker. If you just updated, type `pm2 restart claude-mem-worker` in your terminal to continue");
}

We should either:

Document this as expected behavior (fire-and-forget spawn)
Add a single synchronous pm2 list check after spawn to verify startup
Keep the current approach and rely on hook error messages

We will NOT re-add: Retry loops, health check polling, or arbitrary delays. Those caused the 100% failure rate we just fixed.

⚠️ Issue #2: Removed Health Endpoint Information - INTENTIONAL

Review Comment: "This removes useful debugging information. When troubleshooting production issues, knowing the PID, active sessions count, and port would be valuable."

Our Documentation:

Observation #3616: "Simplified Health Check Endpoint to Minimal Response"
Observation #3601: "Minimum Parameters = Minimum Bugs"
Observation #3600: "Comprehensive Analysis of Cascading Architectural Problems"

Why We Did This:

HTTP 200 = Alive: If the endpoint responds, the worker is healthy. Period.
Diagnostic fields provided no actionable value: PID, activeSessions, chromaSynced didn't help debug the actual production failures
Part of 87% code reduction: worker-utils.ts went from 113 lines → 15 lines
Health checks were hiding real problems: Retry logic masked that startup sequence was broken

Original Problem:

Worker startup: 4-5 seconds (actual)
Health check timeout: 3 seconds (configured)
Result: 100% user failure rate

The detailed health response didn't help diagnose this - fixing the startup sequence (HTTP server first) did.

Response: Will not change. The health endpoint serves one purpose: availability signal. Use PM2 commands for diagnostics:

pm2 list - See PID, status, memory
pm2 logs claude-mem-worker - See application logs
npm run worker:logs - Convenience wrapper

⚠️ Issue #3: Auto-Session Creation Without Validation - NEEDS FIXING

Review Comment: "Uses non-null assertion (dbSession!) without checking if dbSession is actually null. If getSessionById() returns null, this will throw at runtime."

Our Response: You're absolutely right. This is a legitimate bug.

Action Required: Add null checks to handleObservation and handleSummarize like already exist in handleInit:

const dbSession = db.getSessionById(sessionDbId);
if (!dbSession) {
  db.close();
  res.status(404).json({ error: 'Session not found in database' });
  return;
}

This needs to be fixed before merge.

⚠️ Issue #4: Removed Observation Counter - INTENTIONAL

Review Comment: "Was this used for generating correlation IDs for logging? If so, is there now no way to correlate observations within a session for debugging?"

Our Documentation:

Observation #3621-3627: Complete removal of observation counter and correlation IDs
Observation #3602: "Architectural Decision: Remove Health Checks and Arbitrary Delays"
Observation #3612: "Worker Service Simplification Strategy"

Why We Removed It:

Over-engineering: Provided per-observation tracking when session-level identification was sufficient
Part of cascading complexity: Correlation IDs were monitoring infrastructure for complexity that shouldn't exist
Session-level debugging is sufficient: Most issues diagnosed by knowing which session, not which observation #5 within that session
Database IDs provide uniqueness: Once stored, observations have DB IDs for precise identification

The Problem It Was Solving (That No Longer Needs Solving):

Tracking individual observations through worker pipeline
Monitoring Chroma sync success/failure per observation
Detailed per-observation timing metrics

Why That's Unnecessary:

Session-level logging is sufficient for debugging
Database IDs provide uniqueness after storage
The monitoring was masking real problems (startup sequence)

Response: Will not change. This was part of the simplification strategy that fixed production failures.

⚠️ Issue #5: PM2 Watch Mode in Production - VALID CONCERN

Review Comment: "Watch mode causes PM2 to restart the process whenever files change. This is useful during development but potentially problematic in production."

Our Investigation:

Observation #3631: Documents what watch mode does, but no observation documents WHY we enabled it
Observation #3611: PM2 config was "drastically simplified" by removing 21 unnecessary parameters
Watch mode was kept during this aggressive simplification

Conclusion: This appears to be unintentional - likely enabled for development and inadvertently left enabled.

Action Required: Either:

Disable watch mode (recommended) - Users aren't developing, they're using the plugin
Document it as intentional if there's a reason we want auto-restart on file changes

This should be addressed before merge - likely by disabling watch mode.

⚠️ Issue #6: Duplicate Port Constant - ACKNOWLEDGED

Review Comment: "FIXED_PORT constant is defined in 5 places. Creates maintenance burden."

Our Response: Fair point. This is technical debt we can clean up.

However, it's low priority because:

Port is unlikely to change
All values are currently consistent
Not causing production issues

Action: Add to backlog for post-merge cleanup. Export from worker-utils.ts and import elsewhere.

Summary of Actions

Must Fix Before Merge:

✅ Add null checks to auto-session creation in handleObservation and handleSummarize
✅ Decide on watch mode - Disable unless there's documented reason to keep it

Will Not Change (Intentional Decisions):

❌ Health endpoint simplification - Part of solving 100% failure rate
❌ Removed observation counter - Part of simplification strategy
❌ Removed health check system - Was causing production failures
❌ Fire-and-forget worker spawn - Hooks have proper error handling

Race Condition Discussion Needed:

🤔 Worker startup race condition - Valid concern, but retry loops caused the original failures. Options:
- Keep current approach (hooks handle ECONNREFUSED gracefully)
- Add single synchronous pm2 list check after spawn
- Document as expected behavior

Nice to Have (Post-Merge):

📋 Consolidate FIXED_PORT constant - Technical debt cleanup

Key Documentation References

The architectural decisions are comprehensively documented in:

JUST-FUCKING-RUN-IT.md (Observation #3602)
- Architectural decision to remove health checks
- Philosophy: Trust PM2, let HTTP timeouts be the health check
LINE-BY-LINE-CASCADING-BULLSHIT.md (Observation #3600)
- Root cause analysis of how health checks caused 100% failure rate
- Documents cascade from arbitrary 3000ms timeout → retry loops → race conditions
MINIMUM-PARAMETERS.md (Observation #3601)
- Quantified impact: 21 unnecessary PM2 parameters, ~160 lines deleted
- Philosophy: "Minimum parameters = minimum bugs"
STUPID-SHIT-THAT-BROKE-PRODUCTION.md (Observation #3597)
- 8 critical issues causing 100% user failure rate
- Includes worker crashing on Chroma failures despite data already in SQLite

These documents explain why the simplifications were necessary - they weren't arbitrary removal of useful features, they were targeted fixes for production failures.

Production Context

Before This PR:

100% user failure rate after v4.x release
Worker startup took 4-5 seconds but health checks timed out at 3 seconds
stdio: 'ignore' eliminated all debugging visibility
Worker crashed on Chroma failures despite data safely in SQLite
ChromaSync initialized in constructor, blocking HTTP server
113 lines of health check code with retry loops masking real problems

After This PR:

HTTP server starts immediately
Worker stays alive through Chroma failures (graceful degradation)
Errors are visible (stdio: 'inherit')
Worker-utils.ts: 113 lines → 15 lines (87% reduction)
Hooks have proper error handling with actionable user messages
System works with just SQLite FTS5, Chroma is optional enhancement

The "removed observability" was actually removed complexity that was hiding problems, not helping diagnose them.

9.8 KiB Raw Blame History