Files
claude-mem/private/POSTMORTEM-worker-debug-failure.md
T
Alex Newman c5e68a17c8 refactor: Clean up search architecture, remove experimental contextualize endpoint (#133)
* Refactor code structure for improved readability and maintainability

* Add test results for search API and related functionalities

- Created test result files for various search-related functionalities, including:
  - test-11-search-server-changes.json
  - test-12-context-hook-changes.json
  - test-13-worker-service-changes.json
  - test-14-patterns.json
  - test-15-gotchas.json
  - test-16-discoveries.json
  - test-17-all-bugfixes.json
  - test-18-all-features.json
  - test-19-all-decisions.json
  - test-20-session-search.json
  - test-21-prompt-search.json
  - test-22-decisions-endpoint.json
  - test-23-changes-endpoint.json
  - test-24-how-it-works-endpoint.json
  - test-25-contextualize-endpoint.json
  - test-26-timeline-around-observation.json
  - test-27-multi-param-combo.json
  - test-28-file-type-combo.json

- Each test result file captures specific search failures or outcomes, including issues with undefined properties and successful execution of search queries.
- Enhanced documentation of search architecture and testing strategies, ensuring compliance with established guidelines and improving overall search functionality.

* feat: Enhance unified search API with catch-all parameters and backward compatibility

- Implemented a unified search API at /api/search that accepts catch-all parameters for filtering by type, observation type, concepts, and files.
- Maintained backward compatibility by keeping granular endpoints functional while routing through the same infrastructure.
- Completed comprehensive testing of search capabilities with real-world query scenarios.

fix: Address missing debug output in search API query tests

- Flushed PM2 logs and executed search queries to verify functionality.
- Diagnosed absence of "Raw Chroma" debug messages in worker logs, indicating potential issues with logging or query processing.

refactor: Improve build and deployment pipeline for claude-mem plugin

- Successfully built and synced all hooks and services to the marketplace directory.
- Ensured all dependencies are installed and up-to-date in the deployment location.

feat: Implement hybrid search filters with 90-day recency window

- Enhanced search server to apply a 90-day recency filter to Chroma results before categorizing by document type.

fix: Correct parameter handling in searchUserPrompts method

- Added support for filter-only queries and improved dual-path logic for clarity.

refactor: Rename FTS5 method to clarify fallback status

- Renamed escapeFTS5 to escapeFTS5_fallback_when_chroma_unavailable to indicate its temporary usage.

feat: Introduce contextualize tool for comprehensive project overview

- Added a new tool to fetch recent observations, sessions, and user prompts, providing a quick project overview.

feat: Add semantic shortcut tools for common search patterns

- Implemented 'decisions', 'changes', and 'how_it_works' tools for convenient access to frequently searched observation categories.

feat: Unified timeline tool supports anchor and query modes

- Combined get_context_timeline and get_timeline_by_query into a single interface for timeline exploration.

feat: Unified search tool added to MCP server

- New tool queries all memory types simultaneously, providing combined chronological results for improved search efficiency.

* Refactor search functionality to clarify FTS5 fallback usage

- Updated `worker-service.cjs` to replace FTS5 fallback function with a more descriptive name and improved error handling.
- Enhanced documentation in `SKILL.md` to specify the unified API endpoint and clarify the behavior of the search engine, including the conditions under which FTS5 is used.
- Modified `search-server.ts` to provide clearer logging and descriptions regarding the fallback to FTS5 when UVX/Python is unavailable.
- Renamed and updated the `SessionSearch.ts` methods to reflect the conditions for using FTS5, emphasizing the lack of semantic understanding in fallback scenarios.

* feat: Add ID-based fetch endpoints and simplify mem-search skill

**Problem:**
- Search returns IDs but no way to fetch by ID
- Skill documentation was bloated with too many options
- Claude wasn't using IDs because we didn't tell it how

**Solution:**
1. Added three new HTTP endpoints:
   - GET /api/observation/:id
   - GET /api/session/:id
   - GET /api/prompt/:id

2. Completely rewrote SKILL.md:
   - Stripped complexity down to essentials
   - Clear 3-step prescriptive workflow: Search → Review IDs → Fetch by ID
   - Emphasized ID usage: "The IDs are there for a reason - USE THEM"
   - Removed confusing multi-endpoint documentation
   - Kept only unified search with filters

**Impact:**
- Token efficiency: Claude can now fetch full details only for relevant IDs
- Clarity: One clear workflow instead of 10+ options to choose from
- Usability: IDs are no longer wasted context - they're actionable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* chore: Move internal docs to private directory

Moved POSTMORTEM and planning docs to ./private to exclude from PR reviews.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor: Remove experimental contextualize endpoint

- Removed contextualize MCP tool from search-server (saves ~4KB)
- Disabled FTS5 fallback paths in SessionSearch (now vector-first)
- Cleaned up CLAUDE.md documentation
- Removed contextualize-rewrite-plan.md doc

Rationale:
- Contextualize is better suited as a skill (LLM-powered) than an endpoint
- Search API already provides vector search with configurable limits
- Created issue #132 to track future contextualize skill implementation

Changes:
- src/servers/search-server.ts: Removed contextualize tool definition
- src/services/sqlite/SessionSearch.ts: Disabled FTS5 fallback, added deprecation warnings
- CLAUDE.md: Cleaned up outdated skill documentation
- docs/: Removed contextualize plan document

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor: Complete FTS5 cleanup - remove all deprecated search code

This completes the FTS5 cleanup work by removing all commented-out
FTS5 search code while preserving database tables for backward compatibility.

Changes:
- Removed 200+ lines of commented FTS5 search code from SessionSearch.ts
- Removed deprecated degraded_search_query__when_uvx_unavailable method
- Updated all method documentation to clarify vector-first architecture
- Updated class documentation to reflect filter-only query support
- Updated CLAUDE.md to remove FTS5 search references
- Clarified that FTS5 tables exist for backward compatibility only
- Updated "Why SQLite FTS5" section to "Why Vector-First Search"

Database impact: NONE - FTS5 tables remain intact for existing installations

Search architecture:
- ChromaDB: All text-based vector search queries
- SQLite: Filter-only queries (date ranges, metadata, no query text)
- FTS5 tables: Maintained but unused (backward compatibility)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor: Remove all FTS5 fallback execution code from search-server

Completes the FTS5 cleanup by removing all fallback execution paths
that attempted to use FTS5 when ChromaDB was unavailable.

Changes:
- Removed all FTS5 fallback code execution paths
- When ChromaDB fails or is unavailable, return empty results with helpful error messages
- Updated all deprecated tool descriptions (search_observations, search_sessions, search_user_prompts)
- Changed error messages to indicate FTS5 fallback has been removed
- Added installation instructions for UVX/Python when vector search is unavailable
- Updated comments from "hybrid search" to "vector-first search"
- Removed ~100 lines of dead FTS5 fallback code

Database impact: NONE - FTS5 tables remain intact (backward compatibility)

Search behavior when ChromaDB unavailable:
- Text queries: Return empty results with error explaining ChromaDB is required
- Filter-only queries (no text): Continue to work via direct SQLite

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Address PR 133 review feedback

Critical fixes:
- Remove contextualize endpoint from worker-service (route + handler)
- Fix build script logging to show correct .cjs extension (was .mjs)

Documentation improvements:
- Add comprehensive FTS5 retention rationale documentation
- Include v7.0.0 removal TODO for future cleanup

Testing:
- Build succeeds with correct output logging
- Worker restarts successfully (30th restart)
- Contextualize endpoint properly removed (404 response)
- Search endpoint verified working

This addresses all critical review feedback from PR 133.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-11-21 18:59:23 -05:00

5.0 KiB

Postmortem: Worker Debug Failure - 2025-11-17

Incident Summary

Attempted to fix broken worker service. Worker was in crash loop with 225 restarts, failing with "MCP error -32000: Connection closed". Debug attempt failed and changes were reverted.

What Went Wrong

1. Jumped to Symptoms, Not Root Cause

  • Saw "MCP connection failed" errors in logs
  • Immediately focused on MCP/Chroma connection code
  • Never asked: "Why is this suddenly broken when it worked before?"
  • Classic symptom chasing instead of root cause analysis

2. Ignored the Build Pipeline

  • Worker file wasn't in the expected location (plugin/worker-service.cjs vs plugin/scripts/worker-service.cjs)
  • Build output existed but search server was producing corrupted/error output
  • Never investigated: "Is the build system broken?"
  • Should have compared built artifacts between main and current branch

3. Tried to Fix by Disabling Instead of Understanding

  • Final approach: comment out Chroma, comment out search server
  • This is the opposite of debugging - it's just making things "work" by removing functionality
  • User called this out as "duct tape around 5 things unrelated to the problem"
  • Violated YAGNI/KISS by adding defensive complexity instead of fixing the actual issue

4. Didn't Compare Working vs Broken State

  • User specifically said "we fixed this before"
  • Should have immediately: git diff main src/services/worker-service.ts
  • Did this eventually but didn't follow through on the findings
  • The diff showed only search-everything additions - the core worker code was UNCHANGED
  • This should have been a huge red flag: "If the code is the same, why is it broken?"

5. Overcomplicated the Investigation

  • Started reading through ChromaSync implementation
  • Traced through MCP connection code
  • Analyzed startup sequences
  • All of this was unnecessary if the root cause was a build issue

What Should Have Happened

Correct Debug Sequence:

  1. Check worker status (pm2 list) - DONE
  2. Check error logs - DONE
  3. Compare current code to main branch - SKIPPED INITIALLY
  4. Check if built files are correct - SKIPPED
  5. Test the build pipeline - NEVER DONE
  6. Verify dependencies are installed - NEVER CHECKED

The Real Questions:

  • Is this a code change or a build issue?
  • What changed between working state and broken state?
  • Are the built artifacts corrupted?
  • Is the search server build actually valid?
  • Are there missing dependencies in plugin/scripts/node_modules?

Likely Root Causes (Untested)

Based on evidence:

  1. Build artifacts are corrupted - search-server.mjs threw syntax errors when run
  2. Node modules missing/outdated - plugin/scripts/node_modules may be stale
  3. ESM/CJS bundling issue - esbuild may have produced invalid output
  4. search-everything branch has broken build config - scripts/build-hooks.js may have issues

Key Lessons

KISS/DRY/YAGNI Violations

  • Added complexity (disabling features) instead of removing it
  • Tried to work around symptoms instead of fixing root cause
  • Ignored the principle: "If it worked before and code is same, it's environment/build"

Debugging Anti-Patterns

  1. Symptom Chasing: Following error messages down rabbit holes
  2. Defensive Coding: Commenting out "broken" features instead of fixing them
  3. Ignoring History: Not comparing working vs broken states
  4. Build Blindness: Assuming built artifacts are correct without verification

What Good Debugging Looks Like

  1. Compare working state (main) vs broken state (current branch)
  2. Identify what actually changed (code? deps? build?)
  3. Test the simplest hypothesis first (build issue vs code issue)
  4. Never disable features to "fix" things - that's not fixing

Action Items for Next Attempt

Before Writing Any Code:

  • git diff main for all modified files
  • Check if plugin/scripts/ artifacts are valid JavaScript
  • Compare build process: npm run build output on main vs current branch
  • Verify plugin/scripts/node_modules exists and is current
  • Test search-server.mjs in isolation: node plugin/scripts/search-server.mjs

If Build is Broken:

  • Check scripts/build-hooks.js for recent changes
  • Verify esbuild configuration
  • Test build on main branch, then on current branch
  • Don't modify source code until build is proven working

If Code is Broken:

  • Create minimal repro (which specific change broke it?)
  • Fix the actual bug, don't add workarounds
  • Test the fix in isolation

Conclusion

This failure exemplifies "debugging by making changes" instead of "debugging by understanding". The instinct to fix symptoms (MCP errors) instead of investigating root cause (why is it broken now?) led to wasted effort and ultimately no solution.

The user's frustration was justified - I was adding defensive duct tape instead of finding and fixing the real problem. This is exactly what KISS/DRY/YAGNI principles are meant to prevent.

Next time: Compare, verify, understand, THEN fix. Never disable features to make errors go away.