c6f932988a
* MAESTRO: fix ChromaDB core issues — Python pinning, Windows paths, disable toggle, metadata sanitization, transport errors - Add --python version pinning to uvx args in both local and remote mode (fixes #1196, #1206, #1208) - Convert backslash paths to forward slashes for --data-dir on Windows (fixes #1199) - Add CLAUDE_MEM_CHROMA_ENABLED setting for SQLite-only fallback mode (fixes #707) - Sanitize metadata in addDocuments() to filter null/undefined/empty values (fixes #1183, #1188) - Wrap callTool() in try/catch for transport errors with auto-reconnect (fixes #1162) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix data integrity — content-hash deduplication, project name collision, empty project guard, stuck isProcessing - Add SHA-256 content-hash deduplication to observations INSERT (store.ts, transactions.ts, SessionStore.ts) - Add content_hash column via migration 22 with backfill and index - Fix project name collision: getCurrentProjectName() now returns parent/basename - Guard against empty project string with cwd-derived fallback - Fix stuck isProcessing: hasAnyPendingWork() resets processing messages older than 5 minutes - Add 12 new tests covering all four fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix hook lifecycle — stderr suppression, output isolation, conversation pollution prevention - Suppress process.stderr.write in hookCommand() to prevent Claude Code showing diagnostic output as error UI (#1181). Restores stderr in finally block for worker-continues case. - Convert console.error() to logger.warn()/error() in hook-command.ts and handlers/index.ts so all diagnostics route to log file instead of stderr. - Verified all 7 handlers return suppressOutput: true (prevents conversation pollution #598, #784). - Verified session-complete is a recognized event type (fixes #984). - Verified unknown event types return no-op handler with exit 0 (graceful degradation). - Added 10 new tests in tests/hook-lifecycle.test.ts covering event dispatch, adapter defaults, stderr suppression, and standard response constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix worker lifecycle — restart loop coordination, stale transport retry, ENOENT shutdown race - Add PID file mtime guard to prevent concurrent restart storms (#1145): isPidFileRecent() + touchPidFile() coordinate across sessions - Add transparent retry in ChromaMcpManager.callTool() on transport error — reconnects and retries once instead of failing (#1131) - Wrap getInstalledPluginVersion() with ENOENT/EBUSY handling (#1042) - Verified ChromaMcpManager.stop() already called on all shutdown paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Windows platform support — uvx.cmd spawn, PowerShell $_ elimination, windowsHide, FTS5 fallback - Route uvx spawn through cmd.exe /c on Windows since MCP SDK lacks shell:true (#1190, #1192, #1199) - Replace all PowerShell Where-Object {$_} pipelines with WQL -Filter server-side filtering (#1024, #1062) - Add windowsHide: true to all exec/spawn calls missing it to prevent console popups (#1048) - Add FTS5 runtime probe with graceful fallback when unavailable on Windows (#791) - Guard FTS5 table creation in migrations, SessionSearch, and SessionStore with try/catch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix skills/ distribution — build-time verification and regression tests (#1187) Add post-build verification in build-hooks.js that fails if critical distribution files (skills, hooks, plugin manifest) are missing. Add 10 regression tests covering skill file presence, YAML frontmatter, hooks.json integrity, and package.json files field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix MigrationRunner schema initialization (#979) — version conflict between parallel migration systems Root cause: old DatabaseManager migrations 1-7 shared schema_versions table with MigrationRunner's 4-22, causing version number collisions (5=drop tables vs add column, 6=FTS5 vs prompt tracking, 7=discovery_tokens vs remove UNIQUE). initializeSchema() was gated behind maxApplied===0, so core tables were never created when old versions were present. Fixes: - initializeSchema() always creates core tables via CREATE TABLE IF NOT EXISTS - Migrations 5-7 check actual DB state (columns/constraints) not just version tracking - Crash-safe temp table rebuilds (DROP IF EXISTS _new before CREATE) - Added missing migration 21 (ON UPDATE CASCADE) to MigrationRunner - Added ON UPDATE CASCADE to FK definitions in initializeSchema() - All changes applied to both runner.ts and SessionStore.ts Tests: 13 new tests in migration-runner.test.ts covering fresh DB, idempotency, version conflicts, crash recovery, FK constraints, and data integrity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix 21 test failures — stale mocks, outdated assertions, missing OpenClaw guards Server tests (12): Added missing workerPath and getAiStatus to ServerOptions mocks after interface expansion. ChromaSync tests (3): Updated to verify transport cleanup in ChromaMcpManager after architecture refactor. OpenClaw (2): Added memory_ tool skipping and response truncation to prevent recursive loops and oversized payloads. MarkdownFormatter (2): Updated assertions to match current output. SettingsDefaultsManager (1): Used correct default key for getBool test. Logger standards (1): Excluded CLI transcript command from background service check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Codex CLI compatibility (#744) — session_id fallbacks, unknown platform tolerance, undefined guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Cursor IDE integration (#838, #1049) — adapter field fallbacks, tolerant session-init validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix /api/logs OOM (#1203) — tail-read replaces full-file readFileSync Replace readFileSync (loads entire file into memory) with readLastLines() that reads only from the end of the file in expanding chunks (64KB → 10MB cap). Prevents OOM on large log files while preserving the same API response shape. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Settings CORS error (#1029) — explicit methods and allowedHeaders in CORS config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: add session custom_title for agent attribution (#1213) — migration 23, endpoint + store support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: prevent CLAUDE.md/AGENTS.md writes inside .git/ directories (#1165) Add .git path guard to all 4 write sites to prevent ref corruption when paths resolve inside .git internals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix plugin disabled state not respected (#781) — early exit check in all hook entry points Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix UserPromptSubmit context re-injection on every turn (#1079) — contextInjected session flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix stale AbortController queue stall (#1099) — lastGeneratorActivity tracking + 30s timeout Three-layer fix: 1. Added lastGeneratorActivity timestamp to ActiveSession, updated by processAgentResponse (all agents), getMessageIterator (queue yields), and startGeneratorWithProvider (generator launch) 2. Added stale generator detection in ensureGeneratorRunning — if no activity for >30s, aborts stale controller, resets state, restarts 3. Added AbortSignal.timeout(30000) in deleteSession to prevent indefinite hang when awaiting a stuck generator promise Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
813 lines
26 KiB
TypeScript
813 lines
26 KiB
TypeScript
/**
|
|
* ChromaSync Service
|
|
*
|
|
* Automatically syncs observations and session summaries to ChromaDB via MCP.
|
|
* This service provides real-time semantic search capabilities by maintaining
|
|
* a vector database synchronized with SQLite.
|
|
*
|
|
* Uses ChromaMcpManager to communicate with chroma-mcp over stdio MCP protocol.
|
|
* The chroma-mcp server handles its own embedding and persistent storage,
|
|
* eliminating the need for chromadb npm package and ONNX/WASM dependencies.
|
|
*
|
|
* Design: Fail-fast with no fallbacks - if Chroma is unavailable, syncing fails.
|
|
*/
|
|
|
|
import { ChromaMcpManager } from './ChromaMcpManager.js';
|
|
import { ParsedObservation, ParsedSummary } from '../../sdk/parser.js';
|
|
import { SessionStore } from '../sqlite/SessionStore.js';
|
|
import { logger } from '../../utils/logger.js';
|
|
|
|
interface ChromaDocument {
|
|
id: string;
|
|
document: string;
|
|
metadata: Record<string, string | number>;
|
|
}
|
|
|
|
interface StoredObservation {
|
|
id: number;
|
|
memory_session_id: string;
|
|
project: string;
|
|
text: string | null;
|
|
type: string;
|
|
title: string | null;
|
|
subtitle: string | null;
|
|
facts: string | null; // JSON
|
|
narrative: string | null;
|
|
concepts: string | null; // JSON
|
|
files_read: string | null; // JSON
|
|
files_modified: string | null; // JSON
|
|
prompt_number: number;
|
|
discovery_tokens: number; // ROI metrics
|
|
created_at: string;
|
|
created_at_epoch: number;
|
|
}
|
|
|
|
interface StoredSummary {
|
|
id: number;
|
|
memory_session_id: string;
|
|
project: string;
|
|
request: string | null;
|
|
investigated: string | null;
|
|
learned: string | null;
|
|
completed: string | null;
|
|
next_steps: string | null;
|
|
notes: string | null;
|
|
prompt_number: number;
|
|
discovery_tokens: number; // ROI metrics
|
|
created_at: string;
|
|
created_at_epoch: number;
|
|
}
|
|
|
|
interface StoredUserPrompt {
|
|
id: number;
|
|
content_session_id: string;
|
|
prompt_number: number;
|
|
prompt_text: string;
|
|
created_at: string;
|
|
created_at_epoch: number;
|
|
memory_session_id: string;
|
|
project: string;
|
|
}
|
|
|
|
export class ChromaSync {
|
|
private project: string;
|
|
private collectionName: string;
|
|
private collectionCreated = false;
|
|
private readonly BATCH_SIZE = 100;
|
|
|
|
constructor(project: string) {
|
|
this.project = project;
|
|
// Chroma collection names only allow [a-zA-Z0-9._-], 3-512 chars,
|
|
// must start/end with [a-zA-Z0-9]
|
|
const sanitized = project
|
|
.replace(/[^a-zA-Z0-9._-]/g, '_')
|
|
.replace(/[^a-zA-Z0-9]+$/, ''); // strip trailing non-alphanumeric
|
|
this.collectionName = `cm__${sanitized || 'unknown'}`;
|
|
}
|
|
|
|
/**
|
|
* Ensure collection exists in Chroma via MCP.
|
|
* chroma_create_collection is idempotent - safe to call multiple times.
|
|
* Uses collectionCreated flag to avoid redundant calls within a session.
|
|
*/
|
|
private async ensureCollectionExists(): Promise<void> {
|
|
if (this.collectionCreated) {
|
|
return;
|
|
}
|
|
|
|
const chromaMcp = ChromaMcpManager.getInstance();
|
|
try {
|
|
await chromaMcp.callTool('chroma_create_collection', {
|
|
collection_name: this.collectionName
|
|
});
|
|
} catch (error) {
|
|
const message = error instanceof Error ? error.message : String(error);
|
|
if (!message.includes('already exists')) {
|
|
throw error;
|
|
}
|
|
// Collection already exists - this is the expected path after first creation
|
|
}
|
|
|
|
this.collectionCreated = true;
|
|
|
|
logger.debug('CHROMA_SYNC', 'Collection ready', {
|
|
collection: this.collectionName
|
|
});
|
|
}
|
|
|
|
/**
|
|
* Format observation into Chroma documents (granular approach)
|
|
* Each semantic field becomes a separate vector document
|
|
*/
|
|
private formatObservationDocs(obs: StoredObservation): ChromaDocument[] {
|
|
const documents: ChromaDocument[] = [];
|
|
|
|
// Parse JSON fields
|
|
const facts = obs.facts ? JSON.parse(obs.facts) : [];
|
|
const concepts = obs.concepts ? JSON.parse(obs.concepts) : [];
|
|
const files_read = obs.files_read ? JSON.parse(obs.files_read) : [];
|
|
const files_modified = obs.files_modified ? JSON.parse(obs.files_modified) : [];
|
|
|
|
const baseMetadata: Record<string, string | number> = {
|
|
sqlite_id: obs.id,
|
|
doc_type: 'observation',
|
|
memory_session_id: obs.memory_session_id,
|
|
project: obs.project,
|
|
created_at_epoch: obs.created_at_epoch,
|
|
type: obs.type || 'discovery',
|
|
title: obs.title || 'Untitled'
|
|
};
|
|
|
|
// Add optional metadata fields
|
|
if (obs.subtitle) {
|
|
baseMetadata.subtitle = obs.subtitle;
|
|
}
|
|
if (concepts.length > 0) {
|
|
baseMetadata.concepts = concepts.join(',');
|
|
}
|
|
if (files_read.length > 0) {
|
|
baseMetadata.files_read = files_read.join(',');
|
|
}
|
|
if (files_modified.length > 0) {
|
|
baseMetadata.files_modified = files_modified.join(',');
|
|
}
|
|
|
|
// Narrative as separate document
|
|
if (obs.narrative) {
|
|
documents.push({
|
|
id: `obs_${obs.id}_narrative`,
|
|
document: obs.narrative,
|
|
metadata: { ...baseMetadata, field_type: 'narrative' }
|
|
});
|
|
}
|
|
|
|
// Text as separate document (legacy field)
|
|
if (obs.text) {
|
|
documents.push({
|
|
id: `obs_${obs.id}_text`,
|
|
document: obs.text,
|
|
metadata: { ...baseMetadata, field_type: 'text' }
|
|
});
|
|
}
|
|
|
|
// Each fact as separate document
|
|
facts.forEach((fact: string, index: number) => {
|
|
documents.push({
|
|
id: `obs_${obs.id}_fact_${index}`,
|
|
document: fact,
|
|
metadata: { ...baseMetadata, field_type: 'fact', fact_index: index }
|
|
});
|
|
});
|
|
|
|
return documents;
|
|
}
|
|
|
|
/**
|
|
* Format summary into Chroma documents (granular approach)
|
|
* Each summary field becomes a separate vector document
|
|
*/
|
|
private formatSummaryDocs(summary: StoredSummary): ChromaDocument[] {
|
|
const documents: ChromaDocument[] = [];
|
|
|
|
const baseMetadata: Record<string, string | number> = {
|
|
sqlite_id: summary.id,
|
|
doc_type: 'session_summary',
|
|
memory_session_id: summary.memory_session_id,
|
|
project: summary.project,
|
|
created_at_epoch: summary.created_at_epoch,
|
|
prompt_number: summary.prompt_number || 0
|
|
};
|
|
|
|
// Each field becomes a separate document
|
|
if (summary.request) {
|
|
documents.push({
|
|
id: `summary_${summary.id}_request`,
|
|
document: summary.request,
|
|
metadata: { ...baseMetadata, field_type: 'request' }
|
|
});
|
|
}
|
|
|
|
if (summary.investigated) {
|
|
documents.push({
|
|
id: `summary_${summary.id}_investigated`,
|
|
document: summary.investigated,
|
|
metadata: { ...baseMetadata, field_type: 'investigated' }
|
|
});
|
|
}
|
|
|
|
if (summary.learned) {
|
|
documents.push({
|
|
id: `summary_${summary.id}_learned`,
|
|
document: summary.learned,
|
|
metadata: { ...baseMetadata, field_type: 'learned' }
|
|
});
|
|
}
|
|
|
|
if (summary.completed) {
|
|
documents.push({
|
|
id: `summary_${summary.id}_completed`,
|
|
document: summary.completed,
|
|
metadata: { ...baseMetadata, field_type: 'completed' }
|
|
});
|
|
}
|
|
|
|
if (summary.next_steps) {
|
|
documents.push({
|
|
id: `summary_${summary.id}_next_steps`,
|
|
document: summary.next_steps,
|
|
metadata: { ...baseMetadata, field_type: 'next_steps' }
|
|
});
|
|
}
|
|
|
|
if (summary.notes) {
|
|
documents.push({
|
|
id: `summary_${summary.id}_notes`,
|
|
document: summary.notes,
|
|
metadata: { ...baseMetadata, field_type: 'notes' }
|
|
});
|
|
}
|
|
|
|
return documents;
|
|
}
|
|
|
|
/**
|
|
* Add documents to Chroma in batch via MCP
|
|
* Throws error if batch add fails
|
|
*/
|
|
private async addDocuments(documents: ChromaDocument[]): Promise<void> {
|
|
if (documents.length === 0) {
|
|
return;
|
|
}
|
|
|
|
await this.ensureCollectionExists();
|
|
|
|
const chromaMcp = ChromaMcpManager.getInstance();
|
|
|
|
// Add in batches
|
|
for (let i = 0; i < documents.length; i += this.BATCH_SIZE) {
|
|
const batch = documents.slice(i, i + this.BATCH_SIZE);
|
|
|
|
// Sanitize metadata: filter out null, undefined, and empty string values
|
|
// that chroma-mcp may reject (e.g., null subtitle from raw SQLite rows)
|
|
const cleanMetadatas = batch.map(d =>
|
|
Object.fromEntries(
|
|
Object.entries(d.metadata).filter(([_, v]) => v !== null && v !== undefined && v !== '')
|
|
)
|
|
);
|
|
|
|
try {
|
|
await chromaMcp.callTool('chroma_add_documents', {
|
|
collection_name: this.collectionName,
|
|
ids: batch.map(d => d.id),
|
|
documents: batch.map(d => d.document),
|
|
metadatas: cleanMetadatas
|
|
});
|
|
} catch (error) {
|
|
logger.error('CHROMA_SYNC', 'Batch add failed, continuing with remaining batches', {
|
|
collection: this.collectionName,
|
|
batchStart: i,
|
|
batchSize: batch.length
|
|
}, error as Error);
|
|
}
|
|
}
|
|
|
|
logger.debug('CHROMA_SYNC', 'Documents added', {
|
|
collection: this.collectionName,
|
|
count: documents.length
|
|
});
|
|
}
|
|
|
|
/**
|
|
* Sync a single observation to Chroma
|
|
* Blocks until sync completes, throws on error
|
|
*/
|
|
async syncObservation(
|
|
observationId: number,
|
|
memorySessionId: string,
|
|
project: string,
|
|
obs: ParsedObservation,
|
|
promptNumber: number,
|
|
createdAtEpoch: number,
|
|
discoveryTokens: number = 0
|
|
): Promise<void> {
|
|
// Convert ParsedObservation to StoredObservation format
|
|
const stored: StoredObservation = {
|
|
id: observationId,
|
|
memory_session_id: memorySessionId,
|
|
project: project,
|
|
text: null, // Legacy field, not used
|
|
type: obs.type,
|
|
title: obs.title,
|
|
subtitle: obs.subtitle,
|
|
facts: JSON.stringify(obs.facts),
|
|
narrative: obs.narrative,
|
|
concepts: JSON.stringify(obs.concepts),
|
|
files_read: JSON.stringify(obs.files_read),
|
|
files_modified: JSON.stringify(obs.files_modified),
|
|
prompt_number: promptNumber,
|
|
discovery_tokens: discoveryTokens,
|
|
created_at: new Date(createdAtEpoch * 1000).toISOString(),
|
|
created_at_epoch: createdAtEpoch
|
|
};
|
|
|
|
const documents = this.formatObservationDocs(stored);
|
|
|
|
logger.info('CHROMA_SYNC', 'Syncing observation', {
|
|
observationId,
|
|
documentCount: documents.length,
|
|
project
|
|
});
|
|
|
|
await this.addDocuments(documents);
|
|
}
|
|
|
|
/**
|
|
* Sync a single summary to Chroma
|
|
* Blocks until sync completes, throws on error
|
|
*/
|
|
async syncSummary(
|
|
summaryId: number,
|
|
memorySessionId: string,
|
|
project: string,
|
|
summary: ParsedSummary,
|
|
promptNumber: number,
|
|
createdAtEpoch: number,
|
|
discoveryTokens: number = 0
|
|
): Promise<void> {
|
|
// Convert ParsedSummary to StoredSummary format
|
|
const stored: StoredSummary = {
|
|
id: summaryId,
|
|
memory_session_id: memorySessionId,
|
|
project: project,
|
|
request: summary.request,
|
|
investigated: summary.investigated,
|
|
learned: summary.learned,
|
|
completed: summary.completed,
|
|
next_steps: summary.next_steps,
|
|
notes: summary.notes,
|
|
prompt_number: promptNumber,
|
|
discovery_tokens: discoveryTokens,
|
|
created_at: new Date(createdAtEpoch * 1000).toISOString(),
|
|
created_at_epoch: createdAtEpoch
|
|
};
|
|
|
|
const documents = this.formatSummaryDocs(stored);
|
|
|
|
logger.info('CHROMA_SYNC', 'Syncing summary', {
|
|
summaryId,
|
|
documentCount: documents.length,
|
|
project
|
|
});
|
|
|
|
await this.addDocuments(documents);
|
|
}
|
|
|
|
/**
|
|
* Format user prompt into Chroma document
|
|
* Each prompt becomes a single document (unlike observations/summaries which split by field)
|
|
*/
|
|
private formatUserPromptDoc(prompt: StoredUserPrompt): ChromaDocument {
|
|
return {
|
|
id: `prompt_${prompt.id}`,
|
|
document: prompt.prompt_text,
|
|
metadata: {
|
|
sqlite_id: prompt.id,
|
|
doc_type: 'user_prompt',
|
|
memory_session_id: prompt.memory_session_id,
|
|
project: prompt.project,
|
|
created_at_epoch: prompt.created_at_epoch,
|
|
prompt_number: prompt.prompt_number
|
|
}
|
|
};
|
|
}
|
|
|
|
/**
|
|
* Sync a single user prompt to Chroma
|
|
* Blocks until sync completes, throws on error
|
|
*/
|
|
async syncUserPrompt(
|
|
promptId: number,
|
|
memorySessionId: string,
|
|
project: string,
|
|
promptText: string,
|
|
promptNumber: number,
|
|
createdAtEpoch: number
|
|
): Promise<void> {
|
|
// Create StoredUserPrompt format
|
|
const stored: StoredUserPrompt = {
|
|
id: promptId,
|
|
content_session_id: '', // Not needed for Chroma sync
|
|
prompt_number: promptNumber,
|
|
prompt_text: promptText,
|
|
created_at: new Date(createdAtEpoch * 1000).toISOString(),
|
|
created_at_epoch: createdAtEpoch,
|
|
memory_session_id: memorySessionId,
|
|
project: project
|
|
};
|
|
|
|
const document = this.formatUserPromptDoc(stored);
|
|
|
|
logger.info('CHROMA_SYNC', 'Syncing user prompt', {
|
|
promptId,
|
|
project
|
|
});
|
|
|
|
await this.addDocuments([document]);
|
|
}
|
|
|
|
/**
|
|
* Fetch all existing document IDs from Chroma collection via MCP
|
|
* Returns Sets of SQLite IDs for observations, summaries, and prompts
|
|
*/
|
|
private async getExistingChromaIds(projectOverride?: string): Promise<{
|
|
observations: Set<number>;
|
|
summaries: Set<number>;
|
|
prompts: Set<number>;
|
|
}> {
|
|
const targetProject = projectOverride ?? this.project;
|
|
await this.ensureCollectionExists();
|
|
|
|
const chromaMcp = ChromaMcpManager.getInstance();
|
|
|
|
const observationIds = new Set<number>();
|
|
const summaryIds = new Set<number>();
|
|
const promptIds = new Set<number>();
|
|
|
|
let offset = 0;
|
|
const limit = 1000; // Large batches, metadata only = fast
|
|
|
|
logger.info('CHROMA_SYNC', 'Fetching existing Chroma document IDs...', { project: targetProject });
|
|
|
|
while (true) {
|
|
const result = await chromaMcp.callTool('chroma_get_documents', {
|
|
collection_name: this.collectionName,
|
|
limit: limit,
|
|
offset: offset,
|
|
where: { project: targetProject },
|
|
include: ['metadatas']
|
|
}) as any;
|
|
|
|
// chroma_get_documents returns flat arrays: { ids, metadatas, documents }
|
|
const metadatas = result?.metadatas || [];
|
|
|
|
if (metadatas.length === 0) {
|
|
break; // No more documents
|
|
}
|
|
|
|
// Extract SQLite IDs from metadata
|
|
for (const meta of metadatas) {
|
|
if (meta && meta.sqlite_id) {
|
|
const sqliteId = meta.sqlite_id as number;
|
|
if (meta.doc_type === 'observation') {
|
|
observationIds.add(sqliteId);
|
|
} else if (meta.doc_type === 'session_summary') {
|
|
summaryIds.add(sqliteId);
|
|
} else if (meta.doc_type === 'user_prompt') {
|
|
promptIds.add(sqliteId);
|
|
}
|
|
}
|
|
}
|
|
|
|
offset += limit;
|
|
|
|
logger.debug('CHROMA_SYNC', 'Fetched batch of existing IDs', {
|
|
project: targetProject,
|
|
offset,
|
|
batchSize: metadatas.length
|
|
});
|
|
}
|
|
|
|
logger.info('CHROMA_SYNC', 'Existing IDs fetched', {
|
|
project: targetProject,
|
|
observations: observationIds.size,
|
|
summaries: summaryIds.size,
|
|
prompts: promptIds.size
|
|
});
|
|
|
|
return { observations: observationIds, summaries: summaryIds, prompts: promptIds };
|
|
}
|
|
|
|
/**
|
|
* Backfill: Sync all observations missing from Chroma
|
|
* Reads from SQLite and syncs in batches
|
|
* @param projectOverride - If provided, backfill this project instead of this.project.
|
|
* Used by backfillAllProjects() to iterate projects without mutating instance state.
|
|
* Throws error if backfill fails
|
|
*/
|
|
async ensureBackfilled(projectOverride?: string): Promise<void> {
|
|
const backfillProject = projectOverride ?? this.project;
|
|
logger.info('CHROMA_SYNC', 'Starting smart backfill', { project: backfillProject });
|
|
|
|
await this.ensureCollectionExists();
|
|
|
|
// Fetch existing IDs from Chroma (fast, metadata only)
|
|
const existing = await this.getExistingChromaIds(backfillProject);
|
|
|
|
const db = new SessionStore();
|
|
|
|
try {
|
|
// Build exclusion list for observations
|
|
// Filter to validated positive integers before interpolating into SQL
|
|
const existingObsIds = Array.from(existing.observations).filter(id => Number.isInteger(id) && id > 0);
|
|
const obsExclusionClause = existingObsIds.length > 0
|
|
? `AND id NOT IN (${existingObsIds.join(',')})`
|
|
: '';
|
|
|
|
// Get only observations missing from Chroma
|
|
const observations = db.db.prepare(`
|
|
SELECT * FROM observations
|
|
WHERE project = ? ${obsExclusionClause}
|
|
ORDER BY id ASC
|
|
`).all(backfillProject) as StoredObservation[];
|
|
|
|
const totalObsCount = db.db.prepare(`
|
|
SELECT COUNT(*) as count FROM observations WHERE project = ?
|
|
`).get(backfillProject) as { count: number };
|
|
|
|
logger.info('CHROMA_SYNC', 'Backfilling observations', {
|
|
project: backfillProject,
|
|
missing: observations.length,
|
|
existing: existing.observations.size,
|
|
total: totalObsCount.count
|
|
});
|
|
|
|
// Format all observation documents
|
|
const allDocs: ChromaDocument[] = [];
|
|
for (const obs of observations) {
|
|
allDocs.push(...this.formatObservationDocs(obs));
|
|
}
|
|
|
|
// Sync in batches
|
|
for (let i = 0; i < allDocs.length; i += this.BATCH_SIZE) {
|
|
const batch = allDocs.slice(i, i + this.BATCH_SIZE);
|
|
await this.addDocuments(batch);
|
|
|
|
logger.debug('CHROMA_SYNC', 'Backfill progress', {
|
|
project: backfillProject,
|
|
progress: `${Math.min(i + this.BATCH_SIZE, allDocs.length)}/${allDocs.length}`
|
|
});
|
|
}
|
|
|
|
// Build exclusion list for summaries
|
|
const existingSummaryIds = Array.from(existing.summaries).filter(id => Number.isInteger(id) && id > 0);
|
|
const summaryExclusionClause = existingSummaryIds.length > 0
|
|
? `AND id NOT IN (${existingSummaryIds.join(',')})`
|
|
: '';
|
|
|
|
// Get only summaries missing from Chroma
|
|
const summaries = db.db.prepare(`
|
|
SELECT * FROM session_summaries
|
|
WHERE project = ? ${summaryExclusionClause}
|
|
ORDER BY id ASC
|
|
`).all(backfillProject) as StoredSummary[];
|
|
|
|
const totalSummaryCount = db.db.prepare(`
|
|
SELECT COUNT(*) as count FROM session_summaries WHERE project = ?
|
|
`).get(backfillProject) as { count: number };
|
|
|
|
logger.info('CHROMA_SYNC', 'Backfilling summaries', {
|
|
project: backfillProject,
|
|
missing: summaries.length,
|
|
existing: existing.summaries.size,
|
|
total: totalSummaryCount.count
|
|
});
|
|
|
|
// Format all summary documents
|
|
const summaryDocs: ChromaDocument[] = [];
|
|
for (const summary of summaries) {
|
|
summaryDocs.push(...this.formatSummaryDocs(summary));
|
|
}
|
|
|
|
// Sync in batches
|
|
for (let i = 0; i < summaryDocs.length; i += this.BATCH_SIZE) {
|
|
const batch = summaryDocs.slice(i, i + this.BATCH_SIZE);
|
|
await this.addDocuments(batch);
|
|
|
|
logger.debug('CHROMA_SYNC', 'Backfill progress', {
|
|
project: backfillProject,
|
|
progress: `${Math.min(i + this.BATCH_SIZE, summaryDocs.length)}/${summaryDocs.length}`
|
|
});
|
|
}
|
|
|
|
// Build exclusion list for prompts
|
|
const existingPromptIds = Array.from(existing.prompts).filter(id => Number.isInteger(id) && id > 0);
|
|
const promptExclusionClause = existingPromptIds.length > 0
|
|
? `AND up.id NOT IN (${existingPromptIds.join(',')})`
|
|
: '';
|
|
|
|
// Get only user prompts missing from Chroma
|
|
const prompts = db.db.prepare(`
|
|
SELECT
|
|
up.*,
|
|
s.project,
|
|
s.memory_session_id
|
|
FROM user_prompts up
|
|
JOIN sdk_sessions s ON up.content_session_id = s.content_session_id
|
|
WHERE s.project = ? ${promptExclusionClause}
|
|
ORDER BY up.id ASC
|
|
`).all(backfillProject) as StoredUserPrompt[];
|
|
|
|
const totalPromptCount = db.db.prepare(`
|
|
SELECT COUNT(*) as count
|
|
FROM user_prompts up
|
|
JOIN sdk_sessions s ON up.content_session_id = s.content_session_id
|
|
WHERE s.project = ?
|
|
`).get(backfillProject) as { count: number };
|
|
|
|
logger.info('CHROMA_SYNC', 'Backfilling user prompts', {
|
|
project: backfillProject,
|
|
missing: prompts.length,
|
|
existing: existing.prompts.size,
|
|
total: totalPromptCount.count
|
|
});
|
|
|
|
// Format all prompt documents
|
|
const promptDocs: ChromaDocument[] = [];
|
|
for (const prompt of prompts) {
|
|
promptDocs.push(this.formatUserPromptDoc(prompt));
|
|
}
|
|
|
|
// Sync in batches
|
|
for (let i = 0; i < promptDocs.length; i += this.BATCH_SIZE) {
|
|
const batch = promptDocs.slice(i, i + this.BATCH_SIZE);
|
|
await this.addDocuments(batch);
|
|
|
|
logger.debug('CHROMA_SYNC', 'Backfill progress', {
|
|
project: backfillProject,
|
|
progress: `${Math.min(i + this.BATCH_SIZE, promptDocs.length)}/${promptDocs.length}`
|
|
});
|
|
}
|
|
|
|
logger.info('CHROMA_SYNC', 'Smart backfill complete', {
|
|
project: backfillProject,
|
|
synced: {
|
|
observationDocs: allDocs.length,
|
|
summaryDocs: summaryDocs.length,
|
|
promptDocs: promptDocs.length
|
|
},
|
|
skipped: {
|
|
observations: existing.observations.size,
|
|
summaries: existing.summaries.size,
|
|
prompts: existing.prompts.size
|
|
}
|
|
});
|
|
|
|
} catch (error) {
|
|
logger.error('CHROMA_SYNC', 'Backfill failed', { project: backfillProject }, error as Error);
|
|
throw new Error(`Backfill failed: ${error instanceof Error ? error.message : String(error)}`);
|
|
} finally {
|
|
db.close();
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Query Chroma collection for semantic search via MCP
|
|
* Used by SearchManager for vector-based search
|
|
*/
|
|
async queryChroma(
|
|
query: string,
|
|
limit: number,
|
|
whereFilter?: Record<string, any>
|
|
): Promise<{ ids: number[]; distances: number[]; metadatas: any[] }> {
|
|
await this.ensureCollectionExists();
|
|
|
|
try {
|
|
const chromaMcp = ChromaMcpManager.getInstance();
|
|
const results = await chromaMcp.callTool('chroma_query_documents', {
|
|
collection_name: this.collectionName,
|
|
query_texts: [query],
|
|
n_results: limit,
|
|
...(whereFilter && { where: whereFilter }),
|
|
include: ['documents', 'metadatas', 'distances']
|
|
}) as any;
|
|
|
|
// chroma_query_documents returns nested arrays (one per query text)
|
|
// We always pass a single query text, so we access [0]
|
|
const ids: number[] = [];
|
|
const seen = new Set<number>();
|
|
const docIds = results?.ids?.[0] || [];
|
|
const rawMetadatas = results?.metadatas?.[0] || [];
|
|
const rawDistances = results?.distances?.[0] || [];
|
|
|
|
// Build deduplicated arrays that stay index-aligned:
|
|
// Multiple Chroma docs map to the same SQLite ID (one per field).
|
|
// Keep the first (best-ranked) distance and metadata per SQLite ID.
|
|
const metadatas: any[] = [];
|
|
const distances: number[] = [];
|
|
|
|
for (let i = 0; i < docIds.length; i++) {
|
|
const docId = docIds[i];
|
|
// Extract sqlite_id from document ID (supports three formats):
|
|
// - obs_{id}_narrative, obs_{id}_fact_0, etc (observations)
|
|
// - summary_{id}_request, summary_{id}_learned, etc (session summaries)
|
|
// - prompt_{id} (user prompts)
|
|
const obsMatch = docId.match(/obs_(\d+)_/);
|
|
const summaryMatch = docId.match(/summary_(\d+)_/);
|
|
const promptMatch = docId.match(/prompt_(\d+)/);
|
|
|
|
let sqliteId: number | null = null;
|
|
if (obsMatch) {
|
|
sqliteId = parseInt(obsMatch[1], 10);
|
|
} else if (summaryMatch) {
|
|
sqliteId = parseInt(summaryMatch[1], 10);
|
|
} else if (promptMatch) {
|
|
sqliteId = parseInt(promptMatch[1], 10);
|
|
}
|
|
|
|
if (sqliteId !== null && !seen.has(sqliteId)) {
|
|
seen.add(sqliteId);
|
|
ids.push(sqliteId);
|
|
metadatas.push(rawMetadatas[i] ?? null);
|
|
distances.push(rawDistances[i] ?? 0);
|
|
}
|
|
}
|
|
|
|
return { ids, distances, metadatas };
|
|
} catch (error) {
|
|
const errorMessage = error instanceof Error ? error.message : String(error);
|
|
|
|
// Check for connection errors
|
|
const isConnectionError =
|
|
errorMessage.includes('ECONNREFUSED') ||
|
|
errorMessage.includes('ENOTFOUND') ||
|
|
errorMessage.includes('fetch failed') ||
|
|
errorMessage.includes('subprocess closed') ||
|
|
errorMessage.includes('timed out');
|
|
|
|
if (isConnectionError) {
|
|
// Reset collection state so next call attempts reconnect
|
|
this.collectionCreated = false;
|
|
logger.error('CHROMA_SYNC', 'Connection lost during query',
|
|
{ project: this.project, query }, error as Error);
|
|
throw new Error(`Chroma query failed - connection lost: ${errorMessage}`);
|
|
}
|
|
|
|
logger.error('CHROMA_SYNC', 'Query failed', { project: this.project, query }, error as Error);
|
|
throw error;
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Backfill all projects that have observations in SQLite but may be missing from Chroma.
|
|
* Uses a single shared ChromaSync('claude-mem') instance and Chroma connection.
|
|
* Per-project scoping is passed as a parameter to ensureBackfilled(), avoiding
|
|
* instance state mutation. All documents land in the cm__claude-mem collection
|
|
* with project scoped via metadata, matching how DatabaseManager and SearchManager operate.
|
|
* Designed to be called fire-and-forget on worker startup.
|
|
*/
|
|
static async backfillAllProjects(): Promise<void> {
|
|
const db = new SessionStore();
|
|
const sync = new ChromaSync('claude-mem');
|
|
try {
|
|
const projects = db.db.prepare(
|
|
'SELECT DISTINCT project FROM observations WHERE project IS NOT NULL AND project != ?'
|
|
).all('') as { project: string }[];
|
|
|
|
logger.info('CHROMA_SYNC', `Backfill check for ${projects.length} projects`);
|
|
|
|
for (const { project } of projects) {
|
|
try {
|
|
await sync.ensureBackfilled(project);
|
|
} catch (error) {
|
|
logger.error('CHROMA_SYNC', `Backfill failed for project: ${project}`, {}, error as Error);
|
|
// Continue to next project — don't let one failure stop others
|
|
}
|
|
}
|
|
} finally {
|
|
await sync.close();
|
|
db.close();
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Close the ChromaSync instance
|
|
* ChromaMcpManager is a singleton and manages its own lifecycle
|
|
* We don't close it here - it's closed during graceful shutdown
|
|
*/
|
|
async close(): Promise<void> {
|
|
// ChromaMcpManager is a singleton and manages its own lifecycle
|
|
// We don't close it here - it's closed during graceful shutdown
|
|
logger.info('CHROMA_SYNC', 'ChromaSync closed', { project: this.project });
|
|
}
|
|
}
|