feat: replace WASM embeddings with persistent chroma-mcp MCP connection (#1176)

* feat: replace WASM embeddings with persistent chroma-mcp MCP connection

Replace ChromaServerManager (npx chroma run + chromadb npm + ONNX/WASM)
with ChromaMcpManager, a singleton stdio MCP client that communicates with
chroma-mcp via uvx. This eliminates native binary issues, segfaults, and
WASM embedding failures that plagued cross-platform installs.

Key changes:
- Add ChromaMcpManager: singleton MCP client with lazy connect, auto-reconnect,
  connection lock, and Zscaler SSL cert support
- Rewrite ChromaSync to use MCP tool calls instead of chromadb npm client
- Handle chroma-mcp's non-JSON responses (plain text success/error messages)
- Treat "collection already exists" as idempotent success
- Wire ChromaMcpManager into GracefulShutdown for clean subprocess teardown
- Delete ChromaServerManager (no longer needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — connection guard leak, timer leak, async reset

- Clear connecting guard in finally block to prevent permanent reconnection block
- Clear timeout after successful connection to prevent timer leak
- Make reset() async to await stop() before nullifying instance
- Delete obsolete chroma-server-manager test (imports deleted class)
- Update graceful-shutdown test to use chromaMcpManager property name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent chroma-mcp spawn storm — zombie cleanup, stale onclose guard, reconnect backoff

Three bugs caused chroma-mcp processes to accumulate (92+ observed):

1. Zombie on timeout: failed connections left subprocess alive because
   only the timer was cleared, not the transport. Now catch block
   explicitly closes transport+client before rethrowing.

2. Stale onclose race: old transport's onclose handler captured `this`
   and overwrote the current connection reference after reconnect,
   orphaning the new subprocess. Now guarded with reference check.

3. No backoff: every failure triggered immediate reconnect. With
   backfill doing hundreds of MCP calls, this created rapid-fire
   spawning. Added 10s backoff on both connection failure and
   unexpected process death.

Also includes ChromaSync fixes from PR review:
- queryChroma deduplication now preserves index-aligned arrays
- SQL injection guard on backfill ID exclusion lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-02-18 18:32:38 -05:00
committed by GitHub
parent 7e57b6e02d
commit 40daf8f3fa
13 changed files with 833 additions and 1189 deletions
@@ -1,139 +0,0 @@
import { describe, it, expect, beforeEach, afterEach, mock, spyOn } from 'bun:test';
import { EventEmitter } from 'events';
import * as childProcess from 'child_process';
import { ChromaServerManager } from '../../../src/services/sync/ChromaServerManager.js';
function createFakeProcess(pid: number = 4242): childProcess.ChildProcess {
const proc = new EventEmitter() as childProcess.ChildProcess & EventEmitter;
let exited = false;
(proc as any).stdout = new EventEmitter();
(proc as any).stderr = new EventEmitter();
(proc as any).pid = pid;
(proc as any).kill = mock(() => {
if (!exited) {
exited = true;
setTimeout(() => proc.emit('exit', 0, 'SIGTERM'), 0);
}
return true;
});
return proc as childProcess.ChildProcess;
}
describe('ChromaServerManager', () => {
const originalFetch = global.fetch;
const originalPlatform = process.platform;
beforeEach(() => {
mock.restore();
ChromaServerManager.reset();
// Avoid macOS cert bundle shelling in tests; these tests only exercise startup races.
Object.defineProperty(process, 'platform', {
value: 'linux',
writable: true,
configurable: true
});
});
afterEach(() => {
global.fetch = originalFetch;
mock.restore();
ChromaServerManager.reset();
Object.defineProperty(process, 'platform', {
value: originalPlatform,
writable: true,
configurable: true
});
});
it('reuses in-flight startup and only spawns one server process', async () => {
const fetchMock = mock(async () => {
// First call: existing server check fails, second call: waitForReady succeeds.
if (fetchMock.mock.calls.length === 1) {
throw new Error('no server yet');
}
return new Response(null, { status: 200 });
});
global.fetch = fetchMock as typeof fetch;
const spawnSpy = spyOn(childProcess, 'spawn').mockImplementation(
() => createFakeProcess() as unknown as ReturnType<typeof childProcess.spawn>
);
const manager = ChromaServerManager.getInstance({
dataDir: '/tmp/chroma-test',
host: '127.0.0.1',
port: 8000
});
const [first, second] = await Promise.all([
manager.start(2000),
manager.start(2000)
]);
expect(first).toBe(true);
expect(second).toBe(true);
expect(spawnSpy).toHaveBeenCalledTimes(1);
});
it('reuses existing reachable server without spawning', async () => {
global.fetch = mock(async () => new Response(null, { status: 200 })) as typeof fetch;
const spawnSpy = spyOn(childProcess, 'spawn').mockImplementation(
() => createFakeProcess() as unknown as ReturnType<typeof childProcess.spawn>
);
const manager = ChromaServerManager.getInstance({
dataDir: '/tmp/chroma-test',
host: '127.0.0.1',
port: 8000
});
const ready = await manager.start(2000);
expect(ready).toBe(true);
expect(spawnSpy).not.toHaveBeenCalled();
});
it('waits for ongoing startup instead of returning early', async () => {
let resolveReady: ((value: Response) => void) | null = null;
const delayedReady = new Promise<Response>((resolve) => {
resolveReady = resolve;
});
const fetchMock = mock(async () => {
// 1st: existing server check -> fail, 2nd: waitForReady -> block until we resolve.
if (fetchMock.mock.calls.length === 1) {
throw new Error('no server yet');
}
return delayedReady;
});
global.fetch = fetchMock as typeof fetch;
spyOn(childProcess, 'spawn').mockImplementation(
() => createFakeProcess() as unknown as ReturnType<typeof childProcess.spawn>
);
const manager = ChromaServerManager.getInstance({
dataDir: '/tmp/chroma-test',
host: '127.0.0.1',
port: 8000
});
const firstStart = manager.start(5000);
let secondResolved = false;
const secondStart = manager.start(5000).then((value) => {
secondResolved = true;
return value;
});
await new Promise((resolve) => setTimeout(resolve, 50));
expect(secondResolved).toBe(false);
resolveReady!(new Response(null, { status: 200 }));
expect(await firstStart).toBe(true);
expect(await secondStart).toBe(true);
});
});