ResearcherX, Technical Architecture

01, Graph Extraction

From raw text to deterministic knowledge

Ingested sources become a typed, interconnected graph, not a soup of embeddings, so facts can be retrieved and contradicted deterministically.

Parse & chunk. PDFs via MinerU, plus HTML and URLs. Content-aware chunking into manageable segments.
Extract with context. Each chunk is extracted against the existing graph, yielding a dense, interconnected graph instead of disconnected subgraphs.
Index. Nodes merged into Neo4j (idempotent). Descriptions embedded and indexed into LanceDB with Tantivy FTS.

Model-agnostic via LiteLLM, Ollama (local), Anthropic, Gemini, or Bedrock.

Extraction schema Python class signatures for graph nodes and edges

# Extraction schema
class GraphNode:
  id: snake_case
  name: str
  node_type: NodeType
  description: str
  source_id: provenance

class GraphEdge:
  source → target
  description: str
  relationship_id: UUID
  source_id: provenance

02, Retrieval & Linting

Iterative hop+filter with LLM contradiction check

Two-hop graph walk gated at each step by a k-closest filter, then an LLM judges the retrieved context against the draft. Provenance ({document_id}:{page}:{line_start}-{line_end}) travels with every flag.

Input

Query Vector

→

Hybrid top-20

→

Hop 1

1-hop rels

→

Filter

k-closest

→

Hop 2

Expand

→

Filter

k-closest

→

Output

Established Facts

Hybrid search. Semantic + keyword across nodes and relationships, fused via Reciprocal Rank Fusion.
k-closest filter. Each hop keeps only the top-k neighbors by semantic similarity, trimming noise at every expansion.

Lint data flow Full request path from keystroke to red squiggly

User types in TipTap editor
    ↓ 2s debounce
POST /api/v1/lint
  { text_chunk, document_id, paragraph_hash }
    ↓
    └ embed(chunk) → query_vector
    ↓
    └ hybrid_search(nodes, top_k=20)
    ↓
    └ hop 1 → k-closest filter
    ↓
    └ hop 2 → k-closest filter
    ↓
    └ format: "[src] → description → [tgt]"
    ↓
    └ lightweight_contradiction_check()
    │   LLM judge
    ↓
Response: { conflicts[], paragraph_hash }
    ↓
ProseMirror DecorationSet
  Red squiggly → click for detail

03, Architecture

Private by default, local by design

Your drafts and PDFs never leave your machine by default. Everything, source ingestion, the knowledge graph, embeddings, and your writing, runs locally. The backend is stateless: no user sessions, no document storage, no telemetry on your content. The full stack is self-hostable, and you choose whether the LLM is a local model or a cloud provider.

Drafts and PDFs stay in your browser. Persisted to IndexedDB, never uploaded.
Embeddings run locally via our lightweight embedder, no cloud round-trip.
LLM is your choice. Run fully local with Ollama, or swap in Claude / Gemini / Bedrock for speed, with one switch.

LLM backends 4 options via LiteLLM, Ollama, Claude, Gemini, Bedrock

Mode	Runs	Trade-off
Ollama	Fully local	Private, slower
Claude	Anthropic	Cloud, fast
Gemini	Google	Cloud, fast
Bedrock	AWS	Cloud, fast

Under the hood Three-layer stack for engineers who want the full picture

Client, Browser SPA. Next.js 16 + React 19 · TipTap / ProseMirror editor · 3D knowledge graph (Three.js) · IndexedDB via Dexie.
Server, Stateless FastAPI. /lint · /ingest · /chat · /graph · /sources, SSE streaming where it matters.
Data, Neo4j 5.15 + LanceDB. Neo4j holds graph structure. LanceDB + Tantivy handles hybrid search. LLM access via LiteLLM.

04, Data Model

Dual-store: structure + semantics

Neo4j holds graph structure only. All semantic data, embeddings, descriptions, provenance, lives in LanceDB. The split keeps graph walks fast and vector search independent.

Neo4j, structural layer Cypher graph pattern for entities, relationships, and document provenance

(:Entity {id, name, node_type})
-[:RELATION {relationship_id}]->
(:Entity)

(:Entity)
-[:EXTRACTED_FROM {source_id}]->
(:Document {id, text})

LanceDB, semantic layer Tables, fields, and indexes for vector + keyword search

Table	Key Fields	Index
nodes	node_id, description, vector[300], sources[]	FTS + cosine
relationships	source_node_id → target_node_id, description, vector[300]	FTS + cosine
chunks	text, document_id, node_id, vector[300]	FTS + cosine

The logic layer between you and AI-assisted drafts.

From raw text to deterministic knowledge

Iterative hop+filter with LLM contradiction check

Private by default, local by design

Dual-store: structure + semantics

Deterministic verification
for every document an AI touches.

The logic layer between you and AI-assisted drafts.

From raw text to deterministic knowledge

Iterative hop+filter with LLM contradiction check

Private by default, local by design

Dual-store: structure + semantics

Deterministic verificationfor every document an AI touches.

Deterministic verification
for every document an AI touches.