Architecture

The logic layer between you and AI-assisted drafts.

ResearcherX builds a knowledge graph from your sources, then flags contradictions in your draft as you write them, with line-level provenance back to every claim.

01, Graph Extraction

From raw text to deterministic knowledge

Ingested sources become a typed, interconnected graph, not a soup of embeddings, so facts can be retrieved and contradicted deterministically.

  1. Parse & chunk. PDFs via MinerU, plus HTML and URLs. Content-aware chunking into manageable segments.
  2. Extract with context. Each chunk is extracted against the existing graph, yielding a dense, interconnected graph instead of disconnected subgraphs.
  3. Index. Nodes merged into Neo4j (idempotent). Descriptions embedded and indexed into LanceDB with Tantivy FTS.

Model-agnostic via LiteLLM, Ollama (local), Anthropic, Gemini, or Bedrock.

Extraction schema Python class signatures for graph nodes and edges
# Extraction schema
class GraphNode:
  id: snake_case
  name: str
  node_type: NodeType
  description: str
  source_id: provenance

class GraphEdge:
  source → target
  description: str
  relationship_id: UUID
  source_id: provenance
02, Retrieval & Linting

Iterative hop+filter with LLM contradiction check

Two-hop graph walk gated at each step by a k-closest filter, then an LLM judges the retrieved context against the draft. Provenance ({document_id}:{page}:{line_start}-{line_end}) travels with every flag.

Input
Query Vector
Search
Hybrid top-20
Hop 1
1-hop rels
Filter
k-closest
Hop 2
Expand
Filter
k-closest
Output
Established Facts
Lint data flow Full request path from keystroke to red squiggly
User types in TipTap editor
    ↓ 2s debounce
POST /api/v1/lint
  { text_chunk, document_id, paragraph_hash }
    ↓
    └ embed(chunk) → query_vector
    ↓
    └ hybrid_search(nodes, top_k=20)
    ↓
    └ hop 1 → k-closest filter
    ↓
    └ hop 2 → k-closest filter
    ↓
    └ format: "[src] → description → [tgt]"
    ↓
    └ lightweight_contradiction_check()
    │   LLM judge
    ↓
Response: { conflicts[], paragraph_hash }
    ↓
ProseMirror DecorationSet
  Red squiggly → click for detail
03, Architecture

Private by default, local by design

Your drafts and PDFs never leave your machine by default. Everything, source ingestion, the knowledge graph, embeddings, and your writing, runs locally. The backend is stateless: no user sessions, no document storage, no telemetry on your content. The full stack is self-hostable, and you choose whether the LLM is a local model or a cloud provider.

LLM backends 4 options via LiteLLM, Ollama, Claude, Gemini, Bedrock
ModeRunsTrade-off
OllamaFully localPrivate, slower
ClaudeAnthropicCloud, fast
GeminiGoogleCloud, fast
BedrockAWSCloud, fast
Under the hood Three-layer stack for engineers who want the full picture
  • Client, Browser SPA. Next.js 16 + React 19 · TipTap / ProseMirror editor · 3D knowledge graph (Three.js) · IndexedDB via Dexie.
  • Server, Stateless FastAPI. /lint · /ingest · /chat · /graph · /sources, SSE streaming where it matters.
  • Data, Neo4j 5.15 + LanceDB. Neo4j holds graph structure. LanceDB + Tantivy handles hybrid search. LLM access via LiteLLM.
04, Data Model

Dual-store: structure + semantics

Neo4j holds graph structure only. All semantic data, embeddings, descriptions, provenance, lives in LanceDB. The split keeps graph walks fast and vector search independent.

Neo4j, structural layer Cypher graph pattern for entities, relationships, and document provenance
(:Entity {id, name, node_type})
  -[:RELATION {relationship_id}]->
(:Entity)

(:Entity)
  -[:EXTRACTED_FROM {source_id}]->
(:Document {id, text})
LanceDB, semantic layer Tables, fields, and indexes for vector + keyword search
TableKey FieldsIndex
nodes node_id, description, vector[300], sources[] FTS + cosine
relationships source_node_id → target_node_id, description, vector[300] FTS + cosine
chunks text, document_id, node_id, vector[300] FTS + cosine

Deterministic verification
for every document an AI touches.

Built into your writing workflow.

GraphRAG Neo4j LanceDB Multi-Hop Retrieval Contradiction Detection