Skip to content

Benchmarks

MemoryModel is a fully schema-agnostic engine. Users define custom memory types, extraction prompts, and embedding templates directly through the MemoryModel Console (our web-based configuration interface). No code changes required.

For this benchmark, we configured a 4-node topology via the console, optimized for conversational biography extraction from the LoCoMo dataset:

Press enter or click to view image in full size

Defined memory nodes topology for the LoCoMo benchmark
You are a Senior NLP Specialist and Temporal Reasoning Engine. Your task is to extract events from the conversation and resolve ALL relative time references into ISO 8601 absolute dates (YYYY-MM-DD).
### STEP 1: ESTABLISH THE ANCHOR DATE
The system has already processed the context and calculated the correct reference date for this session.
Current Context Date: {{CURRENT_DATE}}
HIERARCHY OF TRUTH:
1. SYSTEM ANCHOR (DEFAULT): Use the "Current Context Date" provided above as your mathematical Anchor for "today".
2. NARRATIVE OVERRIDE (EXCEPTION): ONLY if the user explicitly changes the timeline in the text (e.g., "Imagine it is 1990", "Back in 2012...", "Assume today is Nov 14"), use that specific narrative date instead.
### STEP 2: EXTRACT AND CALCULATE
Extract every event. For each event involving a time reference:
1. Identify the relative phrase (e.g., "last Tuesday", "three days ago", "next week").
2. Perform date arithmetic using the Anchor Date.
- Example: If Anchor is 2023-07-12 (Wednesday) and text says "two days ago", calculation is 2023-07-10.
- Example: "Tomorrow" = Anchor + 1 day.
OUTPUT FORMAT
Return a valid JSON array ordered CHRONOLOGICALLY.
[
{
"event_description": "Self-contained description including key details (what, why, result), specific objects/contents (e.g., what a sign said), and emotional states. Do NOT result to vague summaries.",
"absolute_date": "YYYY-MM-DD",
"original_time_expression": "The verbatim relative phrase used in text",
"location": "Location or null",
"participants": ["Name 1", "Name 2"],
"context_evidence": "Verbatim text span"
}
]
### CRITICAL RULES
- ISO 8601 ONLY: The 'absolute_date' MUST be in YYYY-MM-DD format.
- CALCULATE: Do not be lazy. "Three days ago" must become a specific date.
- OUTPUT: Output ONLY the JSON array.
Input Text:
...
Timestamp: {{absolute_date}} (Ref: {{original_time_expression}}) | Event: {{event_description}} | Details: {{context_evidence}} | Participants: {{participants}} | Location: {{location}}
SystemOverall Accuracy (J Score)Difference vs MemoryModel
MemoryModel (Ours)74.6%-
Letta74.0%-0.6%
Mem0ᵍ (Graph)68.4%-6.2%
Mem066.9%-7.7%
OpenAI Memory52.9%-21.7%

While systems like Mem0 rely on the LLM to calculate dates at query time (runtime calculation), MemoryModel adopts a “Shift-Left” approach: we resolve relative time expressions (e.g., “three days ago”) into ISO 8601 absolute dates during the ingestion phase. This deterministic pre-computation eliminates the hallucination risks associated with real-time arithmetic in LLMs.

Name: LoCoMo (Long Conversational Memory) Source: [snap-research/locomo] Size: 50 long conversations (~300 turns, ~9.000 tokens each) Sessions: Up to 35 sessions per conversation Questions: 1.986 questions for evaluation

MetricDescription
LLM-as-Judge (J)A powerful LLM evaluates the correctness of generated answers
F1 ScoreBalances precision and recall for factual correctness
BLEU-1Assesses text generation quality against ground truth
  • Single-Hop: Questions answerable from a single conversational turn/session.
  • Multi-Hop: Questions requiring synthesis across multiple sessions.
  • Temporal: Questions involving time-based reasoning and chronological awareness.
  • Open-Domain: Questions requiring external knowledge integration.
AspectMemoryModelMem0
Node Architecture4 specialized semantic nodesGeneric memory extraction
Memory StructureTyped structured memories (temporal, profile, career, social)Knowledge Graph
Schema ConfigurationUser-defined via web consoleFixed/hardcoded
Embedding ModelGemini text-embedding
LLM Backendgemini-2.5-flash
Temperature0.1 (extraction) / 0.0 (evaluation)

Architectural Approach to Temporal Reasoning

Section titled “Architectural Approach to Temporal Reasoning”

A key differentiator between MemoryModel and Mem0 lies in how temporal information is handled.

Mem0 stores memories with relative time expressions intact (e.g., “last year”, “two months ago”). During answer generation, their benchmark prompt must perform complex temporal reasoning:

# INSTRUCTIONS (from Mem0 benchmark prompt):
5. If there is a question about time references (like "last year", "two months ago",
etc.), calculate the actual date based on the memory timestamp.
6. Always convert relative time references to specific dates, months, or years.
For example, convert "last year" to "2022" or "two months ago" to "March 2023"
based on the memory timestamp.

This approach requires:

  • A 400 word prompt with step-by-step reasoning instructions
  • The LLM to calculate dates at query time from relative expressions
  • Explicit handling of multi-speaker contexts and contradictory timestamps

MemoryModel’s Approach: Pre-Computed Temporal Indexing

Section titled “MemoryModel’s Approach: Pre-Computed Temporal Indexing”

MemoryModel resolves temporal references at ingestion time, not at query time:

Ingestion-time temporal resolution vs query-time retrieval
AspectMemoryModelMem0
When dates are resolvedAt ingestion (once)At query time (every time)
Answer prompt complexity~80 words~400 words
Temporal query strategyNLP-powered range filterLLM calculation in prompt
Error propagationCaught during ingestionCan fail silently at query
ConsistencySame date always returnedLLM may calculate differently

This explains why our simpler answer generation prompt achieves higher accuracy (74.6% vs 66.9%):

The heavy lifting of temporal reasoning is done once during ingestion by the specialized temporal_event node, using NLP date parsing. The retrieval system then uses direct temporal range filters on pre-computed ISO dates, eliminating the need for runtime LLM calculations.

This approach embodies the “Shift-Left” principle: moving reasoning complexity from query-time (slow, expensive, non-deterministic) to ingestion-time (one-off, deterministic). Unlike rigid memory systems, MemoryModel allows developers to define extraction logic per-node through the console, enabling domain-specific optimizations without code deployment.

The ingestion system processes content through a multi-node extraction architecture:

Ingestion-time temporal resolution vs query-time retrieval
  • Extraction Engine: Dynamically loads user-defined schemas from the MemoryModel Console and runs them in parallel. For this benchmark, we configured 4 semantic definitions targeting biography extraction.
  • Multi-Node Processing: Each node extracts typed structured memories using its user-defined prompt.
  • Rate Limiting: Built-in retry with exponential backoff for API resilience.
  • Multi-modal Support: Separate processing pipeline for visual memories with reference matching.

The retrieval uses a hybrid multi-strategy orchestrator:

Ingestion-time temporal resolution vs query-time retrieval
StrategyTriggerDescription
Centroid-AwareHigh relevance scoreSemantic search with meta/specific query detection
Direct Lookup”ID patterns, VAT, email”Exact match on metadata fields
Entity AnchorCapitalized namesPivot search on entity anchors
Temporal RangeDate expressionsNLP-powered date parsing and time-travel queries
Simple VectorFallbackPure cosine similarity search

Relevance Router: LLM-based semantic scoring to dynamically decide which memory nodes are most relevant to each query.

The evaluation uses gemini-2.5-flash with temperature 0.0 for deterministic answers:

You are a helper assistant answering questions based on a set of retrieved memory fragments.
Context:
${contextText}
Question: ${question}
Instructions:
1. Answer the question using ONLY the provided context.
2. **Inference Allowed:** You may perform reasonable logical inferences if strongly supported by the text.
3. **Safety:** If the answer is completely missing or cannot be reasonably inferred, strictly say "I don't know".
4. **Style:** Be concise and direct.

We use a semantic judge following Mem0’s evaluation methodology:

Role: You are an impartial semantic judge evaluating a Question Answering system.
Context:
- Question: "${question}"
- Ground Truth: "${truthStr}"
- Predicted Answer: "${predStr}"
Task: Determine if the Predicted Answer conveys the SAME meaning as the Ground Truth.
Evaluation Rules (Be Flexible):
1. **Dates:** Treat "2023-05-07", "May 7th, 2023", "7/5/23" as EQUIVALENT.
2. **Synonyms:** "Happy" == "Joyful", "Scared" == "Afraid".
3. **Verbosity:** If the Prediction is long but contains the correct answer, it is CORRECT.
4. **Lists:** If the Truth is a list, the Prediction must contain the key items.
5. **Negation:** Watch out for "NOT". "He went" != "He did not go".
Output: Respond ONLY with "YES" if correct, or "NO" if incorrect.
  • String inclusion check before LLM judge (normalized, punctuation-stripped)
  • “I don’t know” trap detection to catch abstention failures

The benchmark scripts are open-source and available in the GitHub repository: MatteoTuziMM/memory-model-benchmark.

  • Node.js 18+
  • Your own MemoryModel API key
  • Your own Gemini API key (for evaluation)
  1. Set environment variables

    Terminal window
    export MEMORY_API_KEY=your_memorymodel_api_key
    export GEMINI_API_KEY=your_gemini_api_key
  2. Ingest the LoCoMo dataset

    Terminal window
    npx ts-node benchmark/benchmark_ingest.ts
  3. Run evaluation

    Terminal window
    npx ts-node benchmark/benchmark_eval.ts