fix(graph_dbs): sanitize FTS query words with mixed content by zerone0x · Pull Request #1259 · MemTensor/MemOS

zerone0x · 2026-03-17T07:28:35Z

Summary

FTS query returns empty when the query string contains mixed content (message IDs with underscores + Chinese text), because raw words are passed directly to PostgreSQL to_tsquery() which expects valid tsquery syntax
Add _sanitize_tsquery_words() helper that strips tsquery-breaking characters (operators, punctuation) while preserving alphanumeric, CJK unified ideographs, and underscore characters
Apply sanitization in both search_by_fulltext and search_by_keywords_tfidf before building the tsquery string

Test plan

Added unit tests for _sanitize_tsquery_words covering: plain English, Chinese text, mixed content (the original bug scenario), single-quoted inputs, special character removal, deduplication, empty inputs, and tsquery operator stripping
All 11 tests pass
ruff check passes with no errors

🤖 Generated with Claude Code

FTS queries fail when the query string contains mixed content such as message IDs with underscores (e.g. `om_x100b544a390604b8c3e1b7d8641f08e`) combined with Chinese text. The raw words are passed directly to PostgreSQL `to_tsquery()` which expects valid tsquery syntax and chokes on special characters. Add `_sanitize_tsquery_words()` helper that strips tsquery-breaking characters while preserving alphanumeric, CJK, and underscore chars. Apply the sanitization in both `search_by_fulltext` and `search_by_keywords_tfidf` before building the tsquery string. Fixes MemTensor#1247 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Fixes PostgreSQL full-text search (FTS) failures when query tokens contain mixed content (e.g., message IDs with underscores plus CJK text) by sanitizing query words before building a to_tsquery() expression.

Changes:

Added _sanitize_tsquery_words() helper in polardb.py to strip tsquery-breaking characters and deduplicate tokens.
Applied sanitization in search_by_fulltext and search_by_keywords_tfidf before constructing the to_tsquery() parameter.
Added a new unit test module covering expected sanitization behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`src/memos/graph_dbs/polardb.py`	Adds and uses `_sanitize_tsquery_words()` to prevent invalid `to_tsquery()` inputs for mixed-content tokens.
`tests/graph_dbs/test_sanitize_tsquery.py`	Adds unit tests intended to validate sanitization behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+"""Tests for _sanitize_tsquery_words — standalone, no heavy imports."""
+
+import re
+
+
+# ---------------------------------------------------------------------------
+# Inline the function under test to avoid pulling in the full memos import
+# chain (which requires a running logging backend).  The canonical copy lives
+# in ``memos.graph_dbs.polardb._sanitize_tsquery_words``.
+# ---------------------------------------------------------------------------
+
+
+def _sanitize_tsquery_words(query_words: list[str]) -> list[str]:
+    valid_chars_re = re.compile(
+        r"[^\w\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]",
+    )
+    sanitized: list[str] = []
+    seen: set[str] = set()
+    for w in query_words:
+        w = w.strip().strip("'")
+        cleaned = valid_chars_re.sub("", w)
+        if cleaned and cleaned not in seen:
+            seen.add(cleaned)
+            sanitized.append(cleaned)
+    return sanitized
+
+


+        # Sanitize and convert query_text to OR query format: "word1 | word2 | word3"
+        safe_words = _sanitize_tsquery_words(query_words)


+    # Keep word characters (letters, digits, underscore) and CJK unified ideographs.
+    valid_chars_re = re.compile(
+        r"[^\w\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]",
+    )


+    """Sanitize query words for safe use with PostgreSQL to_tsquery().
+
+    Strips tsquery operator characters and other special symbols that can
+    cause parsing errors when mixed content (e.g. message IDs with
+    underscores, Chinese text) is passed to ``to_tsquery``.  Each word is
+    reduced to its alphanumeric/CJK core so that the jieba text-search
+    configuration can tokenize it correctly.
+
+    Returns a de-duplicated list of non-empty sanitized words.
+    """
+    # Keep word characters (letters, digits, underscore) and CJK unified ideographs.
+    valid_chars_re = re.compile(
+        r"[^\w\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]",
+    )


github-actions · 2026-04-17T03:07:36Z

This PR has been automatically marked as stale due to inactivity.

Copilot AI review requested due to automatic review settings March 17, 2026 07:28

Copilot started reviewing on behalf of zerone0x March 17, 2026 07:29 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

github-actions Bot added the stale A stale issue/PR indicates a long period of inactivity. | 一个长期未更新的issue/PR。 label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(graph_dbs): sanitize FTS query words with mixed content#1259

fix(graph_dbs): sanitize FTS query words with mixed content#1259
zerone0x wants to merge 1 commit intoMemTensor:mainfrom
zerone0x:fix/fts-query-mixed-content

zerone0x commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Sanitize and convert query_text to OR query format: "word1 \| word2 \| word3"
		safe_words = _sanitize_tsquery_words(query_words)

Conversation

zerone0x commented Mar 17, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants