Skip to content

fix(graph_dbs): sanitize FTS query words with mixed content#1259

Open
zerone0x wants to merge 1 commit intoMemTensor:mainfrom
zerone0x:fix/fts-query-mixed-content
Open

fix(graph_dbs): sanitize FTS query words with mixed content#1259
zerone0x wants to merge 1 commit intoMemTensor:mainfrom
zerone0x:fix/fts-query-mixed-content

Conversation

@zerone0x
Copy link
Copy Markdown
Contributor

Summary

  • FTS query returns empty when the query string contains mixed content (message IDs with underscores + Chinese text), because raw words are passed directly to PostgreSQL to_tsquery() which expects valid tsquery syntax
  • Add _sanitize_tsquery_words() helper that strips tsquery-breaking characters (operators, punctuation) while preserving alphanumeric, CJK unified ideographs, and underscore characters
  • Apply sanitization in both search_by_fulltext and search_by_keywords_tfidf before building the tsquery string

Fixes #1247

Test plan

  • Added unit tests for _sanitize_tsquery_words covering: plain English, Chinese text, mixed content (the original bug scenario), single-quoted inputs, special character removal, deduplication, empty inputs, and tsquery operator stripping
  • All 11 tests pass
  • ruff check passes with no errors

🤖 Generated with Claude Code

FTS queries fail when the query string contains mixed content such as
message IDs with underscores (e.g. `om_x100b544a390604b8c3e1b7d8641f08e`)
combined with Chinese text.  The raw words are passed directly to
PostgreSQL `to_tsquery()` which expects valid tsquery syntax and chokes
on special characters.

Add `_sanitize_tsquery_words()` helper that strips tsquery-breaking
characters while preserving alphanumeric, CJK, and underscore chars.
Apply the sanitization in both `search_by_fulltext` and
`search_by_keywords_tfidf` before building the tsquery string.

Fixes MemTensor#1247

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 17, 2026 07:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes PostgreSQL full-text search (FTS) failures when query tokens contain mixed content (e.g., message IDs with underscores plus CJK text) by sanitizing query words before building a to_tsquery() expression.

Changes:

  • Added _sanitize_tsquery_words() helper in polardb.py to strip tsquery-breaking characters and deduplicate tokens.
  • Applied sanitization in search_by_fulltext and search_by_keywords_tfidf before constructing the to_tsquery() parameter.
  • Added a new unit test module covering expected sanitization behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/memos/graph_dbs/polardb.py Adds and uses _sanitize_tsquery_words() to prevent invalid to_tsquery() inputs for mixed-content tokens.
tests/graph_dbs/test_sanitize_tsquery.py Adds unit tests intended to validate sanitization behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +27
"""Tests for _sanitize_tsquery_words — standalone, no heavy imports."""

import re


# ---------------------------------------------------------------------------
# Inline the function under test to avoid pulling in the full memos import
# chain (which requires a running logging backend). The canonical copy lives
# in ``memos.graph_dbs.polardb._sanitize_tsquery_words``.
# ---------------------------------------------------------------------------


def _sanitize_tsquery_words(query_words: list[str]) -> list[str]:
valid_chars_re = re.compile(
r"[^\w\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]",
)
sanitized: list[str] = []
seen: set[str] = set()
for w in query_words:
w = w.strip().strip("'")
cleaned = valid_chars_re.sub("", w)
if cleaned and cleaned not in seen:
seen.add(cleaned)
sanitized.append(cleaned)
return sanitized


Comment on lines +1685 to +1686
# Sanitize and convert query_text to OR query format: "word1 | word2 | word3"
safe_words = _sanitize_tsquery_words(query_words)
Comment on lines +35 to +38
# Keep word characters (letters, digits, underscore) and CJK unified ideographs.
valid_chars_re = re.compile(
r"[^\w\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]",
)
Comment on lines +25 to +38
"""Sanitize query words for safe use with PostgreSQL to_tsquery().

Strips tsquery operator characters and other special symbols that can
cause parsing errors when mixed content (e.g. message IDs with
underscores, Chinese text) is passed to ``to_tsquery``. Each word is
reduced to its alphanumeric/CJK core so that the jieba text-search
configuration can tokenize it correctly.

Returns a de-duplicated list of non-empty sanitized words.
"""
# Keep word characters (letters, digits, underscore) and CJK unified ideographs.
valid_chars_re = re.compile(
r"[^\w\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]",
)
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically marked as stale due to inactivity.

@github-actions github-actions Bot added the stale A stale issue/PR indicates a long period of inactivity. | 一个长期未更新的issue/PR。 label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale A stale issue/PR indicates a long period of inactivity. | 一个长期未更新的issue/PR。

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: FTS query failed

2 participants