Bugfix for 0 4 0 release by bmerkle · Pull Request #208 · microsoft/typeagent-py

bmerkle · 2026-02-27T15:48:07Z

based on #200 I have added some bugfixes for the upcoming release 0-4-0

commits from me are all Commits on Feb 27, 2026
each commit covers one potential improvement or bug, also new testcases are added in this step
changelog should explain the reason

…stic using pydantic_ai

…ars (I hope)

…odel

…ocols

…rywhere

…t-embedding-3-small)

reduce lookup to cache the ordinals_of_subset Total Complexity reduces from O(n*k) to O(n+k)

e.g. didn't match &api-version=... in multi-parameter URLs. Fixed to [?&,].

There was a mutable default argument in collections.py — SemanticRefAccumulator.__init__ used set() as default parameter, shared across all instances. The code has been rewritten to still track which search terms produced hits in the accumulator but in a way to avoid mutable default arguments. Now: __init__ always creates a fresh set() — no parameter Add a with_term_matches factory that creates a new accumulator with a copy of term matches (makes the copy-vs-share explicit) Update group_matches_by_type to use a copy instead of sharing the same set object Update get_matches_in_scope and WhereSemanticRefExpr to use the factory 2nd change: hit_count=1 for non-exact matches in collections.py MatchAccumulator.add — inflated hit counts for related-only matches. Fixed to hit_count=0. added comments to the else branches added additional tests

… the object — including inherited methods, dunder methods (__init__, __eq__, …), and class-level descriptors. The fix switches to vars(self), which returns only the instance's __dict__ — i.e. the actual dataclass field values. added testcases removed unused method from class

— produced dataclass repr strings instead of actual facet values. Off-by-one in get_enclosing_date_range_for_text_range — used exclusive end ordinal directly, potentially indexing past the last message. Fixed to use message_ordinal - 1. added testcases

…ding a full chunk), the merged result would begin with a spurious separator like "\n\n". The fix adds a guard added testcases

The coverage package is a dev/test dependency. "user" role correctly but defaulted everything else to "assistant" — including "system" prompts. MCP's SamplingMessage only accepts "user" or "assistant" roles. A "system" prompt section from TypeChat would silently be sent as "assistant", which changes LLM behavior. added testcases

dunder methods (__init__, __eq__, __hash__, …), and descriptors. The fix switches to vars(self), which returns only the instance's __dict__ — the actual dataclass field values. Combined with the not key.startswith("_") and is not None filters, the repr is now clean. added testcases

…buteError. The fix introduces a local ordinal = 0 counter incremented on each iteration, which works regardless of the message implementation. added message.timestamp null guard fix related to the changes in SemanticRefAccumulator use with_term_matches classmethod factory, which copies the search-term provenance set, so mutations to the filtered accumulator's set don't affect the source. added testcases

Copilot

Pull request overview

This pull request contains bugfixes and improvements for the upcoming 0.4.0 release. The most significant change is the migration from direct OpenAI SDK usage to a provider-agnostic architecture using pydantic_ai, which enables support for 25+ AI providers (OpenAI, Anthropic, Google, Azure, etc.) through a unified interface. Additionally, the PR fixes numerous bugs across the codebase including string formatting errors, logic bugs, and API design issues.

Changes:

Introduced provider-agnostic model adapters using pydantic_ai for chat and embedding models
Fixed multiple critical bugs including provenance copying, facet serialization, and timestamp handling
Added comprehensive test coverage for new functionality and bug fixes
Moved pydantic-ai-slim from dev dependencies to runtime dependencies

Reviewed changes

Copilot reviewed 65 out of 66 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
uv.lock	Moved pydantic-ai-slim[openai] from dev to runtime dependencies
pyproject.toml	Updated dependency specification for pydantic-ai-slim
src/typeagent/aitools/model_adapters.py	New module providing provider-agnostic chat and embedding model adapters
src/typeagent/aitools/embeddings.py	Refactored to protocol-based interfaces (IEmbedder, IEmbeddingModel)
src/typeagent/aitools/vectorbase.py	Updated to support dynamic embedding size discovery and improved error handling
src/typeagent/aitools/utils.py	Fixed regex for parsing Azure endpoints with ampersand-separated query params
src/typeagent/knowpro/query.py	Fixed manual ordinal counting and timestamp None guards
src/typeagent/knowpro/searchlang.py	Removed duplicate methods, fixed repr, added missing action_group append
src/typeagent/knowpro/search.py	Fixed repr to use vars() instead of dir()
src/typeagent/knowpro/collections.py	Fixed hit_count initialization for non-exact matches and provenance copying
src/typeagent/knowpro/answers.py	Fixed facet value stringification and off-by-one error in date range lookup
src/typeagent/knowpro/convknowledge.py	Removed deprecated create_typechat_model function
src/typeagent/knowpro/convsettings.py	Updated to use new model adapter interfaces
src/typeagent/knowpro/conversation_base.py	Updated to use new chat model creation pattern
src/typeagent/knowpro/fuzzyindex.py	Fixed deserialization logic
src/typeagent/knowpro/knowledge.py	Removed max_retries parameter
src/typeagent/knowpro/serialization.py	Added embedding size metadata to file headers
src/typeagent/knowpro/interfaces_storage.py	Removed embedding_size field from ConversationMetadata
src/typeagent/knowpro/interfaces_search.py	Removed duplicate all definition
src/typeagent/emails/email_import.py	Fixed separator prepending in chunk merging
src/typeagent/emails/email_memory.py	Updated to use new chat model creation
src/typeagent/mcp/server.py	Added coverage import guard and match statement default case
src/typeagent/storage/sqlite/provider.py	Removed embedding_size consistency checks, now checks size consistency between indexes
src/typeagent/storage/sqlite/messageindex.py	Added empty chunks guard
src/typeagent/transcripts/transcript.py	Fixed f-string formatting and reads embedding size from file metadata
src/typeagent/podcasts/podcast.py	Same f-string and metadata reading fixes as transcripts
tools/query.py	Updated imports to use new model adapters
tools/ingest_vtt.py	Updated to use new embedding model creation
tests/test_*.py	Comprehensive test updates to use new interfaces and added extensive new test coverage
AGENTS.md	Added deprecation guideline

Comments suppressed due to low confidence (1)

src/typeagent/knowpro/searchlang.py:626

The duplicate method add_search_term_to_groupadd_entity_name_to_group with the typo in its name has been removed. This method name appears to be a copy-paste error combining two method names, and its implementation is identical to add_entity_name_to_group above it.

    def add_property_term_to_group(
        self,
        property_name: str,
        property_value: str,
        term_group: SearchTermGroup,
        exact_match_value=False,
    ) -> None:
        if not self.is_searchable_string(property_name):
            return
        if not self.is_searchable_string(property_value):
            return
        if self.is_noise_term(property_value):
            return
        # Dedupe any terms already added to the group earlier.
        if not self.dedupe or not self.entity_terms_added.has(

Copilot · 2026-02-27T15:53:46Z

src/typeagent/aitools/model_adapters.py

+    async def get_embedding_nocache(self, input: str) -> NormalizedEmbedding:
+        result = await self._embedder.embed_documents([input])
+        embedding: NDArray[np.float32] = np.array(
+            result.embeddings[0], dtype=np.float32
+        )
+        norm = float(np.linalg.norm(embedding))
+        if norm > 0:
+            embedding = (embedding / norm).astype(np.float32)
+        return embedding


The get_embedding_nocache method doesn't validate that the input string is non-empty before calling the embedding API. According to the test test_get_embedding_nocache_empty_input, empty strings should raise a ValueError with "Empty input text". Consider adding validation: if not input: raise ValueError("Empty input text").

gvanrossum · 2026-02-27T17:57:32Z

src/typeagent/aitools/utils.py

    if not azure_endpoint:
        raise RuntimeError(f"Environment variable {endpoint_envvar} not found")

-    m = re.search(r"[?,]api-version=([\d-]+(?:preview)?)", azure_endpoint)


We don't need the comma here -- I just had the URL format wrong.

gvanrossum · 2026-02-27T17:58:45Z

tests/test_utils.py

+        """api-version preceded by comma (alternate separator)."""
+        monkeypatch.setenv(
+            "TEST_ENDPOINT",
+            "https://myhost.openai.azure.com/openai/deployments/gpt-4?foo=bar,api-version=2024-06-01",


This isn't even a valid URL.

gvanrossum · 2026-02-27T22:22:11Z

src/typeagent/knowpro/collections.py

+    @classmethod
+    def with_term_matches(
+        cls, source: "SemanticRefAccumulator"
+    ) -> "SemanticRefAccumulator":
+        """Create a new accumulator inheriting a copy of *source*'s term-match provenance."""
+        acc = cls()
+        acc.search_term_matches = set(source.search_term_matches)
+        return acc



I'd write this as a clone() method on an instance.

def clone(self): acc = self.__class__() acc.search_term_matches = set(self.search_term_matches)) return acc

gvanrossum · 2026-02-27T22:40:22Z

src/typeagent/knowpro/collections.py

@@ -516,12 +537,13 @@ def add_ranges(self, text_ranges: "list[TextRange] | TextRangeCollection") -> No
    def is_in_range(self, inner_range: TextRange) -> bool:


I realize that is_in_range is a terrible name. It should be contains_range or maybe __contains__.

gvanrossum · 2026-02-27T22:42:38Z

src/typeagent/knowpro/collections.py


 class SemanticRefAccumulator(MatchAccumulator[SemanticRefOrdinal]):
-    def __init__(self, search_term_matches: set[str] = set()):
+    """Accumulates scored semantic reference matches.


It will take me a bit more time to review this particular commit more carefully. It seems you're on to something, but I think I don't recall how it originally worked or how it's supposed to work.

gvanrossum · 2026-02-27T22:46:38Z

src/typeagent/knowpro/searchlang.py

+        if use_or_max and action_group.terms:
+            term_group.terms.append(action_group)


This addition wasn't mentioned in the commit description.

gvanrossum · 2026-02-27T23:48:00Z

src/typeagent/knowpro/answers.py

        return None
    end_timestamp = (
-        (await messages.get_item(range.end.message_ordinal)).timestamp
+        (await messages.get_item(range.end.message_ordinal - 1)).timestamp


Everything here is supposed to be half-open intervals, and it really should be the (start) timestamp of the message following the range. But this can raise IndexError if the range ends at the end of the messages array. Maybe we could check for that edge case and use either None or the timestamp of the final message plus 1 second or something like that. But I think this is just wrong.

gvanrossum · 2026-02-27T23:51:30Z

tests/test_mcp_server_unit.py

@@ -0,0 +1,167 @@
+# Copyright (c) Microsoft Corporation.


I feel this file's name doesn't have to end in '_unit'.

gvanrossum · 2026-02-27T23:56:44Z

src/typeagent/knowpro/query.py

            if range_start_ordinal >= 0:
                # We have a range, so break.
                break
+        ordinal += 1


I'd use enumerate() in the for loop.

gvanrossum · 2026-02-27T23:57:23Z

src/typeagent/knowpro/query.py

-                range_start_ordinal = message.ordinal
-            range_end_ordinal = message.ordinal


Huh. I don't think messages ever have an ordinal field... Good catch!

gvanrossum-ms and others added 30 commits February 17, 2026 09:02

Unreviewed agent output: make chat and embed interfaces provider-agno…

74097cf

…stic using pydantic_ai

Agent step 2 -- unreviewed

d59d7b6

Agent step 3 -- unreviewed -- use Pydantic's model registry

ff733f5

Don't hardcode an incomplete table of embedding sizes

1015737

Fix test failures

4bd1387

Rename model_registry -> model_adapters

60aa403

Move pydantic-ai to main deps

067f3b9

Remove obsolete create_embedding_model -- wasn't easy

17b959f

Merge branch 'main' into agnostic

76621b4

Fix test_configure_models_returns_correct_types

6f1286f

Fall back on Azure for OpenAI models if only Azure key is present

83d6f0a

Use embed_documents() instead of embed(input_type=["document"])

2659f30

Fix the mcp test. We now do the right thing with azure endpoint env v…

2b8735b

…ars (I hope)

Remove AsyncEmbeddingModel; migrate all tests to PydanticAIEmbeddingM…

05183d9

…odel

Move in-function imports to module level in tests/

dec2e6f

Don't re-export create_typechat_model from convknowledge.py

909247d

Remove redundant tests that Chat/Embedding models subclass their prot…

8807cc5

…ocols

Avoid type-ignore in favor of isinstance

68d3082

Remove ModelWrapper, create_typechat_model; use create_chat_model eve…

3697f89

…rywhere

Split up *EmbeddingModel into IEmbedder and CachingEmbeddingModel

087b7a3

Remove max_retries everywhere -- this is now under Pydantic control

43894bd

Remove embedding_size argument everywhere. Handle it internally

910a99b

Change default embedding back to ada-002 for backwards compatibility

091bd58

Add OPENAI_EMBEDDING_MODEL envvar to set the text embedding (e.g. tex…

7c1c527

…t-embedding-3-small)

speed optimization:

400f869

reduce lookup to cache the ordinals_of_subset Total Complexity reduces from O(n*k) to O(n+k)

parse_azure_endpoint regex missed & separator [?,]

1cf46b4

e.g. didn't match &api-version=... in multi-parameter URLs. Fixed to [?&,].

added test for parsing azure endpoint urls

aab0eee

fixed parse_azure_endpoint tests

4e9127a

bmerkle added 7 commits February 27, 2026 16:17

Remove duplicate __all__ in interfaces_search.py

ccb96f2

when cur_chunk was empty (i.e. at the very start, or right after yiel…

b4d4a40

…ding a full chunk), the merged result would begin with a spurious separator like "\n\n". The fix adds a guard added testcases

format

b52910b

Copilot AI review requested due to automatic review settings February 27, 2026 15:48

Copilot started reviewing on behalf of bmerkle February 27, 2026 15:48 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

gvanrossum reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for 0 4 0 release#208

Bugfix for 0 4 0 release#208
bmerkle wants to merge 37 commits intomicrosoft:mainfrom
bmerkle:bugfix-for-0-4-0-release

bmerkle commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

gvanrossum Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -516,12 +537,13 @@ def add_ranges(self, text_ranges: "list[TextRange] \| TextRangeCollection") -> No
		def is_in_range(self, inner_range: TextRange) -> bool:

		if use_or_max and action_group.terms:
		term_group.terms.append(action_group)

		range_start_ordinal = message.ordinal
		range_end_ordinal = message.ordinal

Conversation

bmerkle commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants