Skip to content

Use 427 tokenizer#9

Open
JPrevost wants to merge 9 commits intomainfrom
use-427-tokenizer
Open

Use 427 tokenizer#9
JPrevost wants to merge 9 commits intomainfrom
use-427-tokenizer

Conversation

@JPrevost
Copy link
Member

@JPrevost JPrevost commented Mar 6, 2026

Why are these changes being introduced:

  • This is the core work to put the tokenizer lambda in place, which is a
    critical component of the overall system. It will be used to tokenize
    queries for use in the search process.

Relevant ticket(s):

How does this address that need:

  • This change introduces the tokenizer lambda
  • It includes the code for the lambda itself, as well as tests and
    documentation for how to use it.

Document any side effects to this change:

  • Adjusted some linting rules from the template to allow better
    understanding of what we are skipping and added some skips as
    appropriate (mostly around tests).
  • leaves templated my_function in place as a placeholder. This will
    likely be removed in the near future.
  • Introduces files from a Huggingface model without an automated way
    to update them. Whether this is temporary or not will be based on how
    often we need to update those files and whether we can find a better
    way to manage them.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

What are the relevant tickets?

  • Include links to Jira Software and/or Jira Service Management tickets here.

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

JPrevost added 2 commits March 6, 2026 15:59
- remove dependabot config
- update pre-commit settings to pre-push and remove audit
- add support for log levels and removed `configure_dev_logger()`
- add dotenv support
- update license
- update makefile linting commands, adds console command
- removes template info from readme
Why are these changes being introduced:

* This is the core work to put the tokenizer lambda in place, which is a
  critical component of the overall system. It will be used to tokenize
  queries for use in the search process.

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/USE-427

How does this address that need:

* This change introduces the tokenizer lambda
* It includes the code for the lambda itself, as well as tests and
  documentation for how to use it.

Document any side effects to this change:

* Adjusted some linting rules from the template to allow better
  understanding of what we are skipping and added some skips as
  appropriate (mostly around tests).
* leaves templated `my_function` in place as a placeholder. This will
  likely be removed in the near future.
* Introduces files from a Huggingface model without an automated way
  to update them. Whether this is temporary or not will be based on how
  often we need to update those files and whether we can find a better
  way to manage them.
@qltysh
Copy link

qltysh bot commented Mar 6, 2026

❌ 1 blocking issue (2 total)

Tool Category Rule Count
bandit Dependency Alert Unsafe Hugging Face Hub download without revision pinning in from_pretrained() 1
radarlint-python Lint Do not perform equality checks with floating point values. 1

# Model: opensearch-neural-sparse-encoding-doc-v3-gte, stored in the repo
# to avoid network calls at Lambda cold start
tokenizer_path = "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte"
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe Hugging Face Hub download without revision pinning in from_pretrained() [bandit:B615]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I want to disable this rule for this line because it is not a concern (since we are storing the relevant files locally), or put a version in place even though it isn't really indicating what version of the files have been added to the repo.

should_clauses = result["query"]["bool"]["should"]
assert len(should_clauses) == 1
assert should_clauses[0]["rank_feature"]["field"] == "embedding_full_record.hello"
assert should_clauses[0]["rank_feature"]["boost"] == 1.5
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not perform equality checks with floating point values. [radarlint-python:python:S1244]

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new “tokenizer” AWS Lambda that converts an input query string into OpenSearch rank_feature clauses using a locally-vendored OpenSearch/Hugging Face tokenizer + IDF weights (USE-427).

Changes:

  • Add QueryTokenizer + tokenizer_handler Lambda implementation and update the container entrypoint.
  • Vendor OpenSearch model tokenizer artifacts into the repo and document local usage.
  • Add unit tests and SAM assets; update dependencies and lint/Makefile workflows to support the new Lambda.

Reviewed changes

Copilot reviewed 17 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
lambdas/tokenizer_handler.py New Lambda handler that validates input and builds an OpenSearch query from token weights.
lambdas/query_tokenizer.py Implements query tokenization + IDF weighting using a local Hugging Face tokenizer.
lambdas/config.py Adds dotenv loading and makes log level configurable via env var.
Dockerfile Points the Lambda CMD to lambdas.tokenizer_handler.lambda_handler.
pyproject.toml Adds runtime deps (python-dotenv, transformers[torch]) and updates Ruff config.
uv.lock Locks new ML/dependency graph (transformers/torch/tokenizers/etc.) and dev typing dependency.
tests/test_tokenizer_handler.py Adds unit tests for handler output shape and input validation behavior.
tests/test_query_tokenizer.py Adds unit tests for IDF loading and token-to-weight mapping behavior.
tests/sam/template.yaml Updates SAM template to define the Tokenizer function and required env vars.
tests/sam/event.json Adds a sample event payload for local SAM invocation.
tests/sam/env.json.template Updates SAM env template to match the new function name and vars.
tests/sam/template.yaml Adjusts function naming/architecture and environment variables for local runs.
Makefile Reworks lint targets around Ruff formatting and adds SAM invoke + console helpers.
README.md Replaces template README with tokenizer-specific usage and local testing instructions.
LICENSE Replaces placeholder copyright owner line.
.pre-commit-config.yaml Changes default stage to pre-push and trims commented-out hooks.
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte/tokenizer_config.json Adds tokenizer config artifact from the model.
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte/README.md Documents provenance/license for vendored model artifacts.
.github/dependabot.yml (removed) Removes Dependabot configuration file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +4 to +8
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from aws_lambda_typing.context import Context

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is only imported under TYPE_CHECKING, but it's referenced in the runtime function signature. Without postponed evaluation of annotations, importing this module will raise NameError. Import Context at runtime or annotate as a string (or enable from __future__ import annotations).

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate with python 3.14?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain this is not accurate with python 3.14, which is project is using.

https://docs.python.org/3/whatsnew/3.14.html#whatsnew314-deferred-annotations

idf_path = tokenizer_path + "/idf.json"
self.idf = self._load_idf(idf_path)

def _load_idf(self, idf_path: str | os.PathLike[str]) -> torch.Tensor:
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os is only imported under TYPE_CHECKING, but it’s used in the _load_idf type annotation (os.PathLike[...]). Without postponed evaluation of annotations this will raise NameError at import time. Import os at runtime or change the annotation to a string / enable from __future__ import annotations.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +16
from lambdas.config import Config, configure_logger, configure_sentry
from lambdas.query_tokenizer import QueryTokenizer

# ---------------------------------------
# One-time, Lambda cold start setup
# ---------------------------------------
CONFIG = Config()

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIG = Config() is created at import time but never used in this module. Either remove it, or use it for something tangible (e.g., CONFIG.check_required_env_vars() during cold start) to avoid dead code and unnecessary side effects.

Suggested change
from lambdas.config import Config, configure_logger, configure_sentry
from lambdas.query_tokenizer import QueryTokenizer
# ---------------------------------------
# One-time, Lambda cold start setup
# ---------------------------------------
CONFIG = Config()
from lambdas.config import configure_logger, configure_sentry
from lambdas.query_tokenizer import QueryTokenizer
# ---------------------------------------
# One-time, Lambda cold start setup
# ---------------------------------------

Copilot uses AI. Check for mistakes.
Comment on lines 63 to 64
# allow autofix behavior for specified rules
fixable = ["E", "F", "I", "Q"]

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s now a dangling comment (“allow autofix behavior for specified rules”) but the corresponding fixable = [...] setting was removed. Either delete the comment or reintroduce the setting so the intent of the Ruff config is clear.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants