-
Notifications
You must be signed in to change notification settings - Fork 0
release 0.1.0: NeurIPS 2025 #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
New LLM generation workflow. * add an empty .env * refactor OpenAI util class * use new OpenAI client in main * assume .env unchanged * fix: response processing * use new Gemini client in main * enable reasoning effort from cli * document why two gemini wrapper * add Claude API * add claude models to supported list * handle UnionType for Literal ReasoningEffort * add vLLM support and use it as default option * fix: use vLLM chat interface instead of gen * env add vllm api key * add VLLM_HOST and VLLM_PORT * add vllm server mode * add vLLM in dependencies * doc: instruct to run vllm from uv * make deprecated ollama a standalone script * doc: revise ollama * use 3.12 * add Ollama models * fix: ollama model name * fix: ollama model name * fix: Gemini use its own EFFORT_TOKEN_MAP * remove unused imports * fix: google-genai version * fix: ci with uv run
* update tfb to huggingface with base and pure splits * feat: load tfbench from huggingface * remove mandatory path
* answer cannot be None from LM * move evaluation logic inside the tfbench package * fix: orjson writes binary, error is not an option * fix: use pure as parameter in main._eval
* fix: allow generation to fail * remove unnecessary imports * fix: OpenAI response add reasoning summary * fix: load_gen_results_json type * fix: analysis_saved script * fix: evaluation benchmark name * fix: OpenAI response API add summary * use pydantic-v2 * extract incorrect task-answer pairs * fix: groundtruth error (#63) * fix: missing type class and typevar in benchmark * fix: order of tasks in tfb * fix: allow load_gen_results to load error * remove error_cls unused imports * extract type variables from source code * add GHC type check by proving type equiv * fix: cp -> process * fix: API change for AST * feat: type prover support new type definition * test: ghc and type_util * feat: use prover_evaluate for base split * test: add real tfbench test cases, which the deprecated evaluation failed * alt error to syntax parsing error * feat: typeclass constrains reorder * fix: AST.get_all_nodes_of_type ignores the root itself * reorder_constraints using compiler frontend static analysis * feat: add type definitions for pure tasks * test: check type equivalence prover after rewriting mono types * fix: handle type classes alone when ading new definitions * feat: define new types automatically for pure tasks * ghc prover remove standalone type class * doc: detaile docstring for prover_evaluate * script: analysis_saved run both split
* error analysis use prover * error analysis script * feat: record model name when doing error analysis * add plot script for error analysis * adjust row and column spacing * update color map
* add transformers generation as default * remove None option for router * remove vllm option for ease of dependency * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove unnecessary imports --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add transformers generation as default * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove unnecessary imports * doc: improve instructions * fix: unused parameter and import * enable github actions on main commits * doc: add badges and images --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.