Skip to content

Conversation

@EYH0602
Copy link
Member

@EYH0602 EYH0602 commented Sep 30, 2025

No description provided.

EYH0602 and others added 25 commits August 13, 2025 22:04
New LLM generation workflow.

* add an empty .env

* refactor OpenAI util class

* use new OpenAI client in main

* assume .env unchanged

* fix: response processing

* use new Gemini client in main

* enable reasoning effort from cli

* document why two gemini wrapper

* add Claude API

* add claude models to supported list

* handle UnionType for Literal ReasoningEffort

* add vLLM support and use it as default option

* fix: use vLLM chat interface instead of gen

* env add vllm api key

* add VLLM_HOST and VLLM_PORT

* add vllm server mode

* add vLLM in dependencies

* doc: instruct to run vllm from uv

* make deprecated ollama a standalone script

* doc: revise ollama

* use 3.12

* add Ollama models

* fix: ollama model name

* fix: ollama model name

* fix: Gemini use its own EFFORT_TOKEN_MAP

* remove unused imports

* fix: google-genai version

* fix: ci with uv run
* update tfb to huggingface with base and pure splits

* feat: load tfbench from huggingface

* remove mandatory path
* answer cannot be None from LM

* move evaluation logic inside the tfbench package

* fix: orjson writes binary, error is not an option

* fix: use pure as parameter in main._eval
* fix: allow generation to fail

* remove unnecessary imports

* fix: OpenAI response add reasoning summary

* fix: load_gen_results_json type

* fix: analysis_saved script

* fix: evaluation benchmark name

* fix: OpenAI response API add summary

* use pydantic-v2

* extract incorrect task-answer pairs

* fix: groundtruth error (#63)

* fix: missing type class and typevar in benchmark

* fix: order of  tasks in tfb

* fix: allow load_gen_results to load error

* remove error_cls unused imports

* extract type variables from source code

* add GHC type check by proving type equiv

* fix: cp -> process

* fix: API change for AST

* feat: type prover support new type definition

* test: ghc and type_util

* feat: use prover_evaluate for base split

* test: add real tfbench test cases, which the deprecated evaluation failed

* alt  error to syntax parsing error

* feat: typeclass constrains reorder

* fix: AST.get_all_nodes_of_type ignores the root itself

* reorder_constraints using compiler frontend static analysis

* feat: add type definitions for pure tasks

* test: check type equivalence prover after rewriting mono types

* fix: handle type classes alone when ading new definitions

* feat: define new types automatically for pure tasks

* ghc prover remove standalone type class

* doc: detaile docstring for prover_evaluate

* script: analysis_saved run both split
* error analysis use prover

* error analysis script

* feat: record model name when doing error analysis

* add plot script for error analysis

* adjust row and column spacing

* update color map
* add transformers generation as default

* remove None option for router

* remove vllm option for ease of dependency

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove unnecessary imports

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add transformers generation as default

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove unnecessary imports

* doc: improve instructions

* fix: unused parameter and import

* enable github actions on main commits

* doc: add badges and images

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@EYH0602 EYH0602 merged commit 9c09ce2 into main Sep 30, 2025
4 checks passed
@EYH0602 EYH0602 deleted the release-0.1.0 branch November 26, 2025 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants