Skip to content

Conversation

@michaeltremeer
Copy link
Contributor

Periodically (every 10s) alerts users when the average number of generated tokens is less than 90% of max_tokens. This is important if users are using the tool to estimate total requests per second based on their real-world context/response sizes, but the models are returning considerably shorter responses in testing (due to different/automated prompts).

@technicianted
Copy link
Contributor

I think a more suitable approach would be to add another metric that measures actual generated tokens such that, with each report, user can see "attempted" and "actual". Then they can decide about the relevance of the overall test results based on their use-cases.

Copy link
Contributor

@technicianted technicianted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a change of approach to a metrics output one.

@michaeltremeer
Copy link
Contributor Author

michaeltremeer commented Jan 3, 2024

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

@technicianted
Copy link
Contributor

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

@michaeltremeer
Copy link
Contributor Author

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

I'm not sure I understand - can you explain what you mean?

@technicianted
Copy link
Contributor

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

I'm not sure I understand - can you explain what you mean?

I'm suggesting we add 3 new aggregate metrics in the sliding window that represent p10, p90 and avg actual tokens generated instead of your proposal. This way users will have comparable metrics to theoritical gen_tpm based on max_tokens.

@yshahin
Copy link
Contributor

yshahin commented Feb 21, 2024

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

I'm not sure I understand - can you explain what you mean?

I'm suggesting we add 3 new aggregate metrics in the sliding window that represent p10, p90 and avg actual tokens generated instead of your proposal. This way users will have comparable metrics to theoritical gen_tpm based on max_tokens.

Practically speaking when I run this with the existing contexts, I did not see the number change. It is almost always a const value.
Does it make sense to show the same number 3 times? I would think avg would be sufficient maybe a 95th but I dont see the benefit.

Copy link
Contributor

@yshahin yshahin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased and created a new PR here
#50

@yshahin yshahin closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants