feat: Add from_borrowed() constructor#33
Conversation
- `from_pretrained` now delegates to `from_raw_parts` - Fixes BPE tokenizer support (unk_token_id now optional)
Pringled
left a comment
There was a problem hiding this comment.
Thanks for making this PR @zharinov! This is a nice functionality to have I think, and good catch about the unk_token. I have two small (but nice to have) improvements; if you could implement those this is good to go. Thanks for updating the tests as well 👍
|
@zharinov one additional comment, could you also run clippy to fix the formatting issues? |
from_raw_parts() constructorfrom_borrowed() constructor
|
Hey, I wanted to support zero-copy initialization with The second attempt transforms Also, I applied the suggestion for |
|
UPD. Once I had benchmarks set up locally, I've done some additional research if you're interested. Here is the report I've got:
// Before: manual loops
let mut sum = vec![0.0; dim];
for (i, &v) in row.iter().enumerate() {
sum[i] += v * scale;
}
// After: ndarray vectorized ops
let mut sum = Array1::<f32>::zeros(dim);
sum.scaled_add(scale, &row);
// Before
*m.get(tok).unwrap_or(&tok)
// After
m.get(tok).copied().unwrap_or(tok)
// Before
pub fn encode(&self, sentences: &[String]) -> Vec<Vec<f32>>
// After
pub fn encode<S: AsRef<str>>(&self, sentences: &[S]) -> Vec<Vec<f32>>
// Before
fn pool_ids(&self, ids: Vec<u32>) -> Vec<f32>
// After
fn pool_ids(&self, ids: &[u32], max_length: Option<usize>) -> Vec<f32>Benchmark Results:
Branch: https://github.com/zharinov/model2vec-rs/tree/opt/all-optimizations |
|
@zharinov thanks for resolving the comments and for adding these features, everything looks good to me, I'll include this in the next release! |
Adds
from_raw_parts()for constructing models from pre-parsed components,from_pretrained()now delegates to it.Also fixes a bug where loading would fail if the tokenizer doesn't define an
unk_token(not all tokenizers have one).