feat: @mkopcins/gemma4#1162
Conversation
b2b6a24 to
bf62c0b
Compare
bf62c0b to
938bc11
Compare
ec22b0d to
66b3d24
Compare
12f147e to
a606c79
Compare
There was a problem hiding this comment.
What was changed in the binaries? Asking for the record.
| inputs.push_back(llm::make_text_input(prompt.substr(searchPos))); | ||
| } | ||
| size_t imageIdx = 0, audioIdx = 0, pos = 0; | ||
| constexpr int32_t kAudioSampleRate = 16000; |
There was a problem hiding this comment.
What if in the future some other model will accept different sample rate or multiple sample rates? Every time we add such model, we will need to break this file that suppose to be general, for every llm. This is a code smell.
| std::string generateMultimodal( | ||
| std::string prompt, std::shared_ptr<jsi::Function> callback, | ||
| std::vector<std::string> imagePaths = {}, std::string imageToken = "", | ||
| std::vector<std::vector<float>> audioWaveforms = {}, |
There was a problem hiding this comment.
Again, what if in the future multimodal llm will accept video as an input, we need to add another positional argument and break common header. Bad code design that was there and we just extend it. But to be fair, if we plan to refactor whole codebase using this approach proposed by Bartek: see #1208, then I think it doesn't matter as much as if we would keep this codebase as standalone one.
| return module_->is_method_loaded(kAudioEncoderMethod); | ||
| } | ||
|
|
||
| int32_t AudioEncoder::encoderTokenCount() const { return last_token_count_; } |
| static_cast<long long>(n_valid), static_cast<long long>(k_blocks), | ||
| static_cast<long long>(kAudioBlockKMin), | ||
| static_cast<long long>(kAudioBlockKMax), | ||
| static_cast<int>(kSamplesPerBlock), |
There was a problem hiding this comment.
Why do you mix long long and int64_t?
| inline ::executorch::runtime::Result<::executorch::aten::Tensor> | ||
| decode(const ::executorch::runtime::EValue &embeddings, | ||
| const ::executorch::runtime::EValue &ple_tok, int64_t start_pos) { | ||
| auto start_pos_tensor = ::executorch::extension::from_blob( | ||
| &start_pos, {1}, ::executorch::aten::ScalarType::Long); | ||
| auto outputs_result = module_->execute( | ||
| kTextModelMethod, {embeddings, ple_tok, start_pos_tensor}); | ||
| if (!outputs_result.ok()) { | ||
| return outputs_result.error(); | ||
| } | ||
| auto &outputs = *outputs_result; | ||
| ET_CHECK_MSG(outputs.size() == 1, | ||
| "Expected 1 output from text_decoder, got %zu", | ||
| outputs.size()); | ||
| ET_CHECK_MSG(outputs[0].isTensor(), "text_decoder output is not a tensor"); | ||
| return outputs[0].toTensor(); | ||
| } |
There was a problem hiding this comment.
Similar to previous comments, PLE is related to gemma 4 or similar models. If there will be gemma 5 with another quirks, then we need to break this contract again.
| std::make_unique<ProbIndex<T>[]>(vocab_size_); | ||
| T sum = 0; | ||
| for (int i = 0; i < vocab_size_; i++) { | ||
| T e = static_cast<T>(expf(static_cast<float>(logits[i] - max_val))); |
There was a problem hiding this comment.
| T e = static_cast<T>(expf(static_cast<float>(logits[i] - max_val))); | |
| T e = static_cast<T>(std::expf(static_cast<float>(logits[i] - max_val))); |
| if (sum <= T(0)) { | ||
| return; | ||
| } | ||
| for (int i = 0; i < vocab_size_; i++) { |
There was a problem hiding this comment.
And fix other places as such
| for (int i = 0; i < vocab_size_; i++) { | |
| for (size_t i = 0; i < vocab_size_; i++) { |
| Sampler::Sampler(int vocab_size, float temperature, float topp, int32_t topk, | ||
| unsigned long long rng_seed) | ||
| : vocab_size_(vocab_size), | ||
| inv_temperature_((temperature != 0.0f) ? (1.0f / temperature) : 0.0f), | ||
| topp_(topp), min_p_(0.0f), repetition_penalty_(1.0f), topk_(topk), | ||
| rng_state_(rng_seed) {} | ||
|
|
||
| Sampler::Sampler(int vocab_size, float temperature, float topp, int32_t topk) | ||
| : Sampler(vocab_size, temperature, topp, topk, std::time(nullptr)) {} | ||
|
|
||
| Sampler::Sampler(int vocab_size, float temperature, float topp, | ||
| unsigned long long rng_seed) | ||
| : Sampler(vocab_size, temperature, topp, 0, rng_seed) {} | ||
|
|
||
| Sampler::Sampler(int vocab_size, float temperature, float topp) |
There was a problem hiding this comment.
Poor design, can't we somehow avoid 5 different constructors which take or not some subset of arguments?
| const AUDIO_SAMPLES_PER_BLOCK = 7680; | ||
| const AUDIO_TOKENS_PER_BLOCK = 12; |
There was a problem hiding this comment.
Similar to cpp, LLMController is a general file for LLMs, but AUDIO_SAMPLES_PER_BLOCK = 7680 is value specific for gemma 4, shouldn't be there.
There was a problem hiding this comment.
yes, absolutely agreed, those are leftovers from iterating and I missed them in self-review
msluszniak
left a comment
There was a problem hiding this comment.
CI needs to be fixed. Also please add testing instructions.
| for (const auto &input : inputs) { | ||
| if (input.is_image()) { | ||
| ET_CHECK_OR_RETURN_ERROR(image_encoder_ != nullptr, InvalidState, | ||
| "No image encoder registered"); | ||
| const int32_t num_visual = image_encoder_->encoderTokenCount(); | ||
| ET_CHECK_OR_RETURN_ERROR(num_visual > 0, InvalidState, | ||
| "Image encoder reports 0 visual tokens"); | ||
| image_slots.push_back(ImageSlot{&input, static_cast<int64_t>(ids.size()), | ||
| static_cast<int64_t>(num_visual)}); | ||
| ids.insert(ids.end(), static_cast<size_t>(num_visual), 0); | ||
| } else if (input.is_audio()) { | ||
| ET_CHECK_OR_RETURN_ERROR(audio_encoder_ != nullptr, InvalidState, | ||
| "No audio encoder registered"); | ||
| const long t_aud_begin = time_in_ms(); | ||
| auto enc = audio_encoder_->encode(input); | ||
| ET_CHECK_OK_OR_RETURN_ERROR(enc.error(), "Audio encoding failed"); | ||
| audio_encode_ms += time_in_ms() - t_aud_begin; | ||
| audio_calls += 1; | ||
| // Snapshot the encoder output NOW — see AudioSlot comment above for | ||
| // why the returned EValue's tensor metadata can't survive past the | ||
| // next module_->execute(). num_audio and audio_hidden are read from |
There was a problem hiding this comment.
As we dicussed, this must be refactored. It is hard to read, functions are way too long. Here each if block might be a separate function. Please create issue for the refactor if the multimodal prefiller if you don't want to handle this in the PR.
Description
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Test by running apps/llm app on llm screen (for text only model) and multimodal screen (for audio-vision-text model). Text model should work as any other llm model. Multimodal can process up-to-30sec audio chunks as well as image inputs, should be able to transcribe audio, describe pictures or similar.
Screenshots
Related issues
#1062
Checklist
Additional notes