Skip to content

feat: @mkopcins/gemma4#1162

Open
mkopcins wants to merge 6 commits into
mainfrom
@mkopcins/gemma4
Open

feat: @mkopcins/gemma4#1162
mkopcins wants to merge 6 commits into
mainfrom
@mkopcins/gemma4

Conversation

@mkopcins
Copy link
Copy Markdown
Collaborator

@mkopcins mkopcins commented May 21, 2026

Description

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Test by running apps/llm app on llm screen (for text only model) and multimodal screen (for audio-vision-text model). Text model should work as any other llm model. Multimodal can process up-to-30sec audio chunks as well as image inputs, should be able to transcribe audio, describe pictures or similar.

Screenshots

Related issues

#1062

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@mkopcins mkopcins force-pushed the @mkopcins/gemma4 branch 4 times, most recently from b2b6a24 to bf62c0b Compare May 25, 2026 09:18
@mkopcins mkopcins force-pushed the @mkopcins/gemma4 branch from bf62c0b to 938bc11 Compare May 25, 2026 09:41
@mkopcins mkopcins force-pushed the @mkopcins/gemma4 branch from ec22b0d to 66b3d24 Compare June 1, 2026 09:30
@msluszniak msluszniak marked this pull request as ready for review June 1, 2026 15:32
@msluszniak msluszniak added the feature PRs that implement a new feature label Jun 1, 2026
@msluszniak msluszniak linked an issue Jun 1, 2026 that may be closed by this pull request
@msluszniak msluszniak self-requested a review June 1, 2026 15:33
@mkopcins mkopcins force-pushed the @mkopcins/gemma4 branch from 12f147e to a606c79 Compare June 2, 2026 08:16
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was changed in the binaries? Asking for the record.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding Vulkan to RNE

inputs.push_back(llm::make_text_input(prompt.substr(searchPos)));
}
size_t imageIdx = 0, audioIdx = 0, pos = 0;
constexpr int32_t kAudioSampleRate = 16000;
Copy link
Copy Markdown
Member

@msluszniak msluszniak Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if in the future some other model will accept different sample rate or multiple sample rates? Every time we add such model, we will need to break this file that suppose to be general, for every llm. This is a code smell.

Comment on lines +30 to +33
std::string generateMultimodal(
std::string prompt, std::shared_ptr<jsi::Function> callback,
std::vector<std::string> imagePaths = {}, std::string imageToken = "",
std::vector<std::vector<float>> audioWaveforms = {},
Copy link
Copy Markdown
Member

@msluszniak msluszniak Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, what if in the future multimodal llm will accept video as an input, we need to add another positional argument and break common header. Bad code design that was there and we just extend it. But to be fair, if we plan to refactor whole codebase using this approach proposed by Bartek: see #1208, then I think it doesn't matter as much as if we would keep this codebase as standalone one.

return module_->is_method_loaded(kAudioEncoderMethod);
}

int32_t AudioEncoder::encoderTokenCount() const { return last_token_count_; }
Copy link
Copy Markdown
Member

@msluszniak msluszniak Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably noexcept

Comment on lines +77 to +80
static_cast<long long>(n_valid), static_cast<long long>(k_blocks),
static_cast<long long>(kAudioBlockKMin),
static_cast<long long>(kAudioBlockKMax),
static_cast<int>(kSamplesPerBlock),
Copy link
Copy Markdown
Member

@msluszniak msluszniak Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you mix long long and int64_t?

Comment on lines +80 to +96
inline ::executorch::runtime::Result<::executorch::aten::Tensor>
decode(const ::executorch::runtime::EValue &embeddings,
const ::executorch::runtime::EValue &ple_tok, int64_t start_pos) {
auto start_pos_tensor = ::executorch::extension::from_blob(
&start_pos, {1}, ::executorch::aten::ScalarType::Long);
auto outputs_result = module_->execute(
kTextModelMethod, {embeddings, ple_tok, start_pos_tensor});
if (!outputs_result.ok()) {
return outputs_result.error();
}
auto &outputs = *outputs_result;
ET_CHECK_MSG(outputs.size() == 1,
"Expected 1 output from text_decoder, got %zu",
outputs.size());
ET_CHECK_MSG(outputs[0].isTensor(), "text_decoder output is not a tensor");
return outputs[0].toTensor();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previous comments, PLE is related to gemma 4 or similar models. If there will be gemma 5 with another quirks, then we need to break this contract again.

std::make_unique<ProbIndex<T>[]>(vocab_size_);
T sum = 0;
for (int i = 0; i < vocab_size_; i++) {
T e = static_cast<T>(expf(static_cast<float>(logits[i] - max_val)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T e = static_cast<T>(expf(static_cast<float>(logits[i] - max_val)));
T e = static_cast<T>(std::expf(static_cast<float>(logits[i] - max_val)));

if (sum <= T(0)) {
return;
}
for (int i = 0; i < vocab_size_; i++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And fix other places as such

Suggested change
for (int i = 0; i < vocab_size_; i++) {
for (size_t i = 0; i < vocab_size_; i++) {

Comment on lines +208 to 222
Sampler::Sampler(int vocab_size, float temperature, float topp, int32_t topk,
unsigned long long rng_seed)
: vocab_size_(vocab_size),
inv_temperature_((temperature != 0.0f) ? (1.0f / temperature) : 0.0f),
topp_(topp), min_p_(0.0f), repetition_penalty_(1.0f), topk_(topk),
rng_state_(rng_seed) {}

Sampler::Sampler(int vocab_size, float temperature, float topp, int32_t topk)
: Sampler(vocab_size, temperature, topp, topk, std::time(nullptr)) {}

Sampler::Sampler(int vocab_size, float temperature, float topp,
unsigned long long rng_seed)
: Sampler(vocab_size, temperature, topp, 0, rng_seed) {}

Sampler::Sampler(int vocab_size, float temperature, float topp)
Copy link
Copy Markdown
Member

@msluszniak msluszniak Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poor design, can't we somehow avoid 5 different constructors which take or not some subset of arguments?

Comment thread packages/react-native-executorch/src/constants/modelUrls.ts
Comment on lines +23 to +24
const AUDIO_SAMPLES_PER_BLOCK = 7680;
const AUDIO_TOKENS_PER_BLOCK = 12;
Copy link
Copy Markdown
Member

@msluszniak msluszniak Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to cpp, LLMController is a general file for LLMs, but AUDIO_SAMPLES_PER_BLOCK = 7680 is value specific for gemma 4, shouldn't be there.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, absolutely agreed, those are leftovers from iterating and I missed them in self-review

Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI needs to be fixed. Also please add testing instructions.

Comment on lines +115 to +135
for (const auto &input : inputs) {
if (input.is_image()) {
ET_CHECK_OR_RETURN_ERROR(image_encoder_ != nullptr, InvalidState,
"No image encoder registered");
const int32_t num_visual = image_encoder_->encoderTokenCount();
ET_CHECK_OR_RETURN_ERROR(num_visual > 0, InvalidState,
"Image encoder reports 0 visual tokens");
image_slots.push_back(ImageSlot{&input, static_cast<int64_t>(ids.size()),
static_cast<int64_t>(num_visual)});
ids.insert(ids.end(), static_cast<size_t>(num_visual), 0);
} else if (input.is_audio()) {
ET_CHECK_OR_RETURN_ERROR(audio_encoder_ != nullptr, InvalidState,
"No audio encoder registered");
const long t_aud_begin = time_in_ms();
auto enc = audio_encoder_->encode(input);
ET_CHECK_OK_OR_RETURN_ERROR(enc.error(), "Audio encoding failed");
audio_encode_ms += time_in_ms() - t_aud_begin;
audio_calls += 1;
// Snapshot the encoder output NOW — see AudioSlot comment above for
// why the returned EValue's tensor metadata can't survive past the
// next module_->execute(). num_audio and audio_hidden are read from
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we dicussed, this must be refactored. It is hard to read, functions are way too long. Here each if block might be a separate function. Please create issue for the refactor if the multimodal prefiller if you don't want to handle this in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma4 support

3 participants