feat(gateway): implement multimodal (image/document/audio) inbound support for LINE and Telegram#757
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds inbound multimodal support (images, text documents, and audio) for the Custom Gateway’s LINE and Telegram adapters, including shared gateway-side image resizing/compression and Core-side audio transcription via the configured STT.
Changes:
- Add gateway-side media utilities (
resize_and_compress, size limits) and wire them into adapters. - Implement Telegram + LINE inbound attachment downloading/encoding and inclusion in
GatewayEventattachments. - Extend Core gateway adapter to decode attachments and (optionally) transcribe inbound audio when STT is enabled.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main.rs | Pass STT config into the Core gateway adapter params. |
| src/gateway.rs | Convert gateway attachments into Core ContentBlocks, including audio transcription support. |
| gateway/src/media.rs | New shared media module for image resize/compress + download size limits. |
| gateway/src/main.rs | Register the new media module; minor formatting changes. |
| gateway/src/adapters/telegram.rs | Add inbound photo/document/audio handling and media download helpers (currently has compile/logic issues). |
| gateway/src/adapters/line.rs | Add inbound image/audio handling and LINE media download helper (currently has compile issues). |
| gateway/src/adapters/feishu.rs | Refactor to reuse shared media module; mostly formatting. |
| gateway/src/adapters/googlechat.rs | Formatting and test fixture updates to include empty attachments. |
| gateway/src/adapters/teams.rs | Formatting only. |
| gateway/Cargo.lock | Bump openab-gateway lockfile version entry. |
| docs/telegram.md | Document Telegram inbound file/image/audio support. |
| docs/line.md | Document LINE inbound image/audio support. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let Some(msg) = update.message else { | ||
| return axum::http::StatusCode::OK; | ||
| }; | ||
| let Some(text) = msg.text.as_deref() else { | ||
| return axum::http::StatusCode::OK; | ||
| }; | ||
| if text.trim().is_empty() { | ||
| let is_voice = msg.voice.is_some(); | ||
| let is_audio = msg.audio.is_some(); | ||
| let text = msg.text.as_deref().or(msg.caption.as_deref()).unwrap_or(""); | ||
|
|
||
| if text.trim().is_empty() && !is_photo && !is_document && !is_voice && !is_audio { | ||
| return axum::http::StatusCode::OK; | ||
| } | ||
|
|
||
| let mut attachments = Vec::new(); | ||
| if is_photo || is_document || is_voice || is_audio { |
| let max_size = if attachment_type == "image" { | ||
| IMAGE_MAX_DOWNLOAD | ||
| } else { | ||
| AUDIO_MAX_DOWNLOAD | ||
| }; |
| caption: Option<String>, | ||
| #[serde(default)] | ||
| entities: Vec<TelegramEntity>, | ||
| #[serde(default)] |
| let is_text = msg.message_type == "text"; | ||
| let is_audio = msg.message_type == "audio"; | ||
|
|
||
| if !is_text && !is_image && !is_audio { | ||
| continue; | ||
| } |
| let max_size = if attachment_type == "image" { | ||
| IMAGE_MAX_DOWNLOAD | ||
| } else { | ||
| AUDIO_MAX_DOWNLOAD | ||
| }; |
| let new_w = (f64::from(w) * ratio) as u32; | ||
| let new_h = (f64::from(h) * ratio) as u32; |
| "audio" => { | ||
| if stt.enabled { | ||
| use base64::Engine; | ||
| if let Ok(bytes) = base64::engine::general_purpose::STANDARD.decode(&att.data) { | ||
| let client = reqwest::Client::new(); | ||
| if let Some(text) = crate::stt::transcribe( | ||
| &client, | ||
| &stt, | ||
| bytes, | ||
| att.filename.clone(), | ||
| &att.mime_type | ||
| ).await { | ||
| extra_blocks.push(ContentBlock::Text { | ||
| text: format!("[Audio: {}]", text), | ||
| }); | ||
| } | ||
| } | ||
| } |
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening report## IntentPR #757 aims to add inbound multimodal message handling to the OpenAB custom gateway for LINE and Telegram. The user-visible problem is that users on those platforms can currently interact reliably with text, but images, voice/audio, and Telegram text documents are not fully accepted, normalized, and forwarded into the agent/core flow. FeatThis is a gateway feature PR. It adds inbound support for:
Outbound multimodal support is explicitly out of scope: text remains supported, while outbound images/audio are not implemented. Who It ServesPrimary beneficiaries:
Rewritten PromptImplement inbound multimodal support for LINE and Telegram in the custom gateway. Requirements:
Review carefully for unrelated adapter churn outside LINE, Telegram, shared media, gateway wiring, and docs. Merge PitchThis PR moves OpenAB closer to parity with real messaging-platform usage: users commonly send screenshots, photos, voice notes, and small documents instead of plain text. Inbound-only support is a sensible first step because it unlocks richer agent context without committing to platform-specific outbound media delivery. Risk profile is moderate. The main reviewer concern should be scope control: the file list includes substantial changes to Feishu, Google Chat, and Teams even though the PR title is LINE and Telegram focused. Review should determine whether those changes are necessary gateway-interface adjustments or unrelated refactor noise. Best-Practice ComparisonOpenClaw principles that apply:
OpenClaw principles that are less central here:
Hermes Agent principles that apply:
Hermes Agent principles that are less central here:
Overall, the most relevant best practices are explicit routing, bounded media handling, failure logging, and keeping expensive processing isolated from fragile webhook request paths. Implementation OptionsOption 1: Conservative, LINE/Telegram-only inbound support Option 2: Balanced, shared gateway media abstraction Option 3: Ambitious, durable async media pipeline Comparison Table
RecommendationAdvance this item, but steer review toward the balanced option. The feature is valuable and user-facing, but the merge discussion should focus on scope discipline: confirm that non-LINE/Telegram adapter changes are required by a shared gateway interface rather than incidental churn. If they are not required, split them out. Recommended sequencing:
|
890bba5 to
cf22796
Compare
🔃 Review: feat(gateway): implement multimodal inbound support for LINE and TelegramWhat problem does this solve?LINE and Telegram users can send images, voice notes, and documents, but the gateway previously only forwarded text messages. This PR enables multimodal inbound — images are resized/compressed, audio is passed through for STT transcription, and text documents are read and forwarded as content. How does it solve it?
What was considered?
Verdict🟡 CHANGES REQUESTED — Good feature, solid implementation, but a few issues need addressing before merge. Detailed notes🟢 INFO — Good patterns:
🔴 SUGGESTED CHANGES:
🟡 NIT (non-blocking):
|
- Shared reqwest::Client from AppState in gateway main loop. - Refactored Feishu adapter to use shared media utility module. - Cleaned up formatting noise in feishu, googlechat, and teams adapters. - Fixed Google Chat test fixtures for schema changes.
JARVIS-coding-Agent
left a comment
There was a problem hiding this comment.
🔍 Code Review — PR #757
Thanks for the solid work on multimodal inbound support! After review, we found 6 issues that should be addressed before merge.
1. [High] Empty event guard missing
If media download fails, access token is missing, or image decode errors out, LINE/Telegram adapters can still emit a GatewayEvent with empty text + empty attachments. Core gets woken up for nothing.
Fix: Before sending the event, add:
if text.trim().is_empty() && attachments.is_empty() {
continue;
}Applies to both gateway/src/adapters/line.rs and gateway/src/adapters/telegram.rs.
2. [Medium] MIME type mismatch after JPEG compression
Both LINE and Telegram image paths ignore the mime returned by resize_and_compress:
Ok((c, _m)) => (c, content_type, format!("{}.jpg", message_id))
// ^^^^^^^^^^^^ uses original server header, not "image/jpeg"The filename says .jpg but mime_type field could be image/png or image/webp. Core routing based on mime_type will break.
Fix: Use the returned mime from resize_and_compress instead of content_type.
3. [Medium] Telegram caption_entities not parsed for mentions
Currently only msg.entities is read for bot mentions. When a user sends a photo with a caption containing @bot, the mention lives in caption_entities, not entities. Group mention gate may skip these messages.
Fix: Merge caption_entities into the mention detection logic.
4. [Medium] .env in TEXT_EXTS whitelist — potential secret leakage
const TEXT_EXTS: &[&str] = &["txt", ..., "env", ...];.env files typically contain API keys and secrets. Allowing them means user-uploaded .env content gets sent to the LLM.
Fix: Remove "env" from TEXT_EXTS.
5. [Low] Client::new() per STT request in src/gateway.rs
The audio transcription path creates a new reqwest::Client on every request:
let client = reqwest::Client::new();This bypasses connection pooling. The gateway side already has a shared client in AppState.
Fix: Pass the shared client through GatewayParams or use a once_cell/LazyLock static client.
6. [Low] download_telegram_media missing Content-Length pre-check
download_telegram_document has dual validation (Content-Length header + body size), but download_telegram_media for images/audio only validates after full download. A 20MB audio file gets fully downloaded before rejection.
Fix: Add Content-Length header pre-check to download_telegram_media for consistency (same pattern as download_telegram_document).
Reviewed by
This review was conducted collaboratively by:
- JARVIS (coordinator)
- ULTRON
- FRIDAY
🚀 Multimodal Inbound Support for LINE & Telegram
This PR implements end-to-end multimodal support (images, text documents, and audio/voice) for LINE and Telegram integrations via the Custom Gateway.
Closes #690
Implementation Matrix
Key Features:
gateway/src/media.rsto reduce bandwidth and memory pressure.Discord Discussion URL
https://discord.com/channels/1491295327620169908/1496171374711148665/1499859716409393172
Verification:
cargo checkandcargo test.