An iOS app for learning European / Brazilian Portuguese pronunciation through typed input and photo-based OCR, powered by Microsoft Azure Cognitive Services.
Help a learner quickly hear how any Portuguese word or phrase should sound and see its phonetic form — whether typed in by hand or captured from a real-world sign, book, or menu photo.
- User types a Portuguese word or phrase into a text field.
- App displays:
- the input text,
- the IPA (or SAPI) phonetic transcription,
- a playback control that speaks the text aloud.
- User can pick the Portuguese variant (Portugal
pt-PTvs. Brazilpt-BR) and, optionally, a specific neural voice. - Playback speed can be adjusted (e.g. 0.75×, 1.0×).
- Recent queries are kept in a local history for quick replay.
- User takes a new photo with the camera or picks an image from the library.
- App runs OCR on the image and overlays detected word bounding boxes on top of the photo.
- User can tap a single word, drag to select multiple adjacent words, or tap a whole line to select all of its words.
- For the current selection the app shows:
- the recognized text,
- its phonetic transcription,
- a play button that speaks it aloud.
- Selections and their audio are cached so re-tapping the same word does not re-request the network.
- Works offline for anything already cached (playback of previously spoken words, previously OCR'd images).
- Errors (no network, Azure quota exceeded, OCR failure) surface as inline messages, not modal alerts.
- All Azure requests go through a thin service layer so the TTS / OCR providers can be swapped later without touching the UI.
- Platform: iOS 17+, Swift 5.9+, SwiftUI, Swift Concurrency (
async/await). - Architecture: feature modules (TextMode, PhotoMode) over a shared
AzureClientlayer; view models expose@Observablestate. - Azure Cognitive Services
- Speech — Text-to-Speech with neural
pt-PT/pt-BRvoices; request SSML with<mstts:viseme>/ phoneme output to obtain transcription. - Computer Vision Read (or Document Intelligence) — OCR with word- level bounding boxes.
- Speech — Text-to-Speech with neural
- Storage: Core Data or SwiftData for history and cached audio/OCR blobs;
audio files cached on disk keyed by
(text, voice, rate)hash. - Secrets: Azure keys are kept out of the repo (loaded from a local
Secrets.xcconfigthat is gitignored, and ultimately a token-exchange service for production).
- Translation to other languages.
- Grammar analysis / dictionary lookup.
- User accounts / cloud sync across devices.
- Android or web client.
- Install XcodeGen and CocoaPods:
brew install xcodegen cocoapods - Copy
Secrets.xcconfig.exampletoSecrets.xcconfigand fill in your Azure keys/regions (Speech, Vision, Translator).Secrets.xcconfigis gitignored. - Generate the Xcode project, then install the Speech SDK pod:
xcodegen generate pod install - Open
ptkw.xcworkspace(not the.xcodeproj) in Xcode, or build from CLI:xcodebuild -workspace ptkw.xcworkspace -scheme ptkw \ -destination 'platform=iOS Simulator,name=iPhone 15' build - Run tests:
xcodebuild test -workspace ptkw.xcworkspace -scheme ptkw \ -destination 'platform=iOS Simulator,name=iPhone 15'
The Microsoft Cognitive Services Speech SDK for iOS is distributed only via
CocoaPods (MicrosoftCognitiveServicesSpeech-iOS). Re-run pod install
whenever you regenerate the project with xcodegen generate.
v1 scaffolding in place: shared Azure layer (Speech, Vision, Translator),
SwiftData history, TextMode/PhotoMode/History/Settings tabs, unit tests for
parsing and view models. Smoke-testable on device once Secrets.xcconfig is
filled in.