I spent months building an Android app that runs large language models entirely on-device — no API calls, no cloud inference, no data leaving your phone. This article goes deep on every technical decision: how GGUF quantization actually works, how llama.cpp tokenizes and generates tokens on a mobile GPU, and how the voice pipeline achieves near-real-time streaming from microphone to synthesized speech.
App Screenshots
Here is every major screen in BRAINY.AI — from the main chat interface to system monitoring and AI personality calibration.
Chat Interface
Main AI chat screen with streaming markdown responses and typing indicator
Live Voice Mode
Hands-free voice conversation with real-time waveform visualiser
Model Hub
Browse and download 19+ GGUF models by category
My Models
Manage downloaded models, activate with one tap
Settings — Performance
GPU acceleration, Low RAM Mode, and memory optimisation controls
Model Tuning
Temperature, Top-P, Top-K, Max Tokens and context window sliders
AI Personalities
Preset personalities with trait intensity sliders and custom builder
System Monitor
Real-time RAM and CPU stats with usage history
Benchmark Suite
Tokens/sec measurement, prefill latency and device performance grading
History & Security
Chat history logs, biometric lock and local-only privacy settings
01. How GGUF models work
GGUF (GPT-Generated Unified Format) is the file format that makes on-device AI practical. It is a single self-contained binary that packages everything a model needs: architecture metadata, tokenizer vocabulary, quantized weight tensors, and inference configuration — all in one file you can drop on your device.
The GGUF file structure
Every GGUF file is a structured binary with a fixed header followed by key-value metadata and then the raw tensor data:
What quantization actually does
A standard neural network stores weights as 32-bit floats (FP32). Each parameter takes 4 bytes. A 7-billion parameter model needs 28 GB — impossible on a phone. Quantization compresses weights into fewer bits:
Why K-Quants are better
Standard Q4_0 quantizes all weights uniformly. K-Quants (Q4_K_M, Q4_K_S) use block-level scaling — different scale factors for different groups of weights — which preserves more information at the same bit depth. Attention layers that matter most get better precision.
02. The inference pipeline
When you send a message, here is the exact sequence of operations from raw text to streaming tokens on your screen:
Step 1
Prompt templating
Raw user text is wrapped in the model’s expected format. Different architectures expect different templates — getting this wrong breaks the model completely.
// Llama 3 format
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
What is photosynthesis?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
// Phi-3 format (different!)
<|user|>What is photosynthesis?<|end|><|assistant|>
Step 2
Tokenization
The string is split into tokens using the model’s BPE vocabulary stored inside the GGUF file. Common words become single tokens; rare words split into subwords.
// "photosynthesis" tokenizes as:
// ["photo", "syn", "thesis"] -> token IDs: [7397, 3460, 8071]
// All prompt tokens processed in one parallel pass (prefill)
Step 3
KV cache prefill
All prompt tokens run through the transformer layers in parallel. Attention keys and values are cached in RAM. This is the “time to first token” — the slowest phase.
// KV cache memory estimate (Llama 3 8B, 2048 context):
// 2 x 32 layers x 32 heads x 128 dim x 2048 ctx x 2 bytes
// = ~536 MB extra RAM on top of model weights
Step 4
Autoregressive decoding
The model generates one token at a time. Each token is emitted to the UI immediately via callback. Temperature and Top-P sampling shape which token gets picked.
for step in range(max_tokens):
logits = forward_pass(last_token, kv_cache)
logits = temperature_scale(logits, temp=0.8)
probs = top_p_filter(logits, p=0.95)
token = sample(probs)
emit_to_ui(token) // stream immediately
if token == EOS: break
Step 5
Detokenization and render
Token IDs convert back to text. Partial tokens buffer until a word boundary, then stream to the Flutter widget for real-time markdown rendering and syntax highlighting.
03. How the code works
The app uses a clean domain-driven architecture. The Flutter UI never talks directly to the inference engine — everything flows through a layered service stack.
The LLMService orchestrator
Riverpod state management
Prompt template auto-detection
04. GPU acceleration
Without GPU acceleration, a 7B model on a mid-range phone generates tokens at 2–3 per second. With Vulkan offloading, the same model hits 20–40 tokens per second.
How GPU offloading works
llama.cpp offloads transformer layers to the GPU. Each layer runs matrix multiplications on the GPU. When nGpuLayers=99, all layers run on GPU — CPU only handles tokenization and sampling. On Snapdragon 8 Gen 3 with Vulkan, this is 10–15x faster than CPU-only inference.
Minimum RAM by model
Device performance tiers
05. Voice mode architecture
Live mode chains three independent real-time systems — speech recognition, LLM inference, and text-to-speech — with no perceptible latency between them.
Listening
Continuous mic capture. Partial results stream to transcript. 2.5s silence triggers auto-send.
Thinking
LLM processes prompt, KV cache prefill runs. Animated indicator shows inference activity.
Speaking
TTS reads tokens as they generate. User can tap to interrupt mid-response.
Paused
Waiting for user. Mic reactivates after TTS finishes speaking.
06. App capabilities
Streaming AI chat
Token-by-token streaming with full markdown, code syntax highlighting, and copy-to-clipboard. Multi-turn conversations within the context window.
Live voice mode
Hands-free continuous conversation with auto silence detection, real-time transcript, AI interrupt, and frosted glass UI.
File attachments
Attach images (JPEG, PNG), PDFs, TXT, MD, CSV, and DOCX files. Content is extracted and injected into the prompt context.
Image processing
Filters (grayscale, sepia, vivid), wallpaper setting, gallery save, and text-to-image via Hugging Face cloud fallback.
System monitoring
Real-time RAM and CPU usage in the Android notification bar. Persistent background service updates every 10 seconds.
Benchmark suite
Measures tokens/sec, prefill latency, and peak RAM. Grades your device into one of five performance tiers automatically.
AI personalities
4 preset personalities (Assistant, Code Expert, Comedy Bot, Counselor) with trait intensity sliders and a fully custom builder.
Biometric security
Face ID and fingerprint lock on launch. All data encrypted in local SQLite. HF tokens in flutter_secure_storage. Zero telemetry.
Model manager
Browse 19+ models, download with progress tracking, pause and resume, SHA-256 checksum verification, import custom GGUF files.
HF cloud fallback
Optional Hugging Face Inference API for models too large for on-device. SSE streaming, token encrypted, opt-in only.
07. Real use cases
Confidential writing and journaling
Write personal journals, draft sensitive documents, or brainstorm private ideas with an AI that genuinely cannot share your data — because it runs entirely on your device.
Offline coding assistant
Use StarCoder2 or CodeQwen on a plane, in a secure environment where cloud API tools are prohibited, or simply with no WiFi. Full code completions and explanations locally.
Offline travel translation
MADLAD-400 and NLLB-200 support 100+ languages and run entirely offline. Translate menus, signs, and conversations with no data coverage.
Private health and wellness journaling
Log symptoms, track moods, and get AI-assisted reflection without that data ever touching a cloud server. Critical in regions with weak data privacy laws.
Maths and reasoning tutor
WizardMath 7B walks through calculus problems, explains proofs, and checks homework step-by-step. No subscription, no internet, no data shared.
Creative writing and storytelling
Use Mistral 7B in creative personality mode to draft fiction, develop characters, or write scripts. Long context (8192 tokens) handles extended sessions without losing the thread.
Hands-free voice assistant
Live mode with continuous listening is genuinely useful for accessibility — users with motor impairments can have a full conversation with an AI using only their voice, fully offline.
08. Honest limitations
RAM is a binary constraint
If the model plus KV cache plus OS do not fit in available RAM, Android will OOM-kill the app. No graceful degradation. Devices with 4 GB or less are limited to sub-2B models.
Android only — no iOS yet
llama.cpp native bindings need separate compilation for iOS. NDK vs Xcode native library differences are non-trivial. iOS support is actively in progress.
Speed gap between device tiers is large
Snapdragon 8 Gen 3 hits 40+ tokens/sec with Vulkan. A CPU-only budget phone hits 2–3 tokens/sec. That is a 15–20x difference in perceived responsiveness.
Q4 quantization loses accuracy
Quantization is lossy. For complex reasoning and factual accuracy, a Q4 model scores measurably lower than FP16 on benchmarks like MMLU. For conversational use the difference is usually imperceptible.
Thermal throttling on sustained sessions
15+ minutes of continuous inference will trigger thermal throttling — the SoC reduces clock speeds. Tokens/sec drops noticeably after prolonged use. Low RAM Mode helps manage thermal load.
No cross-session memory
Each conversation starts fresh. Chat history is stored in SQLite and viewable in the history screen, but is not re-injected into the model context automatically. RAG support is on the roadmap.
Initial model load latency
Loading a 4 GB GGUF model takes 8–20 seconds depending on storage speed (UFS 3.1 vs UFS 4.0). The app shows a progress indicator and the model stays loaded until explicitly unloaded.
No on-device fine-tuning yet
LoRA fine-tuning on-device would let users personalise models for their specific use cases. Planned as a future feature gated on higher-end devices with dedicated NPU support.
09. Full tech stack
| Layer | Technology |
|---|---|
| Framework | Flutter 3.24+ / Dart 3.5+ |
| LLM engine | llamadart 0.6.9 — llama.cpp Flutter bindings |
| Model format | GGUF (Q4_K_M, Q8_0, Q2_K quantizations) |
| GPU backends | Vulkan (Android) · OpenCL (MediaTek) · Metal (iOS, planned) · CUDA (Linux) |
| State mgmt | Riverpod 2.6+ with code generation (@riverpod) |
| Navigation | go_router 14.8+ with ShellRoute + endDrawer pattern |
| Database | Drift 2.19+ (SQLite with type-safe code generation) |
| Secure storage | flutter_secure_storage 10.0+ (AES-256 encrypted) |
| Biometric auth | local_auth 2.3+ (Face ID, fingerprint) |
| Networking | Dio 5.5+ with interceptors (HF cloud only, opt-in) |
| TTS | flutter_tts 4.2+ with pitch, speed, and voice selection |
| STT | speech_to_text 7.0+ with continuous listening and partials |
| Notifications | flutter_local_notifications 21.0+ (persistent system monitor) |
| Image | image 4.3+ · photo_view · gal · async_wallpaper |
| Native build | CMake 3.22+ · NDK 26+ — auto-compiles libllama.so |
| Supported ABIs | armeabi-v7a · arm64-v8a · x86 · x86_64 |
| Min Android | API 29 (Android 10) |
| License | MIT — fully open source |
Building BRAINY.AI convinced me that on-device AI is not a future technology — it is a present one. The models are good enough. The hardware is fast enough. The tooling is mature enough. What is missing is not capability — it is developer awareness and polished UX.
The privacy case is stronger than the performance case. Once someone experiences an AI that genuinely cannot share their data with anyone, the cloud alternative starts to feel like an unnecessary risk.
View on GitHub
MIT licensed. Full source at github.com/Deshan555/BRAINY.AI — star the repo if you find it useful. If you are working on mobile AI, edge inference, or Flutter, drop a comment or connect.