I spent months building an Android app that runs large language models entirely on-device — no API calls, no cloud inference, no data leaving your phone. This article goes deep on every technical decision: how GGUF quantization actually works, how llama.cpp tokenizes and generates tokens on a mobile GPU, and how the voice pipeline achieves near-real-time streaming from microphone to synthesized speech.

App Screenshots

Here is every major screen in BRAINY.AI — from the main chat interface to system monitoring and AI personality calibration.

Chat Interface screenshot

Chat Interface

Main AI chat screen with streaming markdown responses and typing indicator

Live Voice Mode screenshot

Live Voice Mode

Hands-free voice conversation with real-time waveform visualiser

Model Hub screenshot

Model Hub

Browse and download 19+ GGUF models by category

My Models screenshot

My Models

Manage downloaded models, activate with one tap

Settings — Performance screenshot

Settings — Performance

GPU acceleration, Low RAM Mode, and memory optimisation controls

Model Tuning screenshot

Model Tuning

Temperature, Top-P, Top-K, Max Tokens and context window sliders

AI Personalities screenshot

AI Personalities

Preset personalities with trait intensity sliders and custom builder

System Monitor screenshot

System Monitor

Real-time RAM and CPU stats with usage history

Benchmark Suite screenshot

Benchmark Suite

Tokens/sec measurement, prefill latency and device performance grading

History & Security screenshot

History & Security

Chat history logs, biometric lock and local-only privacy settings

01. How GGUF models work

GGUF (GPT-Generated Unified Format) is the file format that makes on-device AI practical. It is a single self-contained binary that packages everything a model needs: architecture metadata, tokenizer vocabulary, quantized weight tensors, and inference configuration — all in one file you can drop on your device.

The GGUF file structure

Every GGUF file is a structured binary with a fixed header followed by key-value metadata and then the raw tensor data:

  GGUF binary layout — every field is mandatory
0x00
magic0x46554747 (ASCII "GGUF")4 bytes
0x04
versionFormat version (currently 3)4 bytes
0x08
tensor_countNumber of weight tensors8 bytes
0x10
metadata_kv_countNumber of key-value metadata pairs8 bytes
0x18+
metadata_kv[]Architecture, tokenizer vocab, hyperparams, quant typevariable
aligned
tensor_info[]Name, shape, type, offset for each tensorvariable
data
tensor_dataRaw quantized weight bytes — bulk of the fileGBs

What quantization actually does

A standard neural network stores weights as 32-bit floats (FP32). Each parameter takes 4 bytes. A 7-billion parameter model needs 28 GB — impossible on a phone. Quantization compresses weights into fewer bits:

Conceptquantization tradeoffs
// Original FP32 weight: 4 bytes, full precision float weight_fp32 = 0.48291634f; // 32 bits // Q8_0: 8-bit — 4x smaller, ~0.1% quality loss int8_t weight_q8 = 62; // 8 bits + shared scale factor // Q4_K_M: 4-bit — 8x smaller, ~1-2% quality loss uint4_t weight_q4 = 7; // 4 bits + block-level scale + min // Real-world impact on Llama 3 8B: // FP32: 32GB — impossible on Android // Q8_0: 8GB — needs 12GB RAM device // Q4_K_M: 4.5GB — usable on 8GB flagship phones

  Why K-Quants are better

Standard Q4_0 quantizes all weights uniformly. K-Quants (Q4_K_M, Q4_K_S) use block-level scaling — different scale factors for different groups of weights — which preserves more information at the same bit depth. Attention layers that matter most get better precision.

02. The inference pipeline

When you send a message, here is the exact sequence of operations from raw text to streaming tokens on your screen:

  Step 1

Prompt templating

Raw user text is wrapped in the model’s expected format. Different architectures expect different templates — getting this wrong breaks the model completely.

// Llama 3 format <|begin_of_text|><|start_header_id|>user<|end_header_id|> What is photosynthesis?<|eot_id|><|start_header_id|>assistant<|end_header_id|> // Phi-3 format (different!) <|user|>What is photosynthesis?<|end|><|assistant|>

  Step 2

Tokenization

The string is split into tokens using the model’s BPE vocabulary stored inside the GGUF file. Common words become single tokens; rare words split into subwords.

// "photosynthesis" tokenizes as: // ["photo", "syn", "thesis"] -> token IDs: [7397, 3460, 8071] // All prompt tokens processed in one parallel pass (prefill)

  Step 3

KV cache prefill

All prompt tokens run through the transformer layers in parallel. Attention keys and values are cached in RAM. This is the “time to first token” — the slowest phase.

// KV cache memory estimate (Llama 3 8B, 2048 context): // 2 x 32 layers x 32 heads x 128 dim x 2048 ctx x 2 bytes // = ~536 MB extra RAM on top of model weights

  Step 4

Autoregressive decoding

The model generates one token at a time. Each token is emitted to the UI immediately via callback. Temperature and Top-P sampling shape which token gets picked.

for step in range(max_tokens): logits = forward_pass(last_token, kv_cache) logits = temperature_scale(logits, temp=0.8) probs = top_p_filter(logits, p=0.95) token = sample(probs) emit_to_ui(token) // stream immediately if token == EOS: break

  Step 5

Detokenization and render

Token IDs convert back to text. Partial tokens buffer until a word boundary, then stream to the Flutter widget for real-time markdown rendering and syntax highlighting.

03. How the code works

The app uses a clean domain-driven architecture. The Flutter UI never talks directly to the inference engine — everything flows through a layered service stack.

The LLMService orchestrator

Dartlib/data/datasources/llm_service.dart
class LLMService { LlamaCpp? _engine; ModelRuntime _runtime = ModelRuntime.cpu; /// Load a model and auto-detect the best runtime Future<void> loadModel(String modelPath) async { final metadata = await ModelMetadataExtractor.extract(modelPath); _runtime = await _detectBestRuntime(metadata); _engine = LlamaCpp( modelPath: modelPath, nGpuLayers: _runtime == ModelRuntime.vulkan ? 99 : 0, contextSize: settings.contextSize, threads: Platform.numberOfProcessors, ); await _engine!.init(); } /// Stream tokens to the UI via callback Stream<String> generate(String prompt) async* { final formatted = PromptTemplateService.format( prompt, _currentModel!.architecture, ); await for (final token in _engine!.generateStream(formatted)) { yield token; } } Future<ModelRuntime> _detectBestRuntime(ModelMetadata meta) async { if (await VulkanDetector.isAvailable()) return ModelRuntime.vulkan; if (await OpenCLDetector.isAvailable()) return ModelRuntime.openCL; return ModelRuntime.cpu; } }

Riverpod state management

Dartlib/features/editor/providers/chat_provider.dart
@riverpod class ChatNotifier extends _$ChatNotifier { final _messages = <ChatMessage>[]; Future<void> sendMessage(String text) async { _messages.add(ChatMessage.user(text)); state = [..._messages]; final assistantMsg = ChatMessage.assistant(''); _messages.add(assistantMsg); final llm = ref.read(llmServiceProvider); await for (final token in llm.generate(text)) { assistantMsg.content += token; state = [..._messages]; // triggers UI rebuild per token } await ref.read(historyRepositoryProvider).save(_messages); } }

Prompt template auto-detection

Dartlib/core/services/prompt_template_service.dart
enum PromptTemplate { llama3, phi3, gemma, chatML, mistral } class PromptTemplateService { static String format(String prompt, ModelArchitecture arch) { return switch (arch.templateType) { PromptTemplate.llama3 => _llama3Format(prompt), PromptTemplate.phi3 => _phi3Format(prompt), PromptTemplate.chatML => _chatMLFormat(prompt), PromptTemplate.mistral => _mistralFormat(prompt), }; } }

04. GPU acceleration

Without GPU acceleration, a 7B model on a mid-range phone generates tokens at 2–3 per second. With Vulkan offloading, the same model hits 20–40 tokens per second.

  How GPU offloading works

llama.cpp offloads transformer layers to the GPU. Each layer runs matrix multiplications on the GPU. When nGpuLayers=99, all layers run on GPU — CPU only handles tokenization and sampling. On Snapdragon 8 Gen 3 with Vulkan, this is 10–15x faster than CPU-only inference.

DartGPU layer configuration
final gpuLayers = switch (detectedRuntime) { ModelRuntime.vulkan => 99, // all layers on GPU — Qualcomm/Samsung ModelRuntime.openCL => 99, // all layers on GPU — MediaTek ModelRuntime.npu => 32, // partial — NPU has limited VRAM ModelRuntime.cpu => 0, // CPU fallback }; // Benchmark results (Llama 3 8B Q4): // Snapdragon 8 Gen 3 + Vulkan: ~42 tokens/sec // Snapdragon 8 Gen 2 + Vulkan: ~28 tokens/sec // Dimensity 9200 + OpenCL: ~22 tokens/sec // CPU-only (any chip): ~3 tokens/sec

  Minimum RAM by model

TinyLlama 1.1B
2 GB
Qwen2.5 1.5B
3 GB
Phi-3 Mini / StarCoder2
4 GB
MADLAD / NLLB 3B
4 GB
Mistral 7B Q4
8 GB
Llama 3 8B Q4
8 GB
WizardMath 7B Q4
8 GB

  Device performance tiers

Excellent
50+ t/s
Very good
30–50 t/s
Good
20–30 t/s
Fair
10–20 t/s
Slow
<10 t/s

05. Voice mode architecture

Live mode chains three independent real-time systems — speech recognition, LLM inference, and text-to-speech — with no perceptible latency between them.

Listening

Continuous mic capture. Partial results stream to transcript. 2.5s silence triggers auto-send.

Thinking

LLM processes prompt, KV cache prefill runs. Animated indicator shows inference activity.

Speaking

TTS reads tokens as they generate. User can tap to interrupt mid-response.

Paused

Waiting for user. Mic reactivates after TTS finishes speaking.

Dartlib/features/editor/live_chat/live_chat_controller.dart
class LiveChatController { LiveChatState _state = LiveChatState.listening; Timer? _silenceTimer; static const _silenceThreshold = Duration(milliseconds: 2500); void _onSpeechResult(String finalText) { _silenceTimer?.cancel(); _silenceTimer = Timer(_silenceThreshold, () async { await _processAndRespond(finalText); }); } Future<void> _processAndRespond(String text) async { _speechService.stopListening(); _state = LiveChatState.thinking; StringBuffer response = StringBuffer(); await for (final token in _llm.generate(text)) { response.write(token); // Speak at sentence boundaries — no need to wait for full response if (_isSentenceBoundary(token)) { _state = LiveChatState.speaking; await _tts.speak(response.toString()); response.clear(); } } if (response.isNotEmpty) await _tts.speak(response.toString()); _state = LiveChatState.listening; startListening(); } void interrupt() { // User taps to stop mid-response _tts.stop(); _llm.cancelGeneration(); startListening(); } }
“The key insight: buffer tokens until a sentence boundary, then speak that sentence while the model is still generating the next one — no waiting for the full response.”

06. App capabilities

Streaming AI chat

Token-by-token streaming with full markdown, code syntax highlighting, and copy-to-clipboard. Multi-turn conversations within the context window.

Live voice mode

Hands-free continuous conversation with auto silence detection, real-time transcript, AI interrupt, and frosted glass UI.

File attachments

Attach images (JPEG, PNG), PDFs, TXT, MD, CSV, and DOCX files. Content is extracted and injected into the prompt context.

Image processing

Filters (grayscale, sepia, vivid), wallpaper setting, gallery save, and text-to-image via Hugging Face cloud fallback.

System monitoring

Real-time RAM and CPU usage in the Android notification bar. Persistent background service updates every 10 seconds.

Benchmark suite

Measures tokens/sec, prefill latency, and peak RAM. Grades your device into one of five performance tiers automatically.

AI personalities

4 preset personalities (Assistant, Code Expert, Comedy Bot, Counselor) with trait intensity sliders and a fully custom builder.

Biometric security

Face ID and fingerprint lock on launch. All data encrypted in local SQLite. HF tokens in flutter_secure_storage. Zero telemetry.

Model manager

Browse 19+ models, download with progress tracking, pause and resume, SHA-256 checksum verification, import custom GGUF files.

HF cloud fallback

Optional Hugging Face Inference API for models too large for on-device. SSE streaming, token encrypted, opt-in only.

07. Real use cases

Privacy

Confidential writing and journaling

Write personal journals, draft sensitive documents, or brainstorm private ideas with an AI that genuinely cannot share your data — because it runs entirely on your device.

Code

Offline coding assistant

Use StarCoder2 or CodeQwen on a plane, in a secure environment where cloud API tools are prohibited, or simply with no WiFi. Full code completions and explanations locally.

Travel

Offline travel translation

MADLAD-400 and NLLB-200 support 100+ languages and run entirely offline. Translate menus, signs, and conversations with no data coverage.

Health

Private health and wellness journaling

Log symptoms, track moods, and get AI-assisted reflection without that data ever touching a cloud server. Critical in regions with weak data privacy laws.

Education

Maths and reasoning tutor

WizardMath 7B walks through calculus problems, explains proofs, and checks homework step-by-step. No subscription, no internet, no data shared.

Creative

Creative writing and storytelling

Use Mistral 7B in creative personality mode to draft fiction, develop characters, or write scripts. Long context (8192 tokens) handles extended sessions without losing the thread.

Access

Hands-free voice assistant

Live mode with continuous listening is genuinely useful for accessibility — users with motor impairments can have a full conversation with an AI using only their voice, fully offline.

08. Honest limitations

Hard

RAM is a binary constraint

If the model plus KV cache plus OS do not fit in available RAM, Android will OOM-kill the app. No graceful degradation. Devices with 4 GB or less are limited to sub-2B models.

Hard

Android only — no iOS yet

llama.cpp native bindings need separate compilation for iOS. NDK vs Xcode native library differences are non-trivial. iOS support is actively in progress.

Friction

Speed gap between device tiers is large

Snapdragon 8 Gen 3 hits 40+ tokens/sec with Vulkan. A CPU-only budget phone hits 2–3 tokens/sec. That is a 15–20x difference in perceived responsiveness.

Friction

Q4 quantization loses accuracy

Quantization is lossy. For complex reasoning and factual accuracy, a Q4 model scores measurably lower than FP16 on benchmarks like MMLU. For conversational use the difference is usually imperceptible.

Friction

Thermal throttling on sustained sessions

15+ minutes of continuous inference will trigger thermal throttling — the SoC reduces clock speeds. Tokens/sec drops noticeably after prolonged use. Low RAM Mode helps manage thermal load.

Minor

No cross-session memory

Each conversation starts fresh. Chat history is stored in SQLite and viewable in the history screen, but is not re-injected into the model context automatically. RAG support is on the roadmap.

Minor

Initial model load latency

Loading a 4 GB GGUF model takes 8–20 seconds depending on storage speed (UFS 3.1 vs UFS 4.0). The app shows a progress indicator and the model stays loaded until explicitly unloaded.

Roadmap

No on-device fine-tuning yet

LoRA fine-tuning on-device would let users personalise models for their specific use cases. Planned as a future feature gated on higher-end devices with dedicated NPU support.

09. Full tech stack

  Layer  Technology
FrameworkFlutter 3.24+ / Dart 3.5+
LLM enginellamadart 0.6.9 — llama.cpp Flutter bindings
Model formatGGUF (Q4_K_M, Q8_0, Q2_K quantizations)
GPU backendsVulkan (Android) · OpenCL (MediaTek) · Metal (iOS, planned) · CUDA (Linux)
State mgmtRiverpod 2.6+ with code generation (@riverpod)
Navigationgo_router 14.8+ with ShellRoute + endDrawer pattern
DatabaseDrift 2.19+ (SQLite with type-safe code generation)
Secure storageflutter_secure_storage 10.0+ (AES-256 encrypted)
Biometric authlocal_auth 2.3+ (Face ID, fingerprint)
NetworkingDio 5.5+ with interceptors (HF cloud only, opt-in)
TTSflutter_tts 4.2+ with pitch, speed, and voice selection
STTspeech_to_text 7.0+ with continuous listening and partials
Notificationsflutter_local_notifications 21.0+ (persistent system monitor)
Imageimage 4.3+ · photo_view · gal · async_wallpaper
Native buildCMake 3.22+ · NDK 26+ — auto-compiles libllama.so
Supported ABIsarmeabi-v7a · arm64-v8a · x86 · x86_64
Min AndroidAPI 29 (Android 10)
LicenseMIT — fully open source
· · ·

Building BRAINY.AI convinced me that on-device AI is not a future technology — it is a present one. The models are good enough. The hardware is fast enough. The tooling is mature enough. What is missing is not capability — it is developer awareness and polished UX.

The privacy case is stronger than the performance case. Once someone experiences an AI that genuinely cannot share their data with anyone, the cloud alternative starts to feel like an unnecessary risk.

#Flutter#Android#LLM #GGUF#OnDeviceAI#llamacpp #EdgeAI#Vulkan#OpenSource #Privacy#MobileAI#BuildInPublic #DartLang#Riverpod#EdgeComputing

View on GitHub

MIT licensed. Full source at github.com/Deshan555/BRAINY.AI — star the repo if you find it useful. If you are working on mobile AI, edge inference, or Flutter, drop a comment or connect.