Why this matters
- Mobile networks are hostile: variable latency, captive portals, TLS interceptors, and background restrictions.
- A high-quality app treats the network as unreliable and designs a resilient data layer that fails gracefully, retries intelligently, and remains observable in production.
Objectives
- Predictable error classification with app-friendly messages
- Sensible defaults for timeouts, retries, and cancellation
- Request deduplication and backpressure to prevent stampedes
- Clear observability: traces, metrics, structured logs for diagnosis
Contents
- Error taxonomy and surfacing to UI
- Timeouts: connect vs. read vs. whole call budgets
- Retry strategy, jitter, and idempotence
- Cancellation and lifecycles (widgets, navigation, background)
- Request coalescing/deduplication
- Backpressure and concurrency limits
- Caching and staleness semantics
- Observability (traces, logs, metrics)
- Testing strategy (fakes, chaos, property tests)
- Reference design and code outline
Error taxonomy Categorize errors first; everything else follows.
- Client/Device
- No connectivity, DNS failure, TLS handshake, certificate pinning failure
- Transient Server/Network
- 429 Too Many Requests, 5xx server errors, timeouts, reset by peer
- Permanent Client Errors
- 400/422 validation, authentication/authorization (401/403)
- Business Errors
- Domain-level constraints, quota exceeded, locked resource
UI surfacing
- Client/Transient: “Temporary network issue. Retrying…”
- Permanent Client: show actionable validation/auth error
- Business: error banner with context; consider soft retry/recovery guidance
Timeouts and budgets
- Separate budgets:
- Connect timeout: 2–5s
- Read timeout: 10–30s (per chunk/response)
- Whole-call deadline: 15–60s (use CancellationToken/Deadline)
- Budget propagation:
- Compose budgets downstream (e.g., list → item calls inherit parent deadline minus elapsed)
- Alert on timeouts per endpoint to catch regressions
Retry strategy and jitter
- Retry only idempotent methods by default (GET, HEAD). For POST/PUT/PATCH/DELETE, require explicit allowance and include idempotency keys.
- Exponential backoff with full jitter:
- Base = 250ms, factor 2, jitter [0, base) for first few attempts
- Cap (max backoff) e.g., 5–10s
- Max attempts: 3–5 (context-dependent)
- Honor
Retry-Afterheaders and server-provided pace - Circuit-breaking:
- Track rolling error rate per host/route; short-circuit with immediate fallback when above threshold
Cancellation and lifecycles
- Tie requests to UI lifecycles:
- Cancel when widget is disposed, route is popped, or search query changes
- Prevent “late write” of stale responses:
- Attach generation/version number to callbacks; ignore responses from outdated generations
- Use autoDispose scopes (Riverpod) or cancellable tokens (BLoC/controller)
Request coalescing and deduplication
- Coalesce identical in-flight GETs (same method+URL+headers+body hash) and fan-out results to all subscribers.
- Useful for “search-as-you-type” (coalesce per debounced query) and details pages accessed from multiple entry points.
Backpressure and concurrency limits
- Limit concurrent in-flight requests per host and globally (e.g., 6 per host, 24 global).
- Queue requests beyond the limit; prioritize critical ones (foreground UI > background prefetch).
- Combine with debounce for chattier flows (e.g., filters, search).
- Avoid stampedes by caching negative results briefly and coalescing.
Caching and staleness
- Respect HTTP caching headers (ETag, Cache-Control, Last-Modified)
- Serve stale-while-revalidate (SWR) in UI:
- Show cached data immediately
- Kick off background refresh; update UI when fresh arrives
- Define staleness budgets per domain (e.g., 30s for news list, 24h for static catalog)
Observability
- Tracing:
- Trace ID per request (correlate logs across layers)
- Span for DNS, TLS handshake, connect, request write, response read
- Metrics:
- p50/p95 latency per endpoint
- Error rate per class (timeout, 5xx, 4xx, network)
- Retry counts and final outcomes
- Logs:
- Structured (JSON); include endpoint, status, attempt, backoff, cacheHit, dedupHit
- Redact secrets; sample at rate for success to avoid volume
Testing strategy
- Unit tests:
- Retry/backoff schedule correctness with fixed seeded RNG
- Idempotent vs. non-idempotent behaviors
- Deduplication map lifecycle and cancellation
- Integration tests:
- Fake server with injected behaviors (429 with Retry-After, 5xx flaps, slow streams)
- Chaos/Property tests:
- Randomized interleavings of timeouts, cancellations, and retries verify invariants (no memory leak, no double callbacks)
Reference design (pseudo-code)
Client Facade
class HttpClientFacade {
final HttpClient raw;
final RetryPolicy retry;
final Budget budget;
final Deduper deduper;
final Limiter limiter;
final Tracer tracer;
Future<HttpResponse> get(Uri uri, {Headers? headers, Deadline? deadline}) {
return _run(HttpRequest.get(uri, headers: headers), deadline: deadline);
}
Future<HttpResponse> _run(HttpRequest req, {Deadline? deadline}) async {
return limiter.run(() async {
final span = tracer.startSpan('http', attributes: {'url': req.uri.toString()});
try {
final key = deduper.keyOf(req);
return await deduper.run(key, () async {
return await retry.run(() async {
final t = deadline ?? budget.child();
return await raw.send(req, timeout: t.remaining, cancelToken: t.token);
}, isIdempotent: req.isIdempotent, deadline: deadline);
});
} on Cancelled {
span.setStatus('cancelled');
rethrow;
} on Timeout {
span.setStatus('deadline_exceeded');
rethrow;
} finally {
span.end();
}
});
}
}
Retry policy
class RetryPolicy {
final int maxAttempts;
final Duration base;
final Duration cap;
final Jitter jitter;
Future<T> run<T>(Future<T> Function() fn, {required bool isIdempotent, Deadline? deadline}) async {
var attempt = 0;
while (true) {
final remaining = deadline?.remaining;
if (remaining != null && remaining.isNegative) throw Timeout();
try {
return await fn();
} catch (e) {
if (!isIdempotent || !_isRetryable(e) || attempt++ >= maxAttempts) rethrow;
final backoff = _computeBackoff(attempt);
await Future.delayed(backoff);
}
}
}
}
Dedup and limiter (sketch)
class Deduper {
final _inflight = <String, Future>{}; // key -> future
Future<T> run<T>(String key, Future<T> Function() fn) {
if (_inflight.containsKey(key)) return _inflight[key] as Future<T>;
final fut = fn();
_inflight[key] = fut;
fut.whenComplete(() => _inflight.remove(key));
return fut;
}
}
class Limiter {
final int max;
int _running = 0;
final Queue<Completer<void>> _q = Queue();
Future<T> run<T>(Future<T> Function() fn) async {
if (_running >= max) {
final c = Completer<void>();
_q.add(c);
await c.future;
}
_running++;
try {
return await fn();
} finally {
_running--;
if (_q.isNotEmpty) _q.removeFirst().complete();
}
}
}
UX patterns
- “Retry” affordances for transient failures; “Report” for unexpected permanent errors
- Inline skeletons for SWR; subtle refresh indicators on updated content
- Respect user actions: avoid spinners that block the whole screen for background-able pulls
Adoption checklist
- Define error taxonomy and map to UI surfaces
- Establish deadlines and default timeouts
- Implement retry with backoff+jitter and idempotency keys
- Add cancellation tokens tied to UI lifecycles
- Add deduplication for identical inflight GETs
- Introduce concurrency limits and backpressure
- Add SWR caching where it pays off
- Wire tracing, logs, metrics; add dashboards
- Build fakes and chaos tests; enforce invariants in CI
Conclusion Resilience is a design choice, not an afterthought. A disciplined networking layer yields fewer production incidents, more predictable UX, and observability that speeds root-cause analysis. Start with a clear taxonomy and budgets, then layer in retries, cancellation, dedup, and backpressure—with tests that enforce the contracts.