Robust Networking in Flutter: Errors, Retries, Cancellation, and Backpressure

3/22/2025
ZK

Written by Zayin Krige

Flutter developer and founder of Apex Technology

Why this matters

  • Mobile networks are hostile: variable latency, captive portals, TLS interceptors, and background restrictions.
  • A high-quality app treats the network as unreliable and designs a resilient data layer that fails gracefully, retries intelligently, and remains observable in production.

Objectives

  • Predictable error classification with app-friendly messages
  • Sensible defaults for timeouts, retries, and cancellation
  • Request deduplication and backpressure to prevent stampedes
  • Clear observability: traces, metrics, structured logs for diagnosis

Contents

  • Error taxonomy and surfacing to UI
  • Timeouts: connect vs. read vs. whole call budgets
  • Retry strategy, jitter, and idempotence
  • Cancellation and lifecycles (widgets, navigation, background)
  • Request coalescing/deduplication
  • Backpressure and concurrency limits
  • Caching and staleness semantics
  • Observability (traces, logs, metrics)
  • Testing strategy (fakes, chaos, property tests)
  • Reference design and code outline

Error taxonomy Categorize errors first; everything else follows.

  • Client/Device
    • No connectivity, DNS failure, TLS handshake, certificate pinning failure
  • Transient Server/Network
    • 429 Too Many Requests, 5xx server errors, timeouts, reset by peer
  • Permanent Client Errors
    • 400/422 validation, authentication/authorization (401/403)
  • Business Errors
    • Domain-level constraints, quota exceeded, locked resource

UI surfacing

  • Client/Transient: “Temporary network issue. Retrying…”
  • Permanent Client: show actionable validation/auth error
  • Business: error banner with context; consider soft retry/recovery guidance

Timeouts and budgets

  • Separate budgets:
    • Connect timeout: 2–5s
    • Read timeout: 10–30s (per chunk/response)
    • Whole-call deadline: 15–60s (use CancellationToken/Deadline)
  • Budget propagation:
    • Compose budgets downstream (e.g., list → item calls inherit parent deadline minus elapsed)
  • Alert on timeouts per endpoint to catch regressions

Retry strategy and jitter

  • Retry only idempotent methods by default (GET, HEAD). For POST/PUT/PATCH/DELETE, require explicit allowance and include idempotency keys.
  • Exponential backoff with full jitter:
    • Base = 250ms, factor 2, jitter [0, base) for first few attempts
    • Cap (max backoff) e.g., 5–10s
    • Max attempts: 3–5 (context-dependent)
  • Honor Retry-After headers and server-provided pace
  • Circuit-breaking:
    • Track rolling error rate per host/route; short-circuit with immediate fallback when above threshold

Cancellation and lifecycles

  • Tie requests to UI lifecycles:
    • Cancel when widget is disposed, route is popped, or search query changes
  • Prevent “late write” of stale responses:
    • Attach generation/version number to callbacks; ignore responses from outdated generations
  • Use autoDispose scopes (Riverpod) or cancellable tokens (BLoC/controller)

Request coalescing and deduplication

  • Coalesce identical in-flight GETs (same method+URL+headers+body hash) and fan-out results to all subscribers.
  • Useful for “search-as-you-type” (coalesce per debounced query) and details pages accessed from multiple entry points.

Backpressure and concurrency limits

  • Limit concurrent in-flight requests per host and globally (e.g., 6 per host, 24 global).
  • Queue requests beyond the limit; prioritize critical ones (foreground UI > background prefetch).
  • Combine with debounce for chattier flows (e.g., filters, search).
  • Avoid stampedes by caching negative results briefly and coalescing.

Caching and staleness

  • Respect HTTP caching headers (ETag, Cache-Control, Last-Modified)
  • Serve stale-while-revalidate (SWR) in UI:
    • Show cached data immediately
    • Kick off background refresh; update UI when fresh arrives
  • Define staleness budgets per domain (e.g., 30s for news list, 24h for static catalog)

Observability

  • Tracing:
    • Trace ID per request (correlate logs across layers)
    • Span for DNS, TLS handshake, connect, request write, response read
  • Metrics:
    • p50/p95 latency per endpoint
    • Error rate per class (timeout, 5xx, 4xx, network)
    • Retry counts and final outcomes
  • Logs:
    • Structured (JSON); include endpoint, status, attempt, backoff, cacheHit, dedupHit
    • Redact secrets; sample at rate for success to avoid volume

Testing strategy

  • Unit tests:
    • Retry/backoff schedule correctness with fixed seeded RNG
    • Idempotent vs. non-idempotent behaviors
    • Deduplication map lifecycle and cancellation
  • Integration tests:
    • Fake server with injected behaviors (429 with Retry-After, 5xx flaps, slow streams)
  • Chaos/Property tests:
    • Randomized interleavings of timeouts, cancellations, and retries verify invariants (no memory leak, no double callbacks)

Reference design (pseudo-code)

Client Facade

class HttpClientFacade {
  final HttpClient raw;
  final RetryPolicy retry;
  final Budget budget;
  final Deduper deduper;
  final Limiter limiter;
  final Tracer tracer;

  Future<HttpResponse> get(Uri uri, {Headers? headers, Deadline? deadline}) {
    return _run(HttpRequest.get(uri, headers: headers), deadline: deadline);
  }

  Future<HttpResponse> _run(HttpRequest req, {Deadline? deadline}) async {
    return limiter.run(() async {
      final span = tracer.startSpan('http', attributes: {'url': req.uri.toString()});
      try {
        final key = deduper.keyOf(req);
        return await deduper.run(key, () async {
          return await retry.run(() async {
            final t = deadline ?? budget.child();
            return await raw.send(req, timeout: t.remaining, cancelToken: t.token);
          }, isIdempotent: req.isIdempotent, deadline: deadline);
        });
      } on Cancelled {
        span.setStatus('cancelled');
        rethrow;
      } on Timeout {
        span.setStatus('deadline_exceeded');
        rethrow;
      } finally {
        span.end();
      }
    });
  }
}

Retry policy

class RetryPolicy {
  final int maxAttempts;
  final Duration base;
  final Duration cap;
  final Jitter jitter;
  Future<T> run<T>(Future<T> Function() fn, {required bool isIdempotent, Deadline? deadline}) async {
    var attempt = 0;
    while (true) {
      final remaining = deadline?.remaining;
      if (remaining != null && remaining.isNegative) throw Timeout();
      try {
        return await fn();
      } catch (e) {
        if (!isIdempotent || !_isRetryable(e) || attempt++ >= maxAttempts) rethrow;
        final backoff = _computeBackoff(attempt);
        await Future.delayed(backoff);
      }
    }
  }
}

Dedup and limiter (sketch)

class Deduper {
  final _inflight = <String, Future>{}; // key -> future
  Future<T> run<T>(String key, Future<T> Function() fn) {
    if (_inflight.containsKey(key)) return _inflight[key] as Future<T>;
    final fut = fn();
    _inflight[key] = fut;
    fut.whenComplete(() => _inflight.remove(key));
    return fut;
  }
}

class Limiter {
  final int max;
  int _running = 0;
  final Queue<Completer<void>> _q = Queue();
  Future<T> run<T>(Future<T> Function() fn) async {
    if (_running >= max) {
      final c = Completer<void>();
      _q.add(c);
      await c.future;
    }
    _running++;
    try {
      return await fn();
    } finally {
      _running--;
      if (_q.isNotEmpty) _q.removeFirst().complete();
    }
  }
}

UX patterns

  • “Retry” affordances for transient failures; “Report” for unexpected permanent errors
  • Inline skeletons for SWR; subtle refresh indicators on updated content
  • Respect user actions: avoid spinners that block the whole screen for background-able pulls

Adoption checklist

  • Define error taxonomy and map to UI surfaces
  • Establish deadlines and default timeouts
  • Implement retry with backoff+jitter and idempotency keys
  • Add cancellation tokens tied to UI lifecycles
  • Add deduplication for identical inflight GETs
  • Introduce concurrency limits and backpressure
  • Add SWR caching where it pays off
  • Wire tracing, logs, metrics; add dashboards
  • Build fakes and chaos tests; enforce invariants in CI

Conclusion Resilience is a design choice, not an afterthought. A disciplined networking layer yields fewer production incidents, more predictable UX, and observability that speeds root-cause analysis. Start with a clear taxonomy and budgets, then layer in retries, cancellation, dedup, and backpressure—with tests that enforce the contracts.