Kafka & Data Pipeline

Overview
#

Production-grade data pipeline handling event streaming, log aggregation, and analytics data flow across multiple services — processing real-time events and feeding them into PostgreSQL for operational data and BigQuery for analytics.

Apache Kafka PostgreSQL Redis BigQuery Docker

The Problem
#

As the application grew, several data challenges emerged:

Tight coupling between services — Direct API calls between services created cascading failures
No real-time event processing — Changes in one system took minutes to propagate to others
Analytics queries impacting production — Running analytics on the production database caused performance degradation
Log data was scattered — No centralized way to aggregate and search logs across services

Architecture
#

flowchart TB
    subgraph "Producers"
        A[Application Services]
        B[Backend APIs]
        C[Background Workers]
    end

    subgraph "Kafka Cluster"
        D[Topic: events]
        E[Topic: logs]
        F[Topic: analytics]
    end

    subgraph "Consumers"
        G[Event Processor]
        H[Log Aggregator]
        I[Analytics Pipeline]
    end

    subgraph "Storage"
        J[(PostgreSQL\nOperational DB)]
        K[(Redis\nCache Layer)]
        L[(BigQuery\nAnalytics Warehouse)]
    end

    A --> D
    B --> D & E
    C --> E & F

    D --> G --> J
    E --> H --> J
    F --> I --> L

    G --> K
    J --> |Replication| J2[(PostgreSQL\nReplica)]

Kafka Setup
#

Cluster Configuration
#

Multi-broker setup for fault tolerance
Topic partitioning designed around consumer throughput requirements
Retention policies tuned per topic — events retained for 7 days, logs for 30 days, analytics for 3 days (already persisted to BigQuery)

Topic Design
#

Topic	Partitions	Purpose	Retention
`events`	Based on consumer count	Service-to-service event streaming	7 days
`logs`	Based on throughput	Centralized log aggregation	30 days
`analytics`	Optimized for BigQuery sink	Analytics event stream	3 days

Data Flow
#

Event Streaming
#

Services publish domain events (user actions, state changes, system events) to Kafka. Consumer services pick up events asynchronously — decoupling the producer from downstream processing.

Log Aggregation
#

Application logs are published to a dedicated Kafka topic, consumed by a log aggregator that structures and stores them for search and analysis.

Analytics Pipeline
#

High-value analytics events flow through Kafka into BigQuery, where they power dashboards and business intelligence queries — completely isolated from the production database.

PostgreSQL
#

Primary-replica replication for read scaling and disaster recovery
Automated backups with tested restore procedures
Connection pooling to handle high-concurrency workloads

Redis
#

Caching layer for frequently accessed data — reducing PostgreSQL load
Session management for application state
Rate limiting counters for API endpoints

Monitoring
#

Key Kafka metrics tracked:

Consumer lag — How far behind consumers are from the latest offset
Throughput — Messages per second across topics
Broker health — Disk usage, replication status, partition leadership
End-to-end latency — Time from produce to consume

Lessons Learned
#

Partition count matters more than you think — Under-partitioned topics become bottlenecks; over-partitioned topics waste resources. Start conservative, scale up based on consumer throughput
Consumer group management is critical — Rebalancing during deployments can cause message processing delays; use cooperative rebalancing where possible
Separate analytics from production early — Running BigQuery-style queries on PostgreSQL will eventually bring production down. Kafka makes the separation clean
Monitor consumer lag religiously — It’s the first signal that something is wrong in the pipeline
Redis is not a database — Use it for caching and ephemeral data, not as a primary store. Always have a fallback to PostgreSQL

Overview#

The Problem#

Architecture#

Kafka Setup#

Cluster Configuration#

Topic Design#

Data Flow#

Event Streaming#

Log Aggregation#

Analytics Pipeline#

PostgreSQL#

Redis#

Monitoring#

Lessons Learned#