Skip to main content
  1. Projects/

Kafka & Data Pipeline

Author
Karthik B Hegde

Overview
#

Production-grade data pipeline handling event streaming, log aggregation, and analytics data flow across multiple services — processing real-time events and feeding them into PostgreSQL for operational data and BigQuery for analytics.

Apache Kafka PostgreSQL Redis BigQuery Docker


The Problem
#

As the application grew, several data challenges emerged:

  • Tight coupling between services — Direct API calls between services created cascading failures
  • No real-time event processing — Changes in one system took minutes to propagate to others
  • Analytics queries impacting production — Running analytics on the production database caused performance degradation
  • Log data was scattered — No centralized way to aggregate and search logs across services

Architecture
#

flowchart TB
    subgraph "Producers"
        A[Application Services]
        B[Backend APIs]
        C[Background Workers]
    end

    subgraph "Kafka Cluster"
        D[Topic: events]
        E[Topic: logs]
        F[Topic: analytics]
    end

    subgraph "Consumers"
        G[Event Processor]
        H[Log Aggregator]
        I[Analytics Pipeline]
    end

    subgraph "Storage"
        J[(PostgreSQL\nOperational DB)]
        K[(Redis\nCache Layer)]
        L[(BigQuery\nAnalytics Warehouse)]
    end

    A --> D
    B --> D & E
    C --> E & F

    D --> G --> J
    E --> H --> J
    F --> I --> L

    G --> K
    J --> |Replication| J2[(PostgreSQL\nReplica)]

Kafka Setup
#

Cluster Configuration
#

  • Multi-broker setup for fault tolerance
  • Topic partitioning designed around consumer throughput requirements
  • Retention policies tuned per topic — events retained for 7 days, logs for 30 days, analytics for 3 days (already persisted to BigQuery)

Topic Design
#

TopicPartitionsPurposeRetention
eventsBased on consumer countService-to-service event streaming7 days
logsBased on throughputCentralized log aggregation30 days
analyticsOptimized for BigQuery sinkAnalytics event stream3 days

Data Flow
#

Event Streaming
#

Services publish domain events (user actions, state changes, system events) to Kafka. Consumer services pick up events asynchronously — decoupling the producer from downstream processing.

Log Aggregation
#

Application logs are published to a dedicated Kafka topic, consumed by a log aggregator that structures and stores them for search and analysis.

Analytics Pipeline
#

High-value analytics events flow through Kafka into BigQuery, where they power dashboards and business intelligence queries — completely isolated from the production database.

PostgreSQL
#

  • Primary-replica replication for read scaling and disaster recovery
  • Automated backups with tested restore procedures
  • Connection pooling to handle high-concurrency workloads

Redis
#

  • Caching layer for frequently accessed data — reducing PostgreSQL load
  • Session management for application state
  • Rate limiting counters for API endpoints

Monitoring
#

Key Kafka metrics tracked:

  • Consumer lag — How far behind consumers are from the latest offset
  • Throughput — Messages per second across topics
  • Broker health — Disk usage, replication status, partition leadership
  • End-to-end latency — Time from produce to consume

Lessons Learned
#

  • Partition count matters more than you think — Under-partitioned topics become bottlenecks; over-partitioned topics waste resources. Start conservative, scale up based on consumer throughput
  • Consumer group management is critical — Rebalancing during deployments can cause message processing delays; use cooperative rebalancing where possible
  • Separate analytics from production early — Running BigQuery-style queries on PostgreSQL will eventually bring production down. Kafka makes the separation clean
  • Monitor consumer lag religiously — It’s the first signal that something is wrong in the pipeline
  • Redis is not a database — Use it for caching and ephemeral data, not as a primary store. Always have a fallback to PostgreSQL