CDC, ETL solution for Oracle, MongoDB, Elasticsearch powered by Kafka, Confluent
As more service grows we need to distribute data across multiple domains to achieve performance and availability(CQRS etc) and that being called ETL(Extract, Transform, Load).
There will be several consideration.
- Support for different types of data source and data sink with low maintenance efforts.
- Consistency, no duplication or omission.
- SLA, Close to real time, less than ? seconds for sync.
- Data processing capability, Data filtering, Data transformations and conversions, Data enrichment with joins, Data manipulation with scalar functions, Data analysis with stateful processing, aggregations, windowing operations
Surely it can be developed but hard to maintain and as we should more focus on Business domain, rather decide to investigate on more mature invented wheel, which is Confluent Platform.



Extract (CDC)
Oracle CDC
Query-based CDC
- JDBC Connector with Flashback, Kafka Connect polls the database by JDBC Connector for new or changed data based on an incrementing ID column and/or update timestamp and can use Flashback logs by enabling Flashback Database on Oracle to detect every events including ‘Delete’ operation happen to tables. However it require another more space for log files and impact on performance. Our organization has one day retention of Flashback logs and that is quite short lifetime. FYI, Flashback Best Practices.
Log-based CDC
- Oracle GoldenGate, two options to integrate with Kafka, one with official GoldenGate Kafka Handler, the other with Kafka Connect Handler, there is difference in functionality.
Others
- Trigger based CDC
- Database Change Notification based CDC
I start to focus on OGG approach since first our organization already using Oracle with OGG anyway, seems no cost issue, even other proposed open source database to replace Oracle were declined. Also other approaches are Oracle specific solution which hardly can be maintained by our hands.
OGG Kafka Handler has two mode, Transaction Mode : serialized data for every operation in a transaction and Operation Mode : serialized data for each operation, Transaction Mode can keep the relations between generated data.

However, the problem of Transaction Mode is that it cannot publish to multiple topics
Mongo CDC
- MongoDB Change Stream : Mongo Kafka Connect Source; debezium-connector-mongodb, kafka-connect-mongodb
Transform
Transformation on the fly,
KSQL vs Develop own Kafka client with KStream API,
Simplicity vs Flexibility
KSQL and KStream is simply saying a Kafka Client of both consumer and producer and KSQL is KStream++ which has SQL-like feature with it so it can reduce time to develop.
There is problem using KSQL, KStream for denormalization though, when we think about multiple relationship models. It is hard to be managed in KTable since window range cannot be estimated when updates can be occurred in many tables with relationship. The solution? make KTable for Insertion only and KStream for each single table? or OGG Kafka Handler has Transaction Mode which produce all table changes captured, however, need to check the size of payload, also whether possible to whitelist or blacklist for tables and having multiple OGG Kafka Handler
Load
Mongo Kafka Connect Sink : kafka-connect-mongodb, has two write model strategy, ReplaceOneModel or an UpdateOneModel.
Elastic Kafka Connect Sink : kafka-connect-elasticsearch



