Expand my Community achievements bar.

Streamlining streaming Data Ingestion: Guide to Integrating AWS MSK with Adobe Experience Platform using Kafka Sink Connector

Avatar

Employee

6/19/24

 

Introduction

In the fast-paced world of data-driven insights, Adobe Experience Platform (AEP) stands out as a powerful tool for consolidating and analyzing diverse sets of customer data-fragments in real-time. Adobe Experience Platform facilitates the creation of coordinated, coherent, and personalized experiences by generating a Real-Time Customer Profile for each of your unique customers. Streaming ingestion is instrumental in constructing these profiles by allowing you to swiftly deliver Profile data into the Data Lake with minimal latency from streaming sources such as Amazon Managed Streaming for Apache Kafka (AWS MSK).

Through this article, I want to share my experience on integration of MSK with AEP using the Kafka Sink Connector. Follow along as we guide you through the challenge you may come across while integrating MSK with AEP. Please note this is not a step-by-step guide. You may want to refer to the product documentation for that. The intent here is to share the most important pieces of the puzzle. Hopefully this will ensure a smooth and efficient data flow from your MSK cluster to AEP.

Apache Kafka vs AWS MSK?

Competition serves as a catalyst for innovation, much like the dynamic world of smartphones, where various brands and models continually push the boundaries of technology. Similarly, within the realm of Kafka-based businesses, a diverse ecosystem of companies competes for market dominance. In this analogy, Confluent, Cloudera, Red Hat, and Amazon MSK are akin to the leading smartphone manufacturers. Each vendor uniquely contributes to the Kafka ecosystem, with Confluent standing out for its dedicated focus on event streaming. However, it’s crucial to recognize that each vendor has its own strengths and weaknesses.

Moreover, while Confluent may excel in the Kafka-centric space due to its specialized focus, it’s important to acknowledge that AWS, with its comprehensive cloud infrastructure and strategic vision, holds a distinct advantage in the broader cloud computing landscape. While Kafka offers powerful real-time data processing capabilities, Amazon MSK provides a managed service layer on top of Kafka, relieving organizations of the burden of infrastructure management. This managed service model, along with seamless integration with AWS ecosystem services, makes MSK an appealing choice for enterprises seeking to streamline operations and scale effortlessly.

Kafka stands as a formidable tool, albeit one not without its challenges. Scaling clusters to match varying loads, optimizing costs, and maintaining effective load balancing can all present significant difficulties. On an Apache Kafka cluster, addressing these issues requires meticulous attention to detail. For instance, For instance, scaling a Kafka cluster up and down demanded extensive coordination with the support team to ensure seamless execution without downtime or data loss.

However, with an MSK cluster, the process was considerably smoother, highlighting the advantages of leveraging managed services in streamlining complex data workflows.

Why is connecting MSK to AEP more complex than connecting Apache Kafka?

Connecting MSK to AEP presents a more intricate process compared to linking an Apache Kafka cluster. This complexity arises from the necessity for meticulous configuration and synchronization across various services and platforms. Integrating MSK with AEP demands adherence to precise requirements and tailored setups for both MSK and AEP.                                                                KritikaPareek_1-1718862085946.png

 

When connecting a Kafka or MSK cluster to AEP, an AEP sink connector JAR is essential. This JAR acts as a utility facilitating the connection between Kafka and AEP. For on-prem or standalone Apache Kafka clusters, integration is relatively straightforward if you possess your own Kafka Connect Cluster. Simply dropping in the AEP Streaming Connector, an Uber JAR Kafka Connect Plugin, suffices. Referencing documentation guides on installing Kafka Connect plugins, once the plugin is installed, running streaming connector instances allows for data transmission to Adobe.

Conversely, configuring an MSK cluster for AEP integration involves a multitude of configurations before data can seamlessly flow into AEP. Setting up an MSK cluster entails meticulous attention to detail across various aspects, including establishing VPC in Networking, configuring Security groups, defining IAM roles in Permissions, and crafting custom plugins to leverage the AEP-sink-connector JAR. This intricate setup underscores the complexity inherent in connecting MSK to AEP, requiring comprehensive planning and execution to ensure smooth data streaming.

Challenges faced and Solution

  1. Creating a compatible Subnet

The challenge arose when our Kafka Connect encountered difficulties accessing the internet, resulting in timeouts. To address this issue, we explored two potential solutions:

1. Utilize Public Subnets:

The first option involves fetching the Elastic Network Interface (ENI) of your Kafka Connect and associating it with an Elastic IP address. However, this approach presented some challenges:

  • Since MSK Connect is a managed service, it dynamically creates the ENI, making it challenging to retrieve from the user interface.
  • Determining which Elastic IP address to associate with the ENI proved to be unclear.

2. Create a Private Subnet for Kafka Connect and Implement Self-Calling Security Group:

Alternatively, we pursued a successful strategy:

  • Configured our connector to operate within private subnets.
  • Established a public NAT gateway or NAT instance within our Virtual Private Cloud (VPC) in a public subnet, which was already provisioned in the region.
  • Allowed outbound traffic from our private subnets to the NAT gateway or instance.
arch diagram.png

By opting for the second approach, we overcame the challenges and ensured seamless connectivity for our Kafka Connect deployment.

      2. OAuth support

Initially, during the implementation of the solution, OAuth authentication support was unavailable. As a workaround, we opted for JWT token authentication. Recognizing the need for improved authentication mechanisms, we submitted an enhancement request for the AEP-sink-connector to support OAuth authentication.

Fortunately, the engineering team swiftly addressed our request, providing the necessary support for OAuth authentication, thus enhancing the security and robustness of our integration with Adobe Experience Platform.

Non-functional Aspects

When setting up an MSK cluster and connecting it to a destination, there are several non-functional aspects that should be mindful of to ensure smooth operation and optimal performance:

Maintainability: Setting up a new MSK cluster is straightforward via the AWS console, taking about 20 minutes. Options include a choice of machine sizes and managed Zookeeper. Monitoring through CloudWatch is integrated, though some metrics come with additional charges. Documentation and support are reliable, with prompt responses from the support team.

Performance: Initial tests with basic configurations showed a maximum record rate of around 100K rec/sec and an average latency of ~330ms. With optimized settings provided by AWS documentations, a max record rate of ±300K rec/sec and an average latency of ~100ms were achieved, meeting performance needs.

Scalability: MSK lacks on-the-fly scalability, requiring cluster recreation for changes like adding or removing broker nodes. This limitation poses challenges for handling sudden spikes in workload, suggesting either overprovisioning or manual cluster migration during peak periods.

Reliability: MSK ensures reliability through multi-AZ setups and automatic recovery mechanisms. Despite occasional failures, proper configuration with multi-AZ replication mitigates downtime, although recovery times can vary, impacting service availability.

Security: MSK offers improved security features, including encryption at rest and in transit. However, limitations exist in client permissions management, hindering granular access control.

Cost: MSK can be costly, especially for high-traffic scenarios. While comparisons to on-prem solutions may favor MSK for its managed services benefits, alternative options like Confluent Managed Kafka may offer more competitive pricing for certain workloads.

Conclusion

This blog details my experience connecting an MSK cluster to Adobe Experience Platform using the AEP-Sink-Connector. I hope it provides valuable insights and helps you with your setup.

 

Happy setup!

 

Special thanks to my team and manager - Manjeet Singh Nagi for your valuable insights.

 

Questions? Feedback? Connect with me on LinkedIn or contact me directly at kpareek@adobe.com

 

See you on the next blog!

References

Image1 reference : https://experienceleague.adobe.com/en/docs/experience-platform/ingestion/streaming/kafka

https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect-connectors.html

https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect-plugins.html

https://experienceleague.adobe.com/en/docs/experience-platform/ingestion/streaming/kafka

https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html