Scylla Summit 2022: Stream Processing with ScyllaDB - No Message Queue Involved!

2 min read 6 months ago
Published on Apr 22, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Table of Contents

Step-by-Step Tutorial: Stream Processing with ScyllaDB

Introduction:

In this tutorial, we will discuss stream processing with ScyllaDB based on the insights shared in the YouTube video titled "Scylla Summit 2022: Stream Processing with ScyllaDB - No Message Queue Involved!" by Daniel Belenki, a Principal Software Engineer at Palo Alto Networks.

Step 1: Understanding the Challenge

  • Daniel Belenki explains that their team works on a network security product that handles millions of records per second from various data sources.
  • The challenge they face is to build stories from multiple event types and data sources to create a unified view of network sessions.

Step 2: Technology Stack

  • The technology stack used includes Golang, Python, and Kubernetes for deployment.

Step 3: Initial Solutions Considered

  • Initially, a relational database was considered for storing and querying normalized data. However, it posed operational overhead and limited performance.

Step 4: Solution with ScyllaDB

  • The team decided to use ScyllaDB without a message queue for stream processing.
  • Data is stored in ScyllaDB, which is sharded into hundreds of shards for parallel processing.
  • Workers fetch data from ScyllaDB based on partition keys and timestamps to compute stories and relations between events.

Step 5: Implementation Details

  • Workers store read offsets in a separate table to track the last processed timestamp for each shard.
  • Producers insert data into ScyllaDB, while consumers fetch data based on shard and timestamp criteria.
  • Workers continuously run queries on ScyllaDB to process events and publish stories for system components to consume.

Step 6: Pros and Cons of the Solution

  • Pros:

    • High throughput compared to relational databases.
    • Reduced operational complexity by eliminating the need for a message queue.
    • Improved performance and scalability with ScyllaDB.
  • Cons:

    • Increased code complexity due to custom logic for correlating events.
    • Requirement to synchronize producers and consumers with clock resolution.

Step 7: Benefits of Using ScyllaDB

  • By leveraging ScyllaDB for stream processing, the team reduced operational complexity, improved performance, and managed to handle diverse data sources efficiently.

Conclusion:

  • Stream processing with ScyllaDB offers a scalable and efficient solution for handling real-time data processing requirements, especially in scenarios where maintaining multiple message queue deployments is challenging.

By following these steps, you can understand the process of stream processing with ScyllaDB as explained by Daniel Belenki in the video.