Aggregation Pattern¶

The aggregation pattern allows a routine to wait for all expected messages from multiple upstream routines before processing and emitting results. This is useful for scenarios like aggregating search results from multiple sources, collecting data from parallel tasks, or combining outputs from different processors.

Overview¶

When you need to collect data from multiple sources and process it together, you can use the append merge strategy combined with a counter check in the handler. This pattern ensures that:

All incoming messages are accumulated
The handler is called for each message
Processing only occurs when all expected messages are received
Results are emitted once after all data is collected

Key Concepts¶

Merge Strategy: “append”: With merge_strategy="append", each incoming message’s values are appended to lists. This allows you to accumulate data over multiple receive operations.
Message Counting: In the handler, check the length of any list field to determine how many messages have been received. When the count reaches the expected number, process all accumulated data.
Single Processing: Use a flag to ensure processing only happens once, even if the handler is called multiple times with the same count.

Basic Example¶

Here’s a simple aggregator that waits for 3 messages:

from routilux import Flow, Routine

class SourceRoutine(Routine):
    def __init__(self, source_id: str):
        super().__init__()
        self.source_id = source_id
        self.output_event = self.define_event("output", ["data", "source_id"])

    def __call__(self, **kwargs):
        super().__call__(**kwargs)
        data = kwargs.get("data", f"data_from_{self.source_id}")
        self.emit("output", data=data, source_id=self.source_id)

class AggregatorRoutine(Routine):
    """Aggregator that waits for all expected messages."""

    def __init__(self, expected_count: int = 3):
        super().__init__()
        self.expected_count = expected_count
        self.set_config(expected_count=expected_count)
        self.processed = False  # Flag to ensure single processing

        # Use append strategy to accumulate data
        self.input_slot = self.define_slot(
            "input",
            handler=self._handle_input,
            merge_strategy="append"  # Key: append strategy
        )
        self.output_event = self.define_event("aggregated", ["all_data", "count"])

    def _handle_input(self, **kwargs):
        """Handle input and check if all messages received."""
        # With append strategy, kwargs contains lists
        # Count messages using any list field
        received_count = 0
        if "source_id" in kwargs and isinstance(kwargs["source_id"], list):
            received_count = len(kwargs["source_id"])
        elif "data" in kwargs and isinstance(kwargs["data"], list):
            received_count = len(kwargs["data"])

        expected_count = self.get_config("expected_count", self.expected_count)

        # Process only when all messages received and not already processed
        if received_count >= expected_count and not self.processed:
            self.processed = True

            # Extract accumulated data
            all_data = []
            if "data" in kwargs and isinstance(kwargs["data"], list):
                all_data = kwargs["data"]

            # Emit aggregated result
            self.emit("aggregated", all_data=all_data, count=len(all_data))

            # Reset for next aggregation (optional)
            self.input_slot._data = {}

# Create flow
flow = Flow(flow_id="aggregator_demo")

# Create sources
source1 = SourceRoutine("source1")
source2 = SourceRoutine("source2")
source3 = SourceRoutine("source3")

# Create aggregator
aggregator = AggregatorRoutine(expected_count=3)

# Add to flow
id1 = flow.add_routine(source1, "source1")
id2 = flow.add_routine(source2, "source2")
id3 = flow.add_routine(source3, "source3")
agg_id = flow.add_routine(aggregator, "aggregator")

# Connect all sources to aggregator
flow.connect(id1, "output", agg_id, "input")
flow.connect(id2, "output", agg_id, "input")
flow.connect(id3, "output", agg_id, "input")

# Execute sources
flow.execute(id1, entry_params={"data": "data1"})
flow.execute(id2, entry_params={"data": "data2"})
flow.execute(id3, entry_params={"data": "data3"})

# Aggregator will process when all 3 messages are received

How It Works¶

Append Strategy: When merge_strategy="append" is used, each receive() call appends values to lists in slot._data.
Handler Invocation: The handler is called after each receive() with the accumulated data (where values are lists).
Message Counting: Check the length of any list field in kwargs to count received messages.
Conditional Processing: Only process when: - Count reaches expected number - Not already processed (use a flag)
Data Extraction: Extract all accumulated data from the lists and process it together.
Emission: Emit the aggregated result once.
Reset (Optional): Clear slot._data to prepare for the next aggregation cycle.

Complete Example: Search Result Aggregation¶

Here’s a complete example that aggregates search results from multiple search engines:

"""
Aggregator Routine Demo

Demonstrates how to create a routine that waits for all expected messages
before processing and emitting results.
"""

import time

from routilux import Flow, Routine


class SearchTask(Routine):
    """A search task routine that simulates searching."""

    def __init__(self):
        super().__init__()
        self.set_config(task_name="default")

        # Define trigger slot
        self.trigger_slot = self.define_slot("trigger", handler=self._handle_trigger)

        # Define output event
        self.output_event = self.define_event("result", ["query", "results", "task_name"])

    def _handle_trigger(self, query: str = None, **kwargs):
        """Handle search trigger."""
        query = query or kwargs.get("query", "default")

        # Simulate search operation
        time.sleep(0.1)  # Simulate I/O delay

        # Generate mock results
        results = [
            f"{self.task_name}_result_1",
            f"{self.task_name}_result_2",
            f"{self.task_name}_result_3",
        ]

        # Track operation
        self._track_operation("searches", success=True, results_count=len(results))

        # Emit results
        self.emit("result", query=query, results=results, task_name=self.task_name)


class ResultAggregator(Routine):
    """Aggregator routine that waits for all expected messages before processing."""

    def __init__(self):
        super().__init__()
        self.set_config(expected_count=3)

        # Set configuration
        self.set_config(timeout=10.0)  # Optional timeout

        # Define input slot with append strategy to collect all results
        self.input_slot = self.define_slot(
            "input",
            handler=self._handle_input,
            merge_strategy="append",  # Collect all incoming data
        )

        # Define output event
        self.output_event = self.define_event(
            "aggregated", ["all_results", "total_count", "queries"]
        )

    def _handle_input(self, **kwargs):
        """Handle input and check if we have all expected messages.

        With merge_strategy="append", each receive() call adds to the accumulated data.
        The handler receives the merged data (with lists), so we can check the length
        of any list field to determine how many messages we've received.

        Args:
            **kwargs: Merged data from slot. With append strategy, values are lists.
                For example: {'task_name': ['task1', 'task2'], 'results': [[...], [...]]}
        """
        # With merge_strategy="append", kwargs contains accumulated data where
        # values are lists. We can check the length of any list to count messages.

        # Count how many messages we've received
        # Use task_name list length as the count (since each message has a task_name)
        received_count = 0
        if "task_name" in kwargs and isinstance(kwargs["task_name"], list):
            received_count = len(kwargs["task_name"])
        elif "results" in kwargs and isinstance(kwargs["results"], list):
            # If task_name not available, use results list length
            received_count = len(kwargs["results"])
        elif "query" in kwargs and isinstance(kwargs["query"], list):
            received_count = len(kwargs["query"])
        else:
            # Fallback: count any list field
            for key, value in kwargs.items():
                if isinstance(value, list) and value:
                    received_count = len(value)
                    break

        expected_count = self.get_config("expected_count", 3)

        # Get current task_name from kwargs if available
        current_task = kwargs.get("task_name", "unknown")
        if isinstance(current_task, list) and current_task:
            current_task = current_task[-1]  # Get last one

        print(
            f"Aggregator received message from {current_task}. "
            f"Total received: {received_count}/{expected_count}"
        )

        # Check if we've received all expected messages
        if received_count >= expected_count:
            print(f"✅ All {expected_count} messages received! Processing aggregated results...")

            # Process all accumulated data (kwargs contains the merged data)
            self._process_aggregated_results(kwargs)

            # Reset for next aggregation (optional)
            self.input_slot._data = {}
        else:
            print(f"⏳ Waiting for more messages ({received_count}/{expected_count})...")

    def _process_aggregated_results(self, accumulated_data: dict):
        """Process all aggregated results and emit.

        Args:
            accumulated_data: Dictionary with accumulated data. With append strategy,
                values are lists containing all received values.
        """
        # Track operation
        self._track_operation("aggregations", success=True)

        # Extract all results
        all_results = []
        queries = []
        task_names = []

        if "results" in accumulated_data:
            # results is a list of lists (each search task's results)
            results_list = accumulated_data["results"]
            if isinstance(results_list, list):
                for result_list in results_list:
                    if isinstance(result_list, list):
                        all_results.extend(result_list)
                    else:
                        all_results.append(result_list)

        if "query" in accumulated_data:
            query_list = accumulated_data["query"]
            queries = query_list if isinstance(query_list, list) else [query_list]

        if "task_name" in accumulated_data:
            task_name_list = accumulated_data["task_name"]
            task_names = task_name_list if isinstance(task_name_list, list) else [task_name_list]

        print(f"📊 Aggregated {len(all_results)} results from {len(task_names)} search tasks")

        # Emit aggregated result
        self.emit(
            "aggregated", all_results=all_results, total_count=len(all_results), queries=queries
        )

        # Reset for next aggregation (optional)
        # self.input_slot._data = {}


def demo_aggregator():
    """Demonstrate aggregator routine."""
    print("=" * 70)
    print("Aggregator Routine Demo")
    print("=" * 70)

    # Create flow
    flow = Flow(flow_id="aggregator_demo")

    # Create search tasks
    search1 = SearchTask()
    search1.set_config(task_name="SearchEngine1")
    search2 = SearchTask()
    search2.set_config(task_name="SearchEngine2")
    search3 = SearchTask()
    search3.set_config(task_name="SearchEngine3")

    # Create aggregator (expects 3 results)
    aggregator = ResultAggregator()
    aggregator.set_config(expected_count=3)

    # Add to flow
    id1 = flow.add_routine(search1, "search1")
    id2 = flow.add_routine(search2, "search2")
    id3 = flow.add_routine(search3, "search3")
    agg_id = flow.add_routine(aggregator, "aggregator")

    # Connect all search tasks to aggregator
    flow.connect(id1, "result", agg_id, "input")
    flow.connect(id2, "result", agg_id, "input")
    flow.connect(id3, "result", agg_id, "input")

    # Create a consumer to receive aggregated results
    class ResultConsumer(Routine):
        def __init__(self):
            super().__init__()
            self.received_results = []
            self.input_slot = self.define_slot("input", handler=self._handle_input)

        def _handle_input(self, all_results: list = None, total_count: int = None, **kwargs):
            self.received_results.append({"results": all_results, "count": total_count})
            print(f"📦 Consumer received aggregated result: {total_count} total results")

    consumer = ResultConsumer()
    consumer_id = flow.add_routine(consumer, "consumer")
    flow.connect(agg_id, "aggregated", consumer_id, "input")

    # Create a multi-source trigger routine that triggers all search tasks
    class MultiSourceTrigger(Routine):
        """Trigger routine that emits to multiple search tasks in a single execute()"""

        def __init__(self):
            super().__init__()
            self.trigger_slot = self.define_slot("trigger", handler=self._handle_trigger)
            self.output_event = self.define_event("trigger_search", ["query"])

        def _handle_trigger(self, query: str = None, **kwargs):
            """Trigger all search tasks in a single execute() call"""
            query = query or kwargs.get("query", "test query")
            # Emit to all connected search tasks - they all share the same execution
            # Flow is automatically detected from routine context
            self.emit("trigger_search", query=query)

    trigger = MultiSourceTrigger()
    trigger_id = flow.add_routine(trigger, "trigger")

    # Connect trigger to all search tasks
    flow.connect(trigger_id, "trigger_search", id1, "trigger")
    flow.connect(trigger_id, "trigger_search", id2, "trigger")
    flow.connect(trigger_id, "trigger_search", id3, "trigger")

    print("\nFlow structure:")
    print("  trigger -> search1 -> aggregator -> consumer")
    print("  trigger -> search2 -> aggregator")
    print("  trigger -> search3 -> aggregator")
    print("\nAggregator expects: 3 messages")

    # Execute once - all search tasks will be triggered in the same execution
    print("\n🚀 Executing all search tasks (single execute, multiple emits)...")
    job_state = flow.execute(trigger_id, entry_params={"query": "test query"})

    # Wait for all async tasks to complete
    from routilux.job_state import JobState

    JobState.wait_for_completion(flow, job_state, timeout=2.0)

    print("\n" + "=" * 70)
    print("Results:")
    # Execution state is tracked in JobState, not routine._stats
    # Note: job_state is not available in this scope, execution state would be in JobState
    print(f"  Consumer received: {len(consumer.received_results)} aggregated result(s)")
    if consumer.received_results:
        print(f"  Total results in aggregation: {consumer.received_results[0]['count']}")
    print("=" * 70)


if __name__ == "__main__":
    demo_aggregator()

Key Points¶

Handler is Called for Each Message

The handler is called immediately after each message is received. You check the count inside the handler to decide when to process.

Append Strategy Behavior

With merge_strategy="append":

First message: kwargs = {"data": ["data1"], "source_id": ["source1"]}
Second message: kwargs = {"data": ["data1", "data2"], "source_id": ["source1", "source2"]}
Third message: kwargs = {"data": ["data1", "data2", "data3"], "source_id": ["source1", "source2", "source3"]}

Counting Messages

Use any field that appears in every message to count:

if "source_id" in kwargs and isinstance(kwargs["source_id"], list):
    count = len(kwargs["source_id"])
elif "data" in kwargs and isinstance(kwargs["data"], list):
    count = len(kwargs["data"])

Preventing Duplicate Processing

Use a flag to ensure processing only happens once:

if count >= expected_count and not self.processed:
    self.processed = True
    # Process and emit

Resetting for Next Cycle

After processing, optionally reset the slot data:

self.input_slot._data = {}

Concurrent Execution¶

The aggregation pattern works the same way in concurrent execution mode. However, be aware that:

Handler calls may be interleaved across threads
Use thread-safe operations if handlers share state
The total number of calls remains the same as sequential mode

Example with concurrent execution:

flow = Flow(execution_strategy="concurrent", max_workers=5)

# Same setup as before
# ...

# Execute sources concurrently
flow.execute(id1, entry_params={"data": "data1"})
flow.execute(id2, entry_params={"data": "data2"})
flow.execute(id3, entry_params={"data": "data3"})

# Wait for completion
from routilux.job_state import JobState
JobState.wait_for_completion(flow, job_state, timeout=10.0)

# Aggregator will still process when all 3 messages are received

Best Practices¶

Use Configuration: Store expected_count in _config for flexibility:
```
self.set_config(expected_count=expected_count)
```

Check List Type: Always check if a value is a list before using len():

if "field" in kwargs and isinstance(kwargs["field"], list):
    count = len(kwargs["field"])

Prevent Duplicate Processing: Use a flag to ensure processing happens only once:

if count >= expected_count and not self.processed:
    self.processed = True
    # Process

Reset After Processing: Clear slot data after processing to prepare for the next aggregation cycle:
```
self.input_slot._data = {}
```
Handle Edge Cases: Consider what happens if: - Fewer messages arrive than expected (timeout handling) - More messages arrive than expected - Messages arrive out of order
Thread Safety: In concurrent mode, if you need to share state between handlers, use thread-safe operations (locks, atomic operations).

Common Patterns¶

Timeout Handling

Add timeout logic to process even if not all messages arrive:

import time

def __init__(self, expected_count: int = 3, timeout: float = 10.0):
    # ...
    self.start_time = None
    self.set_config(timeout=timeout)

def _handle_input(self, **kwargs):
    if self.start_time is None:
        self.start_time = time.time()

    # ... count messages ...

    timeout = self.get_config("timeout", 10.0)
    elapsed = time.time() - self.start_time

    if (count >= expected_count) or (elapsed >= timeout):
        # Process with available data

Dynamic Expected Count

Allow the expected count to be set dynamically:

def set_expected_count(self, count: int):
    self.set_config(expected_count=count)
    self.expected_count = count

Multiple Aggregation Cycles

Reset properly to support multiple aggregation cycles:

def _handle_input(self, **kwargs):
    # ... check count ...
    if count >= expected_count and not self.processed:
        self.processed = True
        # Process
        self.input_slot._data = {}
        self.processed = False  # Reset for next cycle