4 Comments

i dont fully follow when you say- Unlike systems like Spark or Daft that can distribute work at the query execution level (breaking down individual operations like joins or aggregations), smallpond operates at a higher level. It distributes entire partitions to workers, and each worker processes its entire partition using DuckDB.

In spark too, a partition is processed by a core, right?

Expand full comment

Yes, Spark also processes partitions with individual cores, but the key difference is how work is distributed WITHIN a partition.

- Spark breaks queries into stages and tasks, distributing operations like joins and aggregations across multiple workers—even within a partition.

- smallpond, on the other hand, assigns entire partitions to separate DuckDB instances running in Ray tasks, where each instance processes its partition independently without breaking queries into smaller tasks.

So while both use partitioning, Spark distributes work at a finer level, while smallpond keeps execution isolated per partition.

Hope that clarifies!

Expand full comment

wouldn't this mean that smallpond isn't meant for complex and dynamic operations (e.g. which involve shuffle) ?

Expand full comment

Thats true. If you have analytical workload that requires typically joins accross partitions, that would be really slow in that case

Expand full comment