DuckDB goes distributed? DeepSeek’s smallpond…

15 hrs ago

DeepSeek is pushing DuckDB beyond its single-node roots with smallpond, a new, simple approach to distributed compute. But does it solve the scalability challenge—or introduce new trade-offs?

Read →

4 Comments

Akshay Baura

14h

i dont fully follow when you say- Unlike systems like Spark or Daft that can distribute work at the query execution level (breaking down individual operations like joins or aggregations), smallpond operates at a higher level. It distributes entire partitions to workers, and each worker processes its entire partition using DuckDB.

In spark too, a partition is processed by a core, right?

Expand full comment

Reply (1)

mehdio

13h

Yes, Spark also processes partitions with individual cores, but the key difference is how work is distributed WITHIN a partition.

- Spark breaks queries into stages and tasks, distributing operations like joins and aggregations across multiple workers—even within a partition.

- smallpond, on the other hand, assigns entire partitions to separate DuckDB instances running in Ray tasks, where each instance processes its partition independently without breaking queries into smaller tasks.

So while both use partitioning, Spark distributes work at a finer level, while smallpond keeps execution isolated per partition.

Hope that clarifies!

Expand full comment

Reply (1)

Akshay Baura

12h

wouldn't this mean that smallpond isn't meant for complex and dynamic operations (e.g. which involve shuffle) ?

Expand full comment

Reply (1)