Why not to use IN in Cassandra

在使用 Cassandra 資料庫時,對於分區鍵(partition key)使用 IN 條件並不是一個推薦的做法。當使用 IN 條件進行查詢時,可能需要查詢多個節點,進而影響性能。比如說,假設在一個擁有 30 個節點、複寫因子為 3、一致性等級設置為 LOCAL_QUORUM 的本地數據中心集群中,對單個分區鍵進行查詢會涉及到兩個節點。

然而,如果使用了 IN 條件,操作可能會涉及更多節點,最多可達 20 個,這取決於鍵落在 token 範圍的位置。作者建議對於分區鍵來說,使用 IN 條件不太安全。而對於群集列(clustering columns),使用 IN 條件則相對更穩妥。此外,建議參考《Cassandra Query Patterns: Not using the “in” query for multiple partitions》中的相關邏輯,了解更多關於不使用“in”查詢多個分區的情況。

透過分開的查詢,在Cassandra中能夠有效地發揮分散式的特性。這種方式有許多優點:首先,沒有單一故障點,即使某個節點出現問題,系統仍能正常運作。其次,讀取速度更快,因為數據可以從多個節點同時獲取,而不是集中在單一節點上。同時,分開的查詢也減輕了協調節點的壓力,因為負載被平均分散到整個系統中。最重要的是,當節點發生故障時,這種方式可以提供更好的性能語義,系統不會因為某個節點失效而完全中止,而是能夠繼續運行並提供服務。總的來說,分開的查詢方式充分發揮了Cassandra分散式的特性,提供了更高效、更健壯的系統運行方式。

reference

https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlSelect.html#cqlSelect__selectInNot

Under most conditions, using IN in relations on the partition key is not recommended. To process a list of values, the SELECT may have to query many nodes, which degrades performance.Under most conditions, using IN in relations on the partition key is not recommended. To process a list of values, the SELECT may have to query many nodes, which degrades performance.For example, consider a single local datacenter cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM. A query on a single partition key query goes out to two nodes.But if the SELECT uses the IN condition, the operation can involve more nodes — up to 20, depending on where the keys fall in the token range.

Using IN for clustering columns is safer. See Cassandra Query Patterns: Not using the “in” query for multiple partitions for additional logic about using IN.

https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

With separate queries you get no single point of failure, faster reads, less pressure on the coordinator node, and better performance semantics when you have a nodes failing. It truly embraces the distributed nature of Cassandra.With separate queries you get no single point of failure, faster reads, less pressure on the coordinator node, and better performance semantics when you have a nodes failing. It truly embraces the distributed nature of Cassandra.

Add a Comment

發佈留言必須填寫的電子郵件地址不會公開。