Tuesday, September 17, 2013

What is vbuckets, and should I care?

The really short answer is: Not unless you really want to know the internals of the Couchbase Server. It is more than sufficient to know about buckets and how to add/remove nodes (and their impact on the
system).

The vbuckets exist so that the Couchbase cluster can move data around within the cluster. When you create a Couchbase Bucket, the cluster splits that bucket up into a fixed number of partitions. Each of this partitions is then assigned an id (the vbucket id) and assigned to a node in the cluster. The thing that maps the different partitions to the physical address is then called the vbucket map. So why not call it partitions? There is no reason for not doing so, but we chose vbuckets for "virtual bucket". At the time, it was never intended to be visible outside "the internals" of the server.

Lets walk through an example and you might see what it is used for. Imagine that you would like to access a document stored under the id "foo". The first thing you would do would be to create a hash value for the key, and then use the hash value to look up which vbucket it belongs to. The number of vbuckets is predefined, and will never change for a given cluster (it is currently set to 1024 on linux/Windows or 256 on Mac OS). With the vbucket id in place we consult the vbucket map to see whos responsible for that vbucket. The client will connect to that server and request the document "foo" from the given vbucket. In fact, the vbucket number to use is in the request itself and defined by the client, based on querying the cluster manager. The client's copy of the vbucket map could be obsolete and the vbucket is not located on the server causing it to return "not my vbucket" and the client should try to update the map. If the vbucket is located on the server it will return the document if it exists.

By having such an indirection from the actual partition and where it is currently located, we can easily move data from one node in the cluster to another node in the cluster (this is what happens during rebalance) and then update the map when we're done copying all data to the new node. When you set up the first node in your cluster all of the vbuckets reside on that node. While you add nodes the vbuckets (and the data) will be spread out across all of the nodes. The cluster tries to keep the distribution of vbuckets evenly across all nodes, to avoid some nodes to be overloaded.

Since we already had a way to transfer all of the data from one node to another node, we could use the same logic to keep replicas on other nodes. The same vbucket id is used on the other server, so the vbucket map could look something like:

+------------+---------+---------+
| vbucket id | active  | replica |
+------------+---------+---------+
|     0      | node A  | node B  |
|     1      | node B  | node C  |
|     2      | node C  | node D  |
|     3      | node D  | node A  |
+------------+---------+---------+

This means that node A have two vbuckets: 0 and 3. VBucket 0 is an active vbucket, which means that all get/set request would go to this node. VBucket 3 is on the other hand only used to keep replicas (there is a special command you may used to read replicas).

Lets imagine that one of your coworkers accidentally spilled his coffee into node 3 causing it to crash and never come up again. You as the administrator could now fail out the node, causing vbucket 3 on node A to be promoted to "active" and all read/write requests would go to that node instead.

As you see these are really some "internal guts" of the Couchbase server that you as user of the cluster really don't need to care about. I would say you'd be better off spending the time focusing on your application and ensuring that you don't under/over provision your cluster. It is by far more important to monitor that the IO path of your cluster is scaled according to your applications usage. If you don't have enough nodes to persist the data you might end up in a situation where your cluster is constantly "out of memory" and returns a message to the clients to back off. If you end up in this situation your cluster will be sluggish, and only accept a small number of updates every time it's written documents to disk.

2 comments:

  1. I made it through the entire first paragraph... then I hit an OutOfDepthException.

    ReplyDelete