Nieksand Dev
Cassandra Replication Imbalance

During some recent pre-production testing on a Cassandra cluster, I came across an odd load imbalance:

Nodetool "ring" command.

We’re using the random partitioner and a 2x replication factor with the NetworkTopologyStrategy.  The replication is working (hence the 200% Effective-Ownership), but there is an obvious load imbalance between nodes.

I tracked down the problem using the “describering” command:

Nodetool "describering" command.

The “endpoints” indicate which nodes receive row copies for given key range.  A sketch of the endpoints shows what’s happening:

Bad replication flow.

As expected, we are replicating across our two racks.  However, all the replicated data is going to only node 1-a and node 2-b.

Why?  The DataStax documentation gives the answer:

With NetworkTopologyStrategy, replica placement is determined independently within each data center (or replication group). The first replica per data center is placed according to the partitioner (same as with SimpleStrategy).  Additional replicas in the same data center are then determined by walking the ring clockwise until a node in a different rack from the previous replica is found.

Going back to the output of the original “ring” command, here’s a visualization of the token assignments:

Bad Token Assignments

So if a key belongs to node 1-a, we place the replicated row by walking clockwise along the ring till we hit node 2-a.  However, under the same logic, a key belong to node 1-b will also replicate to node 2-a!

Once we understood what’s going on, this imbalance was easy to fix.  We changed our token assignments to alternate between racks:

Good Token Assignments

Now node 1-a replicates to node 2-a and node 1-b replicates to node 2-b:

Good replication flow.

Each of the four nodes now has 50% effective-ownership of the key space.  Problem solved.

@secboffin: I know a programming joke about 10,000 mutexes, but it’s a bit contentious.
Tuning APC in PHP

While studying our PHP site’s APC behavior using this handy GUI tool, I noticed that the cache was completely flushing every few minutes.

It turns out that leaving APC’s ttl to its default value of zero does exactly that: dump the entire cache whenever it fills up.  Between opcode caching and a modest pile of user data, we were constantly filling up APC’s default of 32MB.

Bumping up the cache size was a pain in the neck.  

Increasing SHM_SIZE wouldn’t work on our linux box.  We discovered that our Linux box had a maximum shared memory segment size of 32MB.  Making things worse, the APC GUI showed that 64MB was available, even though the cache refused to fill past 32MB.

Attempt two used the SHM_SEGMENTS parameter.  If you look at the documentation link, it seems like the perfect alternative.  Unfortunately, if you scroll way up to the top of the page you’ll find this little gem:

When APC is compiled with mmap support (Memory Mapping), it will use only one memory segment

So our APC GUI showed only one segment.  This only other sign of things being amiss was a single terse entry in the Apache error log.

Attempt three was the winner.  We increased the shared memory size limit using this howto.  Now we run with a single 64MB segment.  There have been no cache-full events since.

Lessons learned:

  • Watch out for the default flush-all behavior of APC.  It can mean nasty load spikes since you’ll end up with a cold cache.
  • APC’s documentation and error handling need a dash of love.
  • The GUI tool for APC is pretty awesome.