Thursday, October 31, 2013

Maintaining Constant Probability of Data Loss with Increased Cluster Size in HDFS

In a conversation with a colleague some months ago I was asked if I knew how to scale the replication factor of a Hadoop Distributed File System (HDFS) cluster as the number of nodes increased in order to keep the probability of experiencing any data loss below a certain threshold. My initial reaction to the question was that it would not be affected, I was naively thinking the data loss probability was a product of the replication factor only.

Thankfully, it didn't take me long to realize I was wrong. What is confusing, is that for a constant replication factor as the cluster grows the probability of data loss increases, but the quantity of data lost decreases (if the quantity of data remains constant).

To see why consider a situation in which we have N nodes in a cluster with replication factor K. We let the probability of a single node failing in a given time period be X. This time period needs to be sufficiently small so that we know that the server administrator will not have enough time to replace the machine or drive and recover the data. The probability of experiencing data loss in that time period is the probability of getting K or more nodes failing. The exact value of which is calculated with the following sum:

Although in general a good approximation (a consistent overestimate) is simply:


Clearly as the size of N increases this probability must get bigger.

This got me thinking about how to determine the way the replication factor should scale with the cluster size in order to keep the probability of data loss constant (ignoring the quantity). This problem may have been solved elsewhere, but it was an enjoyable mathematical exercise to go through.

In essence we want to know if the number of nodes in the cluster increases by some value n, then what is the minimum number k such that the probability of data loss remains the same or smaller. Using the approximation from above we can express this as:


Now if we substitute in the formulas for N-choose-K and perform some simplifications we can transform this into:


I optimistically thought that it might be possible to simplify this using Stirling's Approximation, but I am now fairly certain that this is not possible. Ideally we would be able to express k in terms of N,n,K,X, but I do not think that it is possible. If you are reading this and can see that I am wrong please show me how.

In order to get a sense of the relationship between n and k I decided to do some quick numerical simulations in R to have a look at how k scales with n.

I tried various combinations of X, N and K. Interestingly for a constant X the scaling was fairly robust when you varied the initial values of N and K. I have plotted the results for three different values of X so you can see the effect of different probability of machine failure. In all three plots the baseline case was a cluster of 10 nodes with a replication factor of 3.

You can grab the R code used to generate these plots from my GitHub repository.









Tuesday, October 29, 2013

Philosophical Zombies and the Physical Basis of Consciousness



Given that the Walking Dead is back with a new season and World War Z has just ripped through the public conscious I thought that the philosophical implications of zombies would be a worthwhile subject for a post. I would not be the first person to have thought about what the notion of a zombie means for consciousness, and in fact the zombie has a well entrenched place in a series of arguments about the nature of the relationship between mind and matter.

Before we get under way, it is worth noting that to philosophers versed in the age old tradition of the Thought Experiment, a zombie is not a flesh eating monster that shambles around slowly decomposing and smelling disgusting. A philosophical zombie is merely a person without consciousness. I can hear you all respond "Que?" followed by quizzical silence. The idea is to ask if we can conceive of a person who acts and behaves like we do without the mental qualia of consciousness, that is the internal experience of seeing red, smelling roses and the pain of being pricked by a thorn.

The way this is taken to impact on our understanding of mind and brain relies on a second philosophical trick: the notion of a conceivability argument. This is the idea that if we can conceive of something then it is in some sense at least possible. Usually this is taken as metaphysical possibility, i.e. that it may not be possible in this universe, but in some other universe. If you think this is a pretty slippery way to argue, then you are in good company. Nevertheless, it persists as a philosophical tool, and for the sake of this post I am going to grant it temporary validity.

Ok. So.

The argument goes as follows: physicalist explanations of consciousness require that there be some configuration of matter that corresponds to conscious states. If we can conceive of a zombie, then it is metaphysically possible that a being could exist that can act as we do, yet is not conscious. As that means that the zombie must have the configuration of brain matter that allows the specific conscious-like behavior, therefore that configuration of brain matter cannot be the source of consciousness.

However, even allowing the conceivability argument, this is still an invalid argument. The reason is that just because for homo sapiens we observe certain configurations of brain matter that give rise to the set of behaviors and conscious states, it does not preclude the existence of other arrangements that have the former but not the latter. It is equivalent to observing a bunch of four legged tables and concluding that table-ness and four-legged-ness are a necessary combination. In reality other arrangements of legs can also make tables, and four legs does not always a table make.

Strengthening this objection is the fact that we know that the micro-structure of our brains are different between individuals. In fact, this is the source of our individuality. While the macro-structural features of our brains are shared (thalamus, hypothalamus, corpus callosum, regions of the cerebral cortex and their inter-connectedness), the fine grained structures that control our thoughts and actions are (virtually) unique. This means that in reality there is no a single configuration of brain matter that gives rise to a given set of behaviors and their corresponding conscious states, but rather a family of configurations.

There is nothing preventing this family of configurations being broader than we know them to be, and a certain (as of yet unobserved) set of them having the property of giving rise to behaviors without conscious states. This might seem far-fetched, but as I can conceive of it, it must be meta-physically possible.