Thursday, 7 February 2013

Surfendipity



Let's start with a question. How would you order all websites in order of importance? You could order by most popular, most informative or most authoritative.

None of those are truly satisfactory because the question is subjective. It depends on who is asked. For example I would rank technical, comedy and competitive gaming sites highly conversely fashion, gossip and liberal arts sites would rank extremely low. Yet incomprehensible many (majority?) people think the opposite.

So we alter a websites rank depending on the viewer. This is what Google does now but originally googled ranked websites using the PageRank algorithm (named after one of the founders Larry Page). The algorithm takes an objective approach to solving the problem.

Figure 1. Random surfer

We have all been there, you go on the web with a specific purpose and then through a magical Intertubes journey you end up somewhere totally unpredictable. For example I started today searching for the relationship between static code analysis and the halting problem, and I hate to admit ended up on youtube...

Jeez Louise what on earth does that anecdote have to do with PageRank? Its a nice analogy of how PageRank works. Given a user (for example a monkey) that surfs the web by randomly clicking links. The PageRank of a website is the probability of that user landing on the site after a significant number of clicks. This is illustrated in Figure 2.

Figure 2. Graph with PageRank values.

So you can think of hyper-links as the source website vouching for the destination website. This explains why website B in Figure 2 is ranked so highly because many sites link to it. In contrast site C is ranked highly even though it has only one incoming link. The reason being its got an incoming link from the big dogg B. Since its likely to land on B it is also likely to land on C.

“Its not what you know or who you know – its who knows you.”

Ok that makes sense and if you're like me you must be wondering how on Gods green earth do you calculate those probabilities? That is what the rest of the article is about. First we need Markov chain models. These model the movement through a stochastic state machines. These are just state machines with transitions based on probabilities like so:


Figure 3. Directed graph to illustrated state transitions

So just to clarify we can move from state A to B with probability of 0.5 and from B to C with 1.0 and so on so forth.

The Markov chain is just a sequence of random variables ( X1, X2, … Xn ) that represent a single state. So you can imagine just walking along the graph. So let's say we start at state A. Then the random variable X1 represents our first step so it is

P( X1 = A ) = 0.0
P( X1 = B ) = 0.5
P( X1= C ) = 0.5

The most important feature of the Markov chain is that the transitions probabilities do not depend on the past. In this article we assume the transitions are constant throughout. So it doesn't matter at all what ridiculous journey you took to get to a state. To predicate the future all you need is the current state and the transitions. So just to reiterate the past is irrelevant on Markov chains.

Right let's represent the state transitions using a transition matrix T . Since the transitions are probabilities we call it a stochastic matrix (technically the total probability of moving from one state to the others is 1 so the rows sum to 1).

So the entry at row i and column j represents the probability of moving from state i to state j. For example T23 is the probability of moving from state B to state C i.e. 1.0.

Why did we bother converting the graph to a matrix. Well it turns out it has a few neat properties. Let's say we started at state A and I asked you what's the probability of moving to state C in two steps?

To answer assume the following notation P(IJ) means the probability of moving from state I to state J in one step (just look up the matrix). So for example P(AB) = 0.5. So the probability of going from A to C in two steps is the probability of going from A to every state (A, B, C) then the probability of going from that state to C. i.e.

Notice that this corresponds to the entry T2AC. If you try it out for the rest of the entries it turns out that the product of the transition matrix is the probabilities of going from the row to the column in 2 steps. That is T2ij the probability of going from state i to state j in two steps. If you multiply that by the transition matrix you get the state transitions for 3 steps. So it turns out that Tn is the state transitions for n steps. Really neat huh? I thought so anyway.

Ok let's multiply our transition matrix so we can see the probabilities for two, three, four, five, twenty and one hundred steps:


Something very interesting happens. The entries seem to converge until all the columns values become equivalent. This means that after a certain number of steps it doesn't matter which state you started on. The probability of getting to another state is constant. My mind = blown. So let's try visualise this:


Figure 4. State with converged population distribution.


So let's say we start with a 100 people. Let's use the values of the converged matrix to place people on the graph. So let's put 40% of them on A and another 40% on C. The last 20% can go on B. So move forward one step in the Markov chain.

Half of A will go to B and the other half will go to C. So the new B will have 20% of people. The new C has 20% and the rest of the people from old B which is also 20% therefore the new C has 40%. Finally new state A has all the people from old state C i.e. 40%. Whad'ya know? We are back where we started. So this shows the steady state that the Markov chain converges on. Once we get into this distribution we stay there forever.

So if the small graph used in the example actually represented three websites and the related hyper-links. Then the stabilised values (i.e. the probabilities for each column in converged matrix) represent the PageRank of each website. Sweet!

This blog is already long enough so I'm going to skip over the intuitive proof of convergence. Although Ive put that in the appendix A. I'll just show how we can use the transition matrix to calculate the PageRank.

If we represent the initial distribution (for example of people) using a vector e.g.


That means ¼ of people are in state A and another ¼ in B and the rest ( ½) are in C. If we multiply the transition matrix with the vector. We will get a new vector representing the distribution after the first step. By the same token if we multiply it by an n step transition matrix then the result is the distribution after n steps. So written mathematically, given a transition matrix T, a source distribution s (represented as a row vector) we can calculate the resulting distribution r after n steps using:

r = s . Tn

So we have two very similar concepts but they are not equivalent.
  1. After a number of steps the probability of getting to a state is constant regardless of where you start. So no matter what initial distribution you start with you will end up with stable distribution.
  2. If you apply a single state transition on the stable distribution you will get the stable distribution.

Let's go through both these points in order. To illustrate point 1 let's pick any random distribution and multiply it with the stable matrix (the converged matrix). So let's say we have 0.3, 0.3 and 0.4 in states A,B and C respectively. Represent it as a row vector and multiply with the converged matrix:


Notice the result distribution is the stable distribution. So intuitively what does this mean? It just means that given the initial distribution and taking it through a significant number of steps we will eventually end up at the stable distribution.

The stable distribution is interesting because once you're in it you remain in that distribution forever . This just a rewording of point 2. So let's illustrate this by multiplying the stable distribution with the initial transition matrix:



Notice we get the stable distribution again. That makes sense because the definition of the stable distribution is one that doesn't change when transitioned. If you try any other distribution on the transition matrix you will not get the same result.

So if you recall from linear algebra an eigenvector is a vector that is only scaled when transformed by a matrix. That is, given a matrix T, a vector s (called eigenvector) and a scalar lambda (called eigenvalue) we have


This looks very similar to our answer above. Except we have a value of lambda as 1 and we are using a row vector (we could change that by transposing both). So it turns out that the stable distribution is the eigenvector of the transition matrix with an eigenvalue of 1.

So moving back to our PageRank algorithm where the transition matrix is a representation of the web and the stable distribution is the PageRank of websites. We can say the PageRank is the eigenvector of the stochastic matrix representing the topology of hyper-links. Wow! that sounds complicated but hopefully we now understand what that means.

Let's generate the algorithm. So our input is the stochastic transition matrix representing the Web. We want the stable distribution (i.e. the eigenvector). Remember a multiplication of any distribution with the convergence matrix will be that stable distribution / eigenvector. So all we have to do is multiply any distribution with our convergent matrix to get the answer.

So the problem reduces to finding the convergent matrix. Well we know to do that we just need to multiply repetitively (take to the exponent) the transition matrix until it converges. So let x be our initial distribution. Just set it to 1/n where n is the number of components in the vector (i.e. websites). And we have the following formula to calculate the eigenvector: 


Huh? bear with me. Let's expand this out:


So by doing it in this order we can skip a few operations. We can do vector-matrix product instead of matrix-matrix product which has far fewer operations. So the last and final question is when do we stop? when we hit the stable distribution (eigenvector). That is when: 


So let's convert this into a floating point friendly algorithm:
 /* Inputs are  
 * n = number of websites   
 * T = is a n x n transition matrix  
 */  
 vector x = { 1/n, 1/n, ..., 1/n }    // initial distribution  
 do {  
    old = x   
    x = x * T  
    delta = | x - old |  
 } while( delta > EPISILON );  

So once the algorithm is complete, x will be the stable distribution, eigenvector and the PageRank of the web.

So its been a pretty lengthy article for such little code. Always the way with Maths heavy algorithms. For more information on PageRank read "The PageRank Citation Ranking: Bringing Order to the Web" and on Markov chains I highly recommend "Finite Markov Chains and Algorithmic Applications" by Olle Häggström.

So Appendix A has a high-level explanation for convergence. Then Appendix A shows how Google converts the web graph into a Google matrix so that it has a convergent property.

Right I applied the algorithm to the request for comments (RFC) and used citations as links/edges. The PageRank of all RFCs (at the time of writing) in descending order:

https://github.com/iainkfraser/PageRank/blob/master/rfc_pagerank_with_titles.txt

That's all folks - PEACE!


Appendix A - Convergence

I'm going to very briefly describe state transition restrictions that allow convergence. Then i'll explain intuitively why convergence occurs. If you want to learn more about Markov Chains in more detail then refer to a book, it's a subject in its own right.

Theroem: Any irreducible and aperiodic Markov chain has exactly one stationary distribution. 

Before I explain irreducible and aperiodic let me say the converse isn't true. Not being irreducible and aperiodic does not mean there isn't a stationary distribution it just means there may not be. So irreducible and aperiodic chains are interesting because we are guaranteed to have stationary distribution (and therefore a PageRank).

An irreducible graph means there is always a way to get from one state to another eventually. If you can't you call it reducible. Figure 5. has some example illustrations.


Figure 5. The two chains on the left are reducible and the one on the right is irreducible

By the way its called reducible because you can split into two or more separate graphs and model those using Markov chains. So its quite obvious there can't be a stationary distribution because remember the stationary distribution means: the probability of getting from any state to a certain state is constant. So clearly if you can't get to a state the probabilities can't be the same because some states can reach but other(s) cannot.
Figure 6. The two on the right are aperiodic and the left has a period of of 2.


Aperiodic is a bit more difficult to explain. So I'm going to explain it with an example, again read one of the Markov chains books for the formal definition (quickly its the greatest common divisor of the number of steps to get back to the same state).

So look at example  in Figure 6a. If you place all one hundred people on S1. Then if we do the next step they all move to S2  (so none on S1). Then they all move back to S1 and so on and so forth. So clearly its never going to converge because it oscillates. It has a period of 2, because getting back to state S1 takes a minimum of 2 steps.

Compare that with Figure 6c which is aperiodic. So S1 is trivial because we can get straight back to S1 in one step. But S2 we can get back in 2,3,4,5,7... steps. Why isn't that periodic? Let's run through an instance of a Markov chain. So all 100 people move from S2 to S1. Then 50 go back S2 and the other 50 stay on S1. Now look at this distribution (which originally started with 100 on s2) we can get back to S2 now in 1 step forever. Ergo its not periodic.

So let's get onto a intuitive description of the proof. If the graph is aperiodic and irreducible then eventually (after a number of steps) every value in the transition matrix will be between 0 and 1 exclusively (i.e. not 0 or 1). Because there is a way to get from any state to any other state.


Figure 7. Probability of moving to B in one step

So let's start thinking of a hypothetical graph shown in Figure 7. I was going to use variables for transition probability but I think numbers are simpler to understand. So if we were playing a game and wanted to get to B in one step we would start at C and if we didn't want to get to B we would start at A.  How about two steps? 


Figure 8. Probability of moving to B in two steps

So if I asked you to try get to B, what would you say? B->C->B right. Because we already know ending with C -> B is the optimum for one step. So we want the optimum something to C which is B. So the optimum odds of landing on B for 2 steps is 0.54. Notice that the odds are worse than just the single step.

Now if I asked you try to avoid getting to B in two steps what would you say? Well again we want to pick the most likely of landing on A->B (the least likely single step) so we would choose A->A->B which has overall odds of 0.18. Notice again these odds are greater (so in response to the question worst) than the single step.



Figure 9. Probability of A->B in three steps. Notice the minimum is increasing

This illustrates why convergence works. The maximum and minimum probabilities of getting to B will be in the first step. Because in the next step the probability of getting to the maximum step is not 1 so its going to lose a bit (to the other branches). And the probability of getting to the minimum step is also not 1 so we will gain a bit (by the other branches). 

Through induction (due to recursive nature of the steps) we can show that after each step the minimum will raise and the maximum will fall until they eventually converge. Therefore no matter where you start you have the same probability of getting to the destination (B in this case).

It should now make sense why you don't want 1 or 0 entries in the matrix forever. Because that means the maximum or minimum doesn't have to fall or raise and thus converge. 

Appendix B – Google Matrix

So we need the Markov chain to be irreducible and aperiodic to guarantee that there is a stable distribution (see Appendix 1). An easy way to do that is to have a link from every state to every other state.

So how can we transform the hyper-link graph into this type of graph. Because clearly every website doesn't link every other website. The way Google does it, is to imagine that the random surfer may get bored and when that happens they randomly teleport to random website.

So how do we apply this logic to the transition matrix? Well we need a probability of the user getting bored call d (for damping) which is usually set to 0.85 (the clever dudes at Google figured this number out). Then we can convert the transition matrix to:


So the dT just dampens all the links due to the chance of getting more bored. Then there is a chance of ( 1 – d ) divided by the number of sites of randomly teleporting to another site. So we need to add that to every entry in the matrix which is what eeT (generates a n x n matrix of all ones using outer product).



Wednesday, 30 January 2013

Cache Money Hoes


Before we begin, the underlined words are briefly described in the glossary at the end. If there are any terms you don't understand let me know and I'll place them in the glossary as well.

As the wise contemporary philosophers the Wu-Tang Clan once said: 
"Cache Rules Everything Around Me ( C.R.E.A.M ) "
Many people believe this is just typical rappers exhibiting the materialistic nature of modern life. However, they were in fact visionaries describing the importance of memory hierarchy utilisation on processor performance.

Traditionally the time complexity of an algorithm was based on the RAM model.  The model assumes that access to all memory locations takes equivalent constant time.  In reality a cache hit is an order of magnitude faster than a cache miss.

Therefore it should be possible to significantly improve an algorithm by reordering its memory access to be cache friendly. This isn't a particularly novel idea and it has a name: cache aware algorithms.

The problem with cache aware algorithms is that they depend on the cache size. The algorithm must be tuned for every different cache it runs on. This unportable tuning was the inspiration for Harald Prokop's cache oblivious algorithms. They are algorithms that are cache aware but don't require tuning.

To illustrate cache-oblivious and cache-aware algorithms we first need an algorithm we can improve. The key is that we don't change the time complexity at all. That means we still do the same number of operations - so the processors do the same work. The only variant is the order of memory access and therefore cache utilisation.

Algorithms that access memory often are ripe for the picking. I decided to implement the matrix transpose because it one of the simplest memory heavy algorithms. The matrix transpose simply takes the rows of input algorithm and makes them the columns of the output matrix, like so:



You don't need to understand matrices at all to understand this algorithm. You only need to understand one definition:
[\mathbf{A}^\mathrm{T}]_{ij} = [\mathbf{A}]_{ji}

This simply states that the entry with jth row and ith column for the input matrix is equal to the entry with ith row and jth column in the output matrix.  

This cache-unaware or naive algorithm directly uses this definition. As shown below:
      for( int i = 0; i < N; i++ ){  
           for( int j = 0; j < N; j++ ){  
                out[j][i] = in[i][j];  
           }  
      }  

Just a side note, the matrices here are large e.g. in this blogs associated code N=1024 i.e. the matrix is 4MB ( each entry is 4 bytes so 4x1024x1024) . Anyway back to the analysis 2D arrays in C are stored in row-major order.

This means that an entry's horizontal neighbours ( i.e. row neighbours ) will be on the same cache line. Unless the entry itself is at the end of a cache line.  Therefore accessing a horizontal neighbour will likely be a cache hit. Conversely accessing a vertical neighbour will likely be  a cache miss ( unless its a relatively small matrix and its row is small enough to fit in a cache line ).

The naive algorithm reading the input matrix will often cache hit because it traverses in row order ( the inner iterator j walks the column ). However writing to the output matrix will likely cache miss due to column order traversal.

How can this be improved? Well we can change the order of writing to minimize the cache misses. That is exactly what the cache-aware algorithm does. It splits the matrix into a set of sub-matrices that have a size equivalent to the cache line. Then runs that algorithm on the sub-matrices:


Don't let sub-matrices confuse you. We are still essentially following the [\mathbf{A}^\mathrm{T}]_{ij} = [\mathbf{A}]_{ji} rule. The only difference is we are restricting it to a sub-blocks of the matrix. Once the block has been transferred with minimal cache misses we move onto the next block. Use the source Luke if you want to know more.

The problem with the previous cache aware algorithm is it took the cache line size as a parameter. So we would have to tune it for each different target. Can we eliminate this tuning? Sure we can! The cache oblivious (CO) way.

The CO basically keeps splitting the matrix into sub-blocks until we reach blocks of size 1. In reality you set it to slightly larger then one to minimize the cost of recursion. Then apply the transpose on the blocks. So the CO is basically the cache-aware algorithm but the tuning is done automatically through divide and conquer.

But wait, isn't setting the block size too small going to cost a cache miss? No thats the clever bit! If the block is smaller than the cache line then the siblings and eventually parent block ( the block that encompasses the subblock ) will still be in cache.  As shown below



The initial block being accessed is the sub-block in the top left consisting of red and blue blocks. Its entries are shown in the cache line above. Once its been accessed its sibling block ( the top right block consisting of green and yellow ) is accessed. Since the blocks are smaller than the cache line the siblings entries are also in cache.

So lets see if the wu-tang hypothesis is correct.  The average results of running the transpose on a 1024x1024 matrix 100 times are:

NaiveCache AwareCache Oblivious
5704 ms 763.33 ms 1212 ms

Of course the Wu-Tang clan was right on the money. So the algorithms are doing the exact same amount of operations and the difference is huge. The cache oblivious algorithm is slightly slower because its paying for the recursion. As the size of the matrix grows the cost of recursion will reduce. So we can say that the cache oblivious algorithm is asymptotically equivalent to the cache aware one.

This blog has given a very broad overview. For more information please read Harald Prokop's MIT master's thesis and the attached source code

For the people that do read the thesis one thing that had me confused was the cache complexity was often stated as O( 1 + k / N ) and I couldn't figure out where the 1 came from. Its because big-O represents worst case, and the worst case in terms of cache is when the memory address is misaligned so you have to pay for an additional cache miss per object access. 



So the take home for today is, when your code runs slow don't forget the cache! Until next time keep it real...



Glossary 

Time Complexity - A way of comparing algorithms mathematically. Instead of comparing them by actual time measurements that are affected by the computer the algorithm is run on. Compare them by the expected number of operations. The number of operations is typically dependent on the size of the input n. So we say its complexity is a function of n.

Cache Hit - When the CPU attempts to access an object that's already in cache.

Cache Miss - When the CPU attempts to access an object that's not in cache so it must be copied from memory to cache ( a line size at a time ).

Row Major - A 2D ( or n dimensional ) array in C is actually stored as one large single array. In row major order rows are written out after each other. So for example if the array is defined as x[3][5] and you access element x[2][4] its actual set at 2*5 + 4 because it has to skip the first two rows and then move to the 4th element.