Wednesday 30 January 2013

Cache Money Hoes


Before we begin, the underlined words are briefly described in the glossary at the end. If there are any terms you don't understand let me know and I'll place them in the glossary as well.

As the wise contemporary philosophers the Wu-Tang Clan once said: 
"Cache Rules Everything Around Me ( C.R.E.A.M ) "
Many people believe this is just typical rappers exhibiting the materialistic nature of modern life. However, they were in fact visionaries describing the importance of memory hierarchy utilisation on processor performance.

Traditionally the time complexity of an algorithm was based on the RAM model.  The model assumes that access to all memory locations takes equivalent constant time.  In reality a cache hit is an order of magnitude faster than a cache miss.

Therefore it should be possible to significantly improve an algorithm by reordering its memory access to be cache friendly. This isn't a particularly novel idea and it has a name: cache aware algorithms.

The problem with cache aware algorithms is that they depend on the cache size. The algorithm must be tuned for every different cache it runs on. This unportable tuning was the inspiration for Harald Prokop's cache oblivious algorithms. They are algorithms that are cache aware but don't require tuning.

To illustrate cache-oblivious and cache-aware algorithms we first need an algorithm we can improve. The key is that we don't change the time complexity at all. That means we still do the same number of operations - so the processors do the same work. The only variant is the order of memory access and therefore cache utilisation.

Algorithms that access memory often are ripe for the picking. I decided to implement the matrix transpose because it one of the simplest memory heavy algorithms. The matrix transpose simply takes the rows of input algorithm and makes them the columns of the output matrix, like so:



You don't need to understand matrices at all to understand this algorithm. You only need to understand one definition:
[\mathbf{A}^\mathrm{T}]_{ij} = [\mathbf{A}]_{ji}

This simply states that the entry with jth row and ith column for the input matrix is equal to the entry with ith row and jth column in the output matrix.  

This cache-unaware or naive algorithm directly uses this definition. As shown below:
      for( int i = 0; i < N; i++ ){  
           for( int j = 0; j < N; j++ ){  
                out[j][i] = in[i][j];  
           }  
      }  

Just a side note, the matrices here are large e.g. in this blogs associated code N=1024 i.e. the matrix is 4MB ( each entry is 4 bytes so 4x1024x1024) . Anyway back to the analysis 2D arrays in C are stored in row-major order.

This means that an entry's horizontal neighbours ( i.e. row neighbours ) will be on the same cache line. Unless the entry itself is at the end of a cache line.  Therefore accessing a horizontal neighbour will likely be a cache hit. Conversely accessing a vertical neighbour will likely be  a cache miss ( unless its a relatively small matrix and its row is small enough to fit in a cache line ).

The naive algorithm reading the input matrix will often cache hit because it traverses in row order ( the inner iterator j walks the column ). However writing to the output matrix will likely cache miss due to column order traversal.

How can this be improved? Well we can change the order of writing to minimize the cache misses. That is exactly what the cache-aware algorithm does. It splits the matrix into a set of sub-matrices that have a size equivalent to the cache line. Then runs that algorithm on the sub-matrices:


Don't let sub-matrices confuse you. We are still essentially following the [\mathbf{A}^\mathrm{T}]_{ij} = [\mathbf{A}]_{ji} rule. The only difference is we are restricting it to a sub-blocks of the matrix. Once the block has been transferred with minimal cache misses we move onto the next block. Use the source Luke if you want to know more.

The problem with the previous cache aware algorithm is it took the cache line size as a parameter. So we would have to tune it for each different target. Can we eliminate this tuning? Sure we can! The cache oblivious (CO) way.

The CO basically keeps splitting the matrix into sub-blocks until we reach blocks of size 1. In reality you set it to slightly larger then one to minimize the cost of recursion. Then apply the transpose on the blocks. So the CO is basically the cache-aware algorithm but the tuning is done automatically through divide and conquer.

But wait, isn't setting the block size too small going to cost a cache miss? No thats the clever bit! If the block is smaller than the cache line then the siblings and eventually parent block ( the block that encompasses the subblock ) will still be in cache.  As shown below



The initial block being accessed is the sub-block in the top left consisting of red and blue blocks. Its entries are shown in the cache line above. Once its been accessed its sibling block ( the top right block consisting of green and yellow ) is accessed. Since the blocks are smaller than the cache line the siblings entries are also in cache.

So lets see if the wu-tang hypothesis is correct.  The average results of running the transpose on a 1024x1024 matrix 100 times are:

NaiveCache AwareCache Oblivious
5704 ms 763.33 ms 1212 ms

Of course the Wu-Tang clan was right on the money. So the algorithms are doing the exact same amount of operations and the difference is huge. The cache oblivious algorithm is slightly slower because its paying for the recursion. As the size of the matrix grows the cost of recursion will reduce. So we can say that the cache oblivious algorithm is asymptotically equivalent to the cache aware one.

This blog has given a very broad overview. For more information please read Harald Prokop's MIT master's thesis and the attached source code

For the people that do read the thesis one thing that had me confused was the cache complexity was often stated as O( 1 + k / N ) and I couldn't figure out where the 1 came from. Its because big-O represents worst case, and the worst case in terms of cache is when the memory address is misaligned so you have to pay for an additional cache miss per object access. 



So the take home for today is, when your code runs slow don't forget the cache! Until next time keep it real...



Glossary 

Time Complexity - A way of comparing algorithms mathematically. Instead of comparing them by actual time measurements that are affected by the computer the algorithm is run on. Compare them by the expected number of operations. The number of operations is typically dependent on the size of the input n. So we say its complexity is a function of n.

Cache Hit - When the CPU attempts to access an object that's already in cache.

Cache Miss - When the CPU attempts to access an object that's not in cache so it must be copied from memory to cache ( a line size at a time ).

Row Major - A 2D ( or n dimensional ) array in C is actually stored as one large single array. In row major order rows are written out after each other. So for example if the array is defined as x[3][5] and you access element x[2][4] its actual set at 2*5 + 4 because it has to skip the first two rows and then move to the 4th element.

63 comments:

  1. Cool article with good illustration, please keep writing!

    ReplyDelete
  2. thank-you ... I will try to blog about every Sunday project I take on. Next Sun is going to be on linear programming or indexing. I may write an article on JIT from the Sunday before last :)

    ReplyDelete
  3. I believe that cache-oblivious algorithms are asymptotically optimal as the number of cache levels in the hierarchy increases, not asymptotically optimal as the input size increases.

    In "The Cost of Cache-Oblivious Searching" (Proceedings. 44th Annual IEEE Symposium on Foundations of Computer Science, 2003), Bender et al. prove that cache-oblivious search structures take (lg e) or approximately 44% more block transfers than the optimal cache-aware algorithm in the 2-level memory model (the DAM model) and can do no better. As the number of levels in the memory model increases, the performance loss of the cache-oblivious algorithm approaches zero.

    Of course real hardware does not have a 2-level memory hierarchy nor does it have an infinite level memory hierarchy. But I have found that in practice, the 2-level memory model is a good abstraction for application performance. There is usually one "jump" in the memory hierarchy that has a significant performance impact. Either the gap from disk to main memory, or the gap from main memory to highest cache level, depending on the application.

    ReplyDelete
    Replies
    1. Thank you for letting me know. Ill read that on Sat and update the correction.

      Delete
  4. Awesome article! Interesting note when I ran the code, the cache oblivious was faster than the cache-aware. (compiled with gcc -O3 and ran on an i7)

    naive : 1696 1707 1707
    cache aware : 798 800 803
    cache oblivous: 364 366 364

    ReplyDelete
    Replies
    1. Thanks Danny ... A friend of mine also noticed similar results. Unfortunately I must restrict myself to Sundays for tangent work so I really can't do more research on the cause.

      If are interested in finding out. Before studying the optimised output make sure:

      1) You have set the cache line parameter for the aware algorithm appropriately
      2) Verify that the matrix begins on a page boundary using mmap ( so that its cached align )

      If you do find out, please let me know and Ill update this blog.

      Thanks

      Delete
    2. Worth noting is that the ratio of performance increase in the three cases are as follows:

      naive: 3.35 times faster
      cache aware: 0.954 times as fast
      cache oblivious: 3.32 times faster

      So that supports the hypothesis that your test machine is around 3.3 times faster than the one used for the article, but you have incorrectly configured the cache size or alignment in the cache-aware case, in one of the ways Iain describes above.

      Incidentally, this is a good anecdotal argument as to why the cache-oblivious algorithm is a better choice in general, because it's harder to misconfigure.

      Delete
  5. get the memory, mega mega bytes ya'll!

    ReplyDelete
  6. Well-written. Made my day.

    (memory) Word!

    ReplyDelete
  7. Hey there, I’m Nikitha Bangalore. I’m a Model living in Bangalore. I am a fan of Independent Bangalore Escorts, Bangalore escorts, and call girls in Bangalore.

    ReplyDelete
  8. You are so awesome! I don’t think I’ve read through anything
    like that before.
    토토
    바카라사이트

    ReplyDelete
  9. This is an informative post review. I am so pleased to get this post article and nice information. Thanks for sharing with us.

    토토
    바카라사이트
    파워볼
    카지노사이트

    ReplyDelete
  10. Your blogs further more each else volume is so entertaining further serviceable It appoints me befall retreat encore. I will instantly grab your rss feed to stay informed of any updates.

    스포츠토토
    안전놀이터
    토토사이트

    ReplyDelete
  11. I recently found many useful information in your website especially this blog page. Among the lots of comments on your articles. Thanks for sharing.

    스포츠토토
    카지노사이트
    파워볼게임
    바카라

    ReplyDelete
  12. Queen Casino & Slots Review (2021) | Play Online Casino
    Queen Casino is rated 8.6/10. 퍼스트 카지노 It provides a fun and modern atmosphere with plenty of games for fun88 vin players to enjoy, the range of bonuses and promotions クイーンカジノ are

    ReplyDelete
  13. If some one desires expert view concerning
    blogging and site-building afterward i suggest him/her to pay a quick visit this website, Keep up the nice work. 스포츠토토

    ReplyDelete
  14. This post gives clear idea in favor of the new users of blogging,
    that in fact how to do running a blog. 파워볼사이트

    ReplyDelete
  15. It’s an awesome paragraph in support of all the web
    viewers; they will get benefit from it I am sure. 바카라사이트

    ReplyDelete
  16. Keep up the good work , I read few blog posts on this website and I believe that your weblog is rattling interesting. Thank you for this effort, I will give you 5 stars for this. Kindly check the link below Thank you! Feel free to visit my website; 배트맨토토

    ReplyDelete
  17. It is not my first time to go to see this website, i am browsing this web site daily and take good facts from here every day. 바카라사이트

    ReplyDelete
  18. I like your blog. i ma happy to read your blog its very informative and your blog is really good and impressive you made it 메이저검증업체

    ReplyDelete
  19. Thank you for the information provided! 토토 Maintain the good performance of your site. You can also check my article

    ReplyDelete
  20. I read this post completely about the comparison of hottest and previous technologies, it’s remarkable article. 안전사이트

    ReplyDelete
  21. 토토
    안전놀이터
    프로토


    This is really interesting, You are a very skilled blogger.
    I've joined your feed and look forward to seeking more of your great post.
    Also, I've shared your web site in my social networks!

    ReplyDelete
  22. 스포츠중계
    스포츠토토티비
    토토사이트

    Asking questions are genuinely good thing if you
    are not understanding something fully, however this post offers
    fastidious understanding even.

    ReplyDelete
  23. 토토
    스포츠토토
    먹튀검증



    Excellent post. I was checking constantly this blog and I'm impressed!
    Very helpful info particularly the last part :) I care for such info much.
    I was seeking this particular info for a long time. Thank
    you and good luck.

    ReplyDelete
  24. Hey friend, it is very well written article, thank you for the valuable and useful information you provide in this post. Keep up the good work! FYI, shih tzu hair products , Airtel Axis Bank Credit Card Review, She erased her pdf download by Himanshu Rai,Paragraph On An Ideal Student

    ReplyDelete
  25. How to Play Casino: Easy Guide to playing slots on
    Casino games are played by 4 players, ventureberg.com/ the average 1xbet app time they casinosites.one take 토토 turns is around 14:20. The house is divided into three distinct wooricasinos.info categories: the house

    ReplyDelete
  26. I'm so happy to finally find a post with what I want. 안전놀이터순위 You have inspired me a lot. If you are satisfied, please visit my website and leave your feedback.

    ReplyDelete
  27. I clearly stumbled upon your weblog and favored to mention that I’ve truly loved reading your blog posts.
    온라인카지노

    ReplyDelete
  28. I really appreciate this wonderful post that you have provided for us.
    바카라사이트

    ReplyDelete
  29. I felt very happy while reading this site. This was really very informative site for me. 바둑이사이트넷

    ReplyDelete
  30. Your post is very helpful and information is reliable. I am satisfied with your post. Thank you so much for sharing this wonderful post. If you have any assignment requirement then you are at the right place. 메이저사이트
    cbb

    ReplyDelete
  31. I know this is one of the most meaningful information for me. And I'm animated reading your article 카지노사이트

    ReplyDelete
  32. But should remark on some general things, the website style is perfect; the articles are great.
    카지노사이트존
    카지노사이트
    바카라사이트

    ReplyDelete
  33. บริการเกมสล็อตออนไลน์ปี 2022 เกมให้เลือกเล่นมากกว่า 1,000 เกม สล็อต d เล่นได้จริง แจกจริง สมัครเลยตอนนี้

    ReplyDelete
  34. If you are interested in sports, my blog will be very helpful to you.

    토토사이트

    ReplyDelete
  35. VERY INTERESTING, I WISH TO SEE MUCH MORE LIKE THIS. THANK YOU FOR SHARING THIS****** KIND OF INFORMATION!

    ReplyDelete
  36. YOUR WRITING SKILL IS SO -------GOOD, KEEP IT UP!

    ReplyDelete
  37. I HAVE READ YOUR POST. THIS IS A GREAT JOB, I WANT TO SAY THANK YOU FOR THIS POST. 😘😘😘***-***

    ReplyDelete
  38. I learned a lot from your post. Thank you for sharing your knowledge." This shows that you found the post informative and helpful.
    Formularios de divorcio de Virginia Beach sin oposición
    preliminary protective order hearing virginia

    ReplyDelete

  39. Crypto.com is a popular cryptocurrency exchange and wallet app that allows users to buy, sell, store, and trade digital assets. if you having issues with your Crypto.com login? Is it not functioning properly? Don’t worry, this is a common issue, and here we will provide you with a few options for fixing the Crypto login Issues.

    ReplyDelete
  40. Your blogs are really good and interesting. It is very great and informative. I got a lots of useful information in your blog. Many people believe this is just typical rappers exhibiting the materialistic nature of modern life. However, they were in fact visionaries describing the importance of memory hierarchy utilisation on processor performance Sex Crime Lawyer. Keeps sharing more useful blogs..

    ReplyDelete
  41. I want to extend a massive appreciation to the mind behind this blog! Your efforts genuinely distinguish themselves and garner immense gratitude. The thought-provoking material you reliably present mirrors your unwavering commitment and fervor. Anticipating a continuous stream of captivating articles. Maintain this exceptional endeavor with utmost brilliance!Abogado de Violencia Doméstica New Jersey

    ReplyDelete
  42. Thank you for sharing the informative article. Abogado DWI Virginia

    ReplyDelete
  43. Cache Money Hoes is a powerful and subversive novel that challenges readers to think about race, class, and gender in America. Beatty's writing is sharp and witty, and he uses humor to address serious issues. The book is not without its flaws, however. Some critics have argued that it is too nihilistic and that it does not offer any solutions to the problems that it raises.Cache Money Hoes is a satirical novel by Paul Beatty that challenges readers to think about race, class, and gender in America. Beatty's writing is sharp and witty, and he uses humor to address serious issues. The book is not without its flaws, but it is a thought-provoking and important novel.
    semi truck accident lawyers

    ReplyDelete
  44. "Cache Money Hoes" is a blog article discussing memory hierarchy and its impact on processor performance. It introduces cache-aware and cache-oblivious algorithms, highlighting their effectiveness in optimizing memory access patterns. The article uses visual aids and performance comparisons to illustrate the practical implications of optimizing memory access. It encourages readers to explore more in-depth resources for further understanding. abogado de planificación patrimonial

    ReplyDelete
  45. I absolutely love how you've demystified the complex world of cache optimization with this blog! How to Apply for Divorce in New York Your analogy to the Wu-Tang Clan's wisdom adds a fun twist to a technical topic. The way you explain cache-aware and cache-oblivious algorithms is so clear and insightful. Divorcio Colaborativo Nueva York Thanks for making this often challenging subject accessible to all. Great job Abogado de Conducir sin Licencia del Condado de Unión

    ReplyDelete
  46. The phrase may be used in various contexts, perhaps as a playful or ironic reference to wealth, success, or a lavish lifestyle. It's essential to be aware of the potential for language to convey different meanings in different contexts, and the interpretation of such phrases can vary widely depending on cultural and social perspectives. truck accidents attorneys

    ReplyDelete
  47. Looking for the Best ivf specialist in delhi? Look no further! Delhi boasts some of the best IVF specialists in the country who are renowned for their expertise and success rates. These specialists have years of experience in the field of reproductive medicine and are dedicated to helping couples achieve their dream of parenthood. With state-of-the-art facilities and advanced techniques, they offer personalized treatment plans tailored to each individual's needs.

    Whether you're struggling with infertility or seeking assisted reproductive technologies, the best IVF specialists in Delhi are equipped to provide you with the highest quality care and support throughout your journey.

    ReplyDelete
  48. The Wu-Tang Clan's wisdom on "Cache Rules Everything Around Me" aligns with the significance of memory hierarchy in processor performance. This blog expertly explores cache-aware and cache-oblivious algorithms with clarity.
    New Jersey Expunge Order of Protection

    ReplyDelete
  49. "Cache Money Hoes" is a provocative and thought-provoking piece that explores themes of wealth, power, and materialism in contemporary society. The title itself challenges the traditional notions of success and raises questions about the values we prioritize. The author's bold approach and compelling narrative style make this work both engaging and impactful. Through vivid imagery and clever wordplay, "Cache Money Hoes" invites readers to critically reflect on societal norms and the pursuit of financial gain. A must-read for those interested in social commentary and cultural critique.
    New York Divorce Waiting Period

    ReplyDelete
  50. how much is an uncontested divorce in virginia
    The text talks about the title and its expectation, the description of the content, the appropriate and professional tone, the concrete use and relevant studies, the clear and organized structure, the grammar and spelling check, links to additional resources and relevant information for the reader, and soliciting feedback from colleagues or experts to ensure content is accurate, relevant and useful to the target audience. Additionally, it is recommended to add concrete examples or relevant studies to support your points and make the article more informative and compelling. The structure of the text is also recommended to check for grammar and spelling to ensure quality and avoid errors.

    ReplyDelete
  51. The phrase "cash, money, hoes" in hip-hop culture may be offensive due to its objectification of women and materialism. The "attached code" is unclear, so provide context or clarification for clarification manassas criminal lawyer.

    ReplyDelete