Skip to main content

{:toc}

Lab 8

Acknowledgements

This lab has been modified by your CMPT 295 instructor and creators of RISC-V ISA.

Tools for Cache Visualization in Venus

To understand how caches is work is typically one of the hardest tasks for students.

This exercise will use some cool cache visualization tools to get you more familiar with cache behavior and performance terminology with the help of the file cache.s provided in the starter files. This is not an actual exercise, but more of an explanation on how to use Venus as a cache visualization tool!

At this point, read through cache.s to get a rough idea of what the program does. Make sure you go through what the pseudocode does and what the argument registers hold before you proceed to analyze cache configurations on it.

  • The most important thing to understand is the section labeled “PSEUDOCODE” at the top of the file. When you run cache.s, instructions that perform this pseudocode will be executed. Basically, you’ll just either zero out some elements of some array (option 0) or you’ll increment them (option 1).
  • Which elements you access is determined by the step size (a1) and how many times you do so is determined by the repcount (a2). These two parameters will most directly affect how many cache hits vs. misses will occur. The option (a3) will also change stuff, and of course the cache parameters themselves will too.

For each of the scenarios below, you’ll be repeating THESE steps:

  1. Paste the contents of cache.s into Venus
  2. In the code for cache.s, set the appropriate Program Parameters as indicated at the beginning of each scenario (by changing the immediates of the commented li instructions in main)
  3. Simulator–>Cache.
  4. Set the appropriate Cache Parameters as indicated at the beginning of each scenario.
  5. As you execute code in Venus, any DATA memory access (load or store) will show up (instruction fetches not shown because instructions are loaded into a separate instruction cache which is not shown in Venus).

The Cache Simulator will show the state of your data cache. If you reset your code, you will also reset the cache hit/miss rate as well!

IMPORTANT: If you run the code all at once, you will get the final state of the cache and hit rate. You will probably benefit the most from setting a breakpoint in the loop wordLoop right before or after each memory access to see exactly where the hits and misses are coming from.

Review - Hit and Miss Policies

The Venus cache simulator currently simulates a write-through, write-allocate cache. Here’s a reminder about the three different cache hit policies you should know about:

  • Write-back means that on a write hit, data is written to the cache only, and when this write happens, the dirty bit for the block that was written becomes 1. Writing to the cache is fast, so write latency in write-back caches is usually quite small. However, when a block is evicted from a write-back cache, if the dirty bit is 1, memory must be updated with the contents of that block, as it contains changes that are not yet reflected in memory. This makes write-back caches more difficult to implement in hardware.
  • Write-through means that on a write hit, data is written to both the cache and main memory. Writing to the cache is fast, but writing to main memory is slow; this makes write latency in write-through caches slower than in a write-back cache. However, write-through caches mean simpler hardware, since we can assume in write-through caches that memory always has the most up-to-date data.
  • Write-around means that in every situation, data is written to main memory only; if we have the block we’re writing in the cache, the valid bit of the block is changed to invalid. Essentially, there is no such thing as a write hit in a write-around cache; a write “hit” does the same thing as a write miss.

There are also two miss policies you should know about:

  • Write-allocate means that on a write miss, you pull the block you missed on into the cache. For write-back, write-allocate caches, this can mean that memory is never written to directly; instead, writes are always to the cache and memory is updated upon eviction.
  • No write-allocate means that on a write miss, you do not pull the block you missed on into the cache. Only memory is updated.

Additionally, in this course, we talk about several replacement policies, in order from most useful to least useful (normally):

  • LRU - Least recently used, when we decide to evict a cache block to make space, we select the block that has been used farthest back in time of all the other blocks.
  • Random - When we decide to evict a cache block to make space, we randomly select one of the blocks in the cache to evict.
  • MRU - Most recently used, when we decide to evict a cache block to make space, we select the block that has been used most recently of all the other available blocks.

The important takeaway concerning Venus here: in a write-through cache (like in Venus), even though you are updating memory on writes, because we also write to the cache, we consider writes to blocks we have in the cache to be write hits.

Common question(s):

  1. Don’t we usually pair write-back with write-allocate and write-through with no write-allocate? Yes, we learned in class that the ordinary pairing of hit policy/miss policy is write-back/write-allocate and write-through/no write allocate. However, with respect to the cache, write-through and write-back caches behave similarly on hits (both write to the cache), so the hit/miss patterns you see in the Venus cache simulator would be the same even if Venus simulated a write-back cache.

Exercise 1 - A Couple of Memory Access Scenarios

Task: Simulate the following scenarios and record the final cache hit rates with the program in cache.s. Try to reason out what the hit rate will be BEFORE running the code. After running each simulation, make sure you understand WHY you see what you see (the TAs will be asking)!

Do not hesitate to ask questions if you feel confused! This is perfectly normal and the staff is there to help you out!

Check Yourself

  • How big is your cache block?
  • How many consecutive accesses (taking into account the step size) fit within a single block?
  • How much data fits in the WHOLE cache?
  • How far apart in memory are blocks that map to the same set (and could create conflicts)?
  • What is your cache’s associativity?
  • Where in the cache does a particular block map to?
  • When considering why a specific access is a miss or hit: Have you accessed this piece of data before? If so, is it still in the cache or not?

Scenario 1:

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 128 (bytes)
  • Step Size (a1): 8
  • Rep Count (a2): 4
  • Option (a3): 0

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 1
  • Block Size: 8
  • Number of Blocks: 4
  • Enable?: Should be green
  • Placement Policy: Direct Mapped
  • Associativity: 1 (Venus won’t let you change this, why?)
  • Block Replacement Policy: LRU

Tip: If it’s hard for you to visualize what’s getting pulled into the cache on each memory access just from staring at the code, try getting out some paper and a pencil. Write down what the tag:index:offset breakdown of the 32-bit addresses would be, figure out which memory addresses map to which set in the cache with the index bits, and see if that helps.

Tasks

  1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)
  2. What is our hit rate if we increase Rep Count arbitrarily? Why?
  3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.

Scenario 2:

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 256 (bytes)
  • Step Size (a1): 2
  • Rep Count (a2): 1
  • Option (a3): 1

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 1
  • Block Size: 16
  • Number of Blocks: 16
  • Enable?: Should be green
  • Placement Policy: N-Way Set Associative
  • Associativity: 4
  • Block Replacement Policy: LRU

Tasks

  1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.
  2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
  3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
  4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
  5. You should have noticed that our hit rate was pretty high for this scenario, and your answer to the previous question should give you a good understanding of why. IF YOU ARE NOT SURE WHY, consider the size of the array and compare it to the size of the cache. Now, consider the following:

Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).

QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.

HINT:

  • You do not want to iterate through the entire array at once because it’s much bigger than your cache. Doing so would reduce the amount of temporal locality your program exhibits, which makes cache hit rate suffer. We want to exhibit more locality so that our caches can take advantage of our predictable behavior.
  • SO, instead, we should try to access __**__ of the array at a time and apply all of the __**___ to that __**__** so we can be completely done with it before moving on, thereby keeping that **__**___ hot in the cache and not having to circle back to it later on!
  • (The 1st, 3rd, and 4th blanks should be the same. It’s not some vocabulary term you should use to fill them in. It’s more of an idea that you should have.)

Scenario 3:

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 128 (bytes)
  • Step Size (a1): 1
  • Rep Count (a2): 1
  • Option (a3): 0

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 2

NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy)

  • Block Size: 8
  • Number of Blocks: 8
  • Enable?: Should be green
  • Placement Policy: Direct Mapped
  • Associativity: 1
  • Block Replacement Policy: LRU

NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy)

  • Block Size: 8
  • Number of Blocks: 16
  • Enable?: Should be green
  • Placement Policy: Direct Mapped
  • Associativity: 1
  • Block Replacement Policy: LRU

Tasks

  1. What is the hit rate of our L1 cache? Our L2 cache? Overall?
  2. How many accesses do we have to the L1 cache total? How many of them are misses?
  3. How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?
  4. What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?
  5. What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?

Exercise 2 - Loop Ordering and Matrix Multiplication

If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:

for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i+j*n] += A[i+k*n] * B[k+j*n];

Fact: Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences.

In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it…

  • moves through B with stride 1
  • moves through A with stride n
  • moves through C with stride 0

REMEMBER: To compute the matrix multiplication correctly, the loop order doesn’t matter.

BUT, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program’s memory access patterns is essential to obtaining good performance from the memory hierarchy.

Take a glance at matrixMultiply.c. You’ll notice that the file contains multiple implementations of matrix multiply with 3 nested loops.

Task: Think about what the strides are for the nested loops in the other five implementations.

Note that the compilation command in the Makefile uses the ‘-O3’ flag. It is important here that we use the ‘-O3’ flag to turn on compiler optimizations. Compile and run this code with the following command, and then answer the questions below:

    make ex2

This will run some matrix multiplications according to the six different implementations in the file, and it will tell you the speed at which each implementation executed the operation. The unit “Gflops/s” reads, “Giga-floating-point-operations per second.” THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!

<iframe allowfullscreen frameborder="0" style="width:1024px; height:400px" src="https://app.lucidchart.com/documents/embeddedchart/fec0df8e-7cdb-4cf4-ac9a-c4b23e5c4330" id="._I-azv2.Al8">

Tasks

  • Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
  • Which ordering(s) perform the worst? Why?
  • How does the way we stride through the matrices with respect to the innermost loop affect performance?