{:toc}
This lab has been modified by your CMPT 295 instructor and creators of RISC-V ISA.
To understand how caches is work is typically one of the hardest tasks for students.
This exercise will use some cool cache visualization tools to get you more familiar with cache behavior and performance terminology with the help of the file cache.s
provided in the starter files. This is not an actual exercise, but more of an explanation on how to use Venus as a cache visualization tool!
At this point, read through cache.s
to get a rough idea of what the program does. Make sure you go through what the pseudocode does and what the argument registers hold before you proceed to analyze cache configurations on it.
a1
) and how many times you do so is determined by the repcount (a2
). These two parameters will most directly affect how many cache hits vs. misses will occur. The option (a3
) will also change stuff, and of course the cache parameters themselves will too.For each of the scenarios below, you’ll be repeating THESE steps:
cache.s
, set the appropriate Program Parameters as indicated at the beginning of each scenario (by changing the immediates of the commented li
instructions in main
)The Cache Simulator will show the state of your data cache. If you reset your code, you will also reset the cache hit/miss rate as well!
IMPORTANT: If you run the code all at once, you will get the final state of the cache and hit rate. You will probably benefit the most from setting a breakpoint in the loop wordLoop
right before or after each memory access to see exactly where the hits and misses are coming from.
The Venus cache simulator currently simulates a write-through, write-allocate cache. Here’s a reminder about the three different cache hit policies you should know about:
There are also two miss policies you should know about:
Additionally, in this course, we talk about several replacement policies, in order from most useful to least useful (normally):
The important takeaway concerning Venus here: in a write-through cache (like in Venus), even though you are updating memory on writes, because we also write to the cache, we consider writes to blocks we have in the cache to be write hits.
Common question(s):
Task: Simulate the following scenarios and record the final cache hit rates with the program in cache.s
. Try to reason out what the hit rate will be BEFORE running the code. After running each simulation, make sure you understand WHY you see what you see (the TAs will be asking)!
Do not hesitate to ask questions if you feel confused! This is perfectly normal and the staff is there to help you out!
Program Parameters: (set these by initializing the a registers in the code)
a0
): 128 (bytes)a1
): 8a2
): 4a3
): 0Cache Parameters: (set these in the Cache tab)
Tip: If it’s hard for you to visualize what’s getting pulled into the cache on each memory access just from staring at the code, try getting out some paper and a pencil. Write down what the tag:index:offset breakdown of the 32-bit addresses would be, figure out which memory addresses map to which set in the cache with the index bits, and see if that helps.
Program Parameters: (set these by initializing the a registers in the code)
a0
): 256 (bytes)a1
): 2a2
): 1a3
): 1Cache Parameters: (set these in the Cache tab)
repcount
). It’s not 1.Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount
times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).
QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k
is applied to element arr[i]
before Rep k
is applied to element arr[i+1]
, etc.
HINT:
Program Parameters: (set these by initializing the a registers in the code)
a0
): 128 (bytes)a1
): 1a2
): 1a3
): 0Cache Parameters: (set these in the Cache tab)
NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy)
NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy)
If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i+j*n] += A[i+k*n] * B[k+j*n];
Fact: Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences.
In the above code, note that the loops are ordered i
, j
, k
. If we examine the innermost loop (the one that increments k
), we see that it…
REMEMBER: To compute the matrix multiplication correctly, the loop order doesn’t matter.
BUT, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program’s memory access patterns is essential to obtaining good performance from the memory hierarchy.
Take a glance at matrixMultiply.c
. You’ll notice that the file contains multiple implementations of matrix multiply with 3 nested loops.
Task: Think about what the strides are for the nested loops in the other five implementations.
Note that the compilation command in the Makefile uses the ‘-O3’ flag. It is important here that we use the ‘-O3’ flag to turn on compiler optimizations. Compile and run this code with the following command, and then answer the questions below:
make ex2
This will run some matrix multiplications according to the six different implementations in the file, and it will tell you the speed at which each implementation executed the operation. The unit “Gflops/s” reads, “Giga-floating-point-operations per second.” THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!