SFU Intro Systems Class Caches

Grace Hopper developed the first compiler for a computer programming language.

Github clone link
Goals
Background
Part 1: Cache Model
Part 2: 2xCache
End-to-End Grading
- Hints
Acknowledgments

Github clone link

Goals

Learning Objectives:

Gain basic familiarity with cache geometries and how different associativities and line sizes impact performance.
Evaluate a claim made in the lecture slides that two seemingly equivalent ways of writing a program can have vastly different performance.
Gain experience in designing cache-friendly code.

In the first part of this assignment, we build a simulator for modeling a single level cache. We run traces extracted from valgrind through this cache model and estimate the number of misses, hits, and evictions.

In the second part, we try to reverse engineer and discover cache geometries. Instead of running programs on these processors and inferring the cache layout from timing results, you will approximate this work by using a simulator.

WARNING: Note that the logical variable used to represent block offset vary from the lecture slides. In representation below, case matters:

{: .table-striped .table-bordered }

Lecture	Assignment	Description
s	s	Number of set bits
S	S	$2^s$ Number of sets
E	E	Number of ways
k	b	Number of block bits or byte offset.
K	B	$2^b$ Size of block in bytes

Background

This image shows the last 2 bytes of the address of each cache block as they align to the 32x32 matrices. Red squares notate the start of a section that would fit in the cache. Green lines divide individual integers, and black lines divide cache blocks.

Part 1: Cache Model

This assignment will help you understand the impact that cache memories can have on the performance of your C programs. You will write a small C (not C++!) program of about 200-300 lines that simulates the behavior of a cache memory.

Check yourself

Do you know what a tag is?
Do you know what is the difference between int * and int?
Given the number of set bits and block bits, do you know how to extract tag bits?

{: .table-striped .table-bordered }

Tag Comparison	Selects the set	Block offset
Tag (t)	Set index (s)	byte offset (b)

What is the block address?

| Tag(t) | Set index (s) | 0s for byte offset |

Trace files

> cd $REPO/model
> ls traces/
dave.trace     fifo_m2.trace  fifo_s3.trace  yi2.trace
fifo_l.trace   fifo_s1.trace  long.trace     trans.trace
fifo_m1.trace  fifo_s2.trace  small.trace    yi.trace

The traces/ directory contains a collection of reference trace files that we will use as input to evaluate the correctness of your cache simulator. The trace files are generated by a Linux program called valgrind. For example, the following command runs the executable program ls -l, captures a trace of each of its memory accesses in the order they occur, and prints them to stdout.

valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l

Memory traces have the following form:

// Format: operation address,size
M 0421c7f0,4
L 04f6b868,8
S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is operation address,size:

The operation field denotes the type of memory access: L a data load, S a data store, and M a data modify (i.e., a data load followed by a data store).
The address field specifies a 64-bit hexadecimal memory address.
The size field specifies the number of bytes accessed by the operation.

Reference Cache Binary

We have provided you with the binary executable of a reference cache simulator, called cache-ref, that simulates the behavior of a cache with arbitrary size and associativity on a valgrind trace file. Your assignment is to create a cache model (called cache) that mimics the behavior of cache-ref.

-h: Optional help flag that prints usage info
-v: Optional verbose flag that displays trace info
-s: S = number of sets = $2^s$ (s is the number of bits used for the set index)
-E: Number of lines per set (i.e., associativity or number of ways)
-b: Number of bits used for the byte offset within the line. Number of bytes per block (i.e., block size) = $2^b$
-t: Trace file to be processed. cache-ref reads file line-by-line and feeds it to cache simulator
-v: Verbose mode. Displays what happens to each line. Displays hit, miss eviction, and miss for each access based on what happened when operating on the cache.
-L: LRU policy. Your own model cache only needs to implement LRU and does not need to implement this option.

WARNING: When running using cache-ref, you have to include -L option. But when running with using own model, you should remove this option

# Reference simulator
> ./cache-ref -v -s 2 -E 1 -b 5 -t traces/yi.trace -L

# sets = 2^2 = 4
# E = 1 Number of lines per set (Number of ways)
# b = 5
# Block size = 2^5 = 32 bytes
# -L : LRU policy


> ./cache -v -s 2 -E 1 -b 5 -t traces/yi.trace
L 10, L1 insert  [status: 0 insert_block: 0x0]
M 20, L1 insert  [status: 0 insert_block: 0x20]
L 22, L1 hit  [status: 1]
S 18, L1 hit  [status: 1]
L 110, L1 insert + eviction  [status: 2 victim_block: 0x0 insert_block: 0x100]
L 210, L1 insert + eviction  [status: 2 victim_block: 0x100 insert_block: 0x200]
M 12, L1 insert + eviction  [status: 2 victim_block: 0x200 insert_block: 0x0]
L1 hits:4 misses:5 evictions:3

# If you have implemented cache.c correctly run as such
./cache -v -s 2 -E 1 -b 5 -t traces/yi.trace

Here is a commentary on the output format required:

Source Files

You will write a cache simulator in cache.c that takes a valgrind memory trace as input (we have provided the traces), simulates the hit/miss behavior of a cache memory on this trace, and outputs the total number of hits, misses, and evictions.

WARNING: The output has to match the reference simulator, both verbose and concise

Functions you will have to fill in:

{: .table-striped .table-bordered }

Function to be implemented	Description
address_to_block	Given an address, return the block aligned address i.e., blockbits are zero'd out
cache_tag	Return the tag of an address
cache_set	Return the set index of the address
cacheSetUp	Allocate memory and initialize the cache sets and each line in every set
probe_cache	Check if the block of the address is in the cache. If yes, return true. else return false
allocate_cache	Insert block in cache
avail_cache	Check if empty way available. If yes, return true, else return false
victim_cache	Find the victim line to be removed from the set. Return the way of the block (i.e., the corresponding line index within the set). The victim will be the least-recently-used block
evict_cache	Remove/Invalidate the cache block in corresponding set and way
evict_address	Remove/Invalidate the cache given address
flush_cache	Given a block address, Find it in corresponding set and way and invalidate it
operateCache	Top-level method that receives an address, and invokes other methods to implement the cache model. Returns a struct result. Fill in the required fields for each result (hit, miss, evict and miss)
cacheSetup	Allocate and initialize cache
	Already Implemented Helpers
print_result(result r)	Print the result of the cache operation. This function is provided to you fo r printing in the required format.
print_candidate_set_way	This function is provided to you to help debug
main.c	You do not need to modify this file. However, you may benefit from reading how `runTrace` invokes `operateCache`. It will be useful in Part 2

Data structures that have been provided

// Per cache-line datastructure
typedef struct Line {
  unsigned long long block_addr;
  short valid; // True: Cacheblock valid False: Cacheblock invalid
  unsigned long long tag;
  // holds the place in used lines
  // the greater the rate, that much recent it is
  int timestamp;
} Line;

typedef struct Set {
  Line *lines; // Lines in each set (# of ways)
  int clock;
  int placementRate;
} Set;

typedef struct Cache {
  // Confiuration
  int setBits; // s
  int linesPerSet; // E
  int blockBits; // k

  Set *sets;    // Array of sets

  // Stats
  int eviction_count;
  int hit_count;
  int miss_count;
  short displayTrace;
  int lfu; // 0: Least Recently Used   1: Least Frequently Used.
  char* name;
} Cache;

When selecting a victim to evict from the cache set, you need to implement the LRU policy. To do so, we give you a timestamp field per-line and clock per-set. The clock is incremented on each access to the set. Thus, the timestamp value when a cache line in accessed indicates recency. Try to figure out:

when is clock incremented? (Hint: think of it as a counter that ticks over on accesses to the set and acts as a virtual clock)
when is the line's timestamp set?
what value is it set to?
how to find the corresponding LRU block when choosing victim?

LRU Implementation

### 1. Overview

In a set-associative cache, each set has multiple ways (or blocks), and the LRU policy decides which block to evict when a new block is loaded into the set. The per-set counter keeps track of access history, while each block has a flag to indicate its position in the recency order.

2. Components

Per-Set Counter: Each set has a counter that tracks the total number of accesses. This counter is incremented with each access in the set, representing a timestamp or clock that is ticking at the rate of number of accesses.
Per-Block Flags: Each block in the set has a flag (or a small counter) that indicates its recent usage relative to other blocks. This value is updated based on the per-set counter to track which blocks are more recently used.Here’s a concise pseudocode version for implementing LRU using a per-set counter and per-block flags:

Pseudocode

Initialize all counters to 0.
On Cache Access(Set S, Block B):
  Increment SetCounter[S]
  If Block B is in cache:
    Flag[B] = SetCounter[S]  // Update accessed block's flag to current counter

When evicting a block:
  Find LRU block L with smallest Flag[L] in set S

Debugging Hints

Here are some hints and suggestions for working on this assignment:

Do your initial debugging on the smaller traces. Numbers below shows number of memory accesses in trace.

small.trace
dave.trace
yi.trace
fifo_s1.trace
fifo_s2.trace
fifo_s3.trace
yi2.trace
fifo_m1.trace
269998 fifo_l.trace
267988 long.trace

The reference simulator takes an optional -v argument that enables verbose output, displaying the hits, misses, and evictions that occur as a result of each memory access. It helps you debug by allowing you to directly compare the behavior of your simulator with the reference simulator on the reference trace files.
The data modify operation (M) is treated as a load followed by a store to the same address. Thus, an M operation can result in two cache hits or a miss and a hit plus a possible eviction (even when only 1 byte is modified).

Part 2: 2xCache

This part will help you understand how parallel caches interact and the impact cache memories can have on you C programs' performance. You will write a top-level controller function that maintains cache blocks across two caches at the same level (L1). We will be using a new set of traces to evaluate the correctness of your cache simulator (these traces are provided in the traces/2x/ directory). Each trace now includes an additional field a program making the access (Here the traces have two programs 1 or 2). So entries will be like the following:

L 000013F8,4,1 # Program 1's access
L 000013FC,4,2 # Program 2's access

Each program uses its own cache (1 accesses go to P1, 2 accesses to P2). However they might make access to the same block e.g., 000013F8 and 000013FC above and miss. When this happens we will supply data from the adjacent cache instead of the memory. This is how the system works for dual cache. When checking and supplying remotely e.g., P2 is supplying an access that originated in P1 cache, we will call this a remote hit. On the remote hit, remove the block from remote cache and insert into local.

// Implement Dual Cache Policy With 2 Caches, P1 and P2

Access trace says P1 access L 00001000,4,1

{: .table-striped .table-bordered }

P1 cache	P2 remote cache	Action
Miss	Miss	Insert block into P1 cache. P1.miss++
Miss	Hit	P2.Hit++ in P2 cache. Remove from P2 Cache. Insert into P1 Cache. P1.miss++
Hit	Don't Care	Operate Local. P1.hit++.

Access trace says P2 access L 00001000,4,2

{: .table-striped .table-bordered }

P2 cache	P1 remote cache	Action
Miss	Miss	Insert block into P2 cache. P2.miss++
Miss	Hit	P2.Hit++ in P2 cache. Remove from P1 Cache. Insert into P2 Cache. P2.miss++
Hit	Don't Care	Operate Local. P2.hit++.

Exclusive means a block can be in either in the P1 or P2, but never both. i.e., P1{} $\notin$ P2{} and P2{} $\notin$ P1.. We have included a validate function that checks this condition after each access

Check yourself

Have you completed part 1? Part 2 depends on completion of part 1
Do you know what a cache hierarchy is? Read this. Focus on slide 14 and 15
Do you know what a block address is?
Given a block address, do you know how to find it in the cache?

Reference Cache Binary

We have provided you with the binary executable of a reference cache simulator, called 1level-shared-ref, that simulates the behavior of a dual cache. Your assignment is to create a cache model 1level-shared.c that mimics the behavior of shared-ref.

-i1: Interactive mode; stops the program after each access and shows the current state.
-h: Optional help flag that prints usage info
-c: Configuration for two-level hierarchy
-v: Optional verbose flag that displays trace info
-t: Trace file to be processed. cache-ref reads file line-by-line and feeds it to cache simulator
-v: Verbose mode. Displays what happens to each line. Displays hit, insert + eviction, and insert for each access based on what happened when operating on the cache

# You do not need to modify this file. This is how we load configurations for `2-level.c`
$ cat 2x_1.config
{
"P1_SetBits (s)": 2,
"P1_Ways (E)": 1,
"BlockBits (b)": 4,
"P2_SetBits (s)": 2,
"P2_Ways (E)": 1
}

$ ./1level-shared-ref -c ./2x_1.config -t ./traces/2x/sample.trace

###### Configuration #########
P1_setBits: 2
P1_ways: 1
P1 and P2 blockBits: 4
P2_setBits: 2
P2_ways: 1
############################

L 1000, P1 insert  [status: 0 insert_block: 0x1000]
L 1200, P2 insert  [status: 0 insert_block: 0x1200]
L 1202, P2 hit  [status: 1] P1 insert + eviction  [status: 2 victim_block: 0x1000 insert_block: 0x1200]
L 1002, P2 insert  [status: 0 insert_block: 0x1000]
L 2000, P1 insert + eviction  [status: 2 victim_block: 0x1200 insert_block: 0x2000]
P1 hits:0 misses:3 evictions:2
P2 hits:1 misses:2 evictions:0

# Implement the 1level simulator
$ make 1level-shared
$ ./1level-shared -c ./2x_1.config -t ./traces/2x/sample.trace

Here is a commentary on the output format required:

Source Files

You will write a cache simulator in 1level-shared-main.c that takes a valgrind memory trace as input (we have provided the traces), simulates the hit/miss behavior of a two-level hierarchy, and outputs the total number of hits, misses, and evictions.

WARNING: The output has to match the reference simulator, both verbose and concise.

Functions you will have to fill in:

{: .table-striped .table-bordered }

Function to be implemented	Description
runTrace	Inside this function invoke the appropriate methods within your P1 and P2 caches to implement an exclusive hierarchy. See the attached diagram on how they interact. Accesses first start with the P1 cache with 3 possible outcomes (`hit`,`miss`, `evict and miss`). On `miss` and `evict-miss`, you need to operate the L2 cache
main	Set cache parameters and instantiate cache

Validation

The key condition that needs to be obeyed in the shared model is that At any given time a block is either in the P1's cache or P2's cache. To validate this we have provided a validate function. After each access we check the condition.

for each block in P1:
   assert not probe_cache(block,P2)
for each block in P2:
   assert not probe_cache(block,P1)

If the probe returns true, we assert and stop.

Hint: You need to always ensure that every block in L1 is not in the L2. Think what you need to do on L1 insertions, and L2 evictions to enforce this condition.

Remember from the testing framework section that these sanity tests are not comprehensive, and you should rely on your own tests to decide whether your code is correct. Your score will be determined mostly by hidden tests that are ran after the submission deadline has passed.

Hints

Use the validate function to double check P1 and P2 blocks. Reference the simulator
evict_address(address,..): Remove/Invalidate the cache block of the address.

End-to-End Grading

> bash ./scripts/localci.sh

# if you see SUCCESS and *.log.sucess then you passed. You can also check your *_Grade.json to see your tentative grade.

# If you see FAILED, then inspect *.log.failed. Check the failed section to see what tests failed.

Acknowledgments

This assignment has been created by the CMPT 295 instructors based on the CS:APP textbook.