Do these timing changes correlate to typical cache sizes or changes in stride?

Computer Architecture

Introduction

The memoy hierarchy of a given microprocessor typically is composed of at least three levels of cache between the CPU and main memory.  It is possible for a fourth level to be present in the form of eDRAM in some CPUs.  In multicore processors, each core will have its own level 1 and level 2, with level 3 shared among all the cores.  The hierarchy of the memory system is designed to provide seamless transfer of data from main memory to the CPU; it is not usually possible to determine by observation how many levels of cache a system has.

 

The goal of this assignment is to attempt to expose the memory hierarchy through programming. A program, written in C, is provided that exercises access through all levels of the memory hierarchy to RAM and collects performance information in the form of memory access times.  This information may provide a hint as to the structure of the memory hierarchy of the system being tested.

 

Background

 

The basis for this assignment comes from Case Study 2 described on pages 150 – 153 in the text book.  A program is provided that is designed to generate data that will allow timing various accesses to the memory hierarchy.  This program, which can be downloaded from Blackboard, is written in C.  A compiled version in the form of an executable for Windows systems is also provided.  For non-Windows systems, the source program should be compiled using a C compiler for the target system.

 

Additional documentation about the process is provided as an appendix to this assignment.

 

Notes About the Program

 

There are two important components embedded in the program.  The first is the ability to access the system clock to collect timing information of memory accesses.  Most programming languages provide a programming interface for this purpose in the form of a function or method.  The time.h header file provides this function in C and allows the capture of timing values in nanoseconds.

 

The second component is an appropriate data structure that can be accessed by dynamically varying the stride of the memory access.  The memory access stride is defined as the distance between addresses between two successive memory accesses and is generally a power of 2.  For this assignment, the simplest data structure for our purpose is a two-dimensional array whose size must be declared large enough to encompass the largest potential cache size.  The program declares a 4096 x 4096 maximum cache size which is equivalent to 16 mebibytes.

 

Program Design

 

The provided program is written in C because it compiles directly to native executable files and will typically provide more accurate results on most operating systems.  You may entertain the option of adapting the sample program to another language.  Java programs compile to bytecode which are interpreted by the JVM (Java Virtual Machine) and this additional overhead of execution may affect the timings that are collected.  Python is similar to Java in that it is an interpreted language but it is possible to produce an executable file using an add-on utility (your option).  If you wish, you may try to rewrite the program in a language that is most convenient for you.

 

The output of the program is directed to a text file so you have a record of the timing information your program generates.  The output is also printed on the monitor so that you can follow the program’s execution.  The format of the output is the size of the cache, the stride and the time for the read/write of the array on each cache size/stride increment.  The program is set up to output the data into a comma separated text file where each row represents a cache size, each column a stride value.  This format makes it possible to import the data into Excel so that the analysis part of this assignment somewhat easier.

 

Program Structure

 

There are two nested loops: the outer loop increments through the cache sizes (from 1K – 16M) and the inner loop increments through the strides for each cache size (1 – cache size/2).  Within the inner loop are two do loops. The first performs repeated read/writes to the matrix.  The second repeats the loop without access to the matrix to capture the overhead of the loop.  The difference between the two times provides the data access times which are averaged over the number of accesses per stride. This is represented by the variable loadtime in the program.

 

This program takes a long time to run because it constantly loops on each cache size and stride for 20 seconds.  Even on a fast computer, the run time can be more than 1.5 hours.  So, you need to allow enough time for the program to complete execution.

 

Observations and Analysis (What To Do For This Assignment)

 

Run the program on a computer system to generate a complete sequence of memory access timings. Once the program has completed, you will need to analyze the results. Using Excel or a comparable spreadsheet, you can import your data and then create graphs to show your data.  A sample graph, as presented in the textbook on page 152, is shown below.  You can use the graph you create using your own results data as a reference for your analysis and conclusions. Review the results and see if you can use the results to answer the following questions:

 

At what cache size and stride level do significant changes in access times occur?

Do these timing changes correlate to typical cache sizes or changes in stride?

Is it possible to determine the cache sizes of the different levels based on the produced data?

What in your data doesn’t make sense? What questions arise from this data?

Compare your data against the actual cache information for the system.  For Windows-based systems, there is a freeware product called CPU-Z which will report detailed CPU information including cache.  On Unix & Linux systems, /prod/cpuinfo or lspci will provide similar information. MacCPUID is a tool used for displaying detailed information about the microprocessor in a Mac computer.  Both CPU-Z and MacCPUID are free and can be downloaded from the Internet.  You may also refer to the specifications for the processor which are published online.

 

If time allows, run the program on a second, different system and compare the results.  Are they similar or different?  How are they different?

 

What to submit on Blackboard?

A spreadsheet file where you consolidated and analyzed your results

A summary of your observations (create a separate tab in the spreadsheet)

Additional comments about your experience with this assignment (challenges, difficulties, surprises encountered, etc.) on the same page as your summary

 

Note:  due to many potential factors that could influence the outcome of your work on this assignment, there is no right vs. wrong solution.  Grading will be based on the observed level of effort presented through your analysis and documented results.  Your analysis should not just be a reiteration of the results, but should reflect your interpretation of the results, as well as posing any questions you formed in viewing the results.

 

Example graph from textbook showing program results:

Assignment Addendum

Cache Access Measurement Process Summary

 

Most contemporary processors today contain multilevel cache memory as part of the memory hierarchy.  Each level of cache can be characterized by the following parameters:

 

Size: typically in the Kibibyte or Mebibyte range

Block size (a.k.a. line size): the number of bytes contained in a block

Associativity: the number of sets contained in a cache location

 

Let D = size, b = block size and a = associativity. The number of sets in a cache is defined as D / ab. So if a cache were 64 KB with a block size of 64 bytes and an associativity of 2, the number of sets would be 64K / (64*2) = 512.

 

The program that you are using for this assignment is supposed to exercise the memory hierarchy by repeatedly accessing a data structure in memory and measuring the time associated with the access.  We stated that a simple two-dimensional array would suffice as the test data as long as its size was declared larger than the largest cache size in the system.  An appropriate upper limit would be 16 Mbytes as most caches are smaller than this size.  The program logic should vary the array size from some minimum value, e.g. 1 Kbyte, to the maximum and for each array size vary the indexing of the array using a stride value in the range 1 to N/2 where N is the size of the array.  Let s represent the stride.

Depending on the magnitudes of N and s, with respect to the size of the cache (D), the block size (b) and the associativity (a), there are four possible categories of operations.  Each of these categories are characterized by the rate at which misses occur in the cache.  The following table summarizes these categories.

 

Category Size of Array Stride Frequency of Mises Time per Iteration
1 1 £ N £ D 1 £ s £ N/2 No misses Tno-miss
2 D £ N 1 £ s £ b 1 miss every b/s elements Tno-miss + Ms/b
3 D £ N b £ s £ N/a 1 miss every element Tno-miss + M
4 D £ N N/a £ s £ N/2 No misses Tno-miss

 

T is access time and M is the miss penalty representing the time that it takes to read the data from the next lower cache or RAM and resume execution.

 

Discussion

 

Category 1:  N £ D

The complete array fits into the cache and thus, independently of the stride (s), once the array is loaded for the first time, there are no more misses.  The execution time per iteration (Tno-miss) includes the time to read the element from the cache, compute its new value and store the result back into the cache.

 

Category 2: N > D and 1 £ s < b

 

The array is bigger than the cache and there are b/s consecutive accesses to the same cache line.  The first access to the block always generates a miss because every cache line is displaced before it can be reused in subsequent accesses.  This follows from N > D. Therefore, the execution time per iteration is Tno-miss + Ms/b.

 

Category 3: N > D and b £ s < N/a

 

The array is bigger than the cache and there is a cache miss every iteration as each element of the array maps to a different line.  Again, every cache line is displaced from the cache before it can be reused.  The execution time per iteration is Tno-miss + M.

 

Category 4:  N > D and N/a £ s < N/2

 

The array is bigger than the cache but the number of addresses mapping to a single set is less than the set associativity.  Thus, once the array is loaded, there are no more misses.  Even when the array has N elements, only N/s < a of these are touched by the program and all of them can fit in a single set.  This follows from the fact that N/a £ s.  The execution time per iteration is Tno-miss.

 

By making a plot of the values of execution time per iteration as a function of N and s, we might be able to identify where the program makes a transition from one category to the next.  And using this information we can estimate the values of the parameters that affect the performance of the cache, namely the cache size, block size and associativity.

 

Our approach is somewhat flawed in that we are neglecting the effect of virtual memory and the use of a TLB (translation-lookaside buffer).    For our purpose, we can neglect these issues and still gain an understanding of the operation and performance of the caches in a given system.