CPU cache - Wikipedia
Cache memory is used to increase the performance of the PC. The CPU ( processor) and main memory (RAM) overcome these differences by using cache . In computing, a cache is a hardware or software component that stores data so that future Nevertheless, caches have proven themselves in many areas of computing, . While the disk buffer, which is an integrated part of the hard disk drive, is sometimes misleadingly referred to as "disk cache", its main functions are write. A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to . If data is written to the cache, at some point it must also be written to main The cache hit rate and the cache miss rate play an important role in .. As the latency difference between main memory and the fastest cache has.
The portion of the processor that does this translation is known as the memory management unit MMU. The fast path through the MMU can perform those translations stored in the translation lookaside buffer TLBwhich is a cache of mappings from the operating system's page tablesegment table, or both. For the purposes of the present discussion, there are three important features of address translation: The physical address is available from the MMU some time, perhaps a few cycles, after the virtual address is available from the address generator.
Multiple virtual addresses can map to a single physical address.
- Explain the role and features of Cache memory.
Most processors guarantee that all updates to that single physical address will happen in program order. To deliver on that guarantee, the processor must ensure that only one copy of a physical address resides in the cache at any given time. The virtual address space is broken up into pages. There may be multiple page sizes supported; see virtual memory for elaboration.
Some early virtual memory systems were very slow because they required an access to the page table held in main memory before every programmed access to main memory. The first hardware cache used in a computer system was not actually a data or instruction cache, but rather a TLB. Physically indexed, physically tagged PIPT caches use the physical address for both the index and the tag.
While this is simple and avoids problems with aliasing, it is also slow, as the physical address must be looked up which could involve a TLB miss and access to main memory before that address can be looked up in the cache.
Virtually indexed, virtually tagged VIVT caches use the virtual address for both the index and the tag. This caching scheme can result in much faster lookups, since the MMU does not need to be consulted first to determine the physical address for a given virtual address.
However, VIVT suffers from aliasing problems, where several different virtual addresses may refer to the same physical address. The result is that such addresses would be cached separately despite referring to the same memory, causing coherency problems. Although solutions to this problem exist  they do not work for standard coherence protocols. Another problem is homonyms, where the same virtual address maps to several different physical addresses.
It is not possible to distinguish these mappings merely by looking at the virtual index itself, though potential solutions include: Additionally, there is a problem that virtual-to-physical mappings can change, which would require flushing cache lines, as the VAs would no longer be valid. All these issues are absent if tags use physical addresses VIPT.
Virtually indexed, physically tagged VIPT caches use the virtual address for the index and the physical address in the tag. The advantage over PIPT is lower latency, as the cache line can be looked up in parallel with the TLB translation, however the tag cannot be compared until the physical address is available.
The advantage over VIVT is that since the tag has the physical address, the cache can detect homonyms. Theoretically, VIPT requires more tags bits because some of the index bits could differ between the virtual and physical addresses for example bit 12 and above for 4 KiB pages and would have to be included both in the virtual index and in the physical tag.
Explain the role and features of Cache memory. - relax-sakura.info
In practice this is not an issue because, in order to avoid coherency problems, VIPT caches are designed to have no such index bits; this limits the size of VIPT caches to the page size times the number of sets. Physically indexed, virtually tagged PIVT caches are often claimed in literature to be useless and non-existing. The R solves the issue by putting the TLB memory into a reserved part of the second-level cache having a tiny, high-speed TLB "slice" on chip.
The cache is indexed by the physical address obtained from the TLB slice. However, since the TLB slice only translates those virtual address bits that are necessary to index the cache and does not use any tags, false cache hits may occur, which is solved by tagging with the virtual address.
The speed of this recurrence the load latency is crucial to CPU performance, and so most modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to proceed in parallel with fetching the data from the cache RAM. But virtual indexing is not the best choice for all cache levels.
The cost of dealing with virtual aliases grows with cache size, and as a result most level-2 and larger caches are physically indexed. Caches have historically used both virtual and physical addresses for the cache tags, although virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then the physical address is available in time for tag compare, and there is no need for virtual tagging.
Large caches, then, tend to be physically tagged, and only small, very low latency caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints, as described below.
Homonym and synonym problems[ edit ] A cache that relies on virtual indexing and tagging becomes inconsistent after the same virtual address is mapped into different physical addresses homonymwhich can be solved by using physical address for tagging, or by storing the address space identifier in the cache line.
However, the latter approach does not help against the synonym problem, in which several cache lines end up storing data for the same physical address. Writing to such locations may update only one location in the cache, leaving the others with inconsistent data. This issue may be solved by using non-overlapping memory layouts for different address spaces, or otherwise the cache or a part of it must be flushed when the mapping changes.
However, coherence probes and evictions present a physical address for action. The hardware must have some means of converting the physical addresses into a cache index, generally by storing physical tags as well as virtual tags.
For comparison, a physically tagged cache does not need to keep virtual tags, which is simpler. When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on pages not mapped by the TLB, then those entries will have to be flushed when the access rights on those pages are changed in the page table. It is also possible for the operating system to ensure that no virtual aliases are simultaneously resident in the cache.
The operating system makes this guarantee by enforcing page coloring, which is described below. It has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen and the software complexity and performance penalty of perfect page coloring has risen. It can be useful to distinguish the two functions of tags in an associative cache: The second function must always be correct, but it is permissible for the first function to guess, and get the wrong answer occasionally.
The virtual tags are used for way selection, and the physical tags are used for determining hit or miss. This kind of cache enjoys the latency advantage of a virtually tagged cache, and the simple software interface of a physically tagged cache.
It bears the added cost of duplicated tags, however.RAM Vs ROM Explained (Hindi) - Kshitij Kumar
Also, during miss processing, the alternate ways of the cache line indexed have to be probed for virtual aliases and any matches evicted. The extra area and some latency can be mitigated by keeping virtual hints with each cache entry instead of virtual tags. These hints are a subset or hash of the virtual tag, and are used for selecting the way of the cache from which to get data and a physical tag.
Like a virtually tagged cache, there may be a virtual hint match but physical tag mismatch, in which case the cache entry with the matching hint must be evicted so that cache accesses after the cache fill at this address will have just one hint match.
Since virtual hints have fewer bits than virtual tags distinguishing them from one another, a virtually hinted cache suffers more conflict misses than a virtually tagged cache.
In these processors the virtual hint is effectively two bits, and the cache is four-way set associative. Effectively, the hardware maintains a simple permutation from virtual address to cache index, so that no content-addressable memory CAM is necessary to select the right one of the four ways fetched.
Cache coloring Large physically indexed caches usually secondary caches run into a problem: Differences in page allocation from one program run to the next lead to differences in the cache collision patterns, which can lead to very large differences in program performance.
These differences can make it very difficult to get a consistent and repeatable timing for a benchmark run.
Sequential physical pages map to sequential locations in the cache until after pages the pattern wraps around. We can label each physical page with a color of 0— to denote where in the cache it can go. Locations within physical pages with different colors cannot conflict in the cache.
But they should also ensure that the access patterns do not have conflict misses. One way to think about this problem is to divide up the virtual pages the program uses and assign them virtual colors in the same way as physical colors were assigned to physical pages before. Programmers can then arrange the access patterns of their code so that no two pages with the same virtual color are in use at the same time.
There is a wide literature on such optimizations e. The snag is that while all the pages in use at any given moment may have different virtual colors, some may have the same physical colors. In fact, if the operating system assigns physical pages to virtual pages randomly and uniformly, it is extremely likely that some pages will have the same physical color, and then locations from those pages will collide in the cache this is the birthday paradox.
The solution is to have the operating system attempt to assign different physical color pages to different virtual colors, a technique called page coloring. Although the actual mapping from virtual to physical color is irrelevant to system performance, odd mappings are difficult to keep track of and have little benefit, so most approaches to page coloring simply try to keep physical and virtual page colors the same.
If the operating system can guarantee that each physical page maps to only one virtual color, then there are no virtual aliases, and the processor can use virtually indexed caches with no need for extra virtual alias probes during miss handling. Alternatively, the OS can flush a page from the cache whenever it changes from one virtual color to another.
Cache hierarchy in a modern processor[ edit ] Memory hierarchy of an AMD Bulldozer server Modern processors have multiple interacting on-chip caches. The operation of a particular cache can be completely specified by the cache size, the cache block size, the number of blocks in a set, the cache set replacement policy, and the cache write policy write-through or write-back.
Level 2 and above have progressively larger numbers of blocks, larger block size, more blocks in a set, and relatively longer access times, but are still much faster than main memory. Cache entry replacement policy is determined by a cache algorithm selected to be implemented by the processor designers. In some cases, multiple algorithms are provided for different kinds of work loads. Specialized caches[ edit ] Pipelined CPUs access memory from multiple points in the pipeline: The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline.
Thus the pipeline naturally ends up with at least three separate caches instruction, TLBand dataeach specialized to its particular role. Victim cache[ edit ] A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path, and holds only those blocks of data that were evicted from the main cache. The victim cache is usually fully associative, and is intended to reduce the number of conflict misses.
Many commonly used programs do not require an associative mapping for all the accesses. In fact, only a small fraction of the memory accesses of the program require high associativity.
The victim cache exploits this property by providing high associativity to only these accesses. Generally, instructions are added to trace caches in groups representing either individual basic blocks or dynamic instruction traces. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.
The WCC's task is reducing number of writes to the L2 cache. The main disadvantage of the trace cache, leading to its power inefficiency, is the hardware complexity required for its heuristic deciding on caching and reusing dynamically created instruction traces. This is used by low-powered processors which do not need a normal instruction cache because the memory system is capable of delivering instructions fast enough to satisfy the CPU without one.
However, this only applies to consecutive instructions in sequence; it still takes several cycles of latency to restart instruction fetch at a new address, causing a few cycles of pipeline bubble after a control transfer.
What is Cache Memory? - Definition from Techopedia
A branch target cache provides instructions for those few cycles avoiding a delay after most taken branches. This allows full-speed operation with a much smaller cache than a traditional full-time instruction cache. Cache hierarchy Another issue is the fundamental tradeoff between cache latency and hit rate.
Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking the fastest, level 1 L1 cache first; if it hits, the processor proceeds at high speed. If that smaller cache misses, the next fastest cache level 2, L2 is checked, and so on, before accessing external memory.
As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache.
Price-sensitive designs used this to pull the entire cache hierarchy on-chip, but by the s some of the highest-performance designs returned to having large off-chip caches, which is often implemented in eDRAM and mounted on a multi-chip moduleas a fourth cache level. The benefits of L3 and L4 caches depend on the application's access patterns.
Examples of products incorporating L3 and L4 caches include the following: However, with register renaming most compiler register assignments are reallocated dynamically by hardware at runtime into a register bank, allowing the CPU to break false data dependencies and thus easing pipeline hazards.
Register files sometimes also have hierarchy: The Cray-1 circa had eight address "A" and eight scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache.
The Cray-1 did, however, have an instruction cache. Multi-core chips[ edit ] When considering a chip with multiple coresthere is a question of whether the caches should be shared or local to each core.
Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per chip, rather than core, greatly reduces the amount of space needed, and thus one can include a larger cache. Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip.
However, for the highest-level cache, the last one called before accessing memory, having a global cache is desirable for several reasons, such as allowing a single core to use the whole cache, reducing data redundancy by making it possible for different processes or threads to share cached data, and reducing the complexity of utilized cache coherency protocols.
Shared highest-level cache, which is called before accessing memory, is usually referred to as the last level cache LLC. Additional techniques are used for increasing the level of parallelism when LLC is shared between multiple cores, including slicing it into multiple pieces which are addressing certain ranges of memory addresses, and can be accessed independently.
Exclusive versus inclusive[ edit ] Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. Throughput[ edit ] The use of a cache also allows for higher throughput from the underlying resource, by assembling multiple fine grain transfers into larger, more efficient requests.
In the case of DRAM circuits, this might be served by having a wider data bus. Reading larger chunks reduces the fraction of bandwidth required for transmitting address information. Operation[ edit ] Hardware implements cache as a block of memory for temporary storage of data likely to be used again.
A cache is made up of a pool of entries. Each entry has associated data, which is a copy of the same data in some backing store. Each entry also has a tag, which specifies the identity of the data in the backing store of which the entry is a copy. When the cache client a CPU, web browser, operating system needs to access data presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired data, the data in the entry is used instead.
This situation is known as a cache hit. For example, a web browser program might check its local cache on disk to see if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is the tag, and the content of the web page is the data. The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache. The alternative situation, when the cache is checked and found not to contain any entry with the desired tag, is known as a cache miss.
This requires a more expensive access of data from the backing store. Once the requested data is retrieved, it is typically copied into the cache, ready for the next access.
During a cache miss, some other previously existing cache entry is removed in order to make room for the newly retrieved data. The heuristic used to select the entry to replace is known as the replacement policy. One popular replacement policy, "least recently used" LRUreplaces the oldest entry, the entry that was accessed less recently than any other entry see cache algorithm.
More efficient caching algorithms compute the use-hit frequency against the size of the stored contents, as well as the latencies and throughputs for both the cache and the backing store. This works well for larger amounts of data, longer latencies, and slower throughputs, such as that experienced with hard drives and networks, but is not efficient for use within a CPU cache.
Cache coherence When a system writes data to cache, it must at some point write that data to the backing store as well. The timing of this write is controlled by what is known as the write policy. There are two basic writing approaches: Write-back also called write-behind: The write to the backing store is postponed until the modified content is about to be replaced by another cache block.
A write-back cache is more complex to implement, since it needs to track which of its locations have been written over, and mark them as dirty for later writing to the backing store.