현재 접속중인 등록 사용자는 0명, 익명 사용자는 8명 입니다.
전체 등록 사용자: 751명
|글쓴이: EzDoum 글쓴날: 2002년 04월 13일 오후 12:12
Topics mostly range across memory management and memory hierarchies, and important interactions between them, including:
* adaptive replacement and prefetching policies, and related adaptive algorithms for managing other kinds of resources
* compressed caching and compression algorithms for in-memory data
* persistent object stores and process checkpointing
* memory allocation and storage fragmentation
* generational and real-time garbage collection,
* locality effects of garbage collection strategies
* tracing, trace reduction and simulation techniques
One of these days we'll presumably publish some papers on our work in extensible languages, hygienic macros, parsers for scoped extensible grammars, and logical metalanguages, which started with RScheme system, an object-oriented Scheme system.
Most of our Publications, in reverse chronological order:
1. Emery D. Berger, Benjamin G. Zorn, and Kathryn S. McKinley. Composing High-Performance Memory Allocators, PLDI 2001 PDF file.
Current general-purpose memory allocators do not provide sufficient speed or flexibility for modern high-performance applications. Highly-tuned general purpose allocators have per-operation costs around one hundred cycles, while the cost of an operation in a custom memory allocator can be just a handful of cycles. To achieve the performance they require, programmers often write custom memory allocators from scratch, a difficult and error-prone process.
In this paper, we present a flexible and efficient infrastructure for building memory allocators that is based on C++ templates and inheritance. This novel approach allows programmers to build custom and general-purpose allocators as heap layers that can be composed without incurring any runtime overhead or additional programming cost. We show that this infrastructure simplifies allocator construction and results in allocators that either match or improve the performance of heavily-tuned allocators written in C, including the Kingsley allocator and the GNU obstack library. We further show this infrastructure can be used to rapidly build a general-purpose allocator that has performance comparable to Lea's allocator, the best uniprocessor allocator we know of. We thus demonstrate a clean, easy-to-use allocator interface that seamlessly combines the power and efficiency of any number of general and custom allocators within a single application.
2. Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications, ASPLOS-IX, 2000 PDF file.
Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits program performance and scalability on multiprocessor systems. Previous allocators suffer from problems that include poor performance and scalability, and heap organizations that introduce false sharing. Worse, many allocators exhibit a dramatic increase in memory consumption when confronted with a producer-consumer pattern of object allocation and freeing. This increase in memory consumption can range from a factor of P (the number of processors) to unbounded memory consumption.
This paper introduces Hoard, a fast, highly scalable allocator that largely avoids false sharing and is memory efficient. Hoard is the first allocator to simultaneously solve the above problems. Hoard combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case. Our results on eleven programs demonstrate that Hoard yields low average fragmentation and improves overall program performance over the standard Solaris allocator by up to a factor of 60 on 14 processors, and up to a factor of 18 over the next best allocator we tested.
3. Yannis Smaragdakis and Paul Wilson, Performing Replacement in Modem Pools USENIX Technical Conference 2000 (pdf file) . (213 KB).
This paper explains how to adaptively manage a pool of modems, although the ideas are probably applicable to many similar uniform-resource-pool management problems, such as maintaining persistent connection state in network protocol stacks.
The problem is similar to replacement in a cache, but unlike many memory caching problems, there are typically many mostly-independent (uncorrelated) request streams. It is easy to maintain very simple and cheap per-stream statistics and efficiently predict which streams are unlikely to make a request soon.
The CIRG algorithm presented here is, in a sense, the flip side of the EELRU caching algorithm. Where EELRU keys its adaptation to the aggregate behavior of a single request stream, CIRG keys its behavior to the differences between request streams. (Both algorithms are based on the same general principles of timescale relativity, applied in different contexts.)
4. Scott F. Kaplan. Compressed caching and modern virtual memory simulation. Ph.D. dissertation, The University of Texas at Austin, August 1999. Compressed Postscript (1024 Kbytes).
5. Paul R. Wilson, Scott F. Kaplan, and Yannis Smaragdakis, The Case for Compressed Caching in Virtual Memory Systems , in revision to appear in USENIX Technical Conference '99. (NEW: revised version won Outstanding Paper Award) PostScript File
Compressed caching bridges the large and exponentially growing gap between RAM and disk speeds, effectively building a new level of memory hierarchy between normal RAM and disk.
Since we first published the idea almost a decade ago, compressed caching for memory systems has been experimented with, but not adopted for general-purpose computer systems. It is widely perceived as a failed idea, except for diskless systems, because of several disappointing experimental results; a common belief is that it cannot improve system performance for fast disk-based general-purpose computers, and is only useful for "stretching" memory on diskless machines.
However, as we and Douglis pointed out years ago, it is an idea that gets better with time, largely due to continuing increases in CPU speed relative to disk speeds. Fast CPUs reduce compression costs, increasing the potential benefit of compressed caching. This point went largely overlooked, but can now be seen to be correct: compressed caching is an idea whose time has come.
In this paper, we show that previous mixed results for compressed caching were only to be expected---the tradeoffs then were much more difficult than they currently are. With modern CPU speeds, the situation is very different, and getting better all the time. Extensive detailed simulations show that compressed caching memory should yield major improvements in paging performance now---tens of percent---and still more in the future, since processor speeds double relative to disk speeds every two or three years.
We also present novel algorithms for in-memory data compression and adaptive compressed caching. Our compression algorithms exploit regularities common in in-memory data, as opposed to file data, and achieve excellent compression while being very fast. Our adaptive caching algorithm performs an online cost-benefit analysis, to adjust the size of the compressed page cache to suit the locality characteristics of a running program. Fast compression and proper adaptation significantly increases the effectiveness of compressed caching.
6. Yannis Smaragdakis, Scott F. Kaplan, and Paul R. Wilson. EELRU: Simple and Effective Adaptive Page Replacement , in SIGMETRICS '99. PostScript file .
LRU (or LRU-like) replacement is the nearly universal standard, because for 30 years it has seemed very hard to beat. Most actual replacement policies (e.g., Clock algorithms) attempt to approximate LRU, diverging from it primarily for implementation efficiency reasons. Many attempts to systematically beat LRU have largely failed, partly due to a basic lack of understanding of LRU's great strengths as well as its weaknesses. In this paper, we show how to beat LRU on average by emulating it in its good cases, and diverging radically from it in LRU's most common (and well-known) bad cases, which are very bad indeed.
Our EELRU algorithm adaptively approximates LRU or a modified MRU, as appropriate. It uses an online cost/benefit analysis to estimate costs and benefits of different eviction strategies, based on recent program behavior. Unlike many adaptive replacement policies, EELRU is well-behaved in two major ways: (1) it keys off of features of program behavior that are directly relevant to caching decisions, at the appropriate timescale, and (2) it is competitive with LRU---it never does worse than LRU by more than a modest constant factor. (In contrast, LRU can do worse than EELRU by a factor proportional to the memory size---up to thousands of times worse.)
We show through simulation that EELRU works as it was designed to---it detects and exploits the same regularities LRU does, and others besides, yielding significan miss rate reductions. Perhaps more importantly, it illustrates basic principles of replacement decisions; the very theory that explains why LRU is good also shows what is worst about it, and exactly how to beat it. More importantly, it illustrates constraints on the behavior of any replacement algorithm that is robust---aggregate behavior of large sets of pages is often more important than distinguishing between the behaviors of individual pages, as some attempts at "smarter" replacement do, and attending to these large-scale regularities is typically sufficient for excellent performance. Equally important, an online cost/benefit analysis can be simple and effective, if it attends only to events at the timescale that matters to the decisions it must make.
In many situations EELRU-like replacement is not only sufficient but necessary for good performance. Any excellent replacement algorithm must behave rather similarly to EELRU in common situations where recently-touched pages are much more like to be touched than less-recently touched ones, and also in situations where LRU's heuristic is systematically defeated by frequent touches to recently-evicted pages. Any smarter replacement policies of the future must respect these principles, no matter how it actually makes its replacement choices.
7. Scott F. Kaplan, Yannis Smaragdakis, and Paul R. Wilson Trace Reduction for Virtual Memory Simulations in SIGMETRICS '99. PostScript file .
Trace driven simulation allows virtual memory system researchers to experiment with new designs and ideas without implemented changes in a real operating system kernel. Moreover, it makes it possible for research to be based on reproducible, controlled experiments. The design space can be more fully examined, and the effects of different decisions can be isolated.
Unfortunately, there is a well know problem with reference traces: size. These traces can grow to gigabytes in size for just a few seconds of program execution time. The storage of a reference trace can be difficult. Sharing of these traces rarely happens because the transfer of them is formidable. Worse, virtual memory experiments are limited in the number of variables and the range of those variables, as the time required to process a trace in simulation can be long. As such, reducing the trace size with a lossless compression algorithm is not enough. Rather, lossy methods that discard reference information not needed for accurate simulations are required.
Most such lossy reduction methods are designed for hardware cache simulation, and depend on sampling techniques. However, the simulation of virtual memory performance has become very important, specifically the management of main memory (RAM) and the avoidance of expensive hits to the backing store (disk). The methods available for lossy trace reduction that are appropriate for virtual memory simulation are old, and discard more information than is needed.
In this paper, we present Safely Allowed Drop (SAD) and Optimal LRU Reduction (OLR). These two methods allow the user to choose the smallest LRU memory to be simulated. The resulting reduced trace is then guaranteed not to affect the exact sequence of fetches and evictions for any LRU memory as large or larger than the size chosen for reduction. (SAD additionally guarantees the exact sequence of fetches and evictions for the offline, optimal paging policy, OPT.) In other words, these reduction policies introduce absolutely no error for LRU (and, for SAD, OPT) simulations, provided the simulated memory is larger than the memory size chosen at reduction time.
Moreover, this paper demonstrates that most real policies that do not directly employ LRU are very similar to LRU. As such, we empirically demonstrate that if the simulated memory size is at least five times larger than the reduction memory size, common policies like clock and segmented queue, which approximate LRU, suffer little error -- less than 6% over 15 real reference traces, and for most memory sizes less than 1%. These error values compare favorably to the Stack Deletion method that is often used and cited as the most common trace reduction method for virtual memory simulations.
Both algorithms can be used online, during trace collection time, such that the original reference trace is never stored. Their space overheads are not significant. Implementations are freely available from our trace reduction page.
8. Mark S. Johnstone and Paul R. Wilson. The Memory Fragmentation Problem: Solved? In International Symposium on Memory Management , Vancouver, British Columbia, Canada, October 1998. Adobe Acrobat (.pdf) file .
This paper substantially strengthens the original fragmentation results presented in our earlier allocator survey, showing that for a variety of applications, good allocators exhibit nearly zero fragmentation---about one percent on average, for 8 varied C and C++ programs. While it is clear that there must exist some programs for which this is not true, it shows that the received view of the fragmentation problem is simply wrong, and fragmentation is not the kind of problem 30 years of research generally assumed it was---typical program behavior is very favorable to several clearly excellent allocation policies, due to strong regularities in program allocation and deallocation behavior.
A major goal of future allocator design should be to construct very efficient algorithms that implement those policies; as our survey showed, this is not nearly as difficult as is widely believed. Future work should identify the apparently rare programs that are likely to exhibit substantial fragmentation, and study their behavior and its interactions with allocator placement policies.
9. Sheetal V. Kakkad, Mark S. Johnstone, and Paul R. Wilson. Portable Runtime Type Description for Conventional Compilers. In International Symposium on Memory Management, Vancouver, British Columbia, Canada, October 1998.
Many useful language extensions and support libraries require knowledge of the layout of fields within objects at runtime. Examples include orthogonal persistent object stores, garbage collectors, data structure picklers, parameter marshalling schemes, etc.
For clean and efficient implementation as libraries, these systems require knowledge of the actual layouts of data objects, which is unavailable in most traditionally-compiled and linked programming languages, such as C, C++, and Ada.
We present a facility for runtime type description, or RTTD, which extracts the low-level layout information from debug output of conventional compilers, and makes it available to user programs. We describe the basic strategies of the system, and present details of our interface for C++. We also sketch some extensions we have implemented, including special treatment of C++'s virtual function table pointers.
Our implementation is freely available via anonymous ftp.
10. Sheetal V. Kakkad. Address Translation and Storage Management for Persistent Object Stores. Ph.D. dissertation, The University of Texas at Austin, December 1997.
A common problem in software engineering is efficiently saving the state of application data structures to non-volatile storage between program executions. If this is accomplished using normal file systems, the programmer is forced to explicitly save the data to files as a stream of uninterpreted bytes, thereby losing both pointer semantics and object identity. A better approach is to use persistent object storage, a natural extension to virtual memory that allows heap data to be saved automatically to disk while maintaining the topology of data structures without any explicit programmer intervention.
If persistent object stores are to replace the functionality of normal file systems, they must be able to address large volumes of data efficiently on standard hardware. High-performance address translation techniques are necessary and important for supporting large address spaces on stock hardware. We present pointer swizzling at page fault time (PS@PFT), a coarse-grained address translation scheme suitable for this purpose, and demonstrate it by building a persistent storage system for C++ called the Texas Persistent Store. We also discuss alternative approaches for portably incorporating fine-grained address translation in Texas for situations where coarse-grained swizzling alone is insufficient. As part of the performance results, we present a detailed analysis of various components of a coarse-grained address translation technique, including a comparison with overall I/O costs.
Pointer swizzling requires run-time knowledge of in-memory object layouts to locate pointers in objects. We have developed and implemented Run-Time Type Description (RTTD) for this purpose; our implementation strategy is portable because it is based on a novel use of compiler-generated debugging information for extracting the necessary type description. RTTD is also useful for other applications such as data structure browsing, and advanced profiling and tracing.
Another part of this research is a study of the interaction between systems similar to PS@PFT and operating systems, particularly regarding virtual memory management issues. We suggest areas where operating system implementations can be made more open to improve their performance and extensibility. Finally, we briefly discuss storage management issues, specifically log-structured storage, disk prefetching, and compressed in-memory storage, and provide directions for future research in this area.
11. Mark S. Johnstone. Non-Compacting Memory Allocation and Real-Time Garbage Collection. Ph.D. dissertation, The University of Texas at Austin, December 1997.
Dynamic memory use has been widely recognized to have profound effects on program performance, and has been the topic of many research studies over the last forty years. In spite of years of research, there is considerable confusion about the effects of dynamic memory allocation. Worse, this confusion is often unrecognized, and memory allocators are widely thought to be fairly well understood.
In this research, we attempt to clarify many issues for both manual and automatic non-moving memory management. We show that the traditional approaches to studying dynamic memory allocation are unsound, and develop a sound methodology for studying this problem. We present experimental evidence that fragmentation costs are much lower than previously recognized for most programs, and develop a framework for understanding these results and enabling further research in this area. For a large class of programs using well-known allocation policies, we show that fragmentation costs are near zero. We also study the locality effects of memory allocation on programs, a research area that has been almost completely ignored. We show that these effects can be quite dramatic, and that the best allocation policies in terms of fragmentation are also among the best in terms of locality at both the cache and virtual memory levels of the memory hierarchy.
We extend these fragmentation and locality results to real-time garbage collection. We have developed a hard real-time, non-copying generational garbage collector which uses a write-barrier to coordinate collection work only with modifications of pointers, therefore making coordination costs cheaper and more predictable than previous approaches. We combine this write-barrier approach with implicit non-copying reclamation, which has most of the advantages of copying collection (notably avoiding both the sweep phase required by mark-sweep collectors, and the referencing of garbage objects when reclaiming their space), without the disadvantage of having to actually copy the objects. In addition, we present a model for non-copying implicit-reclamation garbage collection. We use this model to compare and contrast our work with that of others, and to discuss the tradeoffs that must be made when developing such a garbage collector.
12. Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. Dynamic Storage Allocation: A Survey and Critical Review. In International Workshop on Memory Management, Kinross, Scotland, UK, September 1995.
Dynamic memory allocation has been a fundamental part of most computer systems since roughly 1960, and memory allocation is widely considered to be either a solved problem or an insoluble one. In this survey, we describe a variety of memory allocator designs and point out issues relevant to their design and evaluation. We then chronologically survey most of the literature on allocators between 1961 and 1995. (Scores of papers are discussed, in varying detail, and over 150 references are given.)
We argue that allocator designs have been unduly restricted by an emphasis on mechanism, rather than policy, while the latter is more important; higher-level strategic issues are still more important, but have not been given much attention.
Most theoretical analyses and empirical allocator evaluations to date have relied on very strong assumptions of randomness and independence, but real program behavior exhibits important regularities that must be exploited if allocators are to perform well in practice.
13. Paul R. Wilson, Sheetal Kakkad, and Shubhendu S. Mukherjee. Anomalies and Adaptation in the Analysis and Development of Prefetching Policies. Journal of Systems and Software, 27(2):147-153, November 1994. Technical communication.
In "Analysis and Development of Demand Prepaging Policies," Horspool and Huberman show that it is possible to design prefetching memory policies that preserve a "stack" inclusion property, much like LRU, allowing them to simulate these policies for all sizes of memory in a single pass through a reference trace. We believe that the details of Horspool and Huberman's algorithms introduce unexpected anomalous properties, however. In particular, their policies are not properly timescale relative---events occuring on a timescale that should only matter to some sizes of memory adversely affect replacement decisions for memories of very different sizes. Slight changes to the algorithms can restore timescale relativity and make them much better-behaved. In addition, we would like to point out that Horspool and Huberman's algorithms actually simulate adaptive policies, which may explain some of their unexpectedly positive results. This view suggests that properly timescale-relative inclusion-preserving policies can be used to systematically evaluate adaptive memory management schemes.
14. Paul R. Wilson and Mark S. Johnstone. Real-Time Non-Copying Garbage Collection. Position paper for the 1993 ACM OOPSLA Workshop on Memory Management and Garbage Collection, Washington D.C., September 1993.
15. Paul R. Wilson. Uniprocessor Garbage Collection Techniques. In International Workshop on Memory Management, St. Malo, France, September 1992. (The Proceedings has been published as Springer-Verlag Lecture Notes in Computer Science no. 637).
We survey basic garbage collection algorithms, and variations such as incremental and generational collection. The basic algorithms include reference counting, mark-sweep, mark-compact, copying, and treadmill collection. Incremental techniques can keep garbage collection pause times short, by interleaving small amounts of collection work with program execution. Generational schemes improve efficiency and locality by garbage collecting a smaller area more often, while exploiting typical lifetime characteristics to avoid undue overhead from long-lived objects.
16. Paul R. Wilson. Uniprocessor Garbage Collection Techniques. Draft of much expanded version of the above paper. In revision (accepted for ACM Computing Surveys).
We survey basic garbage collection algorithms, and variations such as incremental and generational collection; we then discuss low-level implementation considerations and relationships between storage management systems, languages, and compilers. Throughout, we attempt to present a unified view based on abstract traversal strategies, addressing issues of conservatism, opportunism, and immediacy of reclamation; we also point out a variety of implemetation details that are likely to have significant impact on the performance.
17. Paul R. Wilson and Sheetal V. Kakkad. Pointer Swizzling at Page Fault Time: Efficiently and Compatibly Supporting Huge Address Spaces on Standard Hardware. In International Workshop on Object Orientation in Operating Systems, pages 364-377, Paris, France, September 1992.
Pointer swizzling at page fault time is a novel address translation mechanism that exploits conventional address translation hardware. It can support huge address spaces efficiently without long hardware addresses; such large address spaces are attractive for persistent object stores, distributed shared memories, and shared address space operating systems. This swizzling scheme can be used to provide data compatibility across machines with different word sizes, and even to provide binary code compatibility across machines with different hardware address sizes.
Pointers are translated ("swizzled") from a long format to a shorter hardware-supported format at page fault time. No extra hardware is required, and no continual software overhead is incurred by presence checks or indirection of pointers. This pagewise technique exploits temporal and spatial locality in much the same way as a normal virtual memory; this gives it many desirable performance characteristics, especially given the trend toward larger main memories. It is easy to implement using common compilers and operating systems.
18. Vivek Singhal, Sheetal Kakkad, and Paul Wilson. Texas: An Efficient, Portable Persistent Store. In Persistent Object Systems: Proceedings of the Fifth International Workshop on Persistent Object Systems, pages 11-33, San Miniato, Italy, September 1992.
Texas is a persistent storage system for C++, providing high performance while emphasizing simplicity, modularity and portability. A key component of the design is the use of pointer swizzling at page fault time, which exploits existing virtual memory features to implement large address spaces efficiently on stock hardware, with little or no change to existing compilers. Long pointers are used to implement an enormous address space, but are transparently converted to the hardware-supported pointer format when pages are loaded into virtual memory.
Runtime type descriptors and slightly modified heap allocation routines support pagewise pointer swizzling by allowing objects and their pointer fields to be identified within pages. If compiler support for runtime type identification is not available, a simple preprocessor can be used to generate type descriptors.
This address translation is largely independent of issues of data caching, sharing, and checkpointing; it employs operating systems' existing virtual memories for caching, and a simple and flexible logging system.
Pagewise virtual memory protections are also used to detect writes for logging purposes, without requiring any changes to compiled code. This may degrade checkpointing performance for small transactions with poor locality of writes, but page diffing and sub-page logging promise to keep performance competitive with finer-grained checkpointing schemes.
Texas presents a simple programming interface; an application creates persistent object by simply allocating them on the persistent heap. In addition, the implementation is relatively small, and is easy to incorporate into existing applications.
19. Paul R. Wilson, Michael S. Lam, and Thomas G. Moher. Caching Considerations for Generational Garbage Collection. 1992 ACM Symposium on Lisp and Functional Programming, San Francisco, California, June 1992.
This paper points out a fundamental locality problem due to garbage collection. Garbage collectors normally trade space for time by allocating substantial amounts of memory between collections. Memory can thus only be used once per allocation/collection cycle, and a substantial amount of memory is both touched and dirtied at each cycle.
Generational garbage collection can improve this situation somewhat at the virtual memory level, by reclaiming and reusing a smaller amount of memory more frequently. Still, the problem remains at the hardware (SRAM) cache scale unless the youngest generation is smaller than the cache. (Which it typically isn't for on chip caches, or for many modest-sized off chip caches.)
If a non-generational collector is used, or if the youngest generation is larger than the cache, then with a normal cache,
the rate of cache misses is at least equal to the rate of heap allocation,
most misses due to allocation are write misses, and fetch useless garbage that will immediately be overwritten by the initializing writes that create objects, and
most misses evict dirty blocks which must be written back to main memory.
A properly designed cache can avoid the problem of faulting on blocks during allocation (only to fetch useless garbage), by allocating cache lines "out of thin air" in the cache, without fetching their former contents from slower memory. If sufficient write bandwidth and write buffering are available, the write-backs of dirty blocks may not be a problem. (For caches larger than the youngest generation, this is not critical.)
Subsequent to this paper, Appel showed that a "write-validate" sector cache can avoid the needless faulting and fetching of garbage. (A write-validate cache is a sector cache which allocates space for block in the cache on a write miss, but does not immediately fetch their contents, and simply validates the contents of sub-blocks as they are written.) Similar effects may be possible with newer software controllable caches.
Recent data on the cache performance of Java systems by Lizy John and others indicates that this is indeed a serious issue for the Java systems they studied, as well as the Lisp system we studied and the ML system studied by Appel. (E.g., data cache miss rates are fairly high, especially in light of the low execution speed of current Java implementations, and 70 percent of data cache misses are write-misses.) Write-validate caching (or some other scheme with a similar effect) seems to be a good choice for systems intended to run Java.
20. Michael S. Lam, Paul R. Wilson, and Thomas G. Moher. Object Type Directed Garbage Collection to Improve Locality . 1992 International Workshop on Memory Management .
Separating objects by type during clustering can improve locality, because objects of different types are often used in different ways.
While it might seem that separating obviously related objects would degrade locality, it generally doesn't, because it doesn't matter which objects go in particular pages. At the scale relevant to caching, we can group related objects into a "small" number of pages (relative to the memory size) while still grouping different kinds of objects apart, in subsets of those pages, in case the different kinds of objects are used during different phases of program execution.
(This principle can now be better understood in light of our subsequent explication of the principles of timescale relativity in our EELRU paper and Scott Kaplan's dissertation. In the context of clustering to improve locality, the bottom line is that it's typically easy to group related objects together, but still err because it's just as important keep objects apart if they're not reliably used "together" at a timescale scale relevant to caching. Objects that are clustered together but not used together are the cause of cache pollution , so it is crucial to separate objects that are sometimes used together, but sometimes not. We can freely separate a group of related objects into subgroups if all of those groups are small relative to the cache size ; details of which objects go in which pages don't usually matter at all if all of the objects do turn out to be used together, but may matter a lot if they don't, and only some of the objects are part of a larger working set, in combination with still other objects.)
21. Paul R. Wilson, Michael S. Lam, and Thomas G. Moher. Effective Static-Graph Reorganization to Improve Locality in Garbage-Collected Systems Proc. ACM SIGPLAN 1991 Conf. on Programming Language Design and Implementation, ACM SIGPLAN Notices vol. 26, No. 6 (June 1991), pp. 177-191. Adobe Acrobat (.pdf) file .
We show that conventionaly copying traversals by copying garbage collectors have deleterious effects on the locality of data, and show how to avoid this by "hierarchical decomposition" traversal. Hierarchical decomposition is an efficient two-level version of the Cheney algorithm, using a "queue of queues" requiring only a pair of pointers per page of data in the "grey" region of the global Cheney scan. Hierarchical Decomposition clusters local regions of the reachability graph into pages, heuristically clustering the data in a way that minimizes path lengths (and resembles a B-tree). Linearization of lists is done "for free" as the degenerate case of hierarchical decomposition.
We also show that a blind traversal of hash tables can have profound affect on the resulting data clustering, especially in combination with a conventional Cheney breadth-first traversal.
When linked data structures are hierarchically clustered, most pointers can easily be compressed to a single bit by a pagewise compression algorithm.
22. Paul R. Wilson. Operating System Support for Small Objects. 1991 IEEE International Workshop on Object Orientation in Operating Systems. Adobe Acrobat .pdf file
This poorly-named paper discussed quite a few ideas about implementation of advanced memory management, many (but not all) of which we have subsequently elaborated in detailed papers about individual topics.
An important general theme is that standard virtual memory hardware is best viewed as general-purpose hardware support for efficient implementations of a wide variety of advanced memory management features: process checkpointing, persistence, compressed virtual memory caching, adaptive clustering of (virtual memory or persistent store) pages on disk, and address translation for flat address spaces vastly larger than the hardware was designed to support.
A normal page table entry cache (a. k. a. Translation Lookaside Buffer) includes fast parallel hardware that performs simple checks on memory accesses at every memory reference. (It checks whether the referenced page is resident in memory, and whether the access violates the read/write permissions set for the page.) By making protections and traps serve multiple purposes, we can provide many advanced memory management features with zero overhead in the usual case, because the hardware is checking page protections at every memory reference anyway.
23. Paul R. Wilson and Thomas G. Moher. Design of the Opportunistic Garbage Collector Proc. ACM OOPSLA '89. ACM SIGPLAN Notices vol. 24, no. 10 (Oct. 1989), pp. 23-35.
We present three improvements to generational copying garbage collectors: (1) fast unconditional write barriers, (2) a simple and effective bucket brigade advancement ("tenuring") policy, and (3) opportunistic scheduling of garbage collections to reduce cost and/or hide pauses, either by performing GCs nondisruptively when the system is idle due to user-response delays, or by hiding garbage collection pauses in larger computational pauses incurred by the mutatator (running program) anyway.
24. Paul R. Wilson and Thomas G. Moher. Demonic Memory For Process Histories Proc. SIGPLAN '89 Conf. on Programming Language Design and Implementation. ACM SIGPLAN Notices vol. 24, no. 7 (July 1989), pp. 330-343.
We present techniques for efficient incremental checkpointing, replay, and detailed process reconstruction. The original target application for these techniques was time-travel debugging (though that term had not been coined then), but the ideas are more generally applicable.
We explain how to coordinate checkpointing with garbage collection, to avoid checkpointing garbage data. (Subsequently Tolmach and Appel showed how to elegantly invert our stacking of checkpointing atop garbage collection, for nearly-pure functional languages, and use continuation capture for checkpointing. In effect, this uses the garbage collector as the checkpointer.)
A bibliography on heap management, and the source code for Texas Persistent Store are available via anonymous ftp at ftp.cs.utexas.edu:/pub/garbage. The README file lists all the available material including subdirectories which contain collected papers from the 1991 and 1993 OOPSLA Garbage Collection and Memory Management Workshops.
[분류: 리눅스 커널 ]
< C++ FAQ LITE | [Cache] An Overview of Cache >