University of Illinois Center for Supercomputing Research and Development explained

The Center for Supercomputing Research and Development (CSRD) at the University of Illinois (UIUC) was a research center funded from 1984 to 1993. It built the shared memory Cedar computer system, which included four hardware multiprocessor clusters, as well as parallel system and applications software. It was distinguished from the four earlier UIUC Illiac systems by starting with commercial shared memory subsystems that were based on an earlier paper published by the CSRD founders. Thus CSRD was able to avoid many of the hardware design issues that slowed the Illiac series work. Over its 9 years of major funding, plus follow-on work by many of its participants, CSRD pioneered many of the shared memory architectural and software technologies upon which all 21st century computation is based.

History

UIUC began computer research in the 1950s, initially for civil engineering problems, and eventually succeeded by cooperative activities among the Math, Physics, and Electrical Engineering Departments to build the Illiac computer series. This led to founding the Computer Science Department in 1965.

By the early 1980s, a time of world-wide HPC expansion arrived, including the race with the Japanese 5th generation system targeting innovative parallel applications in AI. HPC/supercomputing had emerged as a field, commercial supercomputers were in use by industry and labs (but little by academia), and academic architecture and compiler research were expanding. This led to formation of the Lax committee.[1] to study the academic needs of focused HPC research, and to provide commercial HPC systems for university research. When HPC practitioner Ken Wilson won the Nobel physics prize in 1982, he expanded his already strong advocacy of both, and soon several government agencies introduced HPC R&D programs.

As a result, the UIUC Center for Supercomputing R&D (CSRD) was formed in 1984 (with funding from DOE, NSF, and UIUC, as well as DoD Darpa and AFOSR), under the leadership of three CS professors who had worked together since the Illiac 4 project – David Kuck (Director), Duncan Lawrie (Assoc. Dir. for SW) and Ahmed Sameh (Assoc. Dir for applications), plus Ed Davidson (Assoc. Dir. for hardware/ architecture) who joined from ECE. Many graduate students and post-docs were already contributing to constituent efforts; full time academic professionals were hired, and other faculty cooperated. A total of up to 125 people were involved at the peak, over the nine years of full CSRD operation[2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

The UIUC administration responded to the computing and scientific times. CSRD was set up as a Graduate College unit, with space in Talbot Lab. UIUC President Stanley Ikenberry arranged to have Governor James Thompson directly endow CSRD with $1 million per year to guarantee personnel continuity. CSRD management helped write proposals that led to a gift from Arnold Beckman of a $50 million building, the establishment of NCSA, and a new CSRD building (now CSL).

The CSRD plan for success took a major departure from earlier Illiac machines by integrating four commercially built parallel machines using an innovative interconnection network and global shared memory. Cedar was based on designing and building a limited amount of innovative hardware, driven by SW that was built on top of emerging parallel applications and compiler technology. By breaking the tradition of building hardware first and then dealing with SW details later, this codesign approach led to the name Cedar instead of Illiac 5.

Earlier work by the CSRD founders had intensively studied a variety of new high-radix interconnection networks,[12] [13] built tools to measure the parallelism in sequential programs, designed and built a restructuring compiler (Parafrase) to transform sequential programs into parallel forms, as well as inventing parallel numerical algorithms. During the Parafrase development of the 1970s, several papers were published proposing ideas for expressing and automatically optimizing parallelism.[14] [15] [16] [17] These ideas influenced later compiler work at IBM, Rice U. and elsewhere. Parafrase had been donated to Fran Allen's IBM PTRAN group in the late 1970s, Ken Kennedy had gone there on sabbatical and obtained a Parafrase copy, and Ron Cytron joined the IBM group from UIUC. Also, KAI was founded in 1979, by three Parafrase veterans who wrote KAP, a new source-source restructurer, (Kuck, Bruce Leasure, and Mike Wolfe).

The key Cedar idea was to exploit feasible-scale parallelism, by linking together a number of shared memory nodes through an interconnection network and memory hierarchy. Alliant Computers, Inc. Alliant Computer Systems had obtained venture capital funding (in Boston), based on an earlier architecture paper by the CSRD team [18] and was then shipping systems. The Cedar team was thus immediately able to focus on designing hardware to link 4 Alliant systems and add a global shared memory to the Alliant 8-processor shared memory nodes. In distinction to this, other academic teams of the era pursued massively parallel systems (CalTech, later in cooperation with Intel), fetch-and-add combining networks (NYU), innovative caching (Stanford), dataflow systems (MIT), etc.

In sharp contrast, two decades earlier, the Illiac 4 team required years of work with state of the art industry hardware technology leaders to get the system designed and built. The 1966 industrial hardware proposals for Illiac 4 hardware technology even included a GE Josephson Junction proposal which John Bardeen helped evaluate while he was developing the theory that led to his superconductivity Nobel prize. After contracting with Burroughs Corp to build and integrate an all-transistor hardware system, lengthy discussions ensued about the semiconductor memory design (and schedule slips) with subcontractor Texas Instruments' Jack Kilby (IC inventor and later Nobelist), Morris Chang (later TSMC founder) and others. Earlier Illiac teams had pushed contemporary technologies, with similar implementation problems and delays.

Many attempts at parallel computing startups arose in the decades following Illiac 4, but nothing achieved success until adequate languages and software were developed in the 1970s and 80s. Parafrase veteran Steve Chen joined Cray and led development of the parallel/vector Cray-XMP, released in 1982. The 1990s were a turning point with many 1980s startups failing, the end of bipolar technology cost-effectiveness, and the general end of academic computer building. By the 2000s, with Intel and others manufacturing massive numbers of systems, shared memory parallelism had become ubiquitous.

CSRD and the Cedar system played key roles in advancing shared memory system effectiveness. Many CSRD innovations of the late 80s (Cedar and beyond) are in common use today, including hierarchical shared memory hardware. Cedar also had parallel Fortran extensions, a vectorizing and parallelizing compiler, and custom Linux-based OS, that were used to develop advanced parallel algorithms and applications. These will be detailed below.

Cedar design and construction

One unusually productive aspect of the Cedar design effort was the ongoing cooperation among the R&D efforts of architects, compiler writers, and application developers. Another was the substantial legacy of ideas and people from the Parafrase project in the 1970s.[19] These enabled the team to focus on several design topics quickly:

The architecture group had a decade of parallel interconnect and memory experience and high-radix shuffle network chosen, so after selecting Alliant as the node manufacturer, custom interfacing hardware was designed in conjunction with Alliant engineers. The compiler team started by designing Cedar Fortran for this architecture, and by modifying the Kuck & Assoc. (KAI) source-to-source translator with Cedar-specific transformations for the Alliant compiler. Having nearly two decades of parallel algorithm experience (starting from Illiac 4), the applications group chose several applications to study, based on emerging parallel algorithms. This was later extended to include some widely used applications that shared the need for the chosen algorithms [20] . Designing, building and integrating the system was then a multi-year effort, including architecture, hardware, compiler, OS and algorithm work.

System Architecture & Hardware

The hardware design led to 3 different types of 24” printed circuit boards, with the network board using CSRD-designed crossbar gate array chips. The boards were assembled into three custom racks in a machine room in Talbot Lab using water-cooled heat exchangers. Cedar’s key architectural innovations and features included:

Language and compiler

By 1984, Fortran was still the standard language of HPC programming, but no standard existed for parallel programming. Building on the ideas of Parafrase and emerging commercial programming methods, Cedar Fortran [23] was designed and implemented for programming Cedar and to serve as the target of the Cedar autoparallelizer.

Cedar Fortran contained a two-level parallel loop hierarchy that reflected the Cedar architecture. Each iteration of outer parallel loops made use of one cluster and a second level parallel loop made use of one of the eight processors of a cluster for each of its iterations. Cedar Fortran also contained primitives for doacross synchronization and control of critical sections. Outer-level parallel loops were initiated, scheduled and synchronized using a runtime library while inner loops relied on Alliant hardware instructions to initiate the loops, schedule and synchronize their iterations.

Global variables and arrays were allocated in global memory while those declared local to iterations of outer parallel loops were allocated within clusters. There were no caches between clusters and main memory and therefore, programmers had to explicitly copy from global memory to local memory to attain faster memory accesses. These mechanisms worked well in all cases tested and gave programmers control over processor assignment and memory allocation. As discussed in the next section, numerous applications were implemented in Cedar Fortran.

Cedar compiler work started with the development of a Fortran parallelizer for Cedar built by extending KAP, a vectorizer, which was contributed by KAI to CSRD. Because it was built on a vectorizer the first modified version of KAP developed at CSRD lacked some important capabilities necessary for an effective translation for multiprocessors, such as array privatization and parallelization of outer loops. Unlike Parafrase (written in PL/1), which ran only on IBM machines, KAP (written in C) ran on many machines (KAI customer base). To identify the missing capabilities and develop the necessary translation algorithms, a collection of Fortran programs from the Perfect Benchmarks was parallelized by hand.[24] Only techniques that were considered implementable were used in the manual parallelization study. The techniques were later used for a second generation parallelizer that proved effective on collections of programs not used in the manual parallelization study [25] .

Applications and benchmarking

Meanwhile the algorithms/applications group was able to use Cedar Fortran to implement and test algorithms and run them on the four quadrants independently before system integration. The group was focused on developing a library of parallel algorithms and their associated kernels that mainly govern the performance of large-scale computational science and engineering (CSE) applications. Some of the CSE applications that were considered during the Cedar project included: electronic circuit and device simulation, structural mechanics and dynamics, computational fluid dynamics, and the adjustment of very large geodetic networks.

A systematic plan for performance evaluation of many CSE applications on the Cedar platform was outlined in [20] and.[26] In almost all of the above-mentioned CSE applications, dense and sparse matrix computations proved to largely govern the overall performance of these applications on the Cedar architecture. Parallel algorithms that realize high performance on the Cedar architecture were developed for:

In preparing to evaluate candidate hardware building blocks and the final Cedar system, CSRD managers began to assemble a collection of test algorithms; this was described in [20] and later evolved into the Perfect Club.[54] Before that, there were only kernels and focused algorithm approaches (Linpack, NAS benchmarks). In the following decade the idea became popular, especially as many manufacturers introduced high performance workstations, which buyers wanted to compare; SPEC became the workhorse of the field and was followed by many others. SPEC was incorporated in 1988 and released its first benchmark in 1992 (Spec92) and a high performance benchmark in 1994. (David Kuck and George Cybenko were early advisors, Kuck served on the BoD in the early 90s, and Rudolf Eigenmann drove the Spec HPG effort, leading to the release of a first high performance benchmark in 1996.)

In a joint effort between the CSRD groups, the Parafrase memory hierarchy loop blocking work of Abu Sufah[55] was exploited for the Cedar cache hierarchy. Several papers were published demonstrating performance enhancement for basic linear algebra algorithms on the Alliant quadrants and Cedar. A sabbatical spent at CSRD at the time by Jack Dongarra and Danny Sorensen led this work to be transferred as the BLAS 3 (to extend the simpler BLAS 1 and BLAS 2), a standard that is now widely used.

Cedar conclusion

CSRD had many alumni who went on to important careers in computing. Some left early, others came late, etc. Among the leaders were UIUC faculty member Dan Gajski, who was affiliated with the CSRD directors in formulating plans and proposals, but left UIUC just before CSRD actually commenced. Another was Mike Farmwald who joined as an Associate Director for hardware/architecture when Ed Davidson left. Immediately after leaving Mike was a co-founder of Rambus, which continues as a memory design leader. David Padua became Assoc. Director for SW after Duncan Lawrie left, and continued many CSRD projects as a UIUC CS professor. Over time, CSRD researchers became CS and ECE department heads at 5 Big Ten universities.

By 1990, the Cedar system had been completed. The CSRD team was able to scale applications from single clusters to the full 4-cluster system and begin performance measurements. Despite these innovation successes, there was no follow up machine construction project. After the end of the Cedar project, the Stanford DASH/FLASH projects, and the MIT Alewife project around 1995, the era of large, multi-faculty academic machine designs had come to an end. Cedar was a preeminent part of the last wave of such projects. ISCA’s 25th Anniversary Proceedings[56] contain several retrospective papers describing some of the machines in that last wave, including one on Cedar.[57]

About 50 remaining CSRD students, academic professionals and faculty became a research group within the Coordinated Science Laboratory by 1994. For several years, they continued the work initiated in the 1980s, including experimental evaluations of Cedar[58] [59] and continuation of several lines of CSRD compiler research[60] .[25]

Other CSRD contributions

Beyond the core CSRD work of designing, building and using Cedar, many related topics arose. Some were directly motivated by the Cedar project. Many of these had value well beyond Cedar, were pursued well-beyond the official end of CSRD, and were taken up by many academic and industrial groups. Next, the most important such topics are discussed.

Guided Self Scheduling

In the mid 1980s, C. Polychronopoulos developed one of the most influential strategies for the scheduling of parallel loop iterations. The strategy, called Guided Self-Scheduling,[61] schedules the execution of a group of loop iterations each time a processor becomes available. The number of iterations in these groups decreases as the execution of the loop progresses in such a way that the load imbalance is reduced relative to the static or dynamic scheduling techniques used at the time. Guided Self-Scheduling influenced research and practice with numerous citations of the paper introducing the technique and the adoption of the strategy by OpenMP as one of its standard loop scheduling techniques.

Approximation by superpositions of a sigmoidal function

In the mid to late 1980’s, the so-called “Parallel Distributed Processing” (PDP) effort[62] recast earlier generations of neural computation by demonstrating effective machine learning algorithms and neural architectures. The computing paradigm, far removed from traditional von Neumann computer architecture, demonstrated that PDP approaches and algorithms could address a variety of application problems in novel ways. However, it was not known what kinds of problems could be solved using such massively parallel neural network architectures. In 1989, CSRD researcher George Cybenko, demonstrated that even the simplest nontrivial neural network had the representational power to approximate a wide variety of functions, including categorical classifiers and continuous real-valued functions.[63] That work was seminal in that it showed that, in principle, neural machines based on biological nervous systems could effectively emulate any input-output relationship that was computable by traditional machines. As a result, Cybenko’s result has been often called the “Universal Approximation Theorem” in the literature. The proof of that result relied on advanced functional analysis techniques and was not constructive. Even so, it gave rigorous justification for generations of neural network architectures, including deep learning [64] and large language models [65] in wide use in the 2020’s. While Cybenko’s Universal Approximation Theorem addressed the capabilities of neural-based computing machines, it was silent on the ability of such architectures to effectively learn their parameter values from data. Cybenko and CSRD colleagues, Sirpa Saarinen and Randall Bramley, subsequently studied the numerical properties of neural networks which are typically trained using stochastic gradient descent and its variants. They observed that neurons saturate when network parameters are very negative or very positive leading to arbitrarily small gradients which turn result in optimization problems that are numerically poorly conditioned.[66] This property has been called the “vanishing gradient” problem in machine learning.[67]

BLAS 3

The Basic Linear Algebra Subroutines (BLAS) are among the most important mathematical software achievements. They are essential components of LINPACK and versions are used by every major vendor of computer hardware. The BLAS library was developed in three different phases. BLAS 1 provided optimized implementations for basic vector operations; BLAS 2 contributed matrix-vector capabilities to the library. Blas 3 involves optimizations for matrix-matrix operations. The multi-cluster shared memory architecture of Cedar inspired a great deal of library optimization research involving cache locality and data reuse for matrix operations of this type. The official BLAS 3 standard was published in 1990 as.[68] This was inspired, in part, on.[33] Additional CSRD research data management for complex memory management followed and some of the more theoretical work was published as [69] and.[70] The performance impact of these algorithms when running on Cedar is reported in[71] .

OpenMP

Beyond CSRD, the many parallel startup companies of the 1980s created a profusion of ad hoc parallel programming styles, based on various process and thread models. Subsequently, many parallel language and compiler ideas were proposed, including compilers for Cray Fortran, KAI-based source-to-source optimizers, etc. Some of these tried to create product differentiation advantages, but largely went contrary to user desires for performance portability. By the late 1980s, KAI started a standardization effort that led to the ANSI X3H5 draft standard,[72] which was widely adopted.

In the 1990s, after CSRD, these ideas influenced KAI in auto-parallelization, and soonanother round of standardization was begun. By 1996 KAI had SGIas a customer and they joined the effort to form the OpenMP consortium – the OpenMP Architecture Review Board incorporated in 1997 with a growing collection of manufacturers. KAI also developed parallel performance and thread checking tools, which Intel bought with its purchase of KAI in 2000. Many KAI staff members remain, and the Intel development continues, directly inherited from Parafrase and CSRD. Today, OMP is the industry standard shared memory programming API for C/C++ and Fortran.

Speculative parallelization

For his PhD thesis, Rauchwerger introduced [73] an important paradigm shift in the analysis of program loops for parallelization. Instead of first validating the transformation into parallel form through a priori analysis either statically by the compiler or dynamically at runtime, the new paradigm speculatively parallelized the loop and then checked its validity. This technique, named “speculative parallelization", executes a loop in parallel and tests subsequently if any data dependences could have occurred. If this validation test fails, then the loop is re-executed in a safe manner, starting from a safe state, e.g., sequentially from a previous checkpoint. This approach, known as the LRPD Test (Lazy Reduction and Privatization Doall Test). Briefly, the LRPD test instruments the shared memory references of the loop in some “shadow" structures and then, after loop execution, analyzes them for dependent patterns. This pioneering contribution has been quite influential and has been applied throughout the years by many researchers from CSRD or elsewhere.

Race detection

In 1987, Allen pioneered the use of memory traces for the detection of race conditions in parallel programs.[74] Race conditions are defects of parallel programs that manifest in different outcomes for different exertions of the same program and the same input data. Because of their dynamic nature, race detections are difficult to detect and the techniques introduced by Allen and expanded in [75] are the best strategy known to cope with this problem. The strategy has been highly influential with numerous researchers working on the topic during the last decades. The technique has been incorporated into numerous experimental and commercial tools, including Intels' Inspector.

Contributions to Benchmarking – SPEC

One of CSRD’s thrusts was to develop metrics able to evaluate both hardware and software systems using real applications. To this end, the Perfect Benchmarks [54] provided a set of computational applications, collected from various science domains, which were used to evaluate and drive the study of the Cedar system and its compilers. In 1994, members of CSRD and the Standard Performance Evaluation Corporation (SPEC) expanded on this thrust, forming the SPEC High-Performance Group. This group released a first real-application SPEC benchmark suite, SPEC HPC 96. SPEC has been continuing the development of benchmarks for high-performance computing to this date, a recent suite being SPEChpc 2021. With CSRD’s influence, the SPEC High-Performance Group also prompted a close collaboration of industrial and academic participants. A joint workshop in 2001 on Real-Application Benchmarking [76] founded a workshop series, eventually leading to the formation of the SPEC Research Group, which in turn co-initiated the now annual ACM/SPEC International Conference on Performance Engineering.

Parallel Programming Tools

Funded by Darpa, the HPC++ project [77] [78] was led by Dennis Gannon and Allen Malony and Postdocs Francois Bodin from William Jalby’s group in Rennes and Peter Beckman now at Argonne National Lab. This work led from a collaboration between Malony, Gannon and Jalby that began at CSRD. HPC++ is based extensions to C++ standard template library to support a number parallel programming scenarios including single-program-multiple-data (SPMD) and Bulk Synchronous Parallel on both shared memory and distributed memory parallel systems. The most significant outcome of this collaboration was the development of the TAU Parallel Performance System. Originally developed for HPC++, it has become a standard for measuring, visualization and optimizing parallel programs for nearly all programming languages and is available for all parallel computing platforms. It supports various programming interfaces such as OpenCL, DPC++/SYCL, OpenACC, and OpenMP. It can also gather performance information of GPU computations from different vendors such as Intel and NVIDIA. TAU has been used for many HPC applications and projects.

Applications

The Cedar project has strongly influenced the research activities of many of CSRD’s faculty members long after the end of the project. After the termination of the Cedar project, the first task undertaken by three members of Cedar’s Algorithm and Application group (A. Sameh, E. Gallopoulos, and B. Philippe) was documenting the parallel algorithms developed, and published in a variety of journals and conference proceedings, during the lifetime of the project. The result was a graduate textbook: “Parallelism in Matrix Computations” by E. Gallopoulos, B. Philippe, and A. Sameh, published by Springer, 2016.[79] The parallel algorithm development experience gained by one of the members of the Cedar project (A. Sameh) proved to be of great value in his research activities after leaving UIUC. He used many of these parallel algorithms in joint research projects:

• fluid-particle interaction with the late Daniel Joseph (a National Academy of Science faculty member in Aerospace Engineering at the University of Minnesota, Twin Cities),

• fluid-structure interaction with Tayfun Tezduyar (Mechanical Engineering at Rice University),

• computational nanoelectronics with Mark Lundstrom (Electrical & Computer Engineering at Purdue University).

These activities were followed, in 2020, by a Birkhauser volume (edited by A. Grama and A. Sameh) containing two parts: part I consisting of some recent advances in high performance algorithms, and part II consisting of some selected challenging computational science and engineering applications.[80]

Compiler assisted cache coherence

Cache coherence is a key problem in building shared memory multiprocessors. It was traditionally implemented in hardware via coherence protocols. However, the advent of systems like Cedar allowed one to consider a compiler-assisted implementation of cache coherence for parallel programs,[81] with minimal and completely local hardware support. Where a hardware coherence protocol like МESI relies on remote invalidation of cache lines, a compiler-assisted protocol performs a local self-invalidation as directed by a compiler.. CSRD researchers developed several different approaches to compiler-assisted coherence [82] [83],[84] including a scheme with directory assistance.[85] All these schemes performed a post-invalidation at the end of a parallel region. This work has influenced research with numerous citations across decades until today[86] [87]

Compilers for GPUs

Early CSRD work on program optimization for classical parallel computers, also spurred developments of languages and compilers for more specialized accelerators, such as Graphics Processing Units (GPU). For example, in the early 2000s, CSRD researcher Rudolf Eigenmann developed translation methods for compilers that enabled programs written in the standard OpenMP programming model to be executed efficiently on GPUs.[88] [89] [90] Until then, GPUs had been programmed primarily in the specialized CUDA language. The new methods showed that high-level programming of GPUs was not only feasible for classical computational applications, but also for certain types of problems that exhibited irregular program patterns. This work incentivized further initiatives toward high-level programming models for GPUs and accelerators in general, such as OpenACC and OpenMP for accelerators. In turn, these initiatives contributed to the use of GPUs for a wide range of computational problems, including neural networks for deep-learning whose mathematical foundation was studied by Cybenko as discussed above.

Notes and References

  1. https://www.nsf.gov/news/special_reports/nsf-net/images/lax_report_1982.pdf Report of the Panel on Large Scale Computing in Science and Engineering
  2. Otis Port. Superfast computers: you ain’t seen nothin’ yet, Parallel processing will leave today’s speediest machines in the dust, Business Week, Science and Technology section, pp. 91,92, August 26, 1985.
  3. David E. Sanger. BREAKING A COMPUTER BARRIER Three Ways to Speed Up Computers, Special To the New York Times, Sept. 9, 1985.
  4. Alexander Wolfe. Full speed ahead for software, Electronics, Mar. 10 1986.
  5. Donna K. H. Walters. A New Breed of Computers : Mini-Supers at Cutting Edge of Technology, Los Angeles Times, April 27, 1986 12 AM PT.
  6. Philip Elmer-DeWitt author, Thomas McCarroll, J. Madeline Nash, Charles Pelton reporters. Fast and Smart, Computers of the Future, Cover story, Time Magazine, pp. 54-58, March 28, 1988.
  7. William Allen, Sci Editor. UI team brings Cedar system up to speed, Illini Week, Oct. 27, 1988.
  8. John Markoff. Measuring how fast computers really are, . The New York Times Sep 22, 1991.
  9. Kevin Kelly, Joseph Weber, Janin Friend, Sandra Atchison, Gail DeGeorge, William J. Holstein. Hot Spots – America’s new growth regions, Business Week, Oct. 19, 1992.
  10. John Markoff. A New Standard to Govern PC's With Multiple Chips, New York Times, Oct. 28, 1997.
  11. Jon Van, Science writer. U. OF I. SUPERCOMPUTER FUNDING OKD, Chicago Tribune, Feb 04, 1985.
  12. Lawrie, Duncan H. (December 1975). "Access and Alignment of Data in an Array Processor". IEEE Transactions on Computers. C-24 (12): 1145–55.
  13. Pin-Yee Chen, Duncan H. Lawrie, David A. Padua, Pen-Chung Yew: Interconnection Networks Using Shuffles. Computer 14(12): 55-64 (1981)
  14. David A. Padua, Michael Wolfe: Advanced Compiler Optimizations for Supercomputers. Commun. ACM 29(12): 1184-1201 (1986)
  15. David J. Kuck, Robert H. Kuhn, David A. Padua, Bruce Leasure, Michael Wolfe: Dependence Graphs and Compiler Optimizations. POPL 1981: 207-218
  16. Michael Joseph Wolfe 1995. High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., USA.
  17. Ron Cytron: Doacross: Beyond Vectorization for Multiprocessors. ICPP 1986: 836-844
  18. D. A. Padua, D. J. Kuck and D. H. Lawrie. "High-Speed Multiprocessors and Compilation Techniques", Special Issue on Parallel Processing, IEEE Trans. on Computers, Vol. C-29, No. 9, pp. 763-776, Sept., 1980.
  19. Bruce Leasure: Parafrase. Encyclopedia of Parallel Computing 2011: 1407-1409.
  20. David Kuck and Ahmed Sameh. “A Supercomputing Performance Evaluation Plan”. In: Lecture Notes in Computer Science No. 297: Proc. of First Int'l. Conf. on Supercomputing, Athens, Greece T.S. Papatheodorou E.N. Houstis C.D. Polychronopoulos, editors, Springer-Verlag, New York, NY, pp. 1--17, 1987.
  21. Fujiki, Daichi. In-/near-Memory Computing. Cham, Switzerland: Springer, 2021. ISBN 3-031-01772-2; ISBN 3-031-00644-5
  22. Pen-Chung Yew, Nian-Feng Tzeng and Lawrie, "Distributing Hot-Spot Addressing in Large-Scale Multiprocessors," in IEEE Transactions on Computers, vol. C-36, no. 4, pp. 388-395, April 1987, doi: 10.1109/TC.1987.1676921.
  23. Guzzi, Mark D., David A. Padua, Jay P. Hoeflinger, and Duncan H. Lawrie. 1990. “Cedar Fortran and other vector and parallel Fortran dialects.” The Journal of Supercomputing 4 (1): 37-62. https://doi.org/10.1007/BF00162342.
  24. Eigenmann, Rudolf, Jay P. Hoeflinger, Greg P. Jaxon, Zhiyuan Li, and David A. Padua. 1993. “Restructuring Fortran programs for Cedar.” Concurrency - Practice and Experience 5 (7): 553-573. https://doi.org/10.1002/cpe.4330050704.
  25. Blume, William, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay P. Hoeflinger, Thomas Lawrence, Jaejin Lee, et al. 1996. “Parallel Programming with Polaris.” Computer 29 (12): 78-82. 10.1109/2.546612.
  26. D. Kuck, E. Davidson, D. Lawrie, A. Sameh, et al. “The Cedar System and an Initial Performance Study”, Proceedings of the 20-th International Symposium on Computer Architecture, pp. 213-223, San Diego, CA, May 16--19, 1993
  27. A. Sameh. “An Overview of Parallel Algorithms in Numerical Linear Algebra”. First International Colloquium on Vector and Parallel Computing in Scientific Applications, EDF Bulletin de la Direction des Etudes et Recherches-Series C, Mathematique, Informatique No. 1, pp. 129--134, March 1983.
  28. A. H. Sameh. “On Two Numerical Algorithms for Multiprocessors”. Proceedings of NATO Advanced Research Workshop on High-Speed Computing, (Series F: Computer and Systems Sciences, Vol. 7), Springer-Verlag, pp. 311--328, 1983.
  29. D. H. Lawrie and A. H. Sameh. “Applications of Structural Mechanics on Large-Scale Multiprocessor Computers”. Symposium on the Impact of New Computing Systems on Computational Mechanics, Winter Annual ASME Meeting, pp. 55--64, November 1983.
  30. H. Lawrie and A. H. Sameh. “The Computation and Communication Complexity of a Parallel Banded System Solver”. ACM Transactions on Mathematical Software, Vol. 10, No. 2, pp. 185--195, 1984.
  31. C. Kamath and A. Sameh. “The Preconditioned Conjugate Gradient Algorithm on a Multiprocessor”. Fifth IMACS International Symposium on Computer Methods for Partial Differential Equations, pp. 210--217, June, 1984.
  32. A. H. Sameh. “A Fast Poisson Solver on Multiprocessors”. In: Elliptic Problem Solvers II. Academic Press, pp. 175--186, 1984.
  33. M. Berry, K. Gallivan, W. Harrod, W. Jalby, S. Lo, U. Meier, B. Philippe and A. H. Sameh. “Parallel Algorithms on the CEDAR System”. In: CONPAR 86, Lecture Notes in Computer Science, W. Handler et. al., editors, Springer-Verlag, pp. 25--39, 1986.
  34. E. Davidson, D. Kuck, D. Lawrie and A. Sameh. “Supercomputing Tradeoffs and the Cedar System”. In: High-Speed Computing: Scientific Applications and Algorithm Design, R. Wilhelmson, editor, University of Illinois Press, pp. 3--11, 1986.
  35. G. H. Golub, R. J. Plemmons and A. Sameh. “Parallel Block Schemes for Large Scale Least Squares Computations”. In: High-Speed Computing: Scientific Application and Algorithm Design, R. Wilhelmson, editor, University of Illinois Press, pp. 171--179, 1986.
  36. J. L. Larson, I. C. Kizilyalli, K. Hess, A. Sameh and D. J. Widiger. “Two-Dimensional Model for the HEMT”. In: Large Scale Computational Device Modeling, K. Hess, editor, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL, pp. 131--145, 1986.
  37. Hsin-Chu Chen and Ahmed Sameh. “Numerical Linear Algebra Algorithms on the Cedar System”. In: Parallel Computations and their Impact on Mechanics, A. Noor, editor, The American Society of Mechanical Engineering, pp. 101--125, 1987.
  38. Ulrike Meier and Ahmed Sameh. “Numerical Linear Algebra on the CEDAR Multiprocessor”. Proceedings of SPIE Conf. on Advanced Algorithms & Architectures for Signal Processing II, San Diego, CA, Vol. 826, pp. 1--9, August 1987.
  39. Randall Bramley and Ahmed Sameh. “A Robust Parallel Solver for Block Tridiagonal Systems”. Proceedings of 1988 International Conference on Supercomputing, St. Malo, France, pp. 39--54, ACM Press, 1988.
  40. K. Gallivan and A. Sameh. “Matrix Computations on Shared-Memory Multiprocessors”. In: The Application of Advanced Computing Concepts and Techniques in Control Engineering, NATO ASI SERIES, M.J. Denham and A.J. Laub, editors, Springer-Verlag, pp. 289--359, 1988.
  41. Hsin-Chu Chen and Ahmed Sameh. “A Domain Decomposition Method for 3D Elasticity Problems”. Proceedings of the First International Conference on Applications of Supercomputers in Engineering, C. Brebbia and A. Peters, editors, pp. 171--188, North-Holland, 1989.
  42. E. Gallopoulos and Ahmed Sameh. “Solving Elliptic Equations on the Cedar Multiprocessor”. In: Aspects of Computation on Asynchronous Parallel Processors, M.H. Wright, editor, Elsevier Science Publishers B.V. (North-Holland), pp. 1--12, 1989.
  43. K. Gallivan, E. Ng, B. Peyton, R. Plemmons, J. Ortega, C. Romine, A. Sameh and R. Voigt. Parallel Algorithms for Matrix Computations, SIAM Publications, Philadelphia, PA, 1990.
  44. Randall Bramley, Hsin-Chu Chen, Ulrike Meier and Ahmed Sameh. “On Some Parallel Preconditioned CG Schemes”. Lecture Notes in Mathematics, Preconditioned Conjugate Gradient Methods, O. Axelsson, editor, Springer-Verlag, 1990.
  45. Kyle Gallivan, Ahmed Sameh and Zahari Zlatev. “Solving General Sparse Linear Systems Using Conjugate Gradient-type Methods”. Proceedings of the 1990 Int'l Conf. on Supercomputing, Amsterdam, the Netherlands, pp. 132--139, ACM Press, 1990.
  46. K. Gallivan, E. Gallopoulos and A. Sameh. “Cedar: An Experiment in Parallel Computing”, Computer Mathematics and its Applications, Vol. 1, No. 1, pp. 77--98, 1994.
  47. M. Naumov, M. Manguoglu, and A.H. Sameh, “A tearing-based hybrid parallel system solver,” Journal of Computational and Applied Mathematica 234 (2010) pp. 3025-3038.
  48. Sy-Shin Lo, Bernard Philippe and Ahmed H. Sameh. “A Multiprocessor Algorithm for the Symmetric Tridiagonal Eigenvalue Problem”. SIAM Journal on Scientific and Statistical Computing, Vol. 8, No. 2, pp. s155--s165, March, 1987.
  49. Michael Berry and Ahmed Sameh. “Multiprocessor Jacobi Algorithms for Dense Symmetric Eigenvalue and Singular Value Decompositions”. Proceedings of the 1986 Int'l Conf. on Parallel Processing, St. Charles, IL, pp. 433--440, Aug. 19--22, 1986.
  50. Michael Berry and Ahmed Sameh. “A Multiprocessor Scheme for the Singular Value Decomposition”. In: Parallel Processing for Scientific Computing, G. Rodrigue, editor, SIAM, pp. 67--71, 1987.
  51. Ahmed H. Sameh and John Wisniewski. “A Trace Minimization Algorithm for the Generalized Eigenvalue Problem”. SIAM Journal on Numerical Analysis, Vol. 19, No. 6, pp. 1243--1259, 1982.
  52. C. Carey, H.–C. Chen, G. Golub and A. Sameh, “A New Approach for solving Symmetric Eigenvalue Problems”, Computing Systems in Engineering, Vol. 3, Issue 6, December 1992, pp. 671-679.
  53. M. Berry, B. Parlett and A. Sameh. “Computing Extremal Singular Triplets of Sparse Matrices on a Shared-Memory Multiprocessor”, International Journal on High Speed Computing, Vol. 6, No. 2, pp. 239--275, 1994.
  54. George Cybenko, Lyle D. Kipp, Lynn Pointer, David J. Kuck: Supercomputer performance evaluation and the Perfect Benchmarks. ICS 1990: 254-266
  55. Walid A. Abu-Sufah, David J. Kuck, Duncan H. Lawrie: On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations. IEEE Trans. Computers 30(5): 341-356 (1981).
  56. ISCA. 25 years of the international symposia on Computer architecture (selected papers). Association for Computing Machinery, New York, NY, USA. 1998
  57. A. Veidenbaum, P.-C. Yew, D. J. Kuck, C. D. Polychronopoulos, D. A. Padua, E. S. Davidson, and K. Gallivan. 1998. Retrospective: the Cedar system. In 25 years of the international symposia on Computer architecture (selected papers) (ISCA '98). Association for Computing Machinery, New York, NY, USA, 89–91. https://doi.org/10.1145/285930.285965MeSa87
  58. Josep Torrellas, David A. Koufaty, David A. Padua:Comparing the Performance of the DASH and CEDAR Multiprocessors. ICPP (2) 1994: 304-308
  59. Josep Torrellas, Zheng Zhang: The Performance of the Cedar Multistage Switching Network. IEEE Trans. Parallel Distributed Syst. 8(4): 321-336 (1997)
  60. Mohammad R. Haghighat, Constantine D. Polychronopoulos: Symbolic Analysis for Parallelizing Compilers. ACM Trans. Program. Lang. Syst. 18(4): 477-518 (1996)
  61. C. D. Polychronopoulos and D. J. Kuck, "Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers," in IEEE Transactions on Computers, vol. C-36, no. 12, pp. 1425-1439, Dec. 1987, doi: 10.1109/TC.1987.5009495.
  62. McClelland, James L., David E. Rumelhart, and PDP Research Group. Parallel distributed processing. Vol. 2. Cambridge, MA: MIT press, 1986.
  63. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314, 1989.
  64. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521.7553, 436-444 (2015).
  65. https://openai.com/blog/chatgpt ChatGPT
  66. Saarinen, Sirpa, Randall Bramley, and George Cybenko. "Ill-conditioning in neural network training problems." SIAM Journal on Scientific Computing 14.3 (1993): 693-714.
  67. Sjöberg, Jonas, et al. "Nonlinear black-box modeling in system identification: a unified overview." Automatica 31.12 (1995): 1691-1724.
  68. Jack Dongarra, Jeremy du Croz, Sven Hammerling and Iain Duff. “A Set of Level 3 Basic Linear Algebra Subprograms”, in the ACM Transactions on Mathematical Software Volume 16 Issue 1, pp. 1 –17
  69. Dennis Gannon, William Jalby, Kyle A. Gallivan. “Strategies for Cache and Local Memory Management by Global Program Transformation”. J. Parallel Distributed Comput. 5(5): 587-616 (1988) also ICS 1987: 229-254.
  70. Kyle A. Gallivan, William Jalby, Dennis Gannon.“On the problem of optimizing data transfers for complex memory systems”,  ICS 1988: 238-253.
  71. https://www.osti.gov/biblio/5005497/ Cedar project: Original goals and progress to date
  72. US Fortran Standards Committee. Technical Committee X3H5. X3H5 Parallel Extensions for Fortran April 2, 1993, https://j3-fortran.org/doc/year/93/93-x3h5-SD2-A.pdf
  73. Lawrence Rauchwerger and David A. Padua. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. In Proceedings of the SIGPLAN 1995 Conference on Programming Language Design and Implementation, La Jolla, CA, pages 218-232, June 1995.
  74. Todd R. Allen, David A. Padua: Debugging Parallel Fortran on a Shared Memory Machine. International Conference on Parallel Processing 1987: 721-727
  75. Perry A. Emrath, Sanjoy Ghosh, David A. Padua. Detecting Nondeterminacy in Parallel Programs. IEEE Software 9(1): 69-77 (1992)
  76. Rudolf Eigenmann (Ed.), Performance Evaluation and Benchmarking with Realistic Applications, MIT Press, ISBN 9780262050661, 2001.
  77. Gannon, D., Beckman, P., Johnson, E., Green, T., Levine, M. (2001). HPC++ and the HPC++Lib Toolkit. In: Pande, S., Agrawal, D.P. (eds) Compiler Optimizations for Scalable Parallel Systems. Lecture Notes in Computer Science, vol 1808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45403-9_3
  78. Peter H. Beckman and Dennis Gannon and Elizabeth Johnson. Portable parallel programming in HPC++. 1996 Proceedings ICPP Workshop on Challenges for Parallel Processing. 1996, pp. 132-139.
  79. E. Gallopoulos, B. Philippe, and A.H. Sameh, Parallelism in Matrix Computations, Springer 2016.
  80. A. Grama and A.H. Sameh, editors. Parallel Algorithms in Computational Science and Engineering, Birkhauser 2020.
  81. Alexander V. Veidenbaum: A Compiler-Assisted Cache Coherence Solution for Multiprocessors. In Proceedings of ICPP, 1986. https://www.osti.gov/biblio/7065831
  82. Hoichi Cheon, Alexander V. Veidenbaum: “A cache coherence scheme with fast selective invalidation,” Proceedings of the 15th Annual International Symposium on Computer architecture. June 1988, pp 299–307.
  83. Hoichi Cheon, Alexander V. Veidenbaum: “A version control approach to Cache coherence,” Proceedings of the International Conference on Supercomputing, June 1989, pp 322–330. https://doi.org/10.1145/318789.318824
  84. Hoichi Cheong: “Life span strategy - a compiler-based approach to cache coherence," Proceedings of the International Conference on Supercomputing (ICS) June 1992. pp. 139-148
  85. Yung-chin Chen, Alexander V Veidenbaum: “A software coherence scheme with the assistance of directories,” Proceedings of the International Conference on Supercomputing, June 1991, pp. 284-294.
  86. T. J. Ashby, P. Díaz and M. Cintra, "Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters," in IEEE Transactions on Computers, vol. 60, no. 4, pp. 472-483, April 2011, doi: 10.1109/TC.2010.155.
  87. Michael Wilkins, Sam Westrick, Vijay Kandiah, Alex Bernat, Brian Suchy, Enrico Armenio Deiana, Simone Campanoni, Umut A. Acar, Peter Dinda, and Nikos Hardavellas. 2023. WARDen: Specializing Cache Coherence for High-Level Parallel Languages. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2023). Association for Computing Machinery, New York, NY, USA, 122–135. https://doi.org/10.1145/3579990.3580013"
  88. Seyong Lee, Seung-Jai Min and Rudolf Eigenmann, “OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization,” in PPoPP '09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009, pages 101-110.
  89. Seyong Lee, and Rudolf Eigenmann. "OpenMPC: Extended OpenMPprogramming and tuning for GPUs." SC'10: Proceedings of the 2010ACM/IEEE International Conference for High Performance Computing,Networking, Storage and Analysis. IEEE, 2010.
  90. Sabne, Amit, Putt Sakdhnagool, and Rudolf Eigenmann. "Scalinglarge-data computations on multi-GPU accelerators." Proceedings of the27th international ACM conference on International conference onsupercomputing. 2013.