Monthly Archives: April 2017

Using chip memory is more efficient and Faster

For decades, computer chips have increased efficiency by using “caches,” small, local memory banks that store frequently used data and cut down on time- and energy-consuming communication with off-chip memory.

Today’s chips generally have three or even four different levels of cache, each of which is more capacious but slower than the last. The sizes of the caches represent a compromise between the needs of different kinds of programs, but it’s rare that they’re exactly suited to any one program.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory have designed a system that reallocates cache access on the fly, to create new “cache hierarchies” tailored to the needs of particular programs.

The researchers tested their system on a simulation of a chip with 36 cores, or processing units. They found that, compared to its best-performing predecessors, the system increased processing speed by 20 to 30 percent while reducing energy consumption by 30 to 85 percent.

“What you would like is to take these distributed physical memory resources and build application-specific hierarchies that maximize the performance for your particular application,” says Daniel Sanchez, an assistant professor in the Department of Electrical Engineering and Computer Science (EECS), whose group developed the new system.

“And that depends on many things in the application. What’s the size of the data it accesses? Does it have hierarchical reuse, so that it would benefit from a hierarchy of progressively larger memories? Or is it scanning through a data structure, so we’d be better off having a single but very large level? How often does it access data? How much would its performance suffer if we just let data drop to main memory? There are all these different tradeoffs.”

Sanchez and his coauthors — Po-An Tsai, a graduate student in EECS at MIT, and Nathan Beckmann, who was an MIT graduate student when the work was done and is now an assistant professor of computer science at Carnegie Mellon University — presented the new system, dubbed Jenga, at the International Symposium on Computer Architecture last week.

Staying local

For the past 10 years or so, improvements in computer chips’ processing power have come from the addition of more cores. The chips in most of today’s desktop computers have four cores, but several major chipmakers have announced plans to move to six cores in the next year or so, and 16-core processors are not uncommon in high-end servers. Most industry watchers assume that the core count will continue to climb.

Each core in a multicore chip usually has two levels of private cache. All the cores share a third cache, which is actually broken up into discrete memory banks scattered around the chip. Some new chips also include a so-called DRAM cache, which is etched into a second chip that is mounted on top of the first.

For a given core, accessing the nearest memory bank of the shared cache is more efficient than accessing more distant cores. Unlike today’s cache management systems, Jenga distinguishes between the physical locations of the separate memory banks that make up the shared cache. For each core, Jenga knows how long it would take to retrieve information from any on-chip memory bank, a measure known as “latency.”

Jenga builds on an earlier system from Sanchez’s group, called Jigsaw, which also allocated cache access on the fly. But Jigsaw didn’t build cache hierarchies, which makes the allocation problem much more complex.

For every task running on every core, Jigsaw had to calculate a latency-space curve, which indicated how much latency the core could expect with caches of what size. It then had to aggregate all those curves to find a space allocation that minimized latency for the chip as a whole.

Curves to surfaces

But Jenga has to evaluate the tradeoff between latency and space for two layers of cache simultaneously, which turns the two-dimensional latency-space curve into a three-dimensional surface. Fortunately, that surface turns out to be fairly smooth: It may undulate, but it usually won’t have sudden, narrow spikes and dips.

That means that sampling points on the surface will give a pretty good sense of what the surface as a whole looks like. The researchers developed a clever sampling algorithm tailored to the problem of cache allocation, which systematically increases the distances between sampled points. “The insight here is that caches with similar capacities — say, 100 megabytes and 101 megabytes — usually have similar performance,” Tsai says. “So a geometrically increased sequence captures the full picture quite well.”

Once it has deduced the shape of the surface, Jenga finds the path across it that minimizes latency. Then it extracts the component of that path contributed by the first level of cache, which is a 2-D curve. At that point, it can reuse Jigsaw’s space-allocation machinery.

In experiments, the researchers found that this approach yielded an aggregate space allocation that was, on average, within 1 percent of that produced by a full-blown analysis of the 3-D surface, which would be prohibitively time consuming. Adopting the computational short cut enables Jenga to update its memory allocations every 100 milliseconds, to accommodate changes in programs’ memory-access patterns.

End run

Jenga also features a data-placement procedure motivated by the increasing popularity of DRAM cache. Because they’re close to the cores accessing them, most caches have virtually no bandwidth restrictions: They can deliver and receive as much data as a core needs. But sending data longer distances requires more energy, and since DRAM caches are off-chip, they have lower data rates.

If multiple cores are retrieving data from the same DRAM cache, this can cause bottlenecks that introduce new latencies. So after Jenga has come up with a set of cache assignments, cores don’t simply dump all their data into the nearest available memory bank. Instead, Jenga parcels out the data a little at a time, then estimates the effect on bandwidth consumption and latency. Thus, even within the 100-millisecond intervals between chip-wide cache re-allocations, Jenga adjusts the priorities that each core gives to the memory banks allocated to it.

“There’s been a lot of work over the years on the right way to design a cache hierarchy,” says David Wood, a professor of computer science at the University of Wisconsin at Madison. “There have been a number of previous schemes that tried to do some kind of dynamic creation of the hierarchy. Jenga is different in that it really uses the software to try to characterize what the workload is and then do an optimal allocation of the resources between the competing processes. And that, I think, is fundamentally more powerful than what people have been doing before. That’s why I think it’s really interesting.”

Digital design is very wide

Virtually any modern information-capture device — such as a camera, audio recorder, or telephone — has an analog-to-digital converter in it, a circuit that converts the fluctuating voltages of analog signals into strings of ones and zeroes.

Almost all commercial analog-to-digital converters (ADCs), however, have voltage limits. If an incoming signal exceeds that limit, the ADC either cuts it off or flatlines at the maximum voltage. This phenomenon is familiar as the pops and skips of a “clipped” audio signal or as “saturation” in digital images — when, for instance, a sky that looks blue to the naked eye shows up on-camera as a sheet of white.

Last week, at the International Conference on Sampling Theory and Applications, researchers from MIT and the Technical University of Munich presented a technique that they call unlimited sampling, which can accurately digitize signals whose voltage peaks are far beyond an ADC’s voltage limit.

The consequence could be cameras that capture all the gradations of color visible to the human eye, audio that doesn’t skip, and medical and environmental sensors that can handle both long periods of low activity and the sudden signal spikes that are often the events of interest.

The paper’s chief result, however, is theoretical: The researchers establish a lower bound on the rate at which an analog signal with wide voltage fluctuations should be measured, or “sampled,” in order to ensure that it can be accurately digitized. Their work thus extends one of the several seminal results from longtime MIT Professor Claude Shannon’s groundbreaking 1948 paper “A Mathematical Theory of Communication,” the so-called Nyquist-Shannon sampling theorem.

Ayush Bhandari, a graduate student in media arts and sciences at MIT, is the first author on the paper, and he’s joined by his thesis advisor, Ramesh Raskar, an associate professor of media arts and sciences, and Felix Krahmer, an assistant professor of mathematics at the Technical University of Munich.

Wraparound

The researchers’ work was inspired by a new type of experimental ADC that captures not the voltage of a signal but its “modulo.” In the case of the new ADCs, the modulo is the remainder produced when the voltage of an analog signal is divided by the ADC’s maximum voltage.

“The idea is very simple,” Bhandari says. “If you have a number that is too big to store in your computer memory, you can take the modulo of the number. The act of taking the modulo is just to store the remainder.”

“The modulo architecture is also called the self-reset ADC,” Bhandari explains. “By self-reset, what it means is that when the voltage crosses some threshold, it resets, which is actually implementing a modulo. The self-reset ADC sensor was proposed in electronic architecture a couple years back, and ADCs that have this capability have been prototyped.”

One of those prototypes was designed to capture information about the firing of neurons in the mouse brain. The baseline voltage across a neuron is relatively low, and the sudden voltage spikes when the neuron fires are much higher. It’s difficult to build a sensor that is sensitive enough to detect the baseline voltage but won’t saturate during spikes.

When a signal exceeds the voltage limit of a self-reset ADC, it’s cut off, and it starts over again at the circuit’s minimum voltage. Similarly, if the signal drops below the circuit’s minimum voltage, it’s reset to the maximum voltage. If the signal’s peak voltage is several times the voltage limit, the signal can thus wrap around on itself again and again.

This poses a problem for digitization. Digitization is the process of sampling an analog signal — essentially, making many discrete measurements of its voltage. The Nyquist-Shannon theorem establishes the number of measurements required to ensure that the signal can be accurately reconstructed.

But existing sampling algorithms assume that the signal varies continuously up and down. If, in fact, the signal from a self-reset ADC is sampled right before it exceeds the maximum, and again right after the circuit resets, it looks to the standard sampling algorithm like a signal whose voltage decreases between the two measurements, rather than one whose voltage increases.

Big mistakes

Bhandari and his colleagues were interested in the theoretical question of how many samples are required to resolve that ambiguity, and the practical question of how to reconstruct the original signal. They found that the number of samples dictated by the Nyquist-Shannon theorem, multiplied by pi and by Euler’s number e, or roughly 8.5, would guarantee faithful reconstruction.

The researchers’ reconstruction algorithm relies on some clever mathematics. In a self-reset ADC, the voltage sampled after a reset is the modulo of the true voltage. Recovering the true voltage is thus a matter of adding some multiple of the ADC’s maximum voltage — call it M — to the sampled value. What that multiple should be, however — M, 2M, 5M, 10M — is unknown.

The most basic principle in calculus is that of the derivative, which provides a formula for calculating the slope of a curve at any given point. In computer science, however, derivatives are often approximated arithmetically. Suppose, for instance, that you have a series of samples from an analog signal. Take the difference between samples 1 and 2, and store it. Then take the difference between samples 2 and 3, and store that, then 3 and 4, and so on. The end result will be a string of values that approximate the derivative of the sampled signal.

The derivative of the true signal to a self-reset ADC is thus equal to the derivative of its modulo plus the derivative of a bunch of multiples of the threshold voltage — the Ms, 2Ms, 5Ms, and so on. But the derivative of the M-multiples is itself always a string of M-multiples, because taking the difference between two consecutive M-multiples will always yield another M-multiple.

Now, if you take the modulo of both derivatives, all the M-multiples disappear, since they leave no remainder when divided by M. The modulo of the derivative of the true signal is thus equivalent to the modulo of the derivative of the modulo signal.

Inverting the derivative is also one of the most basic operations in calculus, but deducing the original signal does require adding in an M-multiple whose value has to be inferred. Fortunately, using the wrong M-multiple will yield signal voltages that are wildly implausible. The researchers’ proof of their theoretical result involved an argument about the number of samples necessary to guarantee that the correct M-multiple can be inferred.

“If you have the wrong constant, then the constant has to be wrong by a multiple of M,” Krahmer says. “So if you invert the derivative, that adds up very quickly. One sample will be correct, the next sample will be wrong by M, the next sample will be wrong by 2M, and so on. We need to set the number of samples to make sure that if we have the wrong answer in the previous step, our reconstruction would grow so large that we know it can’t be correct.”

“Unlimited sampling is an intriguing concept that addresses the important and real issue of saturation in analog-to-digital converters,” says Richard Baraniuk, a professor of electrical and computer engineering at Rice University and one of the co-inventors of the single-pixel camera. “It is promising that the computations required to recover the signal from modulo measurements are practical with today’s hardware. Hopefully this concept will spur the development of the kind of sampling hardware needed to make unlimited sampling a reality.”

Designing microstructures

Today’s 3-D printers have a resolution of 600 dots per inch, which means that they could pack a billion tiny cubes of different materials into a volume that measures just 1.67 cubic inches.

Such precise control of printed objects’ microstructure gives designers commensurate control of the objects’ physical properties — such as their density or strength, or the way they deform when subjected to stresses. But evaluating the physical effects of every possible combination of even just two materials, for an object consisting of tens of billions of cubes, would be prohibitively time consuming.

So researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a new design system that catalogues the physical properties of a huge number of tiny cube clusters. These clusters can then serve as building blocks for larger printable objects. The system thus takes advantage of physical measurements at the microscopic scale, while enabling computationally efficient evaluation of macroscopic designs.

“Conventionally, people design 3-D prints manually,” says Bo Zhu, a postdoc at CSAIL and first author on the paper. “But when you want to have some higher-level goal — for example, you want to design a chair with maximum stiffness or design some functional soft [robotic] gripper — then intuition or experience is maybe not enough. Topology optimization, which is the focus of our paper, incorporates the physics and simulation in the design loop. The problem for current topology optimization is that there is a gap between the hardware capabilities and the software. Our algorithm fills that gap.”

Zhu and his MIT colleagues presented their work this week at Siggraph, the premier graphics conference. Joining Zhu on the paper are Wojciech Matusik, an associate professor of electrical engineering and computer science; Mélina Skouras, a postdoc in Matusik’s group; and Desai Chen, a graduate student in electrical engineering and computer science.

Points in space

The MIT researchers begin by defining a space of physical properties, in which any given microstructure will assume a particular location. For instance, there are three standard measures of a material’s stiffness: One describes its deformation in the direction of an applied force, or how far it can be compressed or stretched; one describes its deformation in directions perpendicular to an applied force, or how much its sides bulge when it’s squeezed or contract when it’s stretched; and the third measures its response to shear, or a force that causes different layers of the material to shift relative to each other.

Those three measures define a three-dimensional space, and any particular combination of them defines a point in that space.

In the jargon of 3-D printing, the microscopic cubes from which an object is assembled are called voxels, for volumetric pixels; they’re the three-dimensional analogue of pixels in a digital image. The building blocks from which Zhu and his colleagues assemble larger printable objects are clusters of voxels.

In their experiments, the researchers considered clusters of three different sizes — 16, 32, and 64 voxels to a face. For a given set of printable materials, they randomly generate clusters that combine those materials in different ways: a square of material A at the cluster’s center, a border of vacant voxels around that square, material B at the corners, or the like. The clusters must be printable, however; it wouldn’t be possible to print a cluster that, say, included a cube of vacant voxels with a smaller cube of material floating at its center.

For each new cluster, the researchers evaluate its physical properties using physics simulations, which assign it a particular point in the space of properties.

Gradually, the researchers’ algorithm explores the entire space of properties, through both random generation of new clusters and the principled modification of clusters whose properties are known. The end result is a cloud of points that defines the space of printable clusters.

Establishing boundaries

The next step is to calculate a function called the level set, which describes the shape of the point cloud. This enables the researchers’ system to mathematically determine whether a cluster with a particular combination of properties is printable or not.

The final step is the optimization of the object to be printed, using software custom-developed by the researchers. That process will result in specifications of material properties for tens or even hundreds of thousands of printable clusters. The researchers’ database of evaluated clusters may not contain exact matches for any of those specifications, but it will contain clusters that are extremely good approximations.

“The design and discovery of structures to produce materials and objects with exactly specified functional properties is central for a large number of applications where mechanical properties are important, such as in the automotive or aerospace industries,” says Bernd Bickel, an assistant professor of computer science at the Institute of Science and Technology Austria and head of the institute’s Computer Graphics and Digital Fabrication group. “Due to the complexity of these structures, which, in the case of 3-D printing, can consist of more than a trillion material droplets, exploring them manually is absolutely intractable.”

“The solution presented by Bo and colleagues addresses this problem in a very clever way, by reformulating it,” he says. “Instead of working directly on the scale of individual droplets, they first precompute the behavior of small structures and put it in a database. Leveraging this knowledge, they can perform the actual optimization on a coarser level, allowing them to very efficiently generate high-resolution printable structures with more than a trillion elements, even with just a regular computer. This opens up exciting new avenues for designing and optimizing structures at a resolution that was out of reach so far.”

The MIT researchers’ work was supported by the U.S. Defense Advanced Research Projects Agency’s SIMPLEX program.

Phytoplankton and chips on Computer

Microbes mediate the global marine cycles of elements, modulating atmospheric carbon dioxide and helping to maintain the oxygen we all breathe, yet there is much about them scientists still don’t understand. Now, an award from the Simons Foundation will give researchers from MIT’s Darwin Project access to bigger, better computing resources to model these communities and probe how they work.

The simulations of plankton populations made by Darwin Project researchers have become increasingly computationally demanding. MIT Professor Michael “Mick” Follows and Principal Research Engineer Christopher Hill, both affiliates of the Darwin Project, were therefore delighted to learn of their recent Simons Foundation award, providing them with enhanced compute infrastructure to help execute the simulations of ocean circulation, biogeochemical cycles, and microbial population dynamics that are the bread and butter of their research.

The Darwin Project, an alliance between oceanographers and microbiologists in the MIT Department of Earth, Atmospheric and Planetary Sciences (EAPS) and the Parsons Lab in the MIT Department of Civil and Environmental Engineering, was conceived as an initiative to “advance the development and application of novel models of marine microbes and microbial communities, identifying the relationships of individuals and communities to their environment, connecting cellular-scale processes to global microbial community structure” with the goal of coupling “state of the art physical models of global ocean circulation with biogeochemistry and genome-informed models of microbial processes.”

In response to increases in model complexity and resolution over the course of past decade since the project’s inception in 2007, computational demands have ballooned. Increased fidelity and algorithmic sophistication in both biological and fluid dynamical component models and forays into new statistical analysis approaches, leveraging big-data innovations to analyze the simulations and field data, have grown inexorably.

“The award allows us to grow our in-house computational and data infrastructure to accelerate and facilitate these new modeling capabilities,” says Hill, who specializes in Earth and planetary computational science.

The boost in computational infrastructure the award provides for will advance several linked areas of research, including the capacity to model marine microbial systems in more detail, enhanced fidelity of the modeled fluid dynamical environment, support for state of the art data analytics including machine learning techniques, and accelerating and extending genomic data processing capabilities.

High diversity is a ubiquitous aspect of marine microbial communities that is not fully understood and, to date, is rarely resolved in simulations. Darwin Project researchers have broken new ground and continue to push the envelope in modeling in this area: In addition to resolving a much larger number of phenotypes and interactions than has typically been attempted by other investigators, the Darwin Project team has also been increasing the fidelity of the underlying physiological sub-models which define traits and trade-offs.

“One thing we are doing is implementing simplified metabolic models which resolve additional constraints [electron and energy conservation] and higher fidelity [dynamic representations of macro-molecular and elemental composition],” says Fellows. “These advances require more state variables per phenotype. We have also an explicit radiative transfer model that allows us to better exploit satellite remote sensing data but both come at a greater computational expense.” Darwin researchers are also expanding their models to resolve not only phototrophic and grazer communities in the surface ocean, but to include heterotrophic and chemo-autotrophic populations throughout the water column.

Follows and Hill believe these advances will provide better fidelity to real world observations, a more dynamic and fundamental description of marine microbial communities and biogeochemical cycles, and the potential to examine the underlying drivers and significance of diversity in the system.

“Much of the biological action in the surface ocean occurs at scales currently unresolved in most biogeochemical simulations,” Follows explains. “Numerical models and recent observations show that the sub-mesoscale motions in the ocean have a profound impact on the supply of resources to the surface and the dispersal and communication between different populations. The integral impact of this, and how to properly parameterize it, is not yet clear, but one approach, that is within reach, is to resolve these scales of motion nested within global simulations,”

Hill and Follows hope such advances will allow them to examine both local and regionally integrated effects of fine-scale physical drivers. “We have already completed a full annual cycle numerical simulation that resolves physical processes down to kilometer scales globally,” says Hill. “Such simulations provide a basis for driving targeted modeling of, for example, the role of fronts that may involve fully non-hydrostatic dynamics and that could help explain in-situ measurements that suggest enhanced growth rates under such conditions.” Such work is strongly complementary to another Simons Foundation sponsored project, the Simons Collaboration on Ocean Processes and Ecology (SCOPE). As an initiative to advance our understanding of the biology, ecology, and biogeochemistry of microbial processes that dominate Earth’s largest biome — the global ocean — SCOPE seeks to measure, model, and conduct experiments at a model ecosystem site located 100 km north of the Hawaiian island of Oahu that is representative of a large portion of the North Pacific Ocean.

The team has also already implemented algorithms to enable explicit modeling of the relevant fluid dynamics, but here too, the approaches are computationally demanding. “The improved facilities this award provides will enable these extremely demanding experiments to proceed,” says Follows.

Enhanced computer resources will also allow Darwin Project researchers to more effectively utilize data analytics. “We are adopting multiple statistical approaches for classifying fluid dynamical and ecosystem features in observations and in simulations which we plan to apply to biogeochemical problems,” says Hill. “One current direction, which employs random forest classification to identify features corresponding to training sets, is showing particular promise for objectively quantifying links between biogeochemical event occurrence and physical environment phenomena.”

Not only will these methods provide useful analysis tools for their simulations, the pair also see them bridging to real world interpretations of, for example, metagenomics surveys in the ocean. Follows and Hill see this direction as a route by which to bring simulations and observations closer in new and meaningful ways. The growth in computational infrastructure the Simons award allows for, creates the potential for making much larger queries across more realistic datasets.

The Darwin Project is part of a long and fruitful collaboration with Institute Professor Sally “Penny” Chisholm of MIT’s Department of Civil and Environmental Engineering. Steady growth in available large-scale metagenomic and single-cell genomic data resulting from genetics data activities in the Chisholm Lab are also driving additional computational processing resource needs.

With the new Simons-supported enhancements in computational infrastructure, Darwin Project collaborators in the Chisholm Lab will be able to tackle assembly from larger metagenomic libraries and single-cell genome phylogenies using maximum likelihood and/or Bayesian algorithms. Currently, some large metagenomics assembly activities require compute resources with more memory than this team has readily had available. “Single-cell genome phylogeny activities are computationally demanding and require dedicating compute resources for weeks or months at a time, Hill explains. “This creates a bottleneck for other work. To accelerate work in these areas additional compute resources, some with larger memory than current resources and some with GPU accelerators are going to be hugely beneficial. The new systems will permit larger metagenomics library assembly than is currently possible.”