Monthly Archives: July 2017

The Founder dropped Dropbox to inspire and provoke collaboration

Like MIT’s campus computing environment, Athena, a pre-cloud solution for enabling files and applications to follow the user, Dropbox’s Drew Houston ’05 brings his alma mater everywhere he goes.

After earning his bachelor’s in electrical engineering and computer science, Houston’s frustration with the clunky need to carry portable USB drives drove him to partner with a fellow MIT student, Arash Ferdowsi, to develop an online solution — what would become Dropbox.

Dropbox, which now has over 500 million users, continues to adapt. The file-sharing company recently crossed the $1-billion threshold in annual subscription revenue. It’s expanding its business model by selling at the corporate level — employees at companies with Dropbox can use, essentially, one big box.

True to his company’s goal of using technology to bring people (and files) together, Houston is keen to share his own wisdom with others, especially those at MIT. Houston gave the 2013 Commencement address, saying “The hardest-working people don’t work hard because they’re disciplined. They work hard because working on an exciting problem is fun.”

He has also been a guest speaker in ‘The Founder’s Journey,” a course designed to demystify entrepreneurship, and at the MIT Enterprise Forum Cambridge; a frequent and active participant in StartMIT, a workshop on entrepreneurship held over Independent Activities Period (IAP); and a staple of the MIT Better World tour, an alumni engagement event happening at cities all over the globe.

Houston stopped by MIT in February for the latest iteration of StartMIT to give a “fireside chat” about the early days of Dropbox, when it was run with a few of his friends from Course 6, and discussed the current challenges of the company: managing scale. His firm now employs over 1,000 people.

“With thousands of employees in the company, you need coordination, and it can become total chaos. Ultimately all the vectors need to point in the same direction,” he told students. It turns out that Dropbox’s new online collaboration suite, Paper, could play a key role in getting those vectors to line up, offering lessons to both those new to start-ups and more seasoned entrepreneurs.

The Department of Electrical Engineering and Science (EECS) caught up with Houston to ask him about his perspective on Paper, a new tool Dropbox created out of necessity, and the potential for artificial intelligence to improve how people and teams organize their work online.

Q. Why did you decide to create a new way to collaborate?

A. Keeping files in sync is a way to keep teams in sync — sharing info gets teams on the same page. But teams need a lot more than just shared files; they need a way of organizing their knowledge, which might even mean taking bits of useful information from different files and connecting them in new ways.

At Dropbox, we were using tools like Google Docs. And we found they were really good for some things: creating docs in real-time is super easy, collaborating, editing — that stuff was great. But the problem is after using them we’d end up with this static list of 100 documents and 20 different projects, and it wasn’t very organized. The docs were self-contained and not connected.

The other half of the time, we were using wikis — which are good, because they’re like the web: They’re connected, they’re public by default, so they become this home for your knowledge. But the editing experience is pretty limited. There’s not a lot of formatting. You can’t do much more than text.

One way or the other, we’re like, “God! We’re always compromising.” But what if we had the best of both worlds?

It’s also one thing for tools to just be functional. What if they were also beautiful?

Look at this building [Houston waves his hand to indicate the unique heart of Building 32, the Ray and Maria Stata Center.] We spend so much time and money and thought on the design of our physical space — what about our collaborative work environment?

Q. What, exactly, is beautiful about Paper?

A. [Houston opens his laptop and pulls up Dropbox Paper.] This is how we run the company, really. [Houston has the dashboard interface up, where he can see his documents.] It’s very simple. Look at what there isn’t; there’s not 10,000 buttons. [He clicks a plus sign that opens a new document. The placeholder text reads: Beautiful by default.] I can just add a couple things. [Houston types rapidly, creating a checkbox item]: “@Drew, remember to eat. [Houston selects a date on a mini-calendar that appears to the right]… By tomorrow.” [He laughs.] You can add a picture. You can share the document or send it. There are stickers. You have the formatting you would expect, but also emojis, code, LaTeX, tasks, tables. They’re beautiful by default. You can embed stuff — Google Docs, GIFs, Dropbox files with previews, Spotify files, Soundcloud files, tweets, you can paste a URL. It’s really awesome. And it’s on your phone. There are updates, to-dos — so it gives me a feed of what’s happening in my world.

Q. Paper also incorporates machine learning tricks. What does that mean practically? And where might it go from there?

A. When you zoom out more broadly [than a single document], there’s just a lot of context that helps people figure out what’s important, or how to prioritize or organize things. So, for example, you might be working on a Paper doc and based on the key words or various other signals, we can be, like, ‘Oh wait, here are a few suggestions for where you might want to put it, like your public folder.’ I think there’re a lot of ways we can assist; instead of everyone having to be a librarian and file or tag things and make these connections manually, the algorithms can do a lot of the ranking the elements, and so on, a lot better. And that’s throughout the product. It’s a core part of the experience.

Another area I’m very interested in is natural language processing. A lot of work, like when you’re emailing or texting someone, or leaving a comment, is just unstructured text. But from that you can infer that there’s actually structure, intent, meaning, and associations. For example, if I write “doc scanner,” that’s not just a couple words; it’s a feature with associations. And you should be able to query your team’s knowledge and not have to worry about explicitly asking “Where are the specs for the doc scanner?”

Natural language processing could essentially ask, “Is this what you mean?” As you’re trying to type in a query, it could assist you with automated suggestions: “Oh, are you looking for (this)?” Or “are you searching for something by (this person)?” Or “do you mean something that’s part of this project?” Or “something before this date?” Those are just a couple of ways that [machine learning and automated assistance] could be applied, but there are, potentially, thousands. We’re not very far along yet in true natural language understanding. But I think step one is just to get all the stuff into one place — doing some of that plumbing so you can do basic searches.

Learn to make good decisions when the results are uncertain

Markov decision processes are mathematical models used to determine the best courses of action when both current circumstances and future consequences are uncertain. They’ve had a huge range of applications — in natural-resource management, manufacturing, operations management, robot control, finance, epidemiology, scientific-experiment design, and tennis strategy, just to name a few.

But analyses involving Markov decision processes (MDPs) usually make some simplifying assumptions. In an MDP, a given decision doesn’t always yield a predictable result; it could yield a range of possible results. And each of those results has a different “value,” meaning the chance that it will lead, ultimately, to a desirable outcome.

Characterizing the value of given decision requires collection of empirical data, which can be prohibitively time consuming, so analysts usually just make educated guesses. That means, however, that the MDP analysis doesn’t guarantee the best decision in all cases.

In the Proceedings of the Conference on Neural Information Processing Systems, published last month, researchers from MIT and Duke University took a step toward putting MDP analysis on more secure footing. They show that, by adopting a simple trick long known in statistics but little applied in machine learning, it’s possible to accurately characterize the value of a given decision while collecting much less empirical data than had previously seemed necessary.

In their paper, the researchers described a simple example in which the standard approach to characterizing probabilities would require the same decision to be performed almost 4 million times in order to yield a reliable value estimate.

With the researchers’ approach, it would need to be run 167,000 times. That’s still a big number — except, perhaps, in the context of a server farm processing millions of web clicks per second, where MDP analysis could help allocate computational resources. In other contexts, the work at least represents a big step in the right direction.

“People are not going to start using something that is so sample-intensive right now,” says Jason Pazis, a postdoc at the MIT Laboratory for Information and Decision Systems and first author on the new paper. “We’ve shown one way to bring the sample complexity down. And hopefully, it’s orthogonal to many other ways, so we can combine them.”

Unpredictable outcomes

In their paper, the researchers also report running simulations of a robot exploring its environment, in which their approach yielded consistently better results than the existing approach, even with more reasonable sample sizes — nine and 105. Pazis emphasizes, however, that the paper’s theoretical results bear only on the number of samples required to estimate values; they don’t prove anything about the relative performance of different algorithms at low sample sizes.

Pazis is joined on the paper by Jonathan How, the Richard Cockburn Maclaurin Professor of Aeronautics and Astronautics at MIT, and by Ronald Parr, a professor of computer science at Duke.

Although the possible outcomes of a decision may be described according to a probability distribution, the expected value of the decision is just the mean, or average, value of all outcomes. In the familiar bell curve of the so-called normal distribution, the mean defines the highest point of the bell.

The trick the researchers’ algorithm employs is called the median of means. If you have a bunch of random values, and you’re asked to estimate the mean of the probability distribution they’re drawn from, the natural way to do it is to average them. But if your sample happens to include some rare but extreme outliers, averaging can give a distorted picture of the true distribution. For instance, if you have a sample of the heights of 10 American men, nine of whom cluster around the true mean of 5 feet 10 inches, but one of whom is a 7-foot-2-inch NBA center, straight averaging will yield a mean that’s off by about an inch and a half.

With the median of means, you instead divide your sample into subgroups, take the mean (average) of each of those, and then take the median of the results. The median is the value that falls in the middle, if you arrange your values from lowest to highest.

Value proposition

The goal of MDP analysis is to determine a set of policies — or actions under particular circumstances — that maximize the value of some reward function. In a manufacturing setting, the reward function might measure operational costs against production volume; in robot control, it might measure progress toward the completion of a task.

But a given decision is evaluated according to a much more complex measure called a “value function,” which is a probabilistic estimate of the expected reward from not just that decision but every possible decision that could follow.

The researchers showed that, with straight averaging, the number of samples required to estimate the mean value of a decision is proportional to the square of the range of values that the value function can take on. Since that range can be quite large, so is the number of samples. But with the median of means, the number of samples is proportional to the range of a different value, called the Bellman operator, which is usually much narrower. The researchers also showed how to calculate the optimal size of the subsamples in the median-of-means estimate.

“The results in the paper, as with most results of this type, still reflect a large degree of pessimism because they deal with a worst-case analysis, where we give a proof of correctness for the hardest possible environment,” says Marc Bellemare, a research scientist at the Google-owned artificial-intelligence company Google DeepMind. “But that kind of analysis doesn’t need to carry over to applications. I think Jason’s approach, where we allow ourselves to be a little optimistic and say, ‘Let’s hope the world out there isn’t all terrible,’ is almost certainly the right way to think about this problem. I’m expecting this kind of approach to be highly useful in practice.”

The work was supported by the Boeing Company, the U.S. Office of Naval Research, and the National Science Foundation.

Testing the latest network protocol

The transmission control protocol, or TCP, which manages traffic on the internet, was first proposed in 1974. Some version of TCP still regulates data transfer in most major data centers, the huge warehouses of servers maintained by popular websites.

That’s not because TCP is perfect or because computer scientists have had trouble coming up with possible alternatives; it’s because those alternatives are too hard to test. The routers in data center networks have their traffic management protocols hardwired into them. Testing a new protocol means replacing the existing network hardware with either reconfigurable chips, which are labor-intensive to program, or software-controlled routers, which are so slow that they render large-scale testing impractical.

At the Usenix Symposium on Networked Systems Design and Implementation later this month, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory will present a system for testing new traffic management protocols that requires no alteration to network hardware but still works at realistic speeds — 20 times as fast as networks of software-controlled routers.

The system maintains a compact, efficient computational model of a network running the new protocol, with virtual data packets that bounce around among virtual routers. On the basis of the model, it schedules transmissions on the real network to produce the same traffic patterns. Researchers could thus run real web applications on the network servers and get an accurate sense of how the new protocol would affect their performance.

“The way it works is, when an endpoint wants to send a [data] packet, it first sends a request to this centralized emulator,” says Amy Ousterhout, a graduate student in electrical engineering and computer science (EECS) and first author on the new paper. “The emulator emulates in software the scheme that you want to experiment with in your network. Then it tells the endpoint when to send the packet so that it will arrive at its destination as though it had traversed a network running the programmed scheme.”

Ousterhout is joined on the paper by her advisor, Hari Balakrishnan, the Fujitsu Professor in Electrical Engineering and Computer Science; Jonathan Perry, a graduate student in EECS; and Petr Lapukhov of Facebook.

Traffic control

Each packet of data sent over a computer network has two parts: the header and the payload. The payload contains the data the recipient is interested in — image data, audio data, text data, and so on. The header contains the sender’s address, the recipient’s address, and other information that routers and end users can use to manage transmissions.

When multiple packets reach a router at the same time, they’re put into a queue and processed sequentially. With TCP, if the queue gets too long, subsequent packets are simply dropped; they never reach their recipients. When a sending computer realizes that its packets are being dropped, it cuts its transmission rate in half, then slowly ratchets it back up.

A better protocol might enable a router to flip bits in packet headers to let end users know that the network is congested, so they can throttle back transmission rates before packets get dropped. Or it might assign different types of packets different priorities, and keep the transmission rates up as long as the high-priority traffic is still getting through. These are the types of strategies that computer scientists are interested in testing out on real networks.

Speedy simulation

With the MIT researchers’ new system, called Flexplane, the emulator, which models a network running the new protocol, uses only packets’ header data, reducing its computational burden. In fact, it doesn’t necessarily use all the header data — just the fields that are relevant to implementing the new protocol.

When a server on the real network wants to transmit data, it sends a request to the emulator, which sends a dummy packet over a virtual network governed by the new protocol. When the dummy packet reaches its destination, the emulator tells the real server that it can go ahead and send its real packet.

If, while passing through the virtual network, a dummy packet has some of its header bits flipped, the real server flips the corresponding bits in the real packet before sending it. If a clogged router on the virtual network drops a dummy packet, the corresponding real packet is never sent. And if, on the virtual network, a higher-priority dummy packet reaches a router after a lower-priority packet but jumps ahead of it in the queue, then on the real network, the higher-priority packet is sent first.

The servers on the network thus see the same packets in the same sequence that they would if the real routers were running the new protocol. There’s a slight delay between the first request issued by the first server and the first transmission instruction issued by the emulator. But thereafter, the servers issue packets at normal network speeds.

The ability to use real servers running real web applications offers a significant advantage over another popular technique for testing new network management schemes: software simulation, which generally uses statistical patterns to characterize the applications’ behavior in a computationally efficient manner.

“Being able to try real workloads is critical for testing the practical impact of a network design and to diagnose problems for these designs,” says Minlan Yu, an associate professor of computer science at Yale University. “This is because many problems happen at the interactions between applications and the network stack” — the set of networking protocols loaded on each server — “which are hard to understand by simply simulating the traffic.”

“Flexplane takes an interesting approach of sending abstract packets through the emulated data-plane resource management solutions and then feeding back the modified real packets to the real network,” Yu adds. “This is a smart idea that achieves both high link speed and programmability. I hope we can build up a community using the FlexPlane test bed for testing new resource management solutions.”

Explanation of artificial neural network

In the past 10 years, the best-performing artificial-intelligence systems — such as the speech recognizers on smartphones or Google’s latest automatic translator — have resulted from a technique called “deep learning.”

Deep learning is in fact a new name for an approach to artificial intelligence called neural networks, which have been going in and out of fashion for more than 70 years. Neural networks were first proposed in 1944 by Warren McCullough and Walter Pitts, two University of Chicago researchers who moved to MIT in 1952 as founding members of what’ssometimes called the first cognitive science department.

Neural nets were a major area of research in both neuroscience and computer science until 1969, when, according to computer science lore, they were killed off by the MIT mathematicians Marvin Minsky and Seymour Papert, who a year later would become co-directors of the new MIT Artificial Intelligence Laboratory.

The technique then enjoyed a resurgence in the 1980s, fell into eclipse again in the first decade of the new century, and has returned like gangbusters in the second, fueled largely by the increased processing power of graphics chips.

“There’s this idea that ideas in science are a bit like epidemics of viruses,” says Tomaso Poggio, the Eugene McDermott Professor of Brain and Cognitive Sciences at MIT, an investigator at MIT’s McGovern Institute for Brain Research, and director of MIT’s Center for Brains, Minds, and Machines. “There are apparently five or six basic strains of flu viruses, and apparently each one comes back with a period of around 25 years. People get infected, and they develop an immune response, and so they don’t get infected for the next 25 years. And then there is a new generation that is ready to be infected by the same strain of virus. In science, people fall in love with an idea, get excited about it, hammer it to death, and then get immunized — they get tired of it. So ideas should have the same kind of periodicity!”

Weighty matters

Neural nets are a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples. Usually, the examples have been hand-labeled in advance. An object recognition system, for instance, might be fed thousands of labeled images of cars, houses, coffee cups, and so on, and it would find visual patterns in the images that consistently correlate with particular labels.

Modeled loosely on the human brain, a neural net consists of thousands or even millions of simple processing nodes that are densely interconnected. Most of today’s neural nets are organized into layers of nodes, and they’re “feed-forward,” meaning that data moves through them in only one direction. An individual node might be connected to several nodes in the layer beneath it, from which it receives data, and several nodes in the layer above it, to which it sends data.

To each of its incoming connections, a node will assign a number known as a “weight.” When the network is active, the node receives a different data item — a different number — over each of its connections and multiplies it by the associated weight. It then adds the resulting products together, yielding a single number. If that number is below a threshold value, the node passes no data to the next layer. If the number exceeds the threshold value, the node “fires,” which in today’s neural nets generally means sending the number — the sum of the weighted inputs — along all its outgoing connections.

When a neural net is being trained, all of its weights and thresholds are initially set to random values. Training data is fed to the bottom layer — the input layer — and it passes through the succeeding layers, getting multiplied and added together in complex ways, until it finally arrives, radically transformed, at the output layer. During training, the weights and thresholds are continually adjusted until training data with the same labels consistently yield similar outputs.

Minds and machines

The neural nets described by McCullough and Pitts in 1944 had thresholds and weights, but they weren’t arranged into layers, and the researchers didn’t specify any training mechanism. What McCullough and Pitts showed was that a neural net could, in principle, compute any function that a digital computer could. The result was more neuroscience than computer science: The point was to suggest that the human brain could be thought of as a computing device.

Neural nets continue to be a valuable tool for neuroscientific research. For instance, particularnetwork layouts or rules for adjusting weights and thresholds have reproduced observed features of human neuroanatomy and cognition, an indication that they capture something about how the brain processes information.

The first trainable neural network, the Perceptron, was demonstrated by the Cornell University psychologist Frank Rosenblatt in 1957. The Perceptron’s design was much like that of the modern neural net, except that it had only one layer with adjustable weights and thresholds, sandwiched between input and output layers.

Perceptrons were an active area of research in both psychology and the fledgling discipline of computer science until 1959, when Minsky and Papert published a book titled “Perceptrons,” which demonstrated that executing certain fairly common computations on Perceptrons would be impractically time consuming.

“Of course, all of these limitations kind of disappear if you take machinery that is a little more complicated — like, two layers,” Poggio says. But at the time, the book had a chilling effect on neural-net research.

“You have to put these things in historical context,” Poggio says. “They were arguing for programming — for languages like Lisp. Not many years before, people were still using analog computers. It was not clear at all at the time that programming was the way to go. I think they went a little bit overboard, but as usual, it’s not black and white. If you think of this as this competition between analog computing and digital computing, they fought for what at the time was the right thing.”


By the 1980s, however, researchers had developed algorithms for modifying neural nets’ weights and thresholds that were efficient enough for networks with more than one layer, removing many of the limitations identified by Minsky and Papert. The field enjoyed a renaissance.

But intellectually, there’s something unsatisfying about neural nets. Enough training may revise a network’s settings to the point that it can usefully classify data, but what do those settings mean? What image features is an object recognizer looking at, and how does it piece them together into the distinctive visual signatures of cars, houses, and coffee cups? Looking at the weights of individual connections won’t answer that question.

In recent years, computer scientists have begun to come up with ingenious methods fordeducing the analytic strategies adopted by neural nets. But in the 1980s, the networks’ strategies were indecipherable. So around the turn of the century, neural networks were supplanted by support vector machines, an alternative approach to machine learning that’s based on some very clean and elegant mathematics.

The recent resurgence in neural networks — the deep-learning revolution — comes courtesy of the computer-game industry. The complex imagery and rapid pace of today’s video games require hardware that can keep up, and the result has been the graphics processing unit (GPU), which packs thousands of relatively simple processing cores on a single chip. It didn’t take long for researchers to realize that the architecture of a GPU is remarkably like that of a neural net.

Modern GPUs enabled the one-layer networks of the 1960s and the two- to three-layer networks of the 1980s to blossom into the 10-, 15-, even 50-layer networks of today. That’s what the “deep” in “deep learning” refers to — the depth of the network’s layers. And currently, deep learning is responsible for the best-performing systems in almost every area of artificial-intelligence research.

Under the hood

The networks’ opacity is still unsettling to theorists, but there’s headway on that front, too. In addition to directing the Center for Brains, Minds, and Machines (CBMM), Poggio leads the center’s research program in Theoretical Frameworks for Intelligence. Recently, Poggio and his CBMM colleagues have released a three-part theoretical study of neural networks.

The first part, which was published last month in the International Journal of Automation and Computing, addresses the range of computations that deep-learning networks can execute and when deep networks offer advantages over shallower ones. Parts two and three, which have been released as CBMM technical reports, address the problems of global optimization, or guaranteeing that a network has found the settings that best accord with its training data, and overfitting, or cases in which the network becomes so attuned to the specifics of its training data that it fails to generalize to other instances of the same categories.