Posted by manning_pubs
on December 4, 2013 at 2:05 AM PST
Streams vs. Collections: What’s the difference? from Java 8 Lambdas in Action + 45% savings!
Streams vs. Collections: What’s the difference? from Java 8 Lambdas in Action by Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft + 45% savings! Just enter promo code lambdas8jn at manning.com and save 45% today.
One of the new features of Java 8 is the Stream API. By processing a dataset as a stream, you can do on-demand computations on part of a dataset, which can reduce latency and improve overall performance.
This article, based on chapter1 of Java 8 Lambdas in Action, explains the conceptual differences between Streams and Collections. A detailed exploration on how to use Streams, that comes with many code examples and quizzes, will be available soon in chapters 4 and 5 of the MEAP (Manning Early Access Program).
Streams vs. Collections: What’s the difference?
When you’re working with large or changing datasets, you may not want to process the entire dataset as a collection. Let’s start with a visual metaphor. Consider a movie stored on a DVD. This is a Collection (of bytes or of frames-it doesn’t matter which here) because it contains the whole data structure. Now consider watching the same video when it is being streamed over the internet. This is now a Stream (of bytes or frames). The streaming video player only needs to have downloaded a few frames in advance of where the user is watching, so you can start displaying values from the beginning of the Stream before most of the values in a stream have even been computed. Note particularly that the video player may lack the memory to buffer the whole Stream in memory as a Collection—and the startup time would be appalling if we had to wait for the final frame to appear before we could start showing the video. We might choose for video-player implementation reasons to buffer a part of a Stream into a Collection, but this is distinct from the conceptual difference.
At the basic level, the difference between Collections and Streams has to do with when things are computed. A Collection is an in-memory data structure, which holds all the values that the data structure currently has—every element in the Collection has to be computed before it can be added to the Collection. (You can add things to and remove them from the Collection, but at each moment in time, every element in the Collection is stored in memory; elements have to be computed before becoming part of the Collection).
A Stream is a conceptually fixed data structure, in which elements are computed on demand. This gives rise to significant programming benefits. The idea is that a user will extract only the values they require from a Stream, and these elements are only produced—invisibly to the user—as and when required. This is a form of a producer-consumer relationship. Another way to look at it is that a Stream is like a lazily constructed Collection: values are computed when they are solicited by a consumer (in management speak this is demand-driven or even, just-in-time manufacturing). In contrast, a Collection is eagerly constructed (supplier-driven: fill your warehouse before you start selling, like a Christmas novelty that has a limited life).
Another example is a browser internet search. When searching for a phrase with many matches in Google or on an e-commerce online shop, instead of having to wait for the whole Collection of results along with their photographs to be downloaded, you get a Stream whose elements are the best 10 or best 20 matches, along with a button to click for the next 10 or 20. When you, as a consumer, click for the next 10, the supplier computes these on demand, before being returned to your browser for display.
Streams and collections philosophically
For readers who like philosophical viewpoints, you can see a Stream as a set of values spread out in time, which repeatedly appear at the same point—the argument to a function, passed as a parameter to the Stream-processing function (for example, filter). In contrast, a Collection is a set of values spread out in space (here computer memory), which all exist at a single point of time—and which you access using an iterator to access members inside a foreach loop.
External vs. internal iteration
Here’s something else to consider when comparing Streams and Collections. Using the Collections interface requires iteration to be done by the user (for example, using the enhanced for loop called foreach); this is called external iteration. The Streams library by contrast uses internal iteration—it does the iteration for you and takes care of storing the resulting stream value somewhere; you merely provide a function saying what’s to be done.
Let’s use an analogy to understand the difference and benefits of internal iteration. Let’s say you are talking to your two-year old daughter Sofia and want her to put her toys away:
You: “Sofia, let’s put the toys away. Is there a toy on the ground?”
Sofia: “Yes, the ball.”
You: “Okay, put the ball in the box. Is there something else?”
Sofia: “Yes, there’s my doll.”
You: “Okay, put the doll in the box. Is there something else?”
Sofia: “Yes, there’s my book.”
You: “Okay, put the book in the box. Is there something else?”
Sofia: “No, nothing else.”
You: “Fine, we’re finished.”
This is exactly what you do every day with your Java collections. You iterate the collection externally, explicitly pulling out and processing the items one by one. It would be far better if you could tell your daughter Sofia just: “Put all the toys that are on the ground inside the box.”
There are two other reasons why an internal iteration is preferable: first, Sofia could choose to take at the same time the doll with one hand and the ball with the other, and second, she could decide to take the objects closest to the box first and then the others. In the same way using an internal iteration, the processing of items could be transparently done in parallel or in a different order that may be more optimized. These optimizations are very difficult if you iterate the collection externally as you’re used to doing in Java (and in imperative programming style in general).
This may seem like nit-picking, but it’s much of the raison-d’être of Java 8’s introduction of Streams—the internal iteration in the Streams library can automatically choose a data representation and implementation of parallelism to match your hardware. By contrast, once you’ve chosen external iteration by writing foreach, then you have essentially committed to self-manage any parallelism. (Self managing in practice means either “one fine day we will parallelize this” or “starting the long and arduous battle involving tasks and synchronized”.) Hence, Java 8 needed an interface like Collection but without iterators, ergo Streams! Figure 1 illustrates the difference between a Stream (internal iteration) and a Collection (external iteration).
Figure 1: Internal vs. external iteration
Dealing with large datasets that take a long time to load or live data that must be processed in an on-demand manner can create problems for the traditional practice of creating and iterating through a Java collection. The new Java 8 Streams API makes it possible to design more appropriate strategies for approaching these increasingly-common challenges.
Just enter promo code lambdas8jn at manning.com and save 45% today.