Skip to main content

Scala Collections: A Group of groupBy() Examples

Scala provides a rich Collections API. Let's look at the useful groupBy() function.

What does groupBy() do? It takes a collection, assesses each item in that collection against a discriminator function, and returns a Map data structure. Each key in the returned map is a distinct result of the discriminator function, and the key's corresponding value is another collection which contains all elements of the original one that evaluate the same way against the discriminator function.

So, for example, here is a collection of Strings:
val sports = Seq("baseball", "ice hockey", "football", "basketball", "110m hurdles", "field hockey")

Running it through the Scala interpreter produces this output showing our value's definition:
sports: Seq[String] = List(baseball, ice hockey, football, basketball, 110m hurdles, field hockey)

We can group those sports names by, say, their first letter. To do so, we need a discriminator function that takes each element and returns the first character. For example:
sports.groupBy(_.charAt(0))

Running that in the interpreter shows the result:
res0: scala.collection.immutable.Map[Char,Seq[String]] = Map(b -> List(baseball, basketball), 1 -> List(110m hurdles), i -> List(ice hockey), f -> List(football, field hockey))

As you can see, the result is a Map with four key-value pairs. The keys are the letters b,i,f and the digit 1. All of the sports names that begin with "b" are grouped into a new List, and so on for the other sports.

In the above case, the discriminator function produced a key that was of type Char, the character in the 0th position of each String. Here's another example, one that produces a Boolean type for the keys in the Map:
sports.groupBy(_.contains("ball"))

In the above, contains() is a function that will return a true if "ball" is in the name of the sport and false otherwise. We would expect at most two entries in our Map, one with true as the key, one with false as the key. When we check it in the interpreter, we get:
res1: scala.collection.immutable.Map[Boolean,Seq[String]] = Map(false -> List(ice hockey, 110m hurdles, field hockey), true -> List(baseball, football, basketball))

In this case, groupBy() has partitioned the original collection into two new collections, mapped to true for the List(baseball, football, basketball) and to false for the non-ball sports, List(ice hockey, 110m hurdles, field hockey).

Let's switch to numeric values instead of Chars, Strings and Booleans. The groupBy() principles are the same. Here is a new collection of Integers:
val s1 = List(1,3,5,7,9)

If our code needs to treat each element in the collection differently, depending on its remainder when divided by 3, we'd write the following:
s1.groupBy(_ % 3)
res2: scala.collection.immutable.Map[Int,List[Int]] = Map(2 -> List(5), 1 -> List(1, 7), 0 -> List(3, 9))

The Map produced by groupBy() has three pairs. The key = 0 is the collection of all elements that are evenly divisible by 3. Notice that the value mapped to key = 2 is still a List, still a collection, even though it has only one element in it.

We have seen groupBy() with Strings and Ints in the collections, and producing keys that can be Int, Char, even Boolean. Isn't the groupBy() function flexible? And powerful?
More formally, according to the Scaladocs API, it has this signature:
def groupBy[K](f: (A) ⇒ K): immutable.Map[K, Seq[A]]

In that formal definition, K is the Type of the keys in the map, as produced by the discriminator function; f is the function that will determine into which collection the items of the original collection will be placed; and the return type is a Map (an immutable Map) with keys of type K and values collections.

Let's end with a couple more interesting examples of groupBy(). To date, our examples have used some pretty primitive data types. So let's define a more interesting type, and create a collection of objects of this type.

class Point(val x: Int, val y: Int)
val sp = Seq(new Point(1,1), new Point(1,2), new Point(2,2), new Point(2,1))

The resulting value definition looks like this (I have edited out some of the gory details to improve readability), showing that I have a collection of four Point objects:
sp: Seq[Point] = List(Point@1b80cb0, Point@13002da, Point@34f910, Point@120344c)

Now we can group them by the value of one of the members of the object. In this case, the discriminator function simply names the member. Let's group the Points by their x value, to partition the collection by their location on the x axis:
sp.groupBy(_.x)

The result is a Map with two key-value pairs, partitioning the original collection into the Points with x = 1 and those with x = 2:
res3: scala.collection.immutable.Map[Int,Seq[Point]] = Map(2 -> List(Point@34f910, Point@120344c), 1 -> List(Point@1b80cb0, $Point@13002da))

One final example. So far, all our discriminator functions have been pretty simple, so let's do something a little more interesting. Let's group our original sports collection into sports that use balls, sports that use hockey sticks, and a catch-all group of other sports. One way to do so is to create a discriminator function that does a little pattern-matching:
sports.groupBy {
  case sport if sport.contains("ball") => "Balls"
  case sport if sport.contains("hockey") => "Sticks"
  case _ => "Other"
}


We expect the three ball sports in the original collection to be mapped to the key "Balls", the two hockey sports to be mapped to the key "Sticks" and everything else will map to the key "Other". And that is exactly what groupBy() gives us:
res4: scala.collection.immutable.Map[String,Seq[String]] = Map(Sticks -> List(ice hockey, field hockey), Balls -> List(baseball, football, basketball), Other -> List(110m hurdles))

Popular posts from this blog

Git Reset in Eclipse

Using Git and the Eclipse IDE, you have a series of commits in your branch history, but need to back up to an earlier version. The Git Reset feature is a powerful tool with just a whiff of danger, and is accessible with just a couple clicks in Eclipse. In Eclipse, switch to the History view. In my example it shows a series of 3 changes, 3 separate committed versions of the Person file. After commit 6d5ef3e, the HEAD (shown), Index, and Working Directory all have the same version, Person 3.0.

Java 8: Rewrite For-loops using Stream API

Java 8 Tip: Anytime you write a Java For-loop, ask yourself if you can rewrite it with the Streams API. Now that I have moved to Java 8 in my work and home development, whenever I want to use a For-loop, I write it and then see if I can rewrite it using the Stream API. For example: I have an object called myThing, some Collection-like data structure which contains an arbitrary number of Fields. Something has happened, and I want to set all of the fields to some common state, in my case "Hidden"

How to do Git Rebase in Eclipse

This is an abbreviated version of a fuller post about Git Rebase in Eclipse. See the longer one here : One side-effect of merging Git branches is that it leaves a Merge commit. This can create a history view something like: The clutter of parallel lines shows the life spans of those local branches, and extra commits (nine in the above screen-shot, marked by the green arrows icon). Check out this extreme-case history:  http://agentdero.cachefly.net/unethicalblogger.com/images/branch_madness.jpeg Merge Commits show all the gory details of how the code base evolved. For some teams, that’s what they want or need, all the time. Others may find it unnecessarily long and cluttered. They prefer the history to tell the bigger story, and not dwell on tiny details like every trivial Merge-commit. Git Rebase offers us 2 benefits over Git Merge: First, Rebase allows us to clean up a set of local commits before pushing them to the shared, central repository. For this

Code Coverage in C#.NET Unit Tests - Setting up OpenCover

The purpose of this post is to be a brain-dump for how we set up and used OpenCover and ReportGenerator command-line tools for code coverage analysis and reporting in our projects. The documentation made some assumptions that took some digging to fully understand, so to save my (and maybe others') time and effort in the future, here are my notes. Our project, which I will call CEP for short, includes a handful of sub-projects within the same solution. They are a mix of Web APIs, ASP MVC applications and Class libraries. For Unit Tests, we chose to write them using the MSTest framework, along with the Moq mocking framework. As the various sub-projects evolved, we needed to know more about the coverage of our automated tests. What classes, methods and instructions had tests exercising them, and what ones did not? Code Coverage tools are conveniently built-in for Visual Studio 2017 Enterprise Edition, but not for our Professional Edition installations. Much less for any Commun