Learning Scala

Wednesday, October 14, 2015

Here is my experience at learning Scala from scratch, to help the Aalto students with an alternative implementation of Pijul.

Important note: our initial project of writing Pijul in Scala to make it portable was abandonned for several reasons. This page is kept here because no equivalents were found by the author on the internet at the time of this writing.

Hello, World

A tricky part is to learn how to compile Scala programs.

Here is a minimal example of a working program, taken from the Scala website.

object HelloWorld {
  def main(args: Array[String]) {
    println("Hello, world!")

I used emacs’ scala-mode, which recognizes file extension .scala. Then, compile it using scalac test.scala. After several seconds, this produces several files named after the class (and not after the file name): in my case, “HelloWorld.class” and “HelloWorld$.class”.

Then, run your program using scala HelloWorld.

Constructors in Scala

Here is an alternative “main-less” implementation, also taken from the official website, of the same program:

object HelloWorld extends App {
  println("Hello, world!")

We just added “extends App”, which had the effect of making object HelloWorld an instance of a class inherited from App. We defined that class at the same time as instanciating the object. In other words, App already contains a main function, and then anything written in the object/class definition belongs to the class constructor.

Writing C bindings

A natural way to dislike a language is to write C bindings for it. Things are never documented the way I’d like them to be. But in this case, I thought I had little choice but to start my journey into Scala with C bindings. Then I discovered that I could use the Java bindings for LMDB, but I was a little afraid after looking at their overly complicated code for such an elegant API.

Anyway, I want to be able in the future to translate cool optimizations from the Ocaml versions to Pijul, and some need C bindings (using a version of mdb_cursor_get that didn’t allocate and copy strings took Pijul from twice slower than mercurial to faster than git on our first benchmarks).

A first example

First, let’s start with a fully working example, taken from the IBM website, and adapted to linux and C.

Calling C from Scala means calling C from Java bytecode, i.e. from the JVM. In other words, we need to create a shared library using the JVM internals, load that library at runtime, and then write a Scala wrapper around it. Here is the C code, in a file called “Binding.c”:

#include <ctype.h>
#include <string.h>
#include <jni.h>

// Mutate array to uppercase
void uppercase(char* str) {
    size_t n = strlen(str);
    for (size_t i = 0; i < n; i++) {
        str[i] = toupper(str[i]);

JNIEXPORT jint JNICALL Java_Binding_intMethod(JNIEnv* env, jobject obj, jint num) {
    return num * num;

JNIEXPORT jboolean JNICALL Java_Binding_booleanMethod(JNIEnv* env, jobject obj, jboolean boolean) {
    return !boolean;

JNIEXPORT jstring JNICALL Java_Binding_stringMethod(JNIEnv* env, jobject obj, jstring string) {
  const char* str = (*env)->GetStringUTFChars(env,string, 0);
    char cap[128];
    strcpy(cap, str);
    (*env)->ReleaseStringUTFChars(env,string, str);
    return (*env)->NewStringUTF(env,cap);

JNIEXPORT jint JNICALL Java_Binding_intArrayMethod(JNIEnv* env, jobject obj, jintArray array) {
    int sum = 0;
    jsize len = (*env)->GetArrayLength(env,array);
    jint* body = (*env)->GetIntArrayElements(env,array, 0);
    for (int i = 0; i < len; i++) {
        sum += body[i];
    (*env)->ReleaseIntArrayElements(env,array, body, 0);
    return sum;

I compiled this with:

gcc -shared -fPIC -I/usr/lib/jvm/java-7-openjdk-amd64/include Binding.c -o libBinding.so

And then wrote some Scala to call the C functions:

class Binding {
  @native def intMethod(n: Int): Int
  @native def booleanMethod(b: Boolean): Boolean
  @native def stringMethod(s: String): String
  @native def intArrayMethod(a: Array[Int]): Int

object Test extends App {
  val sample = new Binding
  val square = sample.intMethod(5)
  val bool = sample.booleanMethod(true)
  val text = sample.stringMethod("java")
  val sum = sample.intArrayMethod(Array(1, 1, 2, 3, 5, 8, 13))

  println(s"intMethod: $square")
  println(s"booleanMethod: $bool")
  println(s"stringMethod: $text")
  println(s"intArrayMethod: $sum")

Then, compile your scala code using:

scalac Binding.scala

And finally run it using:

scala -Djava.library.path=. -cp . Test

Passing C structs to Scala

This was trickier than it seemed at first, because I had not noticed the jobject obj in the function calls above, which is supposed to be a pointer to the current object (“this”). The real way to pass pointers to Scala is to cast them to jlong. Hiding the naked pointer is then the responsibility of the Scala code. So far, I’ve not found any solution where classes would still expose the pointers to other classes that really need them (in the LMDB bindings, pointers to env should be accessible from the txn class, but not from other classes). Of course I’m not super knowledgeable with Scala classes right now, but if I were, I would definitely try to “hide” the equality between T below and Long, so that calling gettest(t+2) would not pass typechecking.

In Ocaml or Haskell, this would be quite easy, I would just write type t or data T, respectively. But here this gives a java runtime error, which is quite surprising to me.

struct test {
  jint a;

JNIEXPORT jlong JNICALL Java_Binding_newtest(JNIEnv* env,jobject obj) {
  struct test* t=malloc(sizeof(struct test));
  return (jlong) t;
JNIEXPORT jint JNICALL Java_Binding_gettest(JNIEnv* env,jobject obj,jlong l) {
  struct test*t=(struct test*) l;
  return t->a;
JNIEXPORT void JNICALL Java_Binding_storetest(JNIEnv* env,jobject obj,jlong l,jint x) {
  struct test* t=(struct test*) l;

Here is the Scala part:

class Binding {
  type T=Long
  @native def newtest(): T
  @native def gettest(t:T): Int
  @native def storetest(t:T,i:Int): Int

Playing nice with the Java Garbage Collector

Sometimes, a value from Java (a pointer to an array, for instance) might be moved by the GC during a call in one of our C functions.

This is handled by the JNI in the following way: whenever we call a function to get access to the underlying structure (the actual array in memory, for instance), it “pins” that memory so that the GC can neither move it nor collect it. But then, as soon as we’re done using that pointer, we need to release it, using functions such as (*env)->ReleaseIntArrayElements(env,array, body, 0); (in our code above).

First steps at binding LMDB

Finding information on Java and the JNI online is not super easy, essentially because Java adepts like to obfuscate things, and also many people do not really seem to understand what a compiler and a linker do (that may be due to using only a bytecode language, and all-integrated editors).

Here is my first attempt at binding LMDB:

Scala side:

class MdbEnv {
  @native def cCreate(): Long
  @native def cOpen(env:Long,path:String,flags:Int,mode:Int): Boolean

  private val env:Long=cCreate()
  def open(path:String,flags:Int,mode:Int):Boolean= {

object TestLmdb extends App {
  val env = new MdbEnv
  val test=env.open("/home/pe/Recherche/pijul/pijul/.pijul/pristine",0,488)
  println(s"env.open: $test")

C side:

#include <jni.h>
#include <lmdb.h>

JNIEXPORT jlong JNICALL Java_MdbEnv_cCreate(JNIEnv* jenv,jobject jobj) {
  MDB_env *env;
    return 0;
    return (jlong) env;

JNIEXPORT jint JNICALL Java_MdbEnv_cOpen(JNIEnv* jenv,jobject jobj,jlong env,jstring j_path,jint flags,jint mode) {
  const char* c_path = (*jenv)->GetStringUTFChars(jenv,j_path,0);

  return ((mdb_env_open((MDB_env*)env,c_path,flags,mode)) == 0);

Compiling is not that straightforward, because Scala and Java do not seem to have -l. What I ended up doing was, first load the shared objects from Scala (notice the System.loadLibrary("lmdb") line in the example above), then compile the C part with -llmdb (dynamic libraries indeed need to be linked statically with all their dependencies). Here is an excerpt from my Makefile:

	gcc -shared -fPIC -I/usr/lib/jvm/java-7-openjdk-amd64/include -llmdb -o libLmdbstubs.so lmdbstubs.c

And then “link” with the library path:

Lmdb:Lmdb.scala libLmdbstubs.so
	scalac Lmdb.scala
	scala -Djava.library.path=".:/usr/lib/x86_64-linux-gnu/" -cp . TestLmdb

Throwing exceptions from C code

Of course this is not really suitable as a first step to learning bindings (and I’ve not learnt it in this order), but it is really necessary to write correct versions of the other examples in this guide. The code to declare and use exceptions is really short. I haven’t yet used more complicated exceptions than just those carrying a string message.

class NotFound(msg:String) extends Exception{}
jclass class=(*jenv)->FindClass(jenv,"NotFound");
(*jenv)->ThrowNew(jenv,class,"not found");

By the way, using a static variable (as you would do in OCaml) doesn’t work here, Java really wants to fetch the exception every time.

Also, ThrowNew doesn’t halt the function (it is kind of obvious since it is not a C macro). Therefore, don’t forget to return just after the call to ThrowNew.


Java strings are not char*, they are something “more modern”, where chars are replaced by 16 bits values. I think commenting on these design choices would be superfluous, the bindings speak for themselves.

If you plan to use a C library to use only “modified UTF8” (whatever that means) strings, for instance in LMDB where you only want to store Java strings, then probably working with strings is fine. Else, byte arrays are the only solution.

A simple warm-up exercise: write the binding to “mdb_put”, which does not need to allocate any string:

JNIEXPORT void JNICALL Java_Dbi_cPut(JNIEnv* jenv,jobject jobj,jlong txn,jint dbi,jbyteArray a,jbyteArray b,jint flags) {
  MDB_val va,vb;
  va.mv_data= (*jenv)->GetByteArrayElements(jenv,a,NULL);
  va.mv_size= (*jenv)->GetArrayLength(jenv,a);
  vb.mv_data= (*jenv)->GetByteArrayElements(jenv,b,NULL);
  vb.mv_size= (*jenv)->GetArrayLength(jenv,b);
  int ret=mdb_put((MDB_txn*)txn,(MDB_dbi) dbi,&va,&vb,(int) flags);
  if(ret)failwith(jenv,"error in mdb_put");

And the Scala version:

@native def cPut(t:Long,dbi:Int,a:Array[Byte],b:Array[Byte],flags:Int):Unit

Then, “mdb_get” is trickier, since we need to allocate and copy bytes. Another option would have been to return an opaque pointer to MDB_val with just reader functions, and a “copy” from Scala (but that would have been one C heap allocation anyway, and we can’t use the Scala array libraries on it).

JNIEXPORT jbyteArray JNICALL Java_Dbi_cGet(JNIEnv* jenv,jobject jobj,jlong txn,jint dbi,jbyteArray a) {
  MDB_val va,vb;
  va.mv_data= (*jenv)->GetByteArrayElements(jenv,a,NULL);
  va.mv_size= (*jenv)->GetArrayLength(jenv,a);
  int ret=mdb_get((MDB_txn*)txn,(MDB_dbi) dbi,&va,&vb);

    if(ret==MDB_NOTFOUND) {
      jclass class=(*jenv)->FindClass(jenv,"NotFound");
      (*jenv)->ThrowNew(jenv,class,"not found");
    } else {
      failwith(jenv,"error in mdb_get");
    return NULL; // this value is never used by Java anyway.
  } else {
    jbyteArray arr=(*jenv)->NewByteArray(jenv,vb.mv_size);
    return arr;

And the Scala side:

@native def cGet(t:Long,dbi:Int,a:Array[Byte]):Array[Byte]

Allocating memory (and strings) from C, with standard finalizers

I just realized that memory to be garbage-collected by Java can be allocated as byte arrays, using the NewByteArray instead of malloc. Then, another confusing issue is, GetByteArrayElements has a confusing option: if its last argument is not NULL, it allocates a copy of the array. Then, a call to ReleaseByteArrayElements is needed to “commit the changes to that array back to Java”.

In the absence of proper documentation, I wonder when copies and commits are needed. Is there a problem with the Java memory layout? Does it store arrays as C arrays, or maybe as something else?

Note to self: when designing a language/a religion, try not to create mysteries, and don’t document them too much. That will keep people talking for years/centuries.

Custom finalizers

As far as I understand, custom finalizers can only be added from the Java side, by overriding the finalize() method. So, it seems that the best way to add a finalizer to a C struct is to declare a class like so:

In file test.scala:

class C {
  @native def cCreate(): Long
  @native def cFinalize(x:Long):Unit
  var ptr:Long=cCreate()
  override def finalize()=cFinalize(ptr)

object Test extends App {
  def f():String={
    var x=new C
    return "Hello, World"

In file teststubs.c:

#include <jni.h>
#include <stdlib.h>
#include <stdio.h>

struct s{
  char* str;
JNIEXPORT jlong JNICALL Java_C_cCreate(JNIEnv* jenv,jobject jobj) {
  struct s*x=malloc(sizeof(struct s));
  return (jlong)x;

JNIEXPORT void JNICALL Java_C_cFinalize(JNIEnv* jenv,jobject jobj,jlong x) {
  struct s*s=(struct s*) x;

Then compile and run with:

gcc -shared -fPIC -I/usr/lib/jvm/java-7-openjdk-amd64/include teststubs.c -o libteststubs.so
scalac test.scala
scala -Djava.library.path=".:/usr/lib/x86_64-linux-gnu/" -cp . C

Returning tuples

Many languages today have these collections of heterogeneous values. Of course, Java doesn’t.

This seems to be documented absolutely nowhere, and you need to figure it out by studying examples of uses of the JNI. In this particular case, we are looking up a class name, presumably in a big dictionary in the Java runtime. In order to know the Java class name for tuples of byteArrays, open the scala toplevel, and write:

scala> ("a".getBytes(),"b".getBytes()).getClass

Then, a mistake I made was to use GetMethodId with just “”, and then “()V”, which returned a misleading error starting with “No such method: ”. Actually, this is just because methods are looked up with their arguments: in Java and Scala, it seems ok to define a number of methods with the same name, and a different number of arguments. I’m not sure yet how to engineer the strings for these arguments (the (Ljava/lang/Object;Ljava/lang/Object;)V below), but the following (taken from various examples) seems to work:

    jclass tupclass = (*jenv)->FindClass(jenv,"scala/Tuple2");
    jmethodID tupcon = (*jenv)->GetMethodID(jenv,tupclass, "<init>", "(Ljava/lang/Object;Ljava/lang/Object;)V");
    jobject tuple = (*jenv)->NewObject(jenv,tupclass,tupcon,a,b);
    return tuple;

Back to some real Scala

Once the dirty low-level part is written, Scala is a kind of Ocaml + invocations, and with explicit types everywhere. Here is a pretty direct mapping.

Ocaml version:

let with_env path f=
  let env=Mdb.env_create () in
    Mdb.env_set_mapsize env (10485760 lsl 7);
    Mdb.env_set_maxdbs env 9;
    let _=Mdb.reader_check env in
    Mdb.env_open env path 0 0o750;
    let x=f env in
    Mdb.env_close env;
  with e->(Mdb.env_close env;raise e)

Scala version:

object TestLmdb extends App {
  def withEnv[A](path:String,f:(Env=>A)):A= {
    val env=new Env
    try {
      env.set_mapsize(10485760 << 7)
      val x=f(env)
    } catch {
      case e:Exception => {
        throw e

Help wanted: slow compiler

What I still don’t get is this:

Ocaml, full bindings to LMDB + full core pijul.

% time make pijul.cmx
cc -fPIC -Wall -O3 -c -o lmdb_stubs.o lmdb_stubs.c
ocamlmklib -o mdb lmdb_stubs.o mdb_constants.ml mdb.ml -llmdb -linkall
ocamlcp -P f  -c -w A -o pijul.cmi pijul.mli
ocamlfind ocamlopt  -package cryptokit,yojson -c -w A -o pijul.cmx pijul.ml
make pijul.cmx  1,08s user 0,02s system 96% cpu 1,136 total

Scala, partial bindings to LMDB and just a function that opens an environemnt:

 % time make
scalac Lmdb.scala
scala -Djava.library.path=".:/usr/lib/x86_64-linux-gnu/" -cp . TestLmdb
make  10,61s user 0,15s system 184% cpu 5,843 total

Pattern matching

This is just a syntactic point: in Scala, pattern matching on a list called l is done by:

l match {
case Nil=>{ /* empty list */ }
case h::s=>{ /* do something with the head and tail */ }

Local functions and variables

In Scala, local functions are declared with the same syntax as global ones, i.e. there is no “let…in”:

def f(){
  def g(){


Call by name

Scala is strict by default, but has optional “call by name”, which is different from “laziness” in Ocaml and Haskell: in Scala, functions of type Unit=>A can be passed to another function either after or before being evaluated. Example (taken from StackOverflow):

def something() = {
  println("calling something")
  1 // return value

def callByValue(x: Int) = {
  println("x1=" + x)
  println("x2=" + x)

def callByName(x: => Int) = {
  println("x1=" + x)
  println("x2=" + x)

This is not real laziness, as x gets evaluated twice in callByName. Caching the value is certainly one option, but then this is just delayed evaluation, not “on demand” evaluation. It can still be useful to mimic Haskell’s undefined, though: throw exception undefined in the execution branch that uses the value.

Unix file permissions

Unix support is equivalent to Java’s support: choosing Java was actually an interesting strategy, in that it forwards these low-level problems to other Java libraries. Of course, the drawback is, many Java libraries seem to have been written with weird drugs.

Here is how to get an integer representing a file’s permissions:

  def perm(path:String):Option[Int]={
    import java.nio.file.attribute.PosixFilePermission._
    import java.nio.file.attribute._
    import java.nio.file._
    try {
      val p=Paths.get(path)
      val perms=Files.getPosixFilePermissions(p)
      var per=0
      for(i<-0 to 8){
        if(perms.contains(u(i))) per=per | (1<<i)
    } catch { case e:java.io.IOException=>None }

Which basically means that the only way to get interoperability between Java and sane languages is to spend time converting back to the formats Java converted from. This is insane.

Compiling C and Scala on Windows

This is really tricky to figure out, but actually surprisingly easy to do. First, while MinGW and Cygwin are certainly useful tools for some things, my simple solution uses the Visual Studio compiler.

Here is my compilation line (in a file called comp.bat) for LMDB (with some edition to the files, if I remember correctly). I’ve ran these in the Visual Studio shell, 64 bits (without starting the giant monster GUI):

cl.exe /D_USRDLL /D_WINDLL midl.c mdb.c /link /DLL /OUT:lmdb.dll
lib.exe midl.obj mdb.obj -OUT:lmdb.lib

Then, I’ve copied lmdb.lib to the root of my Scala bindings, and done:

cl.exe /D_USRDLL /D_WINDLL /I"C:\Program Files (x86)\Java\jdk1.8.0_60\include" /I"C:\Program Files (x86)\Java\jdk1.8.0_60\include\win32" lmdbstubs.c lmdb.lib /link /DLL /OUT:Lmdbstubs.dll

It seems that the 32 bits version can be compiled using the 32 bits version of the shell instead of the 64 bits one.

Haskell’s typeclasses

I’m no expert of syntax nor ergonomics, so I won’t comment on these, but Scala has a mechanism very similar to Haskell’s typeclasses (an implicit/automatic version of Ocaml’s functors).

Here is an example.

The way orphan instances are treated is pretty neat: the concept simply doesn’t exist here, since whenever two instances are in the scope, the compiler will explicitly ask to disambiguate.

Array[Byte], ArrayBuffer[Byte] and String

Scala inherits the design choices on Strings from Java, which are not always the sanest nor the most intuitive ones. Since Pijul will not have any Java part, I would really like to have overloaded strings and get rid of the main Strings type (and then of course provide an external interface using Strings).

One of the biggest annoyances is that sorting and hashes have to be redefined, and this is not always intuitive. Here is a summary:

  • Array[Byte] are easy to map to C (and also probably efficient, depending on the JVM). They are exactly “char* with bound checking”. Big caveat: their hash function and equality function work on their address, so that Array(1,2,3,4) != Array(1,2,3,4).

  • Strings seem to be “short*“. They are encoded and decoded when mapped to Byte Arrays, but their equality and hash functions is on their contents, which is particularly convenient to use them in hash tables and trees. The choice to be incompatible with C makes no sense, in my opinion: many cool libraries are written in C, and will lose their performance when called from Java for this absurd reason.

  • ArrayBuffer[Byte] are extensible Array[Byte], with hash and equality on the contents. Unfortunately, there seems to be no fast way to pass the underlying array to C (as of Scala 2.11).

I don’t know how to override the hash functions without allocating a wrapper around each string I want to lookup in a hash table, so I’m now using Array[Byte], with the following code to use them in trees:

    type Bytes=Array[Byte]

    implicit private object BytesOrdering extends Ordering[Bytes]{
      def compare(a:Bytes,b:Bytes):Int={
        def comp(i:Int):Int={
          if(i>=a.length && i>=b.length)
            if(i>=a.length) -1
            else if(i>=b.length) 1
            else {
              val c=a(i) compare b(i)
              if(c==0) comp(i+1) else c

Using SBT

SBT is cool. Its documentation is not, it’s like teaching a kid to ride a bike by telling him or her about bushingless chains (yes, bushingless chains are a true cool innovation).

One big thing to understand is, an SBT file contains the inside of a scala class (so val is admissible in there). SBT has variables that need to be first declared first (most people declare them lazy, not sure why).

Here’s what I’m using on linux, you can of course generalize it:

import sbt._

lazy val jniCompile = taskKey[Unit]("Compiles jni sources using gcc")

lazy val binPath = settingKey[File]("Shared libraries produced by JNI")

lazy val root = (project in file(".")).settings(
  organization := "com.pijul",
  maintainer := "PE Meunier <pe@pijul.org>",
  version := "0.2",
  name := "Pijul",
  libraryDependencies += "org.scala-lang" % "scala-reflect" % scalaVersion.value,

  binPath := new File((target in Compile).value, "bin"),
  jniCompile:= Def.task{
    val log=streams.value.log
    val mkBinDir = s"mkdir -p ${binPath.value}"
    mkBinDir ! log
    val nativeSource=new File((sourceDirectory).value , "lmdbstubs.c")
    val command=s"gcc -shared -fPIC -I/usr/lib/jvm/java-7-openjdk-amd64/include -llmdb -o ${binPath.value}/libLmdbstubs.so $nativeSource"
    Process(command, binPath.value) ! (log)
  }.dependsOn(compile in Compile).tag(Tags.Compile,Tags.CPU).value,

  compile <<= (compile in Compile, jniCompile).map((result, _) => result),
  fork in run := true,
  javaOptions += s"-Djava.library.path=${binPath.value}:/usr/lib/x86_64-linux-gnu"

Algebraic data types are case classes

Timo Moilanen told me about case classes today. I had used case classes that extend an abstract class, which corresponds exactly to algebraic data types in other functional languages like Ocaml and Haskell. But in Scala, it is sometimes useful to define a class as a case class, to have it extend some classes like the one providing == for free (think of a class/a type constructor with lots of parameters, like a record type in Haskell/Ocaml).

As far as I know, Scala doesn’t seem to have non-generalized algebraic data types.