Scala-notes-number-madness

This is another article in the scala-notes series. This one deals with what I call number madness.

What!?

Numbers are used everywhere in code, and often the type used is something like Int (Scala) or Integer/int (Java). These simple types are familiar to those of us who used languages like C but should not be used without considering their inherent limitations. Lets imagine an example web application that during its normal operation records interesting events…

Imagine a field used to represent the key of a database table named INTERESTING_EVENT, specifically an automatically incrementing surrogate key, as is typical. (This is not the best example, in part because one would rarely need perform arithmetic on a key value, but it is simple to explain.) Lets say that on average, only twenty INTERESTING_EVENT records are populated per second.

Is an Int or Integer safe? Lets get some perspective…

365 days per year
$= 365$
8760 hours per year
$= 24 \cdot 365$
525,600 minutes per year
$= 60 \cdot 8760$
31,536,000 seconds per year
$= 60 \cdot 525,600$

Given a liner expansion of the INTERESTING_EVENT sequence, by twenty each second, the value would grow by 630,720,000 each year.

630,720,000 events per year
$= 31,536,000 \cdot 20$

In four years that would grow to 2,524,608,000.

2,522,880,000 events in four years
$= 630,720,000 \cdot 4$

The MAX_VALUE for java.lang.Integer is 2,147,483,647.

The example is contrived and unlikely but also recall that automatically incremented keys do not always grow linearly, and twenty events per second is conservative for a web server… (Insert your own web scale humor here.) But even with those constraints the key grows beyond the capacity of java.lang.Integer in less that four years. Four years is enough time to forget about the capacity constraint built into the example application.

You might forget how these simple number types work and think:

No problem, statistics or other processes are running, constantly computing metrics against Events and some computations use the key value… Surely an exception will be thrown as soon as the Integer overflows.

This is where you would be wrong.

Division will throw an ArithmeticException on division by zero, but a sum, for example, will quietly continue on an overflow. Let me repeat that… The sum will return an incorrect value on overflow, but it will not throw a runtime exception! This is serious problem. The application will quietly continue to run, but the mathematical operations on the value will be incorrect. This should be a real “What!?” moment.

Two’s complement

The JVM stores an int as a two’s complement binary (the core of java.lang.Integer and scala.Int), and thus it is often referred to as a two’s complement type.

Specifically, the JVM uses a 32-bit signed two’s complement integer for int. The most significant bit is used to designate the sign of the number. The max value for int is 2,147,483,647, and would thus be represented1 like 0111 1111 1111 1111 1111 1111 1111 1111.

Simple sum

First lets simply add 1 to 2,147,483,646.

2,147,483,646 + 1.

  0111 1111 1111 1111 1111 1111 1111 1110
+ 0000 0000 0000 0000 0000 0000 0000 0001
-------------------------------------------
  0111 1111 1111 1111 1111 1111 1111 1111

The result is what we expect, 0111 1111 1111 1111 1111 1111 1111 1111, or 2,147,483,647 in decimal.

Negative values

The JVM int is signed, that is to say both positive and negative values can be stored as an int.

Using the MSB (most significant bit) to designate sign makes it possible to store negative values using the same placeholders and arithmetic. A negative is represented with the MSB set to 1. To convert -2,147,483,647 to a two’s complement binary for int, we must find the complement of each bit then add one to the result.

  1. Find the complement of each bit of 0111 1111 1111 1111 1111 1111 1111 1111.
    1000 0000 0000 0000 0000 0000 0000 0000

  2. Add one


  1000 0000 0000 0000 0000 0000 0000 0000
+ 0000 0000 0000 0000 0000 0000 0000 0001
-------------------------------------------
  1000 0000 0000 0000 0000 0000 0000 0001 

The minimum value would be the value with the greatest magnitude and the opposite sign. The 32 bit, negative, two’s complement number with the greatest magnitude would be where the sign bit is set to 1 and all the other bits are set to 0, i.e.. 1000 0000 0000 0000 0000 0000 0000 0000. This means the min value2 is -2,147,483,648 in decimal.

Bad arithmetic without error

Lets see what happens when we overflow by adding one to 2,147,483,647.

2,147,483,647 + 1.
A carry bit needed…

                                       1                                 
  0111 1111 1111 1111 1111 1111 1111 1111
+ 0000 0000 0000 0000 0000 0000 0000 0001
-------------------------------------------
                                        0
and so on…

  1  
  0111 1111 1111 1111 1111 1111 1111 1111
+ 0000 0000 0000 0000 0000 0000 0000 0001
-------------------------------------------
  1000 0000 0000 0000 0000 0000 0000 0000

Wait! We have seen this one before, it is -2,147,483,648! So in this case the result is mathematically incorrect, but a valid int!

You can try out Scala code snips in the online REPL here, and Java in the online REPL here.
The following return -2147483648.

Scala
Int.MaxValue + 1
Java
Integer.MAX_VALUE + 1

BigInteger (java.math.BigInteger) to the rescue?

Clearly when arithmetic is needed, or when it is possible to overflow a field, one should use something safe like BigInteger over its primitive counterpart. Since one can never know for certain how a field will be used, selecting a safe type should be the default.

If one can be sure that arithmetic will never be performed on a field, and benchmarking shows that using a primitive is needed, then maybe it is worth the risk.

Primitive Type

Pros
Simple quick arithmetic.
Fixed memory allocation size.
Symbol operators. ( + - * /)
Cons
Arithmetic can fail without error!

BigInteger (java.math.BigInteger)

Pros
Correct arithmetic
Cons
Can not use symbol operators; must use add, subtract, multiply, and divide methods.
Arithmetic could be slower.
Higher memory allocation requirements because more information is stored and capacity is dynamic.

BigInt (scala.math.BigInt)

If you live in the Scala world you have better options, such as scala.math.BigInt. It can be thought of a Scala’s BigInteger. In Scala, characters such as + - * / can be used as method names, so this makes using scala.math.BigInt a pleasant experience.

Spire

Finally, when talking about handling numbers in Scala, we need to talk about spire. Spire makes working with numbers a beautiful experience.

spire example
1
2
3
4
5
object Inspired extends App {
  import spire.syntax.literals.si._
  val y = big"2 147 483 647"
  println(s"y, using spire string interpolation = ${y + 1}")
}

Look at how clear and simple it is to create a BigInt using spire string interpolation. Using spaces makes it easier to identify the place values for each digit and results in source that is simpler to read. Spire supports additional number types, type classes and more. Lets take SafeLong for example. It works as a nice replacement for BigInt. While it provides accurate arithmetic, independant of value, it performs better than BigInt for values inside the java.lang.Long range.3

Conclusion

A two’s complement type like int should be considered an optimization choice for Scala and Java software development, and not a default. Its performance comes at a cost and that cost is silent failure.
The default choice for a whole number field should be something like BigInt or SafeLong because they ensure correct arithmetic without regard to the size of the operands. This is made far more pleasant in Scala, especially when using a library like spire.

  1. I show the left most bit as the most significant bit because it makes it simple to see the arithmetic.

  2. Yes, the minimum and maximum values are asymmetrical, the minimum is -2,147,483,648 and the maximum is 2,147,483,647.

  3. $-2^{63} \text{ through } 2^{63} - 1$.

Mark Mynsted

Comments