1. 概述

本文我们将学习如何使用 BitSet 来表示位向量(vector of bits,或称位图)。

首先我们讲解为什么不用boolean数组来表示位向量。在熟悉BitSet内部原理之后,进一步了解其API用法。

2. Bit 数组

有人可能会想到使用boolean[],来表示 Bit 数组。乍一看,这似乎很合理。

然而,boolean数组中的每个元素都消耗1个字节而不是1 bit内存大小 因此,当我们对内存使用有严格要求时,boolean[]并不是一个理想的选择。

为了更具体的说明,让我们看看长度为1024的boolean数组会消耗多少内存空间:

boolean[] bits = new boolean[1024];
System.out.println(ClassLayout.parseInstance(bits).toPrintable());

理想情况下,我们期望只占用1024 bit的内存大小。然而使用JOL分析后得出一个完全不同的结论(JOL全称为Java Object Layout,是分析JVM中对象布局的工具):

[Z object internals:
 OFFSET  SIZE      TYPE DESCRIPTION            VALUE
      0     4           (object header)        01 00 00 00 (00000001 00000000 00000000 00000000) (1)
      4     4           (object header)        00 00 00 00 (00000000 00000000 00000000 00000000) (0)
      8     4           (object header)        7b 12 07 00 (01111011 00010010 00000111 00000000) (463483)
     12     4           (object header)        00 04 00 00 (00000000 00000100 00000000 00000000) (1024)
     16  1024   boolean [Z.                    N/A
Instance size: 1040 bytes

如果忽略对象标头的开销,数组元素占用了1024个字节大小。 这比我们预期的1024 bit,要多700%的内存消耗。

可寻址问题和字分裂(word tearing)是导致 boolean 类型不止1 bit的主要原因。

为了解决这个问题,我们可以使用数字类型(比如long)结合位操作来实现——这就是BitSet的源由。

3. BitSet 工作原理

我们前面提到,为实现每个flag只占用1 bit内存空间,BitSet 底层使用long数据类型结合位运算来实现。

为了简单起见,我们假设用一个字节表示八个flag。首先,我们用零初始化该字节上的所有bit:

initial-bytes

现在如何想把索引为3的bit设为true,则我们先把“数字1”左移三位:

left-shift

然后将两则进行按位或(or)运算:

final-or

同理如果我们想把索引为7的bit设为true

another-set

如上图所示,我们将“数字1”左移7位,然后与前一个字节做或运算。

3.1. 获取bit索引

要判断某个索引上的bit值为true还是false,我们使用按位与(and)运算符。例如,我们要检查索引3位置上的值:

  1. 将1左移3位
  2. 将结果与当前字节做按位与运算
  3. 如果运算结果大于0,说明值为true,否则设置为false或未设置

get-set

上图展示了索引3的get操作步骤。我们再看看下面这个例子:

get-clear

运算结果为0,所有索引4的值为false。

3.2. 扩容

目前,我们只能存储8位的向量。要想更多,我们需要使用字节数组,而不是单个字节即可!

Now, every time we need to set, get, or clear a specific index, we should find the corresponding array element, first. For instance, let's suppose we're going to set index 14:

array-set

As shown in the above diagram, after finding the right array element, we did set the appropriate index.

Also, if we want to set an index beyond 15 here, the BitSet will expand its internal array, first. Only after expanding the array and copying the elements will it set the requested bit. This is somewhat similar to how ArrayList works internally.

So far, we used the byte data type for the sake of simplicity. The BitSet API, however, is using an array of long values internally.

4. BitSet API 用法

到目前为止,我们已经清楚 BitSet 背后的原理,下面学习BitSet的用法。

首先,我们把BitSet与前面的boolean[]进行内存占用对比:

BitSet bitSet = new BitSet(1024);

System.out.println(GraphLayout.parseInstance(bitSet).toPrintable());

这将打印BitSet对象本身内存占用大小(shallow size) 和内部数组大小:

[email protected] object externals:
          ADDRESS       SIZE TYPE             PATH         VALUE
        70f97d208         24 java.util.BitSet              (object)
        70f97d220        144 [J               .words       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

如上所示,BitSet内部使用了长度16的long[](16 * 64 bits = 1024 bits)。一共占用了168 字节大小,而boolean[]占用了1024字节大小

bit数越多,占用空间的差异就越大。例如,要保存 1024 * 1024 bit,boolean[] 消耗 1MB,而 BitSet 大概只消耗 130KB 内存。

4.1. 构造 BitSets

The simplest way to create a BitSet instance is to use the no-arg constructor:

BitSet bitSet = new BitSet();

This will create a BitSet instance with a long[] of size one. Of course, it can automatically grow this array if needed.

It's also possible to create a BitSet with an initial number of bits:

BitSet bitSet = new BitSet(100_000);

Here, the internal array will have enough elements to hold 100,000 bits. This constructor comes in handy when we already have a reasonable estimate on the number of bits to store. In such use cases, it can prevent or decrease the unnecessary copying of array elements while growing it.

It's even possible to create a BitSet from an existing long[]byte[]LongBuffer, and ByteBuffer. For instance, here we're creating a BitSet instance from a given long[]:

BitSet bitSet = BitSet.valueOf(new long[] { 42, 12 });

There are three more overloaded versions of the valueOf() static factory method to support the other mentioned types.

4.2. 设置 Bit

We can set the value of a particular index to true using the set(index) method:

BitSet bitSet = new BitSet();

bitSet.set(10);
assertThat(bitSet.get(10)).isTrue();

As usual, the indices are zero-based. It's even possible to set a range of bits to true using the set(fromInclusive, toExclusive) method:

bitSet.set(20, 30);
for (int i = 20; i <= 29; i++) {
    assertThat(bitSet.get(i)).isTrue();
}
assertThat(bitSet.get(30)).isFalse();

As is evident from the method signature, the beginning index is inclusive, and the ending one is exclusive.

When we say setting an index, we usually mean setting it to true. Despite this terminology, we can set a particular bit index to false using the set(index, boolean) method:

bitSet.set(10, false);
assertThat(bitSet.get(10)).isFalse();

This version also supports setting a range of values:

bitSet.set(20, 30, false);
for (int i = 20; i <= 29; i++) {
    assertThat(bitSet.get(i)).isFalse();
}

4.3. Clearing Bits

Instead of setting a specific bit index to false, we can simply clear it using the clear(index) method:

bitSet.set(42);
assertThat(bitSet.get(42)).isTrue();
        
bitSet.clear(42);
assertThat(bitSet.get(42)).isFalse();

Moreover, we can also clear a range of bits with the clear(fromInclusive, toExclusive) overloaded version:

bitSet.set(10, 20);
for (int i = 10; i < 20; i++) {
    assertThat(bitSet.get(i)).isTrue();
}

bitSet.clear(10, 20);
for (int i = 10; i < 20; i++) {
    assertThat(bitSet.get(i)).isFalse();
}

Interestingly, if we call this method without passing any arguments, it'll clear all the set bits:

bitSet.set(10, 20);
bitSet.clear();
for (int i = 0; i < 100; i++) { 
    assertThat(bitSet.get(i)).isFalse();
}

As shown above, after calling the clear() method, all bits are set to zero.

4.4. Getting Bits

So far, we used the get(index) method quite extensively. When the requested bit index is set, then this method will return true. Otherwise, it'll return false:

bitSet.set(42);

assertThat(bitSet.get(42)).isTrue();
assertThat(bitSet.get(43)).isFalse();

Similar to set and clear, we can get a range of bit indices using the get(fromInclusive, toExclusive) method:

bitSet.set(10, 20);
BitSet newBitSet = bitSet.get(10, 20);
for (int i = 0; i < 10; i++) {
    assertThat(newBitSet.get(i)).isTrue();
}

As shown above, this method returns another BitSet in the [20, 30) range of the current one. That is, index 20 of the bitSet variable is equivalent to index zero of the newBitSet variable.

4.5. Flipping Bits

To negate the current bit index value, we can use the flip(index) method. That is, it'll turn true values to false and vice versa:

bitSet.set(42);
bitSet.flip(42);
assertThat(bitSet.get(42)).isFalse();

bitSet.flip(12);
assertThat(bitSet.get(12)).isTrue();

Similarly, we can achieve the same thing for a range of values using the flip(fromInclusive, toExclusive) method:

bitSet.flip(30, 40);
for (int i = 30; i < 40; i++) {
    assertThat(bitSet.get(i)).isTrue();
}

4.6. Length

There are three length-like methods for a BitSet. The size() method returns the number of bits the internal array can represent. For instance, since the no-arg constructor allocates a long[] array with one element, then the size() will return 64 for it:

BitSet defaultBitSet = new BitSet();
assertThat(defaultBitSet.size()).isEqualTo(64);

With one 64-bit number, we can only represent 64 bits. Of course, this will change if we pass the number of bits explicitly:

BitSet bitSet = new BitSet(1024);
assertThat(bitSet.size()).isEqualTo(1024);

Moreover, the cardinality() method represents the number of set bits in a BitSet:

assertThat(bitSet.cardinality()).isEqualTo(0);
bitSet.set(10, 30);
assertThat(bitSet.cardinality()).isEqualTo(30 - 10);

At first, this method returns zero as all bits are false. After setting the [10, 30) range to true, then the cardinality() method call returns 20.

Also, the length() method returns the one index after the index of the last set bit:

assertThat(bitSet.length()).isEqualTo(30);
bitSet.set(100);
assertThat(bitSet.length()).isEqualTo(101);

At first, the last set index is 29, so this method returns 30. When we set the index 100 to true, then the length() method returns 101. It's also worth mentioning that this method will return zero if all bits are clear.

Finally, the isEmpty() method returns false when there is at least one set bit in the BitSet. Otherwise, it'll return true:

assertThat(bitSet.isEmpty()).isFalse();
bitSet.clear();
assertThat(bitSet.isEmpty()).isTrue();

4.7. Combining With Other _BitSet_s

The intersects(BitSet) method takes another BitSet and returns true when two _BitSet_s have something in common. That is, they have at least one set bit in the same index:

BitSet first = new BitSet();
first.set(5, 10);

BitSet second = new BitSet();
second.set(7, 15);

assertThat(first.intersects(second)).isTrue();

The [7, 9] range is set in both _BitSet_s, so this method returns true.

It's also possible to perform the logical and operation on two _BitSet_s:

first.and(second);
assertThat(first.get(7)).isTrue();
assertThat(first.get(8)).isTrue();
assertThat(first.get(9)).isTrue();
assertThat(first.get(10)).isFalse();

This will perform a logical and between the two _BitSet_s and modifies the first variable with the result. Similarly, we can perform a logical xor on two _BitSet_s, too:

first.clear();
first.set(5, 10);

first.xor(second);
for (int i = 5; i < 7; i++) {
    assertThat(first.get(i)).isTrue();
}
for (int i = 10; i < 15; i++) {
    assertThat(first.get(i)).isTrue();
}

There are other methods such as the andNot(BitSet) or the or(BitSet), which can perform other logical operations on two _BitSet_s.

4.8. Miscellaneous

As of Java 8, there is a stream() method to stream all set bits of a BitSet. For instance:

BitSet bitSet = new BitSet();
bitSet.set(15, 25);

bitSet.stream().forEach(System.out::println);

This will print all set bits to the console. Since this will return an IntStream, we can perform common numerical operations such as summation, average, counting, and so on. For instance, here we're counting the number of set bits:

assertThat(bitSet.stream().count()).isEqualTo(10);

Also, the nextSetBit(fromIndex) method will return the next set bit index starting from the fromIndex:

assertThat(bitSet.nextSetBit(13)).isEqualTo(15);

The fromIndex itself is included in this calculation. When there isn't any true bit left in the BitSet, it'll return -1:

assertThat(bitSet.nextSetBit(25)).isEqualTo(-1);

Similarly, the nextClearBit(fromIndex) returns the next clear index starting from the fromIndex:

assertThat(bitSet.nextClearBit(23)).isEqualTo(25);

On the other hand, the previousClearBit(fromIndex) returns the index of the nearest clear index in the opposite direction:

assertThat(bitSet.previousClearBit(24)).isEqualTo(14);

Same is true for previousSetBit(fromIndex):

assertThat(bitSet.previousSetBit(29)).isEqualTo(24);
assertThat(bitSet.previousSetBit(14)).isEqualTo(-1);

Moreover, we can convert a BitSet to a byte[] or a long[] using the toByteArray() or toLongArray() methods, respectively:

byte[] bytes = bitSet.toByteArray();
long[] longs = bitSet.toLongArray();

5. 总结

In this tutorial, we saw how we can use BitSet to represent a vector of bits.

At first, we got familiar with the rationale behind not using boolean[] to represent a vector of bits. Then we saw how a BitSet works internally and what its API looks like.

As usual, all the examples are available over on GitHub.