ByteBuffers, String, and C

I end up doing a lot of marshalling between Java and C over the wire. ByteBuffers are a natural fit for this situation given ByteBuffer.order(ByteOrder) and NIO’s selectors.

The problem comes in when dealing with Strings.

There’s no ByteBuffer.put(String) but that’s OK because there’s CharBuffer.put(String). But wait! In Java a char is two bytes. So CharBuffer.put(String) on “bollocks” will return:

0062006f 006c006c 006f0063 006b0073  .b.o.l.l.o.c.k.s

This is all fine and dandy if you’re going to another Java application (or something that’s commonly double-byte) but when going to vanilla C you’re looking for single byte characters.

Your next bet is to try:

final String string = "bollocks";
final ByteBuffer buffer = ByteBuffer.allocateDirect(string.length());

This is fine and dandy for most applications. (It should be noted that the default character set is used in the transformation and that unless this code is used in a controlled environment, you may end up getting BufferOverflowException. So it’s better to do:

final String string = "bollocks";
final byte[] stringBytes = string.getBytes();
final ByteBuffer buffer = ByteBuffer.allocateDirect(stringBytes.length);

Or even better yet, explicitly put the charset in String.getBytes(String charsetName).)

So what am I complaining about? Everything seems fine. That’s true up to this point. But what if you need to chunk up the string? CharBuffer provides CharBuffer.put(String src, int start, int end) which is ideal except for that problem of double-byte chars. What you actually end up doing is String.getBytes() and then walking over the resulting byte array. This may seem all fine and dandy except for the fact that the whole reason for doing the chunking in the first place is that the string is very large. Using String.getBytes() will cost you about three times the memory (the original string, the string as a byte array and the ByteBuffer into which you are writing).

If you’re NIO Charset savvy then you may have said to do:

final String string = "bollocks";
final Charset charset = Charset.forName("UTF-8");
final ByteBuffer buffer = charset.encode(string);

This kills lots of birds with a single stone and is very tight code. (“UTF-8” must be supported by Charset so there’s no need to check.) The parallel code for chunking is similar:

final String string = "bollocks";
final Charset charset = Charset.forName("UTF-8");
final CharBuffer charBuffer = CharBuffer.wrap(string, 0, 3);
final ByteBuffer buffer = charset.encode(charBuffer);

(where the loop over the remaining chars is not shown). Again, this is nice code that solves the problem. So what am I still complaining about? Well, it’s better on the memory consumption but, even though I know the size of my chunking and can allocate a ByteBuffer of this size, I have to allow it to allocate the buffer for me.

If really know your java.nio.charset you would suggest:

final String string = "bollocks";
final Charset charset = Charset.forName("UTF-8");
final CharsetEncoder encoder = charset.newEncoder();
final CharBuffer charBuffer = CharBuffer.wrap(string, 0, 3);
final ByteBuffer buffer = ByteBuffer.allocateDirect(3);
final CoderResult encodingResult = encoder.encode(charBuffer, buffer, true/*no more input*/);

This is an “elegant” solution that allows for reuse of the ByteBuffer and fits the bill almost exactly! There is the extra CharBuffer in there that has to suck up space but at least it’s limited in size.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s