Subscribe to Stuck in an Infiniteloop        RSS Feed
-----

Byte Arrays in Java, Size Does Matter

Icon Leave Comment
I have recently been doing some processor development in Apache NiFi. Being mostly a C++ guy, this was a nice throwback to some Java work I did many years ago. To describe briefly, Apache Nifi:

Quote

is an easy to use, powerful, and reliable system to process and distribute data.


Attached Image

In a nutshell, you have a nice web based WYSIWYG editor that sets up Java "Processors" that are engines that do some kind of work. You ingest data, create FlowFiles, and then do something interesting with them. Out of the box NiFi comes with a slew of Processors that are aptly named:

  • GetFile/PutFile
  • GetSFTP/PutSTFP
  • ConvertXtoY


so on and so forth. Interestingly enough there was a ListenUDP Processor, but no TCP equivalent. The code to do so in and of itself is not particularly interesting. Pretty standard ServerSocket/Channel that threads off connections to interpret the payload. What is interesting is that coming from cpp land, I had neglected to remember that Java doesn't do unsigned integers.

If your ListenTCP Processor has to be compatible with cpp clients that are sending data of an arbitrarily large size up to a uint64_t, you're going to have problems. What kind of problems you ask?

Assume you have the following file structure:

4 byte uint32_t header length
header
8 byte uint64_t payload length
payload


Your standard int/long in java will be okay except for the very largest of edge cases. Java 8 has support for unsigned longs, but alas I am stuck in Java 7 for now. Enter BigInteger. The wrapper that makes doing simple comparisons of value go from one symbol to an entire fuction call with a check of the result:

<
compareTo() == -1


Oh boy. Fine. That's fine.

But wait Knowles you ask, what does this have to do with Byte Arrays and their size? Being able to successfully interpret the uint64_t is step one. Step two is how to move the data around.

Reading the data from the socket is typically done in chunks, but it would make sense for the first pass to write all of that into a buffer in memory before writing it to a FlowFile. Well that would work until you get a file that is greater than 2GB in size. Why?

byte[] bytes = new byte[Integer.MAX_VALUE]; //uh oh, signed int is 2^31-1
byte[] bytes = new byte[Long.MAX_VALUE]; //uh oh, this doesn't compile!



Well that sucks. Okay, but Java is king of wrappers surely there is some kind of stream we can use right? NOPE. BufferedWriter, ByteArrayOutputStream, ByteBuffer, et al...are all byte array backed! Even a MappedByteBuffer, which memory maps a file for you, can only "look" at 2 GB at a time!

What is the only choice? FileInput/OutputStreams. Gross File I/O??? Say it isn't so. IT IS SO. But how does NiFi do arbitrarily large files? It merely gives the impression that the FlowFiles are "moving" around the graph in memory like other real time programs I have worked with. When you have to read or write to a FlowFile, there is a callback function that exposes the In/OutStream that if you dig deep enough down is simply a FileInput/OutputStream. Thus, a FlowFile is never in memory at once and is actually written to disk on ingest to NiFi's content repository.

flowFile = session.write(flowFile, new OutputStreamCallback() {
                    @Override
                    public void process(final OutputStream out) throws IOException {
                        out.write(bufferFromSocket);
                    }
});



Clever. Not a huge fan of having to work around a language's limitations, but considering the API was originally written in 1995 (with ByteBuffers coming into existence with 1.4 in 2002), I can understand why signed integers were the upper limit for a buffer at one point in time.

----------------

Having all of that said, I have found Apache NiFi to be very easy to work with and for data flow tasks, it pretty much does most of it for you. Plug and play, easy packaging, but not really suitable if you need speed. Everything is File I/O otherwise you would blow your heap everytime a couple massive files make their way through your system.

0 Comments On This Entry

 

October 2017

S M T W T F S
1234567
891011121314
1516171819 20 21
22232425262728
293031    

Tags

    Recent Entries

    Search My Blog

    0 user(s) viewing

    0 Guests
    0 member(s)
    0 anonymous member(s)