VW 7.4 Spoilers - OpenSSL
(Republished from Cincom Smalltalk Tech Tips)
VW 7.4 is going through its final stages, so I thought this might be a good time to brag about what’s coming. Today I’d like to mention one of my pet projects. Well, it’s “my” only in the sense of me being fond of it, but the work is actually done by Dave Wallen. He wrote a DLL/CC wrapper for the crypto library used by OpenSSL. It’s not finished and it’s shipping as a preview with 7.4, but already provides access to ARC4, Blowfish, DES and AES, including all the cipher modes and padding. The main requirements were to have the wrapper behave exactly as the native Smalltalk implementation, support the same API and preserve the same runtime characteristics like making sure it generates as little garbage in the process as possible. And Dave’s done a pretty good job with that. So let’s take a look at this puppy.
The API compatibility is great. If you load the OpenSSL parcel you can you can pretty much follow the existing examples from SecurityGuide as they are. The only change is to substitute classes from Security.OpenSSL for the ones in Security namespace. For example
	plaintext := 'This is the end ...' asByteArray.
	key := #[1 2 3 4 5 6 7 8].
	iv := #[8 7 6 5 4 3 2 1].
	alice := Security.OpenSSL.DES newBP_CBC.
	alice setKey: key;
		setIV: iv copy.
	ciphertext := alice encrypt: plaintext.
and to decrypt
	bob := Security.OpenSSL.DES newBP_CBC.
	bob setKey: key;
		setIV: iv copy.
	(bob decrypt: ciphertext) asString
Note that it’s not just the constructors that are polymorphically compatible, but even the wrapper structures work the same way (where possible).
	(Security.OpenSSL.OutputFeedback on: Security.OpenSSL.Blowfish new)
		setKey: #[1 2 3 4];
		setIV: #[8 7 6 5 4 3 2 1] copy;
		encrypt: 'hello' asByteArray
In many cases a switch from one library to the other should be as easy as importing the right namespace into your application code. However there are and will be some incompatibilities. For example tripple DES is implemented as just another cipher in OpenSSL and our generic tripple encryption wrapper approach just doesn’t map well to that. We could implement a wrapper that would make you create three instances of DES and then just throw 2 away, but that seems a bit too extreme. Moreover OpenSSL only supports tripple encryption with DES, so a wrapper would just be confusing in terms of where it can be used anyway. So for tripple DES there’s only the constructor based API, as in
	Security.OpenSSL.DES newBP_3EDE_CBC
		setKey: '0123456789abcdefghijklmn' asByteArray;
		setIV: #[8 7 6 5 4 3 2 1];
		encrypt: 'Hello World!' asByteArray
As I hinted earlier, there’s also capability mismatch between the two libraries. For example we can do tripple encryption with any of the supported block ciphers, not just DES. On the other hand OpenSSL can do ciphers that we don’t yet and possibly never will (e.g. CAST, IDEA), etc. But for the majority of features that matter the most there should be an overlap.
OK, that’s all nice, but you’re probably asking by now, why on earth did we do this ? There’s actually a number of reasons that in total seemed compelling enough.
First of all it gives our users a choice. Having native Smallltalk implementations is incredibly convenient, you don’t have to worry if and where you have an external library installed or if it’s even easily available for the platform that you want to run on. However even though the perfomance of the native ciphers is more than adequate for many applications it will be a stretch in others. Symmetric ciphers are actually the worst case in terms of Smalltalk’s lag behind a C implementation. Symmetric ciphers are generally taylored for heavily optimized “close to the metal” implementations, with lots of bit shuffling and mixing in a fixed set of registers. Current Smalltalks have little chance to compete with hand crafted assembler there. To give you an idea here’s how they compare on this machine, a Pentium4 M/1.7 GHz. The openssl utility clocks its AES perfomance as follows (cygwin installed on WinXP):
$ openssl speed aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 7744385 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 1939658 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 256 size blocks: 491623 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 122641 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 15343 aes-128 cbc's in 3.00s
OpenSSL 0.9.7e 25 Oct 2004
built on: Sat Dec 11 12:01:27 WEST 2004
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) blowfish(idx)
compiler: gcc -DOPENSSL_SYSNAME_CYGWIN32 -DOPENSSL_THREADS  -DDSO_WIN32 -DOPENSS
L_NO_KRB5 -DOPENSSL_NO_IDEA -DOPENSSL_NO_RC5 -DOPENSSL_NO_MDC2 -DTERMIOS -DL_ENDIAN -fomit-frame-pointer -O3 -march=i486 -Wall -DSHA1_ASM -DMD5_ASM -DRMD160_ASM
available timing options: TIMES TIMEB HZ=1000 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      41248.39k    41462.30k    41895.97k    41791.81k    41840.83k
So OpenSSL can generally push through 40 GB/sec in this mode. Now let’s take a look at native Smalltalk implementation:
| aes gigs |
gigs := 5. "We'll encrypt 5GB of data"
aes := Security.AES newCBC
	setKey: #[0 1 2 3 4 5 6 7 8 9 10  11 12 13 14 15];
	setIV: #[8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8] copy;
	yourself.
#(16 64 256 1024 8192) do: [ :size || reps ms b |
	b := ByteArray new: size.
	reps := gigs*(10**6) // size. 
	ms := Time millisecondsToRun: [ reps timesRepeat: [ aes encryptInPlace: b ] ].
	Transcript cr; print: size; show: ' bytes, '; print: reps; show: ' reps = ';
		print: ms; show: ' ms => '; print: gigs * 1000.0 / ms; show: ' GB/s'  ].
16 bytes, 312500 reps = 5979 ms => 0.83626 GB/s
64 bytes, 78125 reps = 6016 ms => 0.831117 GB/s
256 bytes, 19531 reps = 6021 ms => 0.830427 GB/s
1024 bytes, 4882 reps = 6022 ms => 0.830289 GB/s
8192 bytes, 610 reps = 6051 ms => 0.82631 GB/s
It is consistently a bit less than 1 GB/s. I hate to post such a comparatively unimpressive number for my beloved platform, but as I said there are pretty good reasons. One, symmetric ciphers and hashes are designed to fit common hardware. Two, Smalltalk pays a very high price for indexed access into byte arrays, performing bounds checks on both read and write for each byte. So these are really the worst cases. For example in my ad hoc tests RSA is only about 3-4 times slower than OpenSSL’s speed test was showing, but we’ll be able to do better comparison once the wrapper supports those as well. So with all that taken into account I’d say the results aren’t all that bad.
Anyway, here are the results of the same code as above, just using Security.OpenSSL.AES.
16 bytes, 312500 reps = 4988 ms => 1.00241 GB/s
64 bytes, 78125 reps = 1325 ms => 3.77358 GB/s
256 bytes, 19531 reps = 404 ms => 12.3762 GB/s
1024 bytes, 4882 reps = 172 ms => 29.0698 GB/s
8192 bytes, 610 reps = 109 ms => 45.8716 GB/s
It shows nicely that the cost of DLL calls has a significant impact on performance and that for small chunks of data and consequently high frequency of DLL calls it comes quite close to the native implementation. An interesting blip is the largest block size which reports higher throughput than the results of the openssl command. I’m not about to dive into OpenSSL source code to find out why, but my guess is that given that we don’t really know what exactly is being measured by the openssl command, there’s probably some additional activity included in those numbers. But I think the most interesting outcome of these benchmarks is that for small blocks (let’s say up to 64 bytes) the native implementation doesn’t do much worse than calling OpenSSL, beyond that it’s something to consider (if the performance of encryption is a concern in your application) but as soon as you get up to 8K size blocks you’re pretty much matching the speed of pure C code.
For example, this will be quite interesting if you intend to use VW as an SSL server. Symmetric ciphers (and hashes) are the work horse of SSL connections. The results suggest that with the OpenSSL wrapper you might be able to handle up to 40-times more concurrent connections. That’s of course completely ignoring the cost of any other processing, so in practice the ratio should be much smaller, but still significant. Speaking of that, here’s how you can plug the wrapper into the SSL framework. Basically all it takes is initializing selected SSL “cipher suites” with the right algorithms. Unfortunatelly the currently shipping version of SSL doesn’t provide mutators for SSLCipherSuites so we have to do the following for now:
suite := SSLCipherSuite SSL_RSA_WITH_DES_CBC_SHA.
suite cipherSpec
	instVarAt: 1	"the first variable, bulkCipher"
	put: Security.OpenSSL.DES newCBC.
HttpClient new
	sslContext: (SSLContext suites: (Array with: suite));
	get: 'https://www.microsoft.com'
Of course, the right way to do this would be extending the SSLCipherSuite and SSLCipherSpec classes with new suite definitions (following the existing ones).
Something to note about this example is that the suite as defined will use OpenSSL for DES-CBC encryption and native Smalltalk for SHA and RSA. So you can mix and match algorithm implementations arbitrarily. This is actually quite useful if your application must make use of specific hardware for some cryptographic operations, like having a smart card perform all RSA computations for security reasons. All you need is to create another version of the RSA algorithm that simply calls out to the card and hook it up through an SSLCipherSuite definition. Actually, I have to admit that it’s not yet readilly possible with the current implementation, but all it takes is a straightforward extension of the SSLRSAKeyExchange class that is held by the SSLCipherSuite instance. This should all be finished by the time our OpenSSL wrapper supports public key algorithms as well.
And this brings me to another reason to implement the OpenSSL wrapper, it’s an example how to create such a wrapper for other libraries that might be mandated for cryptographic use by your organization. Similarly to the smart card example, various security policies may require your application to exclude any cryptographic operations from you Smalltalk image. Or you may need to use hardware based cryptographic accellerator to speed up these operations on a heavilly loaded SSL server. The wrapper allows us to validate sanity of our APIs and prove that interfacing these external facilities should be possible in reasonably straightforward manner.
Finally, it’s also a very useful tool for us. So far we were only able to test interoperability of our cryptographic library only indirectly, by running SSL communications to external servers. Any fault in our algorithm implementation would very likely manifest itself there. However because of the nature of the beast, it was quite difficult to figure out what exactly is wrong when SSL communication failed. This wrapper will make this process much more efficient, robust and pleasant to work with. So, thanks, Dave! You’ve just made my life sooo much easier.