Computers


Here are some that are noteworthy and funny:

  • Is it Christmas – http://isitchristmas.com – Funny as it tracks the timezones across the globe, really only has much to do on one day per year though. But that one day is coming up soon, so start checking in daily.
  • Is it Dark Outside – http://isitdarkoutside.com – Whopping funny. Works every day of the year.
  • Umbrella Today – http://umbrellatoday.com – I guess they are hawking an iPhone app, but I really enjoyed how the interface pulls you in a little bit at a time. And if you visit for the first time on a rainy day (or just put in a Seattle, Washington zip code), that rainy-day graphic is stunning.

It’s all about hiding the complexity. Do something hard with your software but don’t make the users suffer just you did.

A recent tweet from my pal #eknock complaining about some juicy ORA-06550: line 1, column 7 reminded me of these error codes that an old-cgi script of min produces:

  • Bad disktype code ORA-99xxx
  • CTXUSER not enabled ORA-010xxx
  • Flagellular misfire ORA-910xxx
  • Nascent order lost code: ORA-82xxx
  • Ferrule injector not found: ORA-14xxx

Of course it’s just a perl script and the system had nothing to do with Oracle. Just one of those little things that can be done to throw snoopers off the trail.

Yesterday I attended the first-ever Hadoop World, sponsored by Cloudera and held in The Roosevelt Hotel in New York City. I took an early Amtrak train up to the big city and a late train back that same night. The conference was well attended, over 500 big-data heads were there and the organizers did a fantastic job.

Some of the best stuff was just hearing about how other folks are using Hadoop. I also enjoyed hearing about the sizes of other people’s big-data problems. There were three tracks, so I only heard 1/3 of what took place, but here are some notes on what I did hear after the break.

It was a great day, a long day, glad I went. (more…)

Upgrade to WP 2.8 broke the comments permalinks, 2.8.1 didn’t help. The user would get an error page after posting a comment (comment was posted) and clicking on any of the links to a comment went to the-url-for-the-post/comments-page-1#comment-id which would get you a 404. In the admin settings discussion area there is a (new?) setting to break comments up into pages of 50. Turning this off fixed the problem. I’ve never had 50 non-spam comments so that shouldn’t hurt too much. Plus I don’t think I ever turned it on in the first place.

This screen cast covers Scaling Rails apps using Rack and Metal and is an execellent tutorial on both subjects. Jason Pollack, one of the Rails Envy guys, does a superb job explaining how rack and metal work in Rails 2.3.

Reading an interesting paper on d-Left Hashing (pdf link) by Bonomi, Mitzenmacher, et. al. This is a space and effeciency improvement on Bloom filters. Wondering how it could be incorporated into a Hadoop mapfile to avoid scanning compressed blocks for keys that aren’t present. Maybe the work in hbase on o.a.h.hbase.io.BloomFilterMapFile would provide good clues. Need to understand the dynamic bit reassignment stuff first though.

I lost the entire blog today with a carelessly placed argument to
‘rm -rf’. Ironically I was trying to delete some old backups of
this blog. It turned out I deleted everything.

Most of it was restored from backup. Some few pictures are still
missing though.

I am an Emacs weenie. I really can’t get much work done without emacs. I
have tried other things.

* vi/vim – I can use it and I try to learn a new command every once in a
while
* Textmate – mkay. But I like Emacs. Plus Emacs is free.
* Visual Slick Edit – The parts that are like Emacs are nice. Emacs is
free.
* Netbeans – Almost can be configured to work like Emacs.
* Eclipse – No. I. Just. Can’t. Do. It. Very. Sorry.

So I am making this post from Emacs. Carbon Emacs 22.1 on Mac OS X.
Using weblogger mode. It turned out to be easy. The hard part was
figuring out what the “endpoint URL” was supposed to be
(/blog/xmlrpc.php for this Wordpress blog). And enabling the Markdown
plugin. Yeah, that was tricky — had to click a checkbox.

See ya later lame textarea edit box and lame wordpress markup style. We
hardly knew ye.

We are starting to use thrift and needed an ant build recipe. Here’s what we came up with. It works good and the only thing that looks like an abstraction leakage to me is that I needed to know the package, the java namespace, for the resultant generated thrift code and the name of one of the thrift generated files. The primary goal was to eliminate running the thrift generator when the generated code is newer than the .thrift files. There isn’t a one to one mapping between .thrift files and generated output so if any of the generated stuff is newer than any of the thrift then it all gets recreated.

I also didn’t want to have to copy the thrift output to someplace else, so a javac target was added to just treat the “gen-java” thrift output as a new source directory for direct java compilation. The normal ant target to compile the java code can now just depend on “thrift-gen”.

JNA is surely deserving of all the praise it has been getting. It’s being used on some pretty high profile projects like JRuby with great success. After having done JNI the hard way, the painful, tortuous, despicable, bang-head-on-keyboard-while-wondering-if-ReleaseStringUTFChars-applies-here and why-the-jvm-is-segfaulting-again way, well I have a deep appreciation for JNA.

Still who wants to go write a bunch of useless Java interfaces for stuff that already exists in built into the native library itself? Not me. So that’s where Jython and JRuby come in.

This week I needed Jython/Python access to some native modules, namely ssdeep for fuzzy hashing. There’s already a pretty nice solution for connecting pure python to native libraries — you can either use swig or pyrex. The pyrex piece for ssdeep has been mostly written here. Needed to add the fuzzy_hash_buf method into that mix but it was nice and easy. From inside pure python with ssdeepmodule.so (via pyrex) and libfuzzy.so (from ssdeep) (or .dylib for Mac or .dll on Windows) sitting there on your LD_LIBRARY_PATH, you get to do this coolness:

  from ssdeep import ssdeep
  import sys, os
  f = open("/bin/ls","rb")
  data = f.read()
  f.close()
  ss = ssdeep()
  fuzzy_hash = ss.fuzzy_hash_buf(data)

Pretty nice you have to admit. But from my pure java p2p data-driven workflow framework, I really wanted to do this from Jython to keep from having to start the interpreter up in a subprocess over and over. Pyrex extensions do not work in Jython. Makes perfect sense. M’kay. I could write a whole bunch of lame JNi code to hook libfuzzy.so in there. Or I could use JNA and write some non-dry interface in java and figure out all the details of the types and so forth. Or … I could just push all that code down into the python module that I’m going to call.

from com.sun.jna import NativeLibrary, Function, Memory
import sys,os

class ssdeep:
    fuzlib = None
    hash_func = None
    FUZZY_HASH_SIZE = 116

    def __init__(self):
        self.fuzlib = NativeLibrary.getInstance('fuzzy')
        self.hash_func = fuzlib.getFunction('fuzzy_hash_buf')
        pass

    def hash_data(self,data):
        ptr = Memory(self.FUZZY_HASH_SIZE)
        i = self.hash_func.invokeInt([data,len(data),ptr])
        return ptr.getString(0,False)

With a class and method conveniently named exactly the same as the pyrex module I can make it all flexible enough to work either way:

try:
   # try the pyrex extension module
   from ssdeep import ssdeep
except ImportError:
   try:
     # try the jna wrapper when in jython
     from ssdeepjna import ssdeep
    except ImportError:
      # write tmp files and just exec the dumb thing

Which is all pretty nice I think. Not too many worries about creating interfaces or other crazy things. Seems very efficient, a little extra packaging and we are good to go.

Well there was one problem in getting different values from the hashes when the data was binary. Turns out the JNA layer needs to be told how to convert data with -Djna.encoding=8859_1 on the JVM command line. Since I usually run with -Dfile.encoding=UTF-8 and in a UTF-8 locale, this made all the difference.

If that is inconvenient or you want to encode things differently only sometimes, the extra steps in the python layer would be something like

  from java.lang import String
  def wrap_hash_buf(self,data):
    javastr = String(data,"8859_1")
    jbytes = javastr.getBytes("8859_1")
    return hash_buf(jbytes)

The same type of thing would work just as well from JRuby.

Welcome to the sweet spot. Code on, baby!

Next Page »