What's the sweet spot between development time and performance in extracting the 24 from this XML string:
<eltname formatid="24" someotherattribute="xyz" ... />
The eltname varies, and while the formatid is usually in this front spot, it isn't guaranteed.
Presented with this problem, I settled on this pseudocode:
1. Try a regular expression pattern match
2. If the match succeeds, use the value it extracted
3. Otherwise run a SAX parser that looks up the desired attribute in the first startElement call and then aborts the parse.
Using SAX is clearly indicated for this over DOM, mainly because we don't need the parser to scan the whole input. With SAX we can abort the parse ourselves (apparently the way to do this is to intentionally throw an exception) in the startElement callback, after the opening element tag has been read. This may read more attributes than we need, but typically (for me anyway) most of the input is not in the start tag.
I liked this approach, since it looked like the presumably faster pattern match would usually work, and even if it didn't there was a reasonably performing option for getting the right answer. But once I started thinking about performance, I had to see what we would actually get. And by the way, I was working with Java on this problem.
I was horrified to find that the SAX parsing, using Xerces2, was taking from 0.750ms up to a millisecond to run on a 3.2GHz machine. The pattern match, while much better, was not super great, even though it was a compiled pattern returned by Pattern.compile.
So at this point, I just had to see what would happen if I hand-rolled a regular expression parser. It took a while, and increases the technical risk, but it seemed manageable. Here are the results I got:
<eltname formatid="24" someotherattribute="xyz" ... />
1. Hand-rolled parser: 0.9 microseconds
2. Compiled reg-exp match: 15 microseconds
3. Xerces2 SAX parser: 760 microseconds
This is for running each method on a single test string (typical for my intended purpose) of about 350 characters long, to extract the 24 from this form, on a machine with a 3.2GHz CPU (that's a clock period of around 0.3 nanoseconds).
I consulted the Xerces Performance FAQ list and found Xerces Performance Clue Number One:
Avoid creating a new parser each time you parse; reuse parser instances. A pool of reusable parser instances might be a good idea if you have multiple threads parsing at the same time.Yes. Do that, what the blockquote said. Heeding this advice made the SAX parse align a little better with the other methods:
1. Hand-rolled parser: 0.9 microseconds
2. Compiled reg-exp match: 15 microseconds
3. Xerces2 SAX parser with parser reuse: 52 microseconds
I peeped a little into just what takes 710 microseconds to execute new SAXParser() and it's a landscape littered with rattling across classloaders, checking for Xerces configuration files on the file system, checking the last modified time of the configuration file in case it's changed, almost like a makefile. Very, very flexible, and pretty heavy.
Overall, it looked like the compiled reg-exp match is acceptable, and if I do find this code getting called more often than I expected, I can switch to the hand-rolled parser.
No comments:
Post a Comment