Knowledge base

Our knowledge base contains our former blog entries (from 2009), technical problems and solutions, ideas that we think are worth preserving. We will add new entries from time to time in the hope they will be useful for you.

The concept behind fragmented XML text nodes

posted Apr 20, 2010, 5:28 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 1:50 AM by István Soós ]

A few weeks ago a friend asked me about a problem with XMLStreamReader. We have quickly concluded that it is no error at all, it is in the nature of the XML processing tools, but if you encounter it at the first time, it could seem strange. It is about the fact that XML text nodes are not necessarily processed at once, and while you read the XML, you might receive only fragments.

For example if you have the text: "Q&A", which in XML will be escaped to "Q&A", you might end up with reading first the string "Q" then, the "&" and finally the "A", instead of reading it as a whole string. Like the following code:
import java.io.StringReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;

public class TestTextNode {
public static void main(String[] args) throws Exception {
String xml = "<?xml version=\"1.0\" ?><test>Q&amp;A</test>";
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(xml));
reader.next();
reader.next();
System.out.println(reader.getText());
}
}
On Sun's Java 6 JVM you shall receive just "Q" in the first round. On the consecutive reads, you will receive the other characters, but for the unprepared people, it is just strange. So, why this happens?

XML allows you to have very large files. If you look for example at the wikipedia.org XML dumps, it is not unusual to have XML files larger than a few GB. There is no limit on how big a text node can be, so it is the responsibility of the tool to process it in reasonable chunks. If you order it to load into a DOM, you will receive a large tree in the memory - if you have much more than the XML side itself, you have good chances that it will fit. However on large XMLs or for some kinds of processing, you just stream through the data and do not build a DOM tree.

As in the example above, while you stream though the XML, you will receive TextNodes. These are usually constrained by the:
  • closing or other opening tag
  • buffer size of the streamer (if it is full, the stream reader will receive the text)
  • special escape characters (as above, the escaped &amp; resulted in a new fragment
While the first one is trivial, the second and third is a less-known internal of the XML parsers, but from the memory consumption perspective, it seems it has a good reason behind it.

Now the question remains: are you able to parse the XML and receive all the text consecutive nodes compacted? It depends on the parser, but in Java, you can, just put the following code after the factory initialization:
      factory.setProperty(XMLInputFactory.IS_COALESCING, true);
So it is not magic to change the behavior, although with the recent hardwares and softwares, it might be better to have the coalescing by default, and it could be turned off - although it is definitely fail-safe this way.

published: 2009-08-29, a:István, y:2009, l:java, l:xml

Profiling Java application: measuring real CPU time

posted Apr 20, 2010, 5:26 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 1:57 AM by István Soós ]

When profiling applications, it is always important to measure the time as precise as it can be, and the old way was to measure the system clock with increasing granularity, while in the meantime we have received access to a more precise, thread-specific clock.

In the old-fashioned way, we have two methods to measure the system clock:
  • System.currentTimeMillis()
  • System.nanoTime()
Both method depend on the operation system's internal clock API, so there is no guarantee on the granularity of the clock, however nanoTime tends to be more precise - as you would expect. The difference between the two is that nanoTime does not relates to the date information anyhow (no counting of milliseconds since 1970 or so). Measuring a single-threaded application, with not much multitasking in the background (e.g. "nobody shall touch the machine while I'll do the benchmark") produces good results.
However if the thread or process context is switched by the operation system (because the user moves the mouse or something else steals any CPU time or IO interruption), measuring the system clock will always result larger number than it actually consumed. Of course, if you repeat the measurements and calculate the supremum of the numbers, it will give you good estimates, but that will always be just an estimate.

On the other hand, the Java platform now provides easy access to an other clock: the thread's own CPU time counter, which can compute the CPU time spent in the actual thread. This is more precise in the regard that thread and process context switches are not measured, only the time when the thread is active. Although the usual Thread and System classes don't provide such methods, Sun's JVM provides a simple way to achieve it:
ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
long time = threadMXBean.getCurrentThreadCpuTime();
Simple, isn't it? Of course you can check other methods on this MXBean, but it gives the basic idea behind precise CPU time measurements.

published: 2009-08-24, a:István, y:2009, l:java, l:profiling

Code review: more politics than technology?

posted Apr 20, 2010, 5:24 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 2:10 AM by István Soós ]

A few weeks ago there was a question on a Hungarian Java user list about PMD and code review that made me wonder what is the actual state of automated code review or code analyzer tools like PMD compared to the manual process. The political aspects of the code review process brought up some good and some bad memories too...

On the mail list...
My quick answer on the mail list was around the following: "I wouldn't call PMD as a code review tool, it is rather a code checker. Even if PMD rules pass (especially for the rules actually make sense) we have had projects where our manual code review improved the quality of the code significantly."

Okay, this is not a big surprise so far. But the reply made me think a little bit: "Our actual development methodology doesn't have any place for code review. I am are looking for way to include it without disrupting the current development timelines. For the first 2-3 projects it will be only feedback, to have some routine in it and later on we can fine-tune the process. However somewhere we shall start and PMD looks good for such."

Good luck for such an initiative! :)

On the dark side...
I have some doubts about the long-term success if it will remain feedback only and it is not enforced even in the actual projects. Why I see it that skeptical? I think if something (a) is not mandatory and (b) it is not in the essential culture of the group and (c) it could be skipped to save some time for the hard deadline - will be essentially skipped to save that crucial time... Been there, done that, forgot that... :)

One of our clients did have similar scenario a year ago: a multinational corporation that works mostly with outsourced developers (read: cheaper than everything = not always as qualified as you would like), while we were hired as architects for project assistance. Our first shock was when we've realized that nobody (read: not a single person) has ever required an internal code review of the shipped products. Of course officially they have had it: if the department A of the contractor company developed a product, they have requested a code review from the department B. Guess what: it has passed. Even if they contracted a third party, it was as cheap or much more cheaper than the original one, so there was no surprise they haven't produced any reasonable report. By the way: have you noticed that code review reports make no sense at all for product managers who just know Excel-sheets? :)

Anyway, as architects we were the first to introduce this requirement in the business process, and to be honest, we have made progress in the technical parts, but we have slightly failed the political aspects. At least it reinforced the belief: if it is not mandatory, it will be skipped - period.

In technical terms, we were quite good: we had introduced PMD as an IDE-plugin, so every developer was able to check the rules for itself. Unfortunately PMD contains a lot rules that just make no sense or are irrelevant in the actual project context, so we had to fine-tune the ruleset and eliminate some annoying ones. We have created a few new rules to describe some critical scenario but generally we have left PMD to do the job.

On the political side, we - as architects -  were a bit outsiders: we have had no rights to push the project deadline or block the handover process. Even if the project was buggy (in PMD sense and in our common sense, like String == instead of .equals(...)), we have had no impact on the process. The PMD reports had been sent around, but the management gave it no sh*t (they hadn't understood it, and the deadline was too much worry anyway), so nothing happened with the results. The only thing that successful resonated something was when I've periodically sent a report about the number of errors in the project - with the historical numbers, showing a clearly rising tendency :[. Finally we were allocated with the budget to do some manual code review as well, and it was a clearly better story - well, at least for us, architects :].

And the brighter side...
Speaking on the good experiences, I've had a job at an investment bank, where we had a clever and responsible team of people around infrastructure libraries. These libraries were crucial, many other departments has used them in production environment - so essentially a solid code review process was in place. There was practically no code that wasn't peer reviewed before make to a new release - with a few exception of emergency releases, but they were in a different branch.

Speaking of our internal developments, we have the luxury of doing such code reviews irregularly, because instead of such process, we rely on:
  • unit testing
  • test coverage tools
  • performance profiling that monitors the critical parts
By the way, this resonates to the "automate everything" motto of successful businesses :)

And some supporting tools...
Cutting a long story short (oh, am I late with that?), my view on the technological part:
  • PMD is a nice tool, but has limitations and obsolete rulesets (some of them are detected by the java compiler anyway). If you need similar tools, even open-source, you might check these code analyzers too.
  • Manual code review is always better than such tools. I don't think that static code review for 'feedback' purpose makes any sense without some policies.
  • The weakest chain is - as always - the timeline and the budget. If those doesn't allow the code review process to allocate more time for better quality, you could have the best toolkits in the world, but you might just drop it in the trash.
As I'm typing this entry, I've started looking into tools that allow better formalization and documentation of the manual code review process, basically a better team collaboration. As for one example, Atlassian Crucible looks interesting. Of course there are open source code review tools there, it might be easy to pick from one of those too.

Most of these tools are a mixture of version control repository viewer, issue tracker and commenting, so if you have these installed, you might just use it for the code review process too. Our preferred tool is Redmine, and it is pretty easy to use it for such purpose:
  • When committing code, not only update the originating issue, create a new one requesting a code review and optionally assign to someone.
  • The review process will produce questions and comments, these can be new issues as well.
  • The discussions around these comments can be just as organized or as loosely defined as you like - without much technological restrictions...
And of course: none of these tools can beat the performance of the developers who are sitting at the same desk and checking the code on the monitors real time, while discussing their ideas :).

published: 2009-08-15, a:István, y:2009, l:automation, l:codereview, l:governance, l:pmd, l:redmine, l:testing, l:unittesting

Adding and removing rows in a Wicket ListView via AJAX

posted Apr 20, 2010, 5:09 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 1:45 AM by István Soós ]

As our regular readers already know, Wicket is our favorite web framework and we use it actively in our projects. Wicket is an easy-to-use, well-designed framework and is able to incorporate Ajax in a very nice and easy way. I personally am not a big fan of using Ajax in every corner of the application, however at some points it can make your app much nicer. Let's look at such a case!

Imagine a form where you have to enter possible answers for a survey question, with the number of answers being dynamic. It would be possible to give the user a fixed number of text fields, let's say 8 should be enough, but what happens if the user wants more possible answer? Or what if (s)he only wants 2, and is not really happy about another 6 empty text fields taking up half of the screen? So let's go dynamic and "ajaxify"!

Let's give the user only 2 possible answers to begin with and also a link with the possibily to add more answer rows, and another to remove unused ones. The form can contain a lot of other fields in which we are not interested in, however adding or removing a row should not have any effect on the other fields. What's more the app should keep those values that the user already has entered. Of course the links should be using Ajax and only refresh the dynamic list part of our form page and they should refrain from submitting and/or validating the whole form. If you look up a tutorial or a good book on Wicket it gives you a similar solution:
<!-- DynamicRows.html -->
<form wicket:id="form">
<!-- Other non-repeating fields in the form -->
...
<div wicket:id="rowPanel">
<span wicket:id="rows">
<span wicket:id="index">1.</span>
<input type="text" wicket:id="text"/>
</span>
<a href="#" wicket:id="addRow">Add row</a>
</div>
...
</form>
And here comes the associated Java code:
// Relevant constructor code in DynamicRows.java
...
// Create a panel within the form, to enable AJAX action
final MarkupContainer rowPanel = new WebMarkupContainer("rowPanel");
rowPanel.setOutputMarkupId(true);
form.add(rowPanel);

// List all rows
ArrayList rows = new ArrayList(2);
rows.add(new String());
rows.add(new String());
final ListView lv = new ListView("rows", rows) {
@Override
protected void populateItem(ListItem item) {
int index = item.getIndex() + 1;
item.add(new Label("index", index + "."));

TextField text = new TextField("text", item.getModel()));
item.add(text);
}
};
rowPanel.add(lv);

AjaxSubmitLink addLink = new AjaxSubmitLink("addRow", form) {
@Override
public void onSubmit(AjaxRequestTarget target, Form form) {
lv.getModelObject().add(new String());
if (target != null)
target.addComponent(rowPanel);
}
};
addLink.setDefaultFormProcessing(false);
rowPanel.add(addLink);
...
You can notice, that we have used AjaxSubmitLink for adding a new row into the list. This is needed because the user might already have entered some values in some of the fields and we don't want those to be lost, so the form values have to be submitted. However we would like to avoid getting a validation error in some of the other fields when all we want is to add a new row, so we use addLink.setDefaultFormProcessing(false).

This is a nice solution, however if you try it you'll see that the values entered in the repeating rows get lost, when a new row is added. The reason for this is that after pressing the "Add row" Ajax link the TextFields don't update their backing model (since we have turned off form processing), however the ListView removes and recreates all its TextFields again in its onPopulate() method. And so the "old-new" TextFields will show the original model value.

So what to do now? You can try to update the backing model of all the repeating TextFields in the Ajax "Add row" action, however this is still not a 100% solution in case an invalid value is entered. In case of an invalid value the validation fails, the model doesn't get updated and after recreating the TextFields the invalid value reverts to the last valid value entered. So it turns out that the problem is again caused by the recreation of all the TextFields.

You could cache and reuse all those TextFields in a custom subclass of ListView, as I did for the first time. Or you could browse the source code of ListView and come across a reuseItems property which is just the nice and clean solution to our problem:
  lv.setReuseItems(true);
This will reuse the already created TextFields and will call the populateItem method only for the newly added row. Since the TextFields are reused they will remember the last valid or invalid value entered, and that's what we wanted from the beginning. All the above logic can also be applied to a "Remove row" Ajax action in case it's needed. So, we have made a nice "ajaxified" ListView.


published:2009-07-28, a:Szabolcs, y:2009, l:ajax, l:wicket

Captcha in Wicket with cache control and URL encryption

posted Apr 20, 2010, 5:06 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 2:17 AM by István Soós ]

Wicket provides a lot of useful feature, among the others: it provides a lot out-of-the-box components. And if it doesn't suites you, you can easily create your own. Recently we have encountered this with a captcha component: we required a few different features (e.g. easier to read captcha), so we have created our own captcha panel.
  
We have choosen SimpleCaptcha as the image provider: it can create really difficult captchas, however it does suite our requirement for easier ones too, it can be configured like charm. Everything seemed to be easy and it basically worked for simple examples. However we have created an ajax tabbed panel that, on a few tab, contained this captcha. The problem hit us when the user switched between these tabs: the captcha text changed on the server side (expected) but the image hadn't on the client side (if you are entering the text, it will update, but that is not very user-friendly, is it?). What can you do in similar scenario?
  1. Set the cache control directives in the response header of the image
  2. Add a bit randomness to the image URL
  3. Obfuscate the URL of the image
Some believe that the first point is enough for most of the scenarios, some would go with the second option as well, however it is always a good idea to implement the third one too - it comes almost free and effortless with Wicket.

1. Set the cache control directives

You can set these as part of the DynamicWebResource or its subclass, e.g. the image resource we have used:
       BufferedDynamicImageResource bdir = new BufferedDynamicImageResource() {

private static final long serialVersionUID = 1L;

@Override
protected void setHeaders(WebResponse response) {
super.setHeaders(response);
response.setHeader("Cache-Control", "no-cache, must-revalidate, max-age=0, no-store");
}
};

2. Add a bit randomness to the image URL

Wicket dynamic image generates the URL for itself, but with a simple behavior you can modify it and add an extra item to the end of it:
       Image image = new Image("captchaImage", bdir);
image.add(new AbstractBehavior() {

private static final long serialVersionUID = 1L;

@Override
public void onComponentTag(Component component, ComponentTag tag) {
tag.getAttributes().put("src",
tag.getAttributes().getString("src") + "&nanoTime=" + System.nanoTime());
}

});
add(image);

3. Obfuscate the Wicket URLs

This method will obfuscate every non-bookmarkable URL in your application. This is not only for the captcha images, but it helps you to expose most of your internals, and prevents search engines to index them as part of the URL. This guide can help you to achieve more, but basically you need to add these lines in your Application class:
   @Override
protected IRequestCycleProcessor newRequestCycleProcessor() {
return new WebRequestCycleProcessor() {
@Override
protected IRequestCodingStrategy newRequestCodingStrategy() {
return new CryptedUrlWebRequestCodingStrategy(new WebRequestCodingStrategy());
}
};
}

published: 2009-07-22, a:István, y:2009, l:captcha, l:encryption, l:wicket

Spring + GWT: integration with ease

posted Apr 20, 2010, 5:00 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 2:56 AM by István Soós ]

On the company website, we have indicated the intention to share some of our work as open source. This blog seems to be a good opportunity to start with that process, especially when the toolkit is as simple as our GWT integration solution with Spring framework. (Have you noticed that springframework.org is redirected to springsource.org? I wonder if the packages will be renamed too, just to break backwards compatibility - *evil*)

There was always a fuss about how GWT and Spring can be integrated. There are solutions who go with the hard way: defining the services as servlet paths with some hacking to access the application context, or following the Spring MVC, as separate controllers. A year ago, at the time when the Spring annotation support gained more awareness, there were only a few solution to use that with GWT. At that time, we have developed a GWT application that required a Spring integration, so we took the ideas from Chris Lee's blog, and extended it to our needs. Now, it is our time to share that with the open source community, hoping that we can say something new or at least less-known...

Chris and Martin (check the blogs comments) however started a much more interesting way, which was near perfect for us: define your service interface in GWT, implement it as a normal class on the server side, and let Spring do the magic with the binding and such. You need to have the following:
  • GwtServiceHandlerMapping - for mapping the service with a given servlet path
  • GwtServiceHandlerAdapter - for processing the request on that url and delegating the request to the invoked method
  • RpcProxyWrapper - that handles the client side magic
Above these, we had a requirement to pass a bunch of GWT serialized object to the HTML content (through Freemarker), so we have specified a GwtObjectSerializer interface, that can help us just to do that. And why would be this important? In case you are targetting search engines with you GWT-enabled application, it is definitely important to have some mixed content model with HTML content and GWT too. It is not hard at all, so we share that source here too. You can find these sources and compiled version as an attachment on this page (temporally, as we had no better idea where to place it).
Pretty simple, not too much documented, sometimes it feels like it hasn't been cleaned up (some System.out.println(s) are there for some lazy developer who checked back its debug codes) - which is true, as the project received a slightly different version of it. We might clean it up later...

Anyway, how can you use it?

Add the following lines in your servlet config xml:
<bean class="org.squaredframework.gwt.rpc.server.GwtServiceHandlerAdapter"/>
<bean class="org.squaredframework.gwt.rpc.server.GwtServiceHandlerMapping"/>
Suppose you have a Service interface:
public interface SimpleSearchService extends RemoteService {

public SearchResult search(String text);

/**
* Utility class for simplifying access to the instance of async service.
*/
public static class Util {
private static SimpleSearchServiceAsync instance;

public static SimpleSearchServiceAsync getInstance() {
if (instance == null) { // gwt client calls are single-threaded
instance = GWT.create(SimpleSearchService.class);
RpcProxyWrapper.get().wrapProxy(instance);
}
return instance;
}
}
}
With the async pair:
public interface SimpleSearchServiceAsync {
public void search(String text, AsyncCallback<SearchResult> callback);
}
Just implement the service (and observe that nothing special is added in the implementation):
@Controller
public class SimpleGwtSearchServiceImpl implements SimpleSearchService {
...
}
And that is it. On the client, you can use the following code to invoke the service:
SimpleSearchService.Util.getInstance().search(searchText,
new AsyncCallback&;t;SearchResult>() {
public void onFailure(Throwable caught) {
// display error message
}
public void onSuccess(SearchResult result) {
// display result
}
});
And we are done. Pretty simple, isn't it? Once the Spring magic is in place, both the service implementation and the client code cannot be more simple. (If it can, please let me know!) And what happens inside?

The services are mapped in a special way, which is automatically known to both the server and the client: with some prefix and postfix, we will just transform the my.server.Service to the /gwt-rpc/my/server/Service URL. Simple is that, details are in the RpcProxyWrapper and GwtServiceHandlerMapping classes. As a bonus, you are now refactoring-safe.

Oh, and those of you who are wondering what this squaredframework.org is: this is our registered domain that was thought to host our open source initiatives. We are not sure how we will proceed on that, but at least we name our shared classes in that way.

Update (on July 19): I've just recently encountered Dustin's project: spring4gwt. He had very similar ideas, it might be reasonable to merge these, so if he takes the initiative, expect some merges on his page...


published: 2009-07-13, a:István, y:2009, l:gwt, l:rpc, l:spring

Amazon EC2 + OpenSolaris + ZFS + EBS + encryption

posted Apr 20, 2010, 4:58 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 2:48 AM by István Soós ]

The company made a good decision in the recent weeks: the target is the sky, but at least the cloud. Amazon AWS offerings are hard to beat, so we have started with that one, played around with different configurations a bit, and finally decided that first we shall migrate the company Subversion repository to the cloud, with ZFS mirrors and encryption.

I'm a long-time fan of the ZFS filesystem and the Sun's OpenSolaris offering around it, basically because this is the best, easily accessible filesystem that provides drive mirroring with checksums, enabling automatic recovery from the underlying storage's failures. So it became a natural plan to run OpenSolaris on EC2, ZFS with EBS volumes mirrored. Although the EBS is meant to be very robust, there are always failures in every system, and we have checked a few blog entries where the EBS actually did fail, so better be prepared...

We know that we cannot achieve absolute secrecy only if we unplug the server, dump it a big hole in a deserted location and forget about it, but it seemed to be reasonable to have some encryption. The plan was that at the time the instance starts, we log in, attach the the encrypted ZFS pool with typing the password. Okay, the running instance may be monitored and the content might be extracted if the infrastructure allows such move, but we hope this is a much harder and more classified job to do, than sniffing around a volume snapshot.

I've mailed to the Sun OpenSolaris EC2 team, and they were very kind giving the initial pointers to look for the stuff. I can recommend the following sites in this topic:
Basically the last one pretty much describes most of the important part, but there are a few differences on EC2. First, the Web Console doesn't allow you to mount the EBS volumes directly, because it will provide the /dev/sdf-like mount points for you, but this is not what you are looking for, as the OpenSolaris AMI requires the device number rather. So go to the command line or use ElasticFox, to attach these drives properly. In our test drive, I've attached two 1GB volume as the 2nd and 3rd drive to the EC2 instance, they became the c7d2 and c7d3 respectively.

To cut a long story short, I've used the sun-opensolaris-2009-06/opensolaris_2009.06_32_6.0.img.manifest.xml AMI, and here are the commands that were required to complete the process:
# zpool create rawstorage mirror c7d2 c7d3
# zpool status
# zfs create rawstorage/block
# dd if=/dev/zero of=/rawstorage/block/subversion bs=1024 count=0 seek=$[1024*512]
# ls -lh /rawstorage/block/subversion
# lofiadm -a /rawstorage/block/subversion -c aes-256-cbc
# zpool create subversion /dev/lofi/1
# zpool status
# pkg install SUNWsvn
# svnadmin create /subversion/research/
So what does it give for me?
  • I have a mirrored storage over the EBS (rawstorage pool)
  • I have a ZFS filesystem (/rawstorage/block) on that pool, so I can turn on the compression if I'd like, create snapshots, extend it or anything like that
  • I've created a block file (/rawstorage/block/subversion) on this storage with reasonable starting size. Okay, I haven't checked the size of our ivy repository, this might be not enough for real use. Is there anything more robust (or at least extendable) solution for this?
  • I've attached it as an encrypted loopback device (the /dev/lofi/1 appeared) and set the password
  • Created a new zpool above this device (subversion pool)
  • Installed SVN and used it...
This works from this point on, but what happens if I shut down the instance and start a new one? Well, let's attach the EBS volumes again, and follow these commands:
# zpool import
# zpool import rawstorage
# lofiadm -a /rawstorage/block/subversion -c aes-256-cbc
# zpool import -d /dev/lofi
# zpool import -d /dev/lofi subversion
# ls -lh /subversion/research/
Cool, it works again! You just need to import the rawstorage pool first, attach the lofi driver (get the proper password here), import the second pool, and use it as you like.

But what happens if the password is wrong? First of all, the lofi driver is unable to decide. That seems to be bad at first, but actually it doesn't matter, as we are not going to write any data if we are not able to import the subversion pool. So the worst scenario is that you type a bad password, and the zpool import won't import the subversion pool, and that is it. In such case, you shall detach the lofi drive and retype the password until it gets the pool.

Simple? Seems to be, but before you put all your crucial data on top of it, you might want to play around a bit with OpenSolaris and EC2 first. Many thanks to the Sun and Amazon teams to enable such marvelous technology combination.

Update on 2009-07-16

Last week we have made a little proof of concept about the encrypted Subversion on Amazon EC2. This week, we decided to move forward and migrate most of our development-related stuff to the EC2 cloud, and now here goes our little success story.

The ZFS encryption works mostly as described on the previous blog, although it has a little difference after we have rebundled the OpenSolaris image. (Make sure you follow this guide!) The difference is that on the rebundled image you shall do something like this (supposed that 'storage' is the normal pool, 'safe' is the encrypted pool:
zfs mount storage
lofiadm -a /storage/block/encrypted -c aes-256-cbc
zfs mount safe
Except that, everything works as expected. We have made the following setup on the EC2:
  • The OpenSolaris image handles two EBS volume in a ZFS mirrored pool. This 'storage' pool has turned-on compression to decrease the number of IO-operations a bit.
  • On the storage pool, we have stored some downloadable stuff, but most of our data is on the encrypted volume ('safe' pool).
  • Our issue tracker is Redmine, and although it is hard to setup at first time, and it has some limitations in the project identifier handling (20 character of id is not really long), it is good enough to use for issue and time tracking (+ wiki, + documents, + subversion access control + ...).
  • We are using Postgresql database to store the Redmine stuff.
  • Our Subversion repository is exported on webdav, the access control is delegated to Redmine. One single entry point for the administration gives less overhead...
If we ever need larger storage, we just attach a new drive, the ZFS handles the hard stuff, and detach the old. We have all the development stuff on a remote server that is reliable (okay, we need to do some regular backups even on Amazon), and we are paying much less than our previous server hosting provider. And our public company page can be hosted on a cheap host, as it is 100% static content.

So far so good.

Update on 2009-08-07

We have started evaluating and using Amazon EC2 almost a month ago. Here are our 'lessons learned' items.

Be prepared...
We have evaluated and used encryption with OpenSolaris and ZFS on EBS. We have successfully rebundled the instance to migrate our Subversion repository on this server. Although we have always typed the encryption password right after this migration, we have finally decided to check some scenario, e.g. when we do type it wrong: can we loose data some way? Just in case something does go wrong, we have created EBS snapshots on the volumes. After some testing, we see the data lost scenario unlikely, because if we type the password wrong, we will receive something like the following:
Initial state:
pool: safe
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-72
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
safe FAULTED 0 0 0 corrupted data
/dev/lofi/1 ONLINE 0 0 0

pool: storage
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Jul 24 12:42:15 2009
config:

NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror ONLINE 0 0 0
c7d2 ONLINE 0 0 0
c7d3 ONLINE 0 0 0 43.5K resilvered

errors: No known data errors

# zfs mount safe
cannot open 'safe': I/O error
So we need to remove the lofi storage with lofiadm, and remount it, solves all the problem.

Automate...
It is always a good idea to document things, and this is especially true with a sometimes transient service like Amazon EC2. It turned out that there was a startup bug in the official OpenSolaris bundle and you need to rebundle your server with the new version if you would like to have a better version. We did, as we have encountered this bug sometimes, so the documentation become very handy: we were required just to copy-paste the commands in the console and wait for the output, as most of our documentation was like a shell-script.

The next level of automation will be to create expect-scripts to automatically set-up and bundle full images. I'd suggest anyone starting with EC2 to write the setup scripts in this later fashion from the beginning. For the hard-core Java people like myself, ExpectJ or Enchanter are vital options too, but the ultimate solution is to use something like JSch and Groovy to control every aspect of the communication.

Automate, automate...
When we start an instance, we attach the drives, the elastic IP, then execute a few commands to mount the encrypted storage and start the services. This is a very boring process, and fortunately you could automate this process too:
  • Use the Amazon EC2 command line tools to query information from your available resources.
  • Tag your resources according your service needs (e.g. if you have a redmine server, put the redmine tag in the EBS volume's tag and the elastic IP of that instance).
  • Write scripts that process these tags with the help of the above mentioned command line tools, attach the drives and IP automatically.
  • Execute other scripts (e.g. the encryption) on the running instance to fire up everything.
Even if you are using encryption, late service starting or other exotic requirement, you might reduce the number of required steps to a very small number (1-5, including the password specification).

Automate, automate, automate...
Sometimes it is not known before the server setup how often you would like to have backups / report processes. Rebundling the server just to add a new crontab entry is a very unlucky task for anyone involved. It is better to prepare the bundle image with a few cron job that might not be ever used, but if we does require them, we are not required to re-bundle the image. For example the following commands help to define a hourly report script:
export EDITOR=nano
crontab -e
# 58 * * * * [ -x /safe/home/root/hourly-report.sh ] && /safe/home/root/hourly-report.sh
As you can see, this script is placed in the '/safe' directory, which is on the encrypted volume. If for some reason the encryption / mount fails, or if there is no such file at that place, there will be no error: the [ -x ... ] directive ensures it will be executed if and only if it is present and executable. Placing this in the encrypted volume allows us the opportunity to store a few, more confidential items here as well, e.g. our script can encrypt the report mail, or use some sftp mechanism to access some remote site for such report.

Of course the type and variety of such scripts you define in your crontab is up to you entirely.

Be patient...
With the ElasticFox plugin, we have encountered some strange problem, e.g. sometimes it does take a very long time to get the list of KeyPairs. One inpatient member clicked on the 'create' button, typed the same name we have had previously and silently removed our old key and placed a new one. The KeyPair was distributed internally again, but this is just a silly move it is rather not encountered.


published: 2009-07-09, a:István, y:2009, l:aws, l:cloud, l:ebs, l:ec2, l:encryption, l:opensolaris, l:subversion, l:zfs

What is the price to pay for developer productivity?

posted Apr 20, 2010, 4:39 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 1:40 AM by István Soós ]

I would like to share some thoughts with you about an untold tale in software engineering. Most probably you have already read or seen quite a few marketing-oriented documents or events with a similar message:
  • Use configuration instead of coding
  • Change system behavior or business rules without recompiling
  • Increased developer productivity...
All of these (let's call them dynamics) are nice, however I've never seen (not even in a footer with a tiny font) what is the price you have to pay for these features. And there's sure a price to pay compared to the "old" approach where you didn't put much of your program in configuration, and you had to recompile it for even the smallest changes. Well let's have a closer look at one very common one, the usage of JavaBeans.

A JavaBean is primarily meant to be a "data-holder" (or business object if you like that terminology better), with no or very little application logic included. Their main purpose is to hold data that are set through their setter methods and return those when needed via their getter methods. With some simple rules defined (for an attribute named attribute there must be a getAttribute and a setAttribute method) it is possible to dynamically explore and use any JavaBean with the help of the Java Reflection API (java.lang.reflect package). Well in most of the cases told in the beginning this is the backing concept, and the one that makes a developer's life easier (at least in the development phase). Let's have a look at a simple JavaBean:
public static class Person {
private String name;
private int birthyear;

public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getBirthyear() {
return birthyear;
}
public void setBirthyear(int birthyear) {
this.birthyear = birthyear;
}
}
This is a very simple JavaBean. If you want to work with it the "old" way, you call those getter and setter methods directly, but then you have to explicitly write those at compile time, making your app not very dynamic. The "new" way is this:
java.lang.reflect.Method setNameMethod = Person.class.getDeclaredMethod("setName", String.class);
setNameMethod.invoke(person, "James Bond");
Even better is if you explore the methods and their parameters beginning with "set" and "get" dynamically, and store them in a Map for later use. More or less this is the way all of today's frameworks function, from JSP's Expression Language to Wicket's PropertyModel binding. Since this - not very nice - code can be put in the framework, the developer will see only the nice and easy configuration of ${person.name}="James Bond" or something similar. Ok, but what will the developer see at runtime (with an emphasis on time)?

Well, attached below is a small little tester app, that shows the runtime difference between direct and indirect (reflective) call of a JavaBean's setter and getter methods. The test performs a typical scenario when some dynamically configured behavior of a framework reads and/or updates all the fields of a JavaBean. Please note that the solution used is probably the most simple solution for indirect method invocation, most frameworks use something more complex for example via the java.beans package, but in the end it always comes down to java.lang.reflect.Method.invoke(...)! You can easily run the app yourself, so I only show some interesting results:

Test cycles: 10 50 200 1000 5000 20000 100000 500000
Direct calls (average in ns) 5700 5800 5700 5700 5800 5100 4500 4300
Reflective calls (average in ns) 32000 60000 32000 36000 18000 12000 6300 5100

We can see that there are situations, when it is 10 times slower to call the same method(s) reflectively than directly. This is a huge difference! And the average doesn't even contain the discarded maximum values, which would make the gap even bigger. By increasing the number of test cycles well above 1000, the difference melts down to about 20%. However I think the reason for this is the HotSpot optimization in the JVM when it sees that the same cycle is repeated for an awful lot of time. I am sure that in a normal application where such code is not executed in a cycle like this, JVM optimizations are not likely to be so effective. Most probably in a normal application it can be said that the similar developer-friendly features (dynamics) take up to 5-6 times more cpu time to accomplish for the JVM.
    Well, what do you think: is this a big price to pay for developer productivity?


published: 2009-09-11, a:Szabolcs, y:2009, l:java, l:productivity, l:profiling

DataOutputStream: encoded string too long

posted Apr 20, 2010, 4:35 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 1:42 AM by István Soós ]

As I'm preparing to release OKTECH Profiler 1.1, I have checked the performance benchmarks on the profiler itself. It came apparent that the UTF-8 conversion consumes a lot time, so I've started to investigate what happens behind the scenes. I've encountered a little shock at the DataOutputStream class: it has a serious limitation, as it doesn't allows to write strings larger than 64k. I thought those times(*) were over and it is just the Java reference source code that has this limitation, so I've written a small program to double-check it:
    public static void main(String[] args) throws Exception {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 10000; i++)
sb.append("1234567890");
String s = sb.toString();

ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
dos.writeUTF(s);
dos.close();
}
(*) I got lazy in Java and assumed that it just works.

To my great dislike, it fails on my Mac, the DataOutputStream does have this 64k limitation. Having an other look on the Javadoc, it does contain this information. At the moment this shouldn't affect OKTECH Profiler, as I cannot imagine any stacktrace that has a method or class which is longer than that.

On the other hand, we are now in a process that ensures larger flexibility in the dumps, allowing 3rd party plugins easier to be contributed in the profiler runtime and analysis. After this experiment, I'm considering a more XML-like dump format, e.g. Fast Infoset, a pretty good binary xml format. (Update: no, it won't be binary XML, it is so much slower than pure DataOutputStream).


published:2009-09-26, a:István, y:2009, l:java, l:profiling

Comma separated list: CSS or Wicket?

posted Apr 20, 2010, 4:19 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 1:41 AM by István Soós ]

Recently I've encountered a problem of displaying comma-separated list items on a web page. It came natural to check if it can be done in CSS or not. This page explains the concept implemented in CSS, and on this example page you can check it yourself. It works in Safari and Firefox, but does not work in IE 7 - it just don't display any comma at all, the items are listed with spaces between them. Too bad :(
That will leave us to implement the comma separated list inside the application, in our case: in Wicket. The following small code fragments explain the basic idea:
<wicket:container wicket:id="list">
  <span wicket:id="comma"> </span>
  <span wicket:id="label">some value</span>
</wicket:container>
We created a container without markup; inside the container we have defined a "comma" component, and a "label" component. If we take a look at the Java code, it is pretty easy to understand how it goes, actual code fragment uses ListView:
   add(new ListView<String>("list", new MyListModel(myParentModel)) {
       private static final long serialVersionUID = 1L;
       @Override
       protected void populateItem(final ListItem<String> item) {
           item.add(new Label("comma", ", ") {
               private static final long serialVersionUID = 1L;
               @Override
               public boolean isVisible() {
                   return item.getIndex() != 0;
               }
           });
           item.add(new Label("label", item.getModel()));
       }
   });
The first comma component is not shown, while all the others are. It is really simple to implement - after you have the idea and design. Of course you might get rid of the <span> elements too, if that makes sense in your application.


published:2009-09-27, a:István, y:2009, l:wicket

1-10 of 10