Google AppEngine Disappointment

I was eager to try out Google’s AppEngine for Java, but I was soon disappointed to find out that the AppEngine is just a partial implementation of the Java APIs. The biggest problem with the missing APIs is that a lot of 3rd party software and libraries simply won’t work, which means that you loose all the benefits of the mature Java ecosystem. Porting an existing Java application to AppEngine is simply out of question.

Another major problem is persistence. You cannot deploy your favourite DBMS on AppEngine, you have to use Google’s Datastore, which is not a common relational DB. Google makes available JDO and JPA interfaces to access the Datastore. However, these are also partial implementations which provide just a familiar syntax for a substantially different persistence mechanism and create a false sense of familiarity. This is particularly true when you try to refer to the JDO documentation for troubleshooting, but you discover that the semantics of Google’s JDO implementation are quite different. It would have been much better to stick with a proprietary persisntece API and not pretend to fill the gap to reach JDO or JPA semantics.

Another important missing feature are Threads, which makes AppEngine a poor choice for massive computations since paralellization of subtasks is not possible, at least until Google implement the AppEngine Map/Reduce service.

Lastly the overall sense of a finished product is not there. Things that one would expect to work, fail for some obscure reason. Error messages are often cryptic. Apart from an introductionary tutorial, documentation is quite scarce. I guess this is justifiable since AppEngine is still a beta release, but I would rather consider it a pre-alpha. I was expecting much more considerate choices from the smart Google engineers, who have amazed us many times so far.

There were a few things I did like about the app engine. For example, the Eclipse plugin  simplifies configuration, testing and deployment. Integration with other Google services are a promissing factor. All in all, Google AppEngine is an interesting toy, but compared to other solutions it’s remains just a toy. If you want to play with it, I suggest you should read this.

Time Tracking Tools

Have you ever asked yourself how much time during the day in the office do you spend looking at funny youtube videos, reading emails, taking coffee or doing some boring administrative task that keeps you from getting your work finished? I’ve actually been asking myself that for what it feels my entire life. I always tought that a well organized and productive 3 hours could be more valuable than a whole day of distractions. I always wanted to measure that productivity in detail and see what things I’m spending my time on and how much. Now I finally can.

A few months ago I found out on Lifehacker about this software called RescueTime that tracks how you spend your time on a computer. It’s free and you run it in the background. It categrorizes your activities in Communication, Development Tools, Reference/Search etc. If you are using a browser it is smart enough to distinguish which sites are you accessing, so reading The Server Side and Failblog.org are considered two different activities.

 

rescuetime-screenshot

One downside of RescueTime is that it posts your usage statistics to their servers, so if you don’t want anyone to see how much time you spent… watching porn in the office, you might choose to disable RescueTime for a couple of hours. On the other side, your usage statistics are accessible only by you, so your boss can’t detract from your salary every 5 mins you spent checking your private mail.

Another downside is that it doesn’t show you when exactly did you start working with a particular program and when were you AFK. So you can’t distinguish between a meeting and a lunch break.

Both of these problems are solved by ManicTime. Manic time works only locally and gives you information on what did you do at an exact moment of the day: 
manictime-screenshot

 

 

The only downside of ManicTime is that it is too fine grained so you don’t get the quick overview that you can get in RescueTime. Currently I’m using both of them until I decide which one is better or one of them implements the functionalities of the other.

Active and Passive Replication in Distributed Systems

In the distributed systems research area replication is mainly used to provide fault tolerance. The entity being replicated is a process. Two replication strategies have been used in distributed systems: Active and Passive replication.

active-passive-replication

In active replication each client request is processed by all the servers. Active Replication was first introduced by Leslie Lamport under the name state machine replication. This requires that the process hosted by the servers is deterministic. Deterministic means that, given the same initial state and a request sequence, all processes will produce the same response sequence and end up in the same final state. In order to make all the servers receive the same sequence of operations, an atomic broadcast protocol must be used. An atomic broadcast protocol guarantees that either all the servers receive a message or none, plus that they all receive messages in the same order. The big disadvantage for active replication is that in practice most of the real world servers are non‐deterministic. Still active replication is the preferable choice when dealing with real time systems that require quick response even under the presence of faults or with systems that must handle byzantine faults.  

In passive replication there is only one server (called primary) that processes client requests. After processing a request, the primary server updates the state on the other (backup) servers and sends back the response to the client. If the primary server fails, one of the backup servers takes its place. Passive replication may be used even for non‐deterministic processes. The disadvantage of passive replication compared to active is that in case of failure the response is delayed.

Webapp stacks comparison update 1

We have received measurements from one of the remaining level 3 applications. It’s Immo’s implementation based on Ruby + Ramaze + MySQL. Here are the updated results:

performance1

As soon as we confirm the measurements and get the numbers for the ASP implementation we will publish the updated point chart.

Webapp stacks comparison

Ever wondered what is the best technology stack for building web applications? I’m sure you have.

Last week my company went on a workshop, during which we tried to answer that question. I wrote a requirements specification for a simple web application. Everyone had to choose the technology stack they’re most familiar with and implement the application. Everyone received a benchmark script for the application and at the end of the day we compared the performance results. The application was just enough complex to require at least a day to implement. In the end we multiplied the performance by the level of functionality implemented.

The Application

The webapp I chose is the same one for the couldspeed project. It is based on the model of social networking applicatins. Once registered and logged in, users can add other users as friends and submit posts. The homepage will display the 20 most recent posts from all user’s friends including the user himself.

There are 2 entities each with 2 attributes:

  • users(email, password)
  • post(date, content)

and there are 2 relationships

  • users are friends with other users (n to n)
  • each post is written by a user (1 to n)

there are 2 web pages: login and home.

From the login page, users can register themselves or log in.

The home page displays the number of friends, the 20 most recent posts (for each post, the date, author and content) and allows users to add posts, friends or logout.

Information has to be persisted somehow.

The Benchmark

Each developer got a copy of JMeter and 3 scripts. The 3 scripts excercized 3 different levels of functionality:

  1. Register Users – registered a number of users (100)
  2. Make Friends – created around 10 friends per user
  3. Post Stuff – posted messages

For the first level of functionality, the developer had to implement the functionality behind the Register button. Registered users had to be persisted. This level had a multiplier of 1.

For the second level of functionality, developers had to implement logging in, keeping session information, and the friendship relationship. The homepage should have shown the number of friends. This level had a multiplier of 3.

For the third level of functionality, developers had to implement everything: posting messages and showing the messages from all friends. Since this functionality included a complex query that ate a lot of processing power, the multiplier was 50.

The Contenders

Ben: Erlang + CouchDB

Team 2: Java + Spring + Hibernate + Postgres

Team 3: Java + Spring + Hibernate

Team 4: Ruby On Rails + MySQL

Team 5: ASP with Visual Basic + SQLServer

Immo: Ruby + Ramaze + MySQL

The Results

performance

Only three of the contenders implemented level 3 of functionality, but we were able to measure the performance only for the erlang implementation. The other 2 that made it to the 3rd level were ASP+SQLServer and Ramaze. We will soon publish the results of the two missing level 3 implementations. 

total-points

The winner was Ben with 1111.1 points. Erlang + CouchDB proved to be quite fast and productive in the right hands.

Conclusions

Allthough just one day of development is not nearly enough to measure productivity, we noticed that choosing a particular language that promised a high level of productivity wasn’t as important as choosing the language you’re most familiar with.

Alltough Level 1 of functionality does not provide a meaningful scenario to measure performance, we can see that most of the implementations had the same performance, including Ruby, which has been measured to be 100 times slower than C++, but in our scenario it actually showed the best performance.

Allthough we must point out that both ruby implementations and the ASP implementation showed instability. Errors ranging from 15% to 30% were reported during the benchmarks.

As a comparison, I have done some java implementations before the competition. They all provide level 3 functionality. They all use Servlets in the web tier, but the DAO tier changes: we have JDBC with MySQL, EJB3 with MySQL and pure in memory.

Compared to the Erlang implementation wich serves 15 req/s I have similar results for the JDBC (15 req/s). A “heavyweigth” stack such as EJBs actually manages to produce 38 req/s, but by far the fastest solution is keeping everything in memory that brings us to an impressive 2000 req/s. Obviously the in memory solution cannot be compared to the others because it doesn’t satisfy the persistence requirement. But can still be used as a comparison to see how much performance do we loose on the persistence layer.

Future Work

We have had requests from other people that were not taking part of the competition to submit their implementations. Since we have no way to measure productivity, all submissions now require level 3 of functionality. All implementations will be published on the cloudspeed website. If you wish to contribute by improving an implementation or submitting your own just send me an email or leave a comment here.

Currently the benchmark is adapted to quickly give a very aproximative measure. 100 users is not a realistic number even for a small website. Our next goal is to create a benchmark that will register up to millions of users, have an average of 100 friends per user and post several million messages.

The final goal is to port these implementations to the cloud and measure their scalability. We don’t only want to see the performance on a single machine, but also how will the performance be when we reach the limits of a single server. Ideally we want to try to run these applications on clusters as big as we can get.

Stay tuned for updates…

Tuning the EJB3 Implementation

During my evaluation of highly scalable technologies I incurred in some performance problems with the EJB3 implementation. It seems to me that it is a quite common problem

Read the rest of this entry »

Measuring the Speed of Clouds

I have started a project for benchmarking highly scalable technologies. My plan is to use cloud computing platforms and implement the same application using different stacks of technologies. The aspect I want to evaluate is performance when the number of nodes grows a lot. Read the rest of this entry »

Centralized VS Distributed SCM

Distributed Source Control Management systems have become a trend in the last years: bazaar, git, mercurial, svk. Distributed SCM is a fascinating concept, but how well does it perform in practice? Some say that it leads to a phenomenon called branch proliferation. Read the rest of this entry »

Playing with Jazz

Yeah, you read correctly: “Playing with Jazz”. Jazz is the new Application Lifecycle Management software developed by IBM Rational. In other words it is an all-in-one Source Control Management, Continuous Integration and Issue Tracking server. Currently I’m developing a lot of plugins for my client that are meant to integrate Jira, Subversion, Maven, ClearCase, our custom bug tracking software, our custom made overnight testing and reporting, Eclipse IDE etc. Jazz is supposed to provide these features and much more out-of-the-box. I’ve seen the demos and I was really impressed. So, now that it’s open for download, I registered and downloaded the server and the client.

Well, things are not that impressive when you start working with it. The product isn’t that ergonomic. It presents a relatively steep learing curve for the basic functionality.

Ideally an administrator would:

  1. install the server (it should be added automatically as a windows/*nix service)
  2. start the Jazz service
  3. connect to the web-admin interface
  4. create some users

The developer:

  1. installs the Eclipse Jazz plugin
  2. connects with his username/password
  3. uploads a project or two in the repository

The server:

  1. autodetects how to build and test the projects (eclipse/ant/maven)
  2. starts doing this immediately with a predefined schedule (continuously)
  3. if there are failures it notifies the responsible developer
  4. build and test status are visible in the status bar of the IDE

What happens in reality is not that simple. You need to create a project area, define the process, create a team area (A team area is just a team, the word “Area” was added so that you don’t confuse the representation of a team in Jazz with the people sitting around you), create a development line, create a workspace, create a stream from your workspace to the development line, create an iteration plan, an iteration, a build configuration, a build engine etc.

jazz.png

What I would like to see is a product where the novice user and even the administrator initially, don’t need to know about project areas, iteration plans, development lines, streams. All these are advanced features. The first impact should present a shared versioned storage where you place your projects and it tells you whether the projects are building and whether the tests are passing. Users should be able to add tasks without having to define an iteration or an iteration plan. And most of all, I don’t like the Team Artifacts view. It’s an heterogeneus collection of entities in the Jazz server unimmaninatively organized into a tree. Integration with Eclipse should be tighter. Eclipse already knows how to build and test my projects. Why do I need to specify those things in two different places?

Don’t get me wrong. I don’t want to say that Jazz is overengineered. Most of the entities involved are necessary for such a product. Even multiple project areas and streams. But the point here is that they are presented in the wrong way. Advanced features should be hidden from the user and the product should be usable out-of-the-box. Ok, it took me less than 2 hours to install the server, read the instructions and configure a project, and another hour to troubleshoot the automated build. But there is really no reason why it should take more than 5 minutes.

I really hope that Jazz developers will take this as a constructive critique and deal with these issues before the release of Jazz. Otherwise they risk to loose a multitude of potential customers who need only simple configurations with the most basic process. And it’s from projects like these that you get the most publicity, community involvement and knowledge sharing.