Entering private beta testing mode

Starting today, Observu is no longer limited to two customers only. We’ve sent out the beta invitations to the mailinglist and hope you are on there as well. If you would like an invitation as well, just send us an e-mail.

We are very eager to learn what you think and what direction we should go. We’ve got tons of ideas, but need your guidance to build the tool that will help you most. We will give away a free T-shirt and a significant discount to anyone that sends in valuable feedback.

Posted in Progress Report | Leave a comment

MySQL queries that kill your responsive website

There are a lot of queries that are fine when you’re site is small, but take ages as soon as you start to collect some data. Therefore it’s very important to monitor query performance. We usually track at least the following things:

  • total time spent on SQL queries
  • total time spent on rendering a page
  • queries that took more than a certain threshold (query and time)

We log these, so we can quickly discover bottlenecks. (using the Observu server agent, we also store these in Observu for a quick overview and the ability to receive notifications when it happens)

Many frameworks such as Zend Framework have built in SQL profilers which can already do these things, you just need to check out the documentation.

After you found the culprits, it’s recommended to run them manually, prefixed with EXPLAIN. Often you will have forgotten to add an index or your index does not match the use of your query.

There are however some query patterns you can already watch out for when writing and reviewing your code. We’ve encountered these again and again as our databases grew larger:

SELECT ..... ORDER BY created_date DESC LIMIT 0,7 to get the most recent items
This becomes slow as the database grows larger even if there is an index on created_date. The way to counter this is to actually make use of that index by adding a condition that limits the amount of data involved, like: created_date >= ‘{date_7_days_ago}’
(it’s recommended to generate this date in code and round it to a date and a 00:00 time, so the result can be cached)

SELECT .......... LIMIT 500000,10 created by paging code on a large table
This one is harder to prevent, however there are some approaches:

  • Do not sort the data, but have it returned in it’s natural order.
  • Do not use LIMIT, but use actual conditions on the dimension which you order the results by. (e.g. a range of ID’s or dates)
  • Just disallow browsing this deep into the data, will users really need this? Or is the ability just an oversight, which only gets triggered by search engines

SELECT ..... ORDER BY rand() LIMIT 10 to select random items
This is a very common way to select random items, that does not work at all as soon as you have more than a few thousand items. What happens is that MySQL will first have to generate a random number for each entry in the database, before being able to select the 10 to display.

The way around this is to first determine the range of ID’s to select from. ( SELECT MIN(id), MAX(id) FROM mytable )
Then generate a random id between MIN(id) and MAX(id)-1 and an upper bound, usually something like random_id+1000.
Finally, find a random item by querying SELECT * FROM mytable WHERE id>={random_id} AND id < {upper_bound} ORDER BY id ASC LIMIT 1.

This efficient way to retrieve a random item from a MySQL table can also be applied to multiple items. For really random, just repeat the procedure. However, in most cases, you don't need a really random set and you can just use something like:
SELECT * FROM mytable WHERE id>={random_id} AND id < {upper_bound} ORDER BY rand() LIMIT 10

Posted in Uncategorized | Leave a comment

Development update – 8

It has been silent for a while, but we are definitely still going. Today we’ve deployed all updates to our production systems. Enabling features that were critical to support our launching customer:

  • Grant permissions to view your monitors to other accounts
  • A proper data explorer to browse all metrics that are collected
  • Auto-archiving for monitors (very useful in combination with EC2 auto-scaling groups)
  • Tracking and limiting of account usage

We are now going through some final tests and bugfixes, but we will definitely open up the first month of 2013!

Observu Teaser screenshot

Posted in Progress Report | Leave a comment

Development Update – 7

It has been a while since our last update. In this time, we’ve been working closely with our first customers to determine and implement various essential features. We’ve also applied our experience and research on hosting in the cloud to their projects. This has led up to a major milestone last week: observu.com now hosts the latest beta and a more descriptive website. It is still very private, but if you get on our mailinglist, we can let you in soon.

In terms of development we’ve made a lot of progress on properly organizing the API and mobile website code, to share 100% of the codebase with the main website. (proper MVC with only difference being the View). We are a big fan of Redis, which we’ve used extensively for various queues and rate limiting solutions, that would be challenging to get right otherwise.

What we are working on now is usability and extended reporting. Of course we also have a heap of features in mind, but we would love to have your feedback first, to know what really matters.

Posted in Progress Report | Leave a comment

Development Update – 6

As Observu is all about improving uptime and removing bottlenecks, we strongly believe that we can’t do with an ad-hoc infrastructure either. Especially as the exact time you need Observu is often in case of emergency, we feel strongly about the ability to recover from outages quickly.

We’ve selected Amazon for hosting because it is both flexible, is available in multiple parts of the world and has an excellent network quality. However, earlier this year is has been shown multiple times that no datacenter has 100% uptime and that if failure occurs, it is big. Therefore we are currently working on our architecture to be able to quickly overcome such events. The actual details warrant a separate post.

Our obsession with reliability also touched an other area of development: Our initial implementation of SMS notifications proved unreliable. Therefore we changed to Nexmo as our partner. It provides us with actual delivery confirmations, allowing us to monitor delivery.

Our Growing Development Stack

I personally always like to know what people are using to create their product, therefore a listing of almost everything we use:

  • Ubuntu on Amazon EC2
  • MySQL
  • Redis
  • PHP
  • Perl
  • jQuery
  • RaphaelJS
  • boto
  • Fabric
  • chef-solo

In the area of 3rd party services we rely on:

  • Amazon AWS
  • Nexmo
  • Tropo
  • Sendgrid
  • Github
  • Uservoice

These services allow us to focus on the things that really matter: gaining insight in all parts of your deployment and staying on top of the events that will occur. To improve that insight, we are currently working with the first customers to implement monitoring as part of their stack. A great example is the need to monitor logfiles centrally as soon as there are multiple servers handling your front-end.

Posted in Progress Report | Leave a comment

Development Update – 5

An important milestone has been met: we’ve implemented all core features that make up our monitoring system. The last important hurdle was the completion of a flexible, yet easy to understand, system to create and configure event rules.

Of course it is far from the system we imagine. Nevertheless, we feel confident that what we have now is a very useful product.

In addition to this technical progress, I can also present you the new observu.com logo:

Observu.com Logo

Current efforts are focussed around documentation and workflow, to make sure the first users get the experience they deserve. Furthermore, the API is being finalized and the production environment architected.

We are very excited about the coming weeks when the first testers will finally enter the system.

Posted in Progress Report | Leave a comment

Development Update – 4

We are getting closer and closer each day, we are almost ready for the first bunch of testers. Our attention is shifting more towards the user interface, to create the best experience possible.

An important part of that user interface are the graphs, that start to look better and better. An example is this stacked CPU usage graph:

CPU usage graph

This also shows an other major point of progress: we now have a fairly easy to install daemon script to collect this data on linux servers. Currently it collects load, cpu, memory, disk and network statistics. We are planning on collecting a whole lot more soon.

The final part of this months effort is a mobile website. It provides with a simple and clear way to check your site and server status on-the-go. For example, when you receive a text notification, you can instantly check what is actually going on.

We feel confident that we can allow the first testers access next month and are really looking forward to their feedback. To make sure, you can be one of those testers, sign up for our mailing list!

Posted in Progress Report | Leave a comment

Development Update – 3

Observu Dashboard It has been a while since I last updated you about our progress. We are still continuing our work on various reports. An important part of this is creating informative graphs. Although we liked Open Flash Chart, it felt a bit sluggish and we decided to go for a solution based on the Raphael library: an SVG abstraction with fallback for IE. It comes with a limited graphing library: g.raphael, but it was not mature enough for our needs. Another Raphael based charting library is Grafico, it’s able to display a few great graphs. However, we choose to create our own, mostly because Grafico depends on Prototype, which we do not use and because we would need to extend it for our own graph types. Although the library is well coded, we did not feel confident about customizing it to our needs. We will open-source our own library as soon as it is in a usable state.

At the same time, we’ve started to work on a very basic mobile website, which allows you to check your status on-the-go. We hope to slowly add more functionality and at the same time keep things really quick and simple.

Another major part of development involves the collection agent. We’ve chosen to use Perl, to maximize portability and reduced dependencies. Additional advantages include easy customization and the ability to verify that it does not contain harmful code. The next challenge in this area is to create an install script that works across distributions.

On the front-end we’ve introduced a new splash page for observu.com, it contains the first iteration of our new logo, which still needs a bit of work. We also got a very nice mascot designed, but we will keep that a secret for now.

If you visit the new splash page, you will also notice that we’ve selected UserVoice for feedback and support. I’ll write a separate blog post about the selection process soon.

We are also talking to potential users about their monitoring needs, if you feel you could contribute by telling us about your problems, please feel free to contact me.

Posted in Progress Report | Leave a comment

Amazon RDS vs DIY MySQL on EC2 Benchmark

As I was researching online whether Amazon RDS was a viable option, I had a hard time finding reliable benchmarks. The authors of this good book on EC2 mention it to be a bit faster, but without further clarification. The best benchmark I could find was this one. It uses the sysbench tool to test an EC2 instance vs RDS, exactly what I need. It provides the tools for benchmarking and pointed to the difference between running 1 and 10 threads. However, for me this benchmark was missing some vital information, therefore I decided to run my own benchmark using sysbench in a very similar way, with the following adjustments:

  • I’ve used a much bigger dataset: I’ve set it to use 50 million objects, in order to create a 12GB database that will surely not fit the 1.7GB memory.
  • Some parameters like: instance disk vs EBS and MySQL configuration were unspecified

I’ve used the following setups:

  • A small EC2 instance in the USeast region, with Debian squeeze and a standard MySQL install. The database is set-up on a separate EBS volume. (named Mysql on EBS (standard) )
  • The same instance with MySQL tuned to more reasonable values: key_buffer=512M, query_cache=128MB
  • A small RDS instance, set up in the same region

Single Client Thread

First, I repeated the single thread experiment. In this case the instance is not fully utilized. The results are shown below:

System Operations/sec Times (ms)
Transactions Read/Write Other min avg. max. 95th perc.
Mysql on EBS (standard) 18 334 35 4.4 56.9 1186.5 149.1
Mysql on EBS (optimized) 52 991 104 0.0 19.2 728.6 84.4
RDS 23.2 440.6 46.4 11.1 43.1 691.4 90.0

In this experiment the difference between a standard MySQL install and the optimized one is huge. RDS seems to come in comparable to a standard MySQL install, which seems reasonable.

50 Threads

Now, in real development we don’t care about the difference between fast and faster, if your website is growing, what matters much more is performance not deteriorating when things get tougher. Therefore I tried to stretch the database much further by using 50 client threads. This is much closer to the real world with multiple Apache processes constantly hitting the database. Especially in the case where you might have multiple front-end servers connecting to a single database instance. Again the results are shown below:

System Operations/sec Times (ms)
Transactions Read/Write Other min avg. max. 95th perc.
Mysql on EBS (standard) 38 724 76 30.2 1310.7 4662.8 2179.0
Mysql on EBS (optimized) 46 871 92 27.55 1089.4 3031.43 1853.76
RDS 111 2110 222 13.47 450.0 1557.4 807.3

First, the difference between a standard install and the optimized version have been greatly reduced. The most notable result is that RDS performs so much better. This confirms the results the original benchmark but now under conditions that matter to me. Maybe even more important than the difference in query throughput is that RDS does a much better job keeping request times within reasonable bounds. 95% returns within 807ms, compared to 1854ms for MySQL on the EC2 instance.

My conclusion is that although RDS may not perform as well as you can do yourself under ideal conditions, as soon as you are going for realistic loads, RDS can be pushed much further. Of course this should also be possible with DIY optimizations. RDS is after all running MySQL, but I’m sure it’s going to take a significant amount of time and does not outweigh the other benefits of RDS: easier backup and much less management.

November 3th, 2011 Further benchmarking has shown me that it is actually quite easy to bring the throughput of your own instance running mysql much closer to RDS, by increasing the innodb_buffer_pool_size. My lack of experience with InnoDB clearly biased the benchmark above. I do however still notice the difference in response times, RDS is much more stable.

Notes: 1: I’ve also benchmarked thread-numbers in between, but there was no interesting pattern. Results on 4 threads and up are largely similar to the 50 thread one for RDS, while for MySQL the times gradually get worse as the number of threads grows. 2: I’ve also done an experiment running MySQL on the instance disk, instead of EBS, but it wasn’t better and it removes all benefits of using EBS, therefore results are not included. 3: For more reliable results this should probably be repeated at different points in time with multiple instances.

Posted in Benchmarks | Tagged , , , , , | 7 Comments

Where does it hurt?

We want to create the best monitoring service around, therefore we would like to know:  Where does it hurt?

We’ve got a few pains of ourselves that we are currently focusing on:

  • Too much information: with a lot of servers, there is always something going on, at some point we were receiving so many notifications that we missed the critical ones. To counter this, we’ve created a summary dashboard and smart notifications.
  • Servers in serious problems can’t send e-mails. We solve this by providing a central monitoring dashboard, which directly receives updates from the individual servers with the added benefit of being able to alert you when no information is coming in.

We would love to hear from you, what you feel is important for Observu to provide. You can post your suggestions to our feedback forum.

Posted in Uncategorized | Leave a comment