Observu from idea ’til launch

At the end of 2010 we decided that our development efforts were too fragmented and we needed to focus. We had dozens of websites, each either needing a lot of work or were not really future proof. We decided to select three of them, one of which was Observu. The two most important reasons being: first of all, we really needed it ourselves at that time. Secondly, we wanted to appeal to other developers as that is what we do best. Other projects such as FlexLists mostly appeal to developers and people in education and even our consumer oriented website picturepush.com appeals more to the techie, professional crowd than to any other.

We set out with the following key ideas:

  • We want to collect all kinds of data, especially combining availability, server and application data
  • We really wanted notifications by phone
  • It should fit well with the cloud, so it should not rely on manually configuring each server
  • Receive data at a fine-grained (every minute) resolution

After a few months of full-time development (april 2011) we already had a product that helped us a great deal by monitoring our own websites (more than 20 at that time)

We then started setting up the basics for the infrastructure: load balancing, automated deployment, efficiently storing the time series data, etc. etc. As a big sufferer of the not-invented-here syndrom we did almost everything ourselves, including designing the website and the logo.

In september 2011 we ran out of funds to continue development. We decided we really believed in the product and sold most of our other websites. Other than that we were lucky to find a client where we could apply a lot of knowledge we learned while building Observu as well as apply Observu itself in practice. We advised them on performance improvements, automated deployment, auto scaling, a redundant database setup and proper load testing.

This was a nice opportunity, but it did slow our development down at first. We did however learn a lot about features we really needed and never considered: e.g. auto-scaling your server pool results in a lot of short lived servers and thus monitors that just stop receiving data.

By june 2012 we felt development wasn’t progressing as it should: consulting and other projects got in the way again. We decided to invest a bit more of our consulting revenue and hired a developer on Odesk. We were lucky enough to find a young but very bright guy that made a lot of progress on especially our reporting and data explorer. We continued this till september, unfortunately our funds were limited and the dev had to go back to university, further limiting his availability. Development came down to us again, however our workload was already pretty heavy working on customer projects again. Finishing those last few features had to be done in the weekends when there were no projects to coordinate.

Of course some ‘last fixes’ had bigger implications than I anticipated, but we’ve finally got to a point where we felt confident that we got a product that is really useful for a lot of admins and developers. It’s unavoidable to leave a lot of features we really want in there for the future and we do feel some anxiety about competitors that popped up while we were developing. However, we could not postpone launch any longer and even skipped on payment integration just to get your feedback as soon as possible.

We got quite a bit of signups from the mailing list we built, but very little actual feedback or requests came our way. In the mean time we we had to pay our bills and work on mobile application development. However, it was taking up all of our time, resulting in not getting the most out of our trial users at all. It resulted in a big go/no-go moment. So in July 2013 we decided to take the plunge one more time as well as bring someone in to help us with marketing and business development. This paid off in many ways: we quickly learned a lot more about our users and quickly started to turn trials into paying subscribers.

For the long term we believe we can leverage our open architecture to really monitor anything and utilize machine learning techniques to automatically discover trends and outliers and take big steps in prioritising information and exclusion of false positives. We want to apply this not just to infrastructure and availability but to everything measurable in operating an online business.

Some more detailed aspects we feel we need to focus on as soon as possible:

  • The trend to support more real-time data: every few seconds
  • Full page load measurements and error checking (already in testing)
  • Support for monitoring high-volume log files (e.g. access logs)
  • Log file search and filtering
  • Create low-overhead (async) ways of sending data to Observu
  • Create proper support for rich exception logging that is easy to browse and includes meta data as well as libraries for all popular platforms
  • Import for CloudWatch metrics
  • Aggregated reporting (e.g. combine error logs from all servers in a cluster into a single view)
  • An app with push notifications

Next time I’ll write more about what we did the last few months to turn our beta into a serious subscription business.

Posted in Progress Report | Leave a comment

Monitoring A Website In The Cloud

Observu has been designed from the ground up to deal with the monitoring reality of running your website or application in the cloud.

By allowing servers to share the exact same configuration on the server and not requiring them to be added on the dashboard, deployment is greatly simplified and can be easily automated without loosing monitoring capabilities.

By auto-archiving monitors that no longer provide data, Observu can deal with short-lived virtual instances without cluttering your dashboards.

Read more in our cloud monitoring case.

Posted in Uncategorized | Leave a comment

Monitoring Data From Online Sources and APIs

Observu allows you to check availability on webpages and APIs and test them for presence of certain text. However, web pages and APIs can provide a wealth of information that is also interesting to track. Maybe your forum lists the current number of users or your API replies with the amount of requests that you have left.

Observu allows you to use regular expressions to capture this information and assign them to a property to be tracked every minute of every day.

API and data monitoring options

Extracting numeric data from a web page

Let’s start with a simple example of extracting a row from a table.

<table>
   <tr>
     <td>EUR - USD</td><td>1.31567</td>
   </tr>
</table>

If we now set /EUR\w-\wUSD<\/td> ([0-9\.]+)/si as expression in our advanced capturing settings and then assign it to: currency.EUR_to_USD:float Observu can keep track of the rates published on this page.

Extracting data from an API

Maybe the same page also publishes this data as XML:

<currency>
  <from>EUR</from>
  <to>USD</to>
  <value>1.31567</value>
</currency>

If we now set /value>([0-9\.]+)/si as expression in our advanced capturing settings and then assign it to: currency.EUR_to_USD:float Observu can again keep track of the rates published through this API.

Read more about monitoring your API

The :float at the end of the property name is type hinting to make sure Observu knows how to render and report on the extracted data. Our documentation lists all available types

Posted in Howto | Leave a comment

Respond faster to Internal Server Errors

When a web page shows you an “Internal Server Error”, the webserver also returns a 500 status code. It means there is something wrong on the website itself. The user requested a proper URL, but something on the server makes it unable to fulfil that request. The user has no way to resolve this except to wait for it to disappear. These errors are the responsibility of the website owner to handle and prevent.

Internal Server Error

One of our most basic features is to help you stay on top of errors like this on your pages. Read more on how to get notified about Internal Server Errors

Posted in Uncategorized | Leave a comment

Improved API Monitoring

Last week we’ve improved significantly on our ability to monitor APIs that are available over HTTP(S). You can now set custom headers, cookies, urlencoded form data and a raw POST body to your availability monitors.

Furthermore, we allow you to do an additional request, to for example login to the website before executing the actual request. You can capture data from this initial request to re-use (e.g. an authentication token) in the actual request you want to monitor.

HTTP API Monitoring Options

Finally, you can capture data from the response using a regular expression and use the captured data as a metric in Observu.

Posted in Uncategorized | Leave a comment

Entering private beta testing mode

Starting today, Observu is no longer limited to two customers only. We’ve sent out the beta invitations to the mailinglist and hope you are on there as well. If you would like an invitation as well, just send us an e-mail.

We are very eager to learn what you think and what direction we should go. We’ve got tons of ideas, but need your guidance to build the tool that will help you most. We will give away a free T-shirt and a significant discount to anyone that sends in valuable feedback.

Posted in Progress Report | Leave a comment

MySQL queries that kill your responsive website

There are a lot of queries that are fine when you’re site is small, but take ages as soon as you start to collect some data. Therefore it’s very important to monitor query performance. We usually track at least the following things:

  • total time spent on SQL queries
  • total time spent on rendering a page
  • queries that took more than a certain threshold (query and time)

We log these, so we can quickly discover bottlenecks. (using the Observu server agent, we also store these in Observu for a quick overview and the ability to receive notifications when it happens)

Many frameworks such as Zend Framework have built in SQL profilers which can already do these things, you just need to check out the documentation.

After you found the culprits, it’s recommended to run them manually, prefixed with EXPLAIN. Often you will have forgotten to add an index or your index does not match the use of your query.

There are however some query patterns you can already watch out for when writing and reviewing your code. We’ve encountered these again and again as our databases grew larger:

SELECT ..... ORDER BY created_date DESC LIMIT 0,7 to get the most recent items
This becomes slow as the database grows larger even if there is an index on created_date. The way to counter this is to actually make use of that index by adding a condition that limits the amount of data involved, like: created_date >= ‘{date_7_days_ago}’
(it’s recommended to generate this date in code and round it to a date and a 00:00 time, so the result can be cached)

SELECT .......... LIMIT 500000,10 created by paging code on a large table
This one is harder to prevent, however there are some approaches:

  • Do not sort the data, but have it returned in it’s natural order.
  • Do not use LIMIT, but use actual conditions on the dimension which you order the results by. (e.g. a range of ID’s or dates)
  • Just disallow browsing this deep into the data, will users really need this? Or is the ability just an oversight, which only gets triggered by search engines

SELECT ..... ORDER BY rand() LIMIT 10 to select random items
This is a very common way to select random items, that does not work at all as soon as you have more than a few thousand items. What happens is that MySQL will first have to generate a random number for each entry in the database, before being able to select the 10 to display.

The way around this is to first determine the range of ID’s to select from. ( SELECT MIN(id), MAX(id) FROM mytable )
Then generate a random id between MIN(id) and MAX(id)-1 and an upper bound, usually something like random_id+1000.
Finally, find a random item by querying SELECT * FROM mytable WHERE id>={random_id} AND id < {upper_bound} ORDER BY id ASC LIMIT 1.

This efficient way to retrieve a random item from a MySQL table can also be applied to multiple items. For really random, just repeat the procedure. However, in most cases, you don't need a really random set and you can just use something like:
SELECT * FROM mytable WHERE id>={random_id} AND id < {upper_bound} ORDER BY rand() LIMIT 10

Posted in Uncategorized | Leave a comment

Development update – 8

It has been silent for a while, but we are definitely still going. Today we’ve deployed all updates to our production systems. Enabling features that were critical to support our launching customer:

  • Grant permissions to view your monitors to other accounts
  • A proper data explorer to browse all metrics that are collected
  • Auto-archiving for monitors (very useful in combination with EC2 auto-scaling groups)
  • Tracking and limiting of account usage

We are now going through some final tests and bugfixes, but we will definitely open up the first month of 2013!

Observu Teaser screenshot

Posted in Progress Report | Leave a comment

Development Update – 7

It has been a while since our last update. In this time, we’ve been working closely with our first customers to determine and implement various essential features. We’ve also applied our experience and research on hosting in the cloud to their projects. This has led up to a major milestone last week: observu.com now hosts the latest beta and a more descriptive website. It is still very private, but if you get on our mailinglist, we can let you in soon.

In terms of development we’ve made a lot of progress on properly organizing the API and mobile website code, to share 100% of the codebase with the main website. (proper MVC with only difference being the View). We are a big fan of Redis, which we’ve used extensively for various queues and rate limiting solutions, that would be challenging to get right otherwise.

What we are working on now is usability and extended reporting. Of course we also have a heap of features in mind, but we would love to have your feedback first, to know what really matters.

Posted in Progress Report | Leave a comment

Development Update – 6

As Observu is all about improving uptime and removing bottlenecks, we strongly believe that we can’t do with an ad-hoc infrastructure either. Especially as the exact time you need Observu is often in case of emergency, we feel strongly about the ability to recover from outages quickly.

We’ve selected Amazon for hosting because it is both flexible, is available in multiple parts of the world and has an excellent network quality. However, earlier this year is has been shown multiple times that no datacenter has 100% uptime and that if failure occurs, it is big. Therefore we are currently working on our architecture to be able to quickly overcome such events. The actual details warrant a separate post.

Our obsession with reliability also touched an other area of development: Our initial implementation of SMS notifications proved unreliable. Therefore we changed to Nexmo as our partner. It provides us with actual delivery confirmations, allowing us to monitor delivery.

Our Growing Development Stack

I personally always like to know what people are using to create their product, therefore a listing of almost everything we use:

  • Ubuntu on Amazon EC2
  • MySQL
  • Redis
  • PHP
  • Perl
  • jQuery
  • RaphaelJS
  • boto
  • Fabric
  • chef-solo

In the area of 3rd party services we rely on:

  • Amazon AWS
  • Nexmo
  • Tropo
  • Sendgrid
  • Github
  • Uservoice

These services allow us to focus on the things that really matter: gaining insight in all parts of your deployment and staying on top of the events that will occur. To improve that insight, we are currently working with the first customers to implement monitoring as part of their stack. A great example is the need to monitor logfiles centrally as soon as there are multiple servers handling your front-end.

Posted in Progress Report | Leave a comment