Monitoring private websites and keeping credentials secure

Today I encountered this question on stackoverflow. It asks about monitoring tools to run locally because they cannot share credentials with third parties. (like Observu) This comes with the disadvantage of now having to take care of the reliablity of this script, making sure you receive alerts if anything goes wrong and storing historical data for later root cause analysis.

As this is a problem that is very relevant to our users as well, I will expand on the problem a bit more. First, if possible, the best approach would be to provide monitoring with a dummy user that has no sensitive data in their account. This is often possible if you are running an account based service, where data is owned by the user that is logged in. If that is not possible, I can think of a few solutions to keep your passwords or API credentials to yourself, while still using a remote monitoring service:

  • Create a separate page on your website, that executes all critical operations, but just does not provide any relevant content as a result. e.g. just have it output a keyword and a status for each operation if it succeeded. You can then have a remote monitoring service call that page and use regular expressions to parse out those keywords and collect the status on each one of them as individual data.
  • Similar would be to use the language that you are building your website in to call it’s own URL’s and do the regex parts and again build a status overview. This however does come with the downside that the URL’s are called locally, so firewall rules, etc. may be different than for a real user

The other way would be to run just the part that calls your website on your own VPS (or preferably multiple ) that you can secure the way you like. And send the results to the API of a monitoring service like ours (https://observu.com/docs/api) The script that does the actual fetching, querying and testing it with regular expressions can be in any language you are comfortable with. (Or you can just take the results of calling JMeter)

The advantage here would be that you call your scripts from multiple remote locations, while still keeping control of the passwords. (Given that you either have your own datacenters remotely or trust at least Amazon or any other provider enough to have your passwords in a small virtual instance)

If you are interested in this approach, we are very willing to provide you with assistance in setting this up. Just send an e-mail to our support department outlining your case and we will point you in the right direction, providing ready-made scripts when possible.

Posted in Howto | Comments Off

Got error 28 from storage engine

A common cause of website and webservice failure is running out of disk space. Either the webserver is no longer able to write to it’s log files and fails completely (resulting in an error page or no connection at all) or the database server may return an error. For example MySQL returns the fairly cryptic: Got error 28 from storage engine.  If you are using availability monitoring, it will start alerting you because your page is no longer showing up properly.

Basically meaning the database server can’t write either new data or temporary files (often needed for complex queries). Because the error does not happen on all SQL queries, the error may only become visible on certain pages or actions.

Worst case, the inability to write data, may lead to database corruption, requiring a repair after you freed some space. This may be a problem if you have large MySQL MyISAM tables, because repairing those requires additional free space as much as the largest table you’ve got. The one thing you just ran out of, causing the automatic repair to fail. How to repair MySQL tables will be subject of another post.

After you became aware of the above issue, it’s important to find out what partition is full and what is taking up all that space. I’m assuming that you are on a Linux-based server.

The first thing is to find out the partition in trouble, by running:
df -h

This should show you something like: (note the 100% on rootfs)

Filesystem Size Used Avail Use% Mounted on
rootfs 20G 20G 0M 100% /
none 6.3G 344K 6.3G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 32G 176K 32G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/md2 92G 829M 87G 1% /data
/dev/md3 1.8T 155G 1.5T 10% /home

To figure out what particular files are big, you could run: du -hs * in root ( / ) and then descend into the biggest directory and refine. However, on a production server, this will often take too long and should only be used as a last resort. Most often running these three should already tell you what is going on:

du -hs /tmp => if this is particularly big, you are probably not properly cleaning temporary files after use, or writing logs that are never rotated
du -hs /var/log => if this is particularly big, you may be keeping log files forever or some log has gone haywire, you could consider transporting logs to an external server for long-time archival
du -hs /var/lib/mysql => this is your mysql database, usually you can’t do much about this, except move it to a different server or partition
du -hs /var/* => if none of the above, often it’s still somewhere in this part of your file system

What often happens is that the server comes with a certain partitioning and disks by default, that has a small root partition and sometimes also a small partition just for /tmp. Default installs usually put their database files onto the root partition, so if your database is particularly big, you may quickly fill up the root partition with database and log files, while there is sufficient space somewhere else. This means you will need to move your database and then use a symlink (ln -s), mount –bind or change your server configuration (edit the mysql.conf file) to point to the new database location.

Running out of space happens to everyone at some point, however to avoid a lot of stress, you should make sure you know in advance. This will prevent errors on your service and potential data corruption. To do so, you can use Observu server monitoring, which will set alerts at specific levels of disk usage. The notification will include information which partition is almost full and needs your attention. To resolve the issue you can then (without stress) use the hints provided above.

A special recommendation for those still using MySQL MyISAM tables: make sure you leave enough space for repair/recovery. This means you will need to set warning levels as early as 60 or 70% of disk usage if table size is unbalanced.

Posted in Howto | Leave a comment

Observu from idea ’til launch

At the end of 2010 we decided that our development efforts were too fragmented and we needed to focus. We had dozens of websites, each either needing a lot of work or were not really future proof. We decided to select three of them, one of which was Observu. The two most important reasons being: first of all, we really needed it ourselves at that time. Secondly, we wanted to appeal to other developers as that is what we do best. Other projects such as FlexLists mostly appeal to developers and people in education and even our consumer oriented website picturepush.com appeals more to the techie, professional crowd than to any other.

We set out with the following key ideas:

  • We want to collect all kinds of data, especially combining availability, server and application data
  • We really wanted notifications by phone
  • It should fit well with the cloud, so it should not rely on manually configuring each server
  • Receive data at a fine-grained (every minute) resolution

After a few months of full-time development (april 2011) we already had a product that helped us a great deal by monitoring our own websites (more than 20 at that time)

We then started setting up the basics for the infrastructure: load balancing, automated deployment, efficiently storing the time series data, etc. etc. As a big sufferer of the not-invented-here syndrom we did almost everything ourselves, including designing the website and the logo.

In september 2011 we ran out of funds to continue development. We decided we really believed in the product and sold most of our other websites. Other than that we were lucky to find a client where we could apply a lot of knowledge we learned while building Observu as well as apply Observu itself in practice. We advised them on performance improvements, automated deployment, auto scaling, a redundant database setup and proper load testing.

This was a nice opportunity, but it did slow our development down at first. We did however learn a lot about features we really needed and never considered: e.g. auto-scaling your server pool results in a lot of short lived servers and thus monitors that just stop receiving data.

By june 2012 we felt development wasn’t progressing as it should: consulting and other projects got in the way again. We decided to invest a bit more of our consulting revenue and hired a developer on Odesk. We were lucky enough to find a young but very bright guy that made a lot of progress on especially our reporting and data explorer. We continued this till september, unfortunately our funds were limited and the dev had to go back to university, further limiting his availability. Development came down to us again, however our workload was already pretty heavy working on customer projects again. Finishing those last few features had to be done in the weekends when there were no projects to coordinate.

Of course some ‘last fixes’ had bigger implications than I anticipated, but we’ve finally got to a point where we felt confident that we got a product that is really useful for a lot of admins and developers. It’s unavoidable to leave a lot of features we really want in there for the future and we do feel some anxiety about competitors that popped up while we were developing. However, we could not postpone launch any longer and even skipped on payment integration just to get your feedback as soon as possible.

We got quite a bit of signups from the mailing list we built, but very little actual feedback or requests came our way. In the mean time we we had to pay our bills and work on mobile application development. However, it was taking up all of our time, resulting in not getting the most out of our trial users at all. It resulted in a big go/no-go moment. So in July 2013 we decided to take the plunge one more time as well as bring someone in to help us with marketing and business development. This paid off in many ways: we quickly learned a lot more about our users and quickly started to turn trials into paying subscribers.

For the long term we believe we can leverage our open architecture to really monitor anything and utilize machine learning techniques to automatically discover trends and outliers and take big steps in prioritising information and exclusion of false positives. We want to apply this not just to infrastructure and availability but to everything measurable in operating an online business.

Some more detailed aspects we feel we need to focus on as soon as possible:

  • The trend to support more real-time data: every few seconds
  • Full page load measurements and error checking (already in testing)
  • Support for monitoring high-volume log files (e.g. access logs)
  • Log file search and filtering
  • Create low-overhead (async) ways of sending data to Observu
  • Create proper support for rich exception logging that is easy to browse and includes meta data as well as libraries for all popular platforms
  • Import for CloudWatch metrics
  • Aggregated reporting (e.g. combine error logs from all servers in a cluster into a single view)
  • An app with push notifications

Next time I’ll write more about what we did the last few months to turn our beta into a serious subscription business.

Posted in Progress Report | Leave a comment

Monitoring A Website In The Cloud

Observu has been designed from the ground up to deal with the monitoring reality of running your website or application in the cloud.

By allowing servers to share the exact same configuration on the server and not requiring them to be added on the dashboard, deployment is greatly simplified and can be easily automated without loosing monitoring capabilities.

By auto-archiving monitors that no longer provide data, Observu can deal with short-lived virtual instances without cluttering your dashboards.

Read more in our cloud monitoring case.

Posted in Uncategorized | Leave a comment

Monitoring Data From Online Sources and APIs

Observu allows you to check availability on webpages and APIs and test them for presence of certain text. However, web pages and APIs can provide a wealth of information that is also interesting to track. Maybe your forum lists the current number of users or your API replies with the amount of requests that you have left.

Observu allows you to use regular expressions to capture this information and assign them to a property to be tracked every minute of every day.

API and data monitoring options

Extracting numeric data from a web page

Let’s start with a simple example of extracting a row from a table.

<table>
   <tr>
     <td>EUR - USD</td><td>1.31567</td>
   </tr>
</table>

If we now set /EUR\w-\wUSD<\/td> ([0-9\.]+)/si as expression in our advanced capturing settings and then assign it to: currency.EUR_to_USD:float Observu can keep track of the rates published on this page.

Extracting data from an API

Maybe the same page also publishes this data as XML:

<currency>
  <from>EUR</from>
  <to>USD</to>
  <value>1.31567</value>
</currency>

If we now set /value>([0-9\.]+)/si as expression in our advanced capturing settings and then assign it to: currency.EUR_to_USD:float Observu can again keep track of the rates published through this API.

Read more about monitoring your API

The :float at the end of the property name is type hinting to make sure Observu knows how to render and report on the extracted data. Our documentation lists all available types

Posted in Howto | Leave a comment

Respond faster to Internal Server Errors

When a web page shows you an “Internal Server Error”, the webserver also returns a 500 status code. It means there is something wrong on the website itself. The user requested a proper URL, but something on the server makes it unable to fulfil that request. The user has no way to resolve this except to wait for it to disappear. These errors are the responsibility of the website owner to handle and prevent.

Internal Server Error

One of our most basic features is to help you stay on top of errors like this on your pages. Read more on how to get notified about Internal Server Errors

Posted in Uncategorized | Leave a comment

Improved API Monitoring

Last week we’ve improved significantly on our ability to monitor APIs that are available over HTTP(S). You can now set custom headers, cookies, urlencoded form data and a raw POST body to your availability monitors.

Furthermore, we allow you to do an additional request, to for example login to the website before executing the actual request. You can capture data from this initial request to re-use (e.g. an authentication token) in the actual request you want to monitor.

HTTP API Monitoring Options

Finally, you can capture data from the response using a regular expression and use the captured data as a metric in Observu.

Posted in Uncategorized | Leave a comment

Entering private beta testing mode

Starting today, Observu is no longer limited to two customers only. We’ve sent out the beta invitations to the mailinglist and hope you are on there as well. If you would like an invitation as well, just send us an e-mail.

We are very eager to learn what you think and what direction we should go. We’ve got tons of ideas, but need your guidance to build the tool that will help you most. We will give away a free T-shirt and a significant discount to anyone that sends in valuable feedback.

Posted in Progress Report | Leave a comment

MySQL queries that kill your responsive website

There are a lot of queries that are fine when you’re site is small, but take ages as soon as you start to collect some data. Therefore it’s very important to monitor query performance. We usually track at least the following things:

  • total time spent on SQL queries
  • total time spent on rendering a page
  • queries that took more than a certain threshold (query and time)

We log these, so we can quickly discover bottlenecks. (using the Observu server agent, we also store these in Observu for a quick overview and the ability to receive notifications when it happens)

Many frameworks such as Zend Framework have built in SQL profilers which can already do these things, you just need to check out the documentation.

After you found the culprits, it’s recommended to run them manually, prefixed with EXPLAIN. Often you will have forgotten to add an index or your index does not match the use of your query.

There are however some query patterns you can already watch out for when writing and reviewing your code. We’ve encountered these again and again as our databases grew larger:

SELECT ..... ORDER BY created_date DESC LIMIT 0,7 to get the most recent items
This becomes slow as the database grows larger even if there is an index on created_date. The way to counter this is to actually make use of that index by adding a condition that limits the amount of data involved, like: created_date >= ‘{date_7_days_ago}’
(it’s recommended to generate this date in code and round it to a date and a 00:00 time, so the result can be cached)

SELECT .......... LIMIT 500000,10 created by paging code on a large table
This one is harder to prevent, however there are some approaches:

  • Do not sort the data, but have it returned in it’s natural order.
  • Do not use LIMIT, but use actual conditions on the dimension which you order the results by. (e.g. a range of ID’s or dates)
  • Just disallow browsing this deep into the data, will users really need this? Or is the ability just an oversight, which only gets triggered by search engines

SELECT ..... ORDER BY rand() LIMIT 10 to select random items
This is a very common way to select random items, that does not work at all as soon as you have more than a few thousand items. What happens is that MySQL will first have to generate a random number for each entry in the database, before being able to select the 10 to display.

The way around this is to first determine the range of ID’s to select from. ( SELECT MIN(id), MAX(id) FROM mytable )
Then generate a random id between MIN(id) and MAX(id)-1 and an upper bound, usually something like random_id+1000.
Finally, find a random item by querying SELECT * FROM mytable WHERE id>={random_id} AND id < {upper_bound} ORDER BY id ASC LIMIT 1.

This efficient way to retrieve a random item from a MySQL table can also be applied to multiple items. For really random, just repeat the procedure. However, in most cases, you don't need a really random set and you can just use something like:
SELECT * FROM mytable WHERE id>={random_id} AND id < {upper_bound} ORDER BY rand() LIMIT 10

Posted in Uncategorized | Leave a comment

Development update – 8

It has been silent for a while, but we are definitely still going. Today we’ve deployed all updates to our production systems. Enabling features that were critical to support our launching customer:

  • Grant permissions to view your monitors to other accounts
  • A proper data explorer to browse all metrics that are collected
  • Auto-archiving for monitors (very useful in combination with EC2 auto-scaling groups)
  • Tracking and limiting of account usage

We are now going through some final tests and bugfixes, but we will definitely open up the first month of 2013!

Observu Teaser screenshot

Posted in Progress Report | Leave a comment