How to resolve ‘Got error 28 from storage engine’

A common cause of website and webservice failure is running out of disk space. Either the webserver is no longer able to write to it’s log files and fails completely (resulting in an error page or no connection at all) or the database server may return an error. For example MySQL returns the fairly cryptic: Got error 28 from storage engine.  If you are using availability monitoring, it will start alerting you because your page is no longer showing up properly.

Basically meaning the database server can’t write either new data or temporary files (often needed for complex queries). Because the error does not happen on all SQL queries, the error may only become visible on certain pages or actions.

Worst case, the inability to write data, may lead to database corruption, requiring a repair after you freed some space. This may be a problem if you have large MySQL MyISAM tables, because repairing those requires additional free space as much as the largest table you’ve got. The one thing you just ran out of, causing the automatic repair to fail. How to repair MySQL tables will be subject of another post.

After you became aware of the above issue, it’s important to find out what partition is full and what is taking up all that space. I’m assuming that you are on a Linux-based server.

The first thing is to find out the partition in trouble, by running:
df -h

This should show you something like: (note the 100% on rootfs)

Filesystem Size Used Avail Use% Mounted on
rootfs 20G 20G 0M 100% /
none 6.3G 344K 6.3G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 32G 176K 32G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/md2 92G 829M 87G 1% /data
/dev/md3 1.8T 155G 1.5T 10% /home

To figure out what particular files are big, you could run: du -hs * in root ( / ) and then descend into the biggest directory and refine. However, on a production server, this will often take too long and should only be used as a last resort. Most often running these three should already tell you what is going on:

du -hs /tmp => if this is particularly big, you are probably not properly cleaning temporary files after use, or writing logs that are never rotated
du -hs /var/log => if this is particularly big, you may be keeping log files forever or some log has gone haywire, you could consider transporting logs to an external server for long-time archival
du -hs /var/lib/mysql => this is your mysql database, usually you can’t do much about this, except move it to a different server or partition
du -hs /var/* => if none of the above, often it’s still somewhere in this part of your file system

What often happens is that the server comes with a certain partitioning and disks by default, that has a small root partition and sometimes also a small partition just for /tmp. Default installs usually put their database files onto the root partition, so if your database is particularly big, you may quickly fill up the root partition with database and log files, while there is sufficient space somewhere else. This means you will need to move your database and then use a symlink (ln -s), mount –bind or change your server configuration (edit the mysql.conf file) to point to the new database location.

Running out of space happens to everyone at some point, however to avoid a lot of stress, you should make sure you know in advance. This will prevent errors on your service and potential data corruption. To do so, you can use Observu server monitoring, which will set alerts at specific levels of disk usage. The notification will include information which partition is almost full and needs your attention. To resolve the issue you can then (without stress) use the hints provided above.

A special recommendation for those still using MySQL MyISAM tables: make sure you leave enough space for repair/recovery. This means you will need to set warning levels as early as 60 or 70% of disk usage if table size is unbalanced.

Posted in Howto | Tagged | Comments Off on How to resolve ‘Got error 28 from storage engine’

Observu from idea ’til launch

At the end of 2010 we decided that our development efforts were too fragmented and we needed to focus. We had dozens of websites, each either needing a lot of work or were not really future proof. We decided to select three of them, one of which was Observu. The two most important reasons being: first of all, we really needed it ourselves at that time. Secondly, we wanted to appeal to other developers as that is what we do best. Other projects such as FlexLists mostly appeal to developers and people in education and even our consumer oriented website picturepush.com appeals more to the techie, professional crowd than to any other.

We set out with the following key ideas:

  • We want to collect all kinds of data, especially combining availability, server and application data
  • We really wanted notifications by phone
  • It should fit well with the cloud, so it should not rely on manually configuring each server
  • Receive data at a fine-grained (every minute) resolution

After a few months of full-time development (april 2011) we already had a product that helped us a great deal by monitoring our own websites (more than 20 at that time)

We then started setting up the basics for the infrastructure: load balancing, automated deployment, efficiently storing the time series data, etc. etc. As a big sufferer of the not-invented-here syndrom we did almost everything ourselves, including designing the website and the logo.

In september 2011 we ran out of funds to continue development. We decided we really believed in the product and sold most of our other websites. Other than that we were lucky to find a client where we could apply a lot of knowledge we learned while building Observu as well as apply Observu itself in practice. We advised them on performance improvements, automated deployment, auto scaling, a redundant database setup and proper load testing.

This was a nice opportunity, but it did slow our development down at first. We did however learn a lot about features we really needed and never considered: e.g. auto-scaling your server pool results in a lot of short lived servers and thus monitors that just stop receiving data.

By june 2012 we felt development wasn’t progressing as it should: consulting and other projects got in the way again. We decided to invest a bit more of our consulting revenue and hired a developer on Odesk. We were lucky enough to find a young but very bright guy that made a lot of progress on especially our reporting and data explorer. We continued this till september, unfortunately our funds were limited and the dev had to go back to university, further limiting his availability. Development came down to us again, however our workload was already pretty heavy working on customer projects again. Finishing those last few features had to be done in the weekends when there were no projects to coordinate.

Of course some ‘last fixes’ had bigger implications than I anticipated, but we’ve finally got to a point where we felt confident that we got a product that is really useful for a lot of admins and developers. It’s unavoidable to leave a lot of features we really want in there for the future and we do feel some anxiety about competitors that popped up while we were developing. However, we could not postpone launch any longer and even skipped on payment integration just to get your feedback as soon as possible.

We got quite a bit of signups from the mailing list we built, but very little actual feedback or requests came our way. In the mean time we we had to pay our bills and work on mobile application development. However, it was taking up all of our time, resulting in not getting the most out of our trial users at all. It resulted in a big go/no-go moment. So in July 2013 we decided to take the plunge one more time as well as bring someone in to help us with marketing and business development. This paid off in many ways: we quickly learned a lot more about our users and quickly started to turn trials into paying subscribers.

For the long term we believe we can leverage our open architecture to really monitor anything and utilize machine learning techniques to automatically discover trends and outliers and take big steps in prioritising information and exclusion of false positives. We want to apply this not just to infrastructure and availability but to everything measurable in operating an online business.

Some more detailed aspects we feel we need to focus on as soon as possible:

  • The trend to support more real-time data: every few seconds
  • Full page load measurements and error checking (already in testing)
  • Support for monitoring high-volume log files (e.g. access logs)
  • Log file search and filtering
  • Create low-overhead (async) ways of sending data to Observu
  • Create proper support for rich exception logging that is easy to browse and includes meta data as well as libraries for all popular platforms
  • Import for CloudWatch metrics
  • Aggregated reporting (e.g. combine error logs from all servers in a cluster into a single view)
  • An app with push notifications

Next time I’ll write more about what we did the last few months to turn our beta into a serious subscription business.

Posted in Progress Report | Comments Off on Observu from idea ’til launch

Monitoring A Website In The Cloud

Observu has been designed from the ground up to deal with the monitoring reality of running your website or application in the cloud.

By allowing servers to share the exact same configuration on the server and not requiring them to be added on the dashboard, deployment is greatly simplified and can be easily automated without loosing monitoring capabilities.

By auto-archiving monitors that no longer provide data, Observu can deal with short-lived virtual instances without cluttering your dashboards.

Read more in our cloud monitoring case.

Posted in Howto | Comments Off on Monitoring A Website In The Cloud

Monitoring Data From Online Sources and APIs

Observu allows you to check availability on webpages and APIs and test them for presence of certain text. However, web pages and APIs can provide a wealth of information that is also interesting to track. Maybe your forum lists the current number of users or your API replies with the amount of requests that you have left.

Observu allows you to use regular expressions to capture this information and assign them to a property to be tracked every minute of every day.

API and data monitoring options

Extracting numeric data from a web page

Let’s start with a simple example of extracting a row from a table.

<table>
   <tr>
     <td>EUR - USD</td><td>1.31567</td>
   </tr>
</table>

If we now set /EUR\w-\wUSD<\/td>

([0-9\.]+)/si as expression in our advanced capturing settings and then assign it to: currency.EUR_to_USD:float Observu can keep track of the rates published on this page.

Extracting data from an API

Maybe the same page also publishes this data as XML:

<currency>
  <from>EUR</from>
  <to>USD</to>
  <value>1.31567</value>
</currency>

If we now set /value>([0-9\.]+)/si as expression in our advanced capturing settings and then assign it to: currency.EUR_to_USD:float Observu can again keep track of the rates published through this API.

Read more about monitoring your API

The :float at the end of the property name is type hinting to make sure Observu knows how to render and report on the extracted data. Our documentation lists all available types

Posted in Howto | Comments Off on Monitoring Data From Online Sources and APIs

Respond faster to Internal Server Errors

When a web page shows you an “Internal Server Error”, the webserver also returns a 500 status code. It means there is something wrong on the website itself. The user requested a proper URL, but something on the server makes it unable to fulfil that request. The user has no way to resolve this except to wait for it to disappear. These errors are the responsibility of the website owner to handle and prevent.

Internal Server Error

One of our most basic features is to help you stay on top of errors like this on your pages. Read more on how to get notified about Internal Server Errors

Posted in Uncategorized | Comments Off on Respond faster to Internal Server Errors