Benutzer:Rdiez/ErrorHandling

Aus /dev/tal
Wechseln zu: Navigation, Suche
Warning sign
Dies sind die persönlichen Benutzerseiten von rdiez, bitte nicht verändern! Ausnahmen sind nur einfache Sprachkorrekturen wie Tippfehler, falsche Präpositionen oder Ähnliches. Alles andere bitte nur dem Benutzer melden!


Error Handling in General and C++ Exceptions in Particular

Introduction

Motivation

The software industry does not seem to take software quality seriously, and a good part of it falls into the error-handling category. After putting up for years with so much misinformation, so many half-truths and with a general sentiment of apathy on the subject, I finally decided to write a lengthy article about error handling in general and C++ exceptions in particular.

I am not a professional technical writer and I cannot afford the time to start a long discussion on the subject, but I still welcome feedback, so feel free to drop me a line if you find any mistakes or would like to see some other aspect covered here.

Scope

This document focuses on the "normal" software development scenarios for user-facing applications or for non-critical embedded systems. There are of course other areas not covered here: there are systems where errors are measured, tolerated, compensated for or even incorporated into the decision process.

Audience

This document is meant for software developers who have already gathered a reasonable amount of programming experience. The main goal is to give practical information and describe effective techniques for day-to-day work.

Although you can probably guess how C++ exceptions work from the source code examples below, it is expected that you already know the basics, especially the concept of stack unwinding upon raising (throwing) an exception. Look into your favourite C++ book for a detailed description of exception semantics and their syntax peculiarities.

Causes of Neglect

Proper error-handling logic is what sets professional developers apart. Writing quality error handlers requires continuous discipline during development, because it is a tedious task that can easily cost more than the normal application logic for the sunny-day scenario that the developer is paid to write. Testing error paths manually with the debugger is recommended practice, but that doesn't make it any less time consuming. Repeatable test cases that feed the code with invalid data sequences in order to trigger and test each possible error scenario is a rare luxury. This is why error handling in general needs constant encouraging through systematic code reviews or through separate testing personnel. In my experience, lack of good error handling is also symptomatic that the code hasn't been properly developed and tested. A quick look at the error handlers in the source code can give you a pretty reliable measurement of the general code quality.

It is hard to assess how much value robust error handling brings to the end product, and therefore any extra development costs in this field are hard to justify. Free-software developers are often investing their own spare time and frequently take shortcuts in this area. Software contracts are usually drafted on positive terms describing what the software should do, and robustness in the face of errors gets then relegated to some implied general quality standards that are not properly described or quantified. Furthermore, when a customer tests a software product for acceptance, he is primarily worried about fulfilling the contractual obligations in the "normal", non-error case, and that tends to be hard enough. That the software is brittle either goes unnoticed or is not properly rated in the software bug list.

As a result, small software errors often cascade into great disasters, because all the error paths in between fail one after the next one across all the different software layers and communicating devices, as error handlers hardly ever got any attention. But even in this scenario, the common excuse sounds like "yes, but my part wouldn't have failed in the previous one hadn't in the first place".

In addition to all of the above, when the error-handling logic does fail, or when it does not yield helpful information for troubleshooting purposes, it tends to impact first and foremost the users' budget, and not the developer's, and that normally happens after the delivery and payment dates. Even if the error does come back to the original developer, it may find its way through a separate support department, which may even be able to provide a work-around and further justify the business case for that same support department. If nothing else helps, the developer's urgent help is then suddenly required for a real-world, important business problem, which may help make that original developer a well-regarded, irreplaceable person. After all, only the original person understands the code well enough to figure out what went wrong, and any newcomers will shy away from making any changes to a brittle codebase. This scenario can also hold true in open-source communities, where social credit from quickly fixing bugs may be more relevant than introducing those bugs in the first place. All these factors conspire to make poor error handling an attractive business strategy.

As a result, error handling gets mostly neglected, and that reflects in our day-to-day experience with computer software. I have seen plenty of jokes around about unhelpful or funny error messages. Many security issues have their roots in incorrect error detection or handling, and such issues are still getting patched on a weekly rhythm for operating system releases that have been considered stable for years.

Goals Overview

These are the main goals for a good error-handling strategy:

  1. Provide helpful error messages.
  2. Deliver the error messages timely and to the right person.
    The developer may want more information than the user.
  3. Limit the fallout after an error condition.
    Only the operation that failed should be affected, the rest should continue to run.
  4. Reduce the development costs of:
    • adding error checks to the source code.
    • repurposing existing code.

Compromises

Coding the error-handling logic can be costly, and sometimes compromises must be made:

Unpleasant Error Messages

In order to keep development costs under control, the techniques described below may tend to generate error messages that are too long or unpleasant to read. However, such drawbacks easily outweight the disadvantages of delivering too little error information. After all, errors should be the exception rather than the rule, so users should not need to read too many error messages during normal operation.

Abrupt Termination

It may often be desirable to let an application panic on a severe error than to try and cope with the error condition or ignore it altogether.

Some errors are just too expensive or virtually impossible to handle. An example could be a failed close( file_descriptor ); syscall, which should never fail, and when it does, there is not much the error handler can do about it. These errors are symptomatic of a serious logic error, but usually this kind of error is easy to fix.

Other error conditions may indicate that some memory is corrupt or that some data structure has invalid information that hasn't been detected soon enough. If the application carries on, its behaviour may well be undefined (it may act randomly), which may be even more undesirable than an instant crash.

Leaving a memory, handle or resource leak behind is not an option either, because the application will crash later on for a seemingly random reason. The user will probably not be able to provide an accurate error report, and the error will not be easy to reproduce either. The real cause will be very hard to discover and the user will quickly loose confidence in the general application stability.

Abrupt termination is always unpleasant, but a controlled crash at least lets the user know what went wrong. Although it may sound counterintuitive, such an immediate crash will probably help improve the software quality in the long run, as there will be an incentive to fix the error quickly together with a helpful panic report.

If you are worried about adding panic points, keep in mind that you will not be able to completely rule out abrupt termination anyway. Just touching a NULL pointer, calling some OS syscall with the wrong pointer or just using too much stack space at the wrong place may terminate your application at once.

How to Generate Helpful Error Messages

Let's say you press the 'print' button on your accounting application and the printing fails. Here are some example error messages, ordered by message quality:

  1. Kernel panic / blue screen / access violation.
  2. Nothing gets printed, and there is no error message.
  3. There was an error.
  4. Error 0x03A5.
  5. Error 0x03A5: Write access denied.
  6. Error opening file: Error 0x03A5: Write access denied.
  7. Error opening file "invoice arrears.txt": Error 0x03A5: Write access denied.
  8. Error printing letters for invoice arrears: Error opening file "invoice arrears.txt": Error 0x03A5: Write access denied.
  9. I cannot start printing the letters because write access to file "invoice arrears.txt" was denied.
  10. Before trying to print those letters, please remove the write-protection tab from the SD Card.
    In order to do that, remove the little memory card you just inserted and flip over the tiny white plastic switch on its left side.
  11. You don't need to print those letters. Those customers are not going to pay. Get over it.

Let's evaluate each of the error messages above:

  1. Worst-case scenario.
  2. Awful. Have you ever waited to no avail for a page to come out of a printer?
    When printing, there usually is no success indication either, so the user will wonder and probably try again after a few seconds. If the operation did not actually fail, but the printer just happens to be a little slow, he will end up with 2 or more printed copies. It happens to me all the time, and we live in 2013 now.
    If the printing did fail, where should the user find the error cause? He could try and find the printer's spooler queue application. Or he could try with 'strace'. Or look in the system log file. Or maybe the CUPS printing service maintains a separate log file somewhere?
  3. Negligent development.
  4. Unprofessional development.
  5. You show some hope as a programmer.
  6. You are getting the idea.
  7. You are implementing the idea properly.
  8. This is the most that you can achieve in practice.
    The error message has been generated by a computer, and it shows: it is too long, clunky and sounds artificial. But the error message is still helpful, and it contains enough information for the user to try to understand what went wrong, and for the developer to quickly pin-point the issue. It's a workable compromise.
  9. Unrealistic. This text implies that the error message generation was deferred to a point where both knowledge was available about the high-level operation that was being carried out (printing letters) and about the particular low-level operation that failed (opening a file). This kind of error-handling logic would be too hard to implement in real life.
  10. In your dreams. But there is an aspect of this message that the Operating System could have provided in the messages above: instead of saying "write access denied", it could have said "write access denied because the storage medium is write protected". Or, better still, "cannot modify the file because the memory card is physically write protected". That is doable, because it's a common error and the OS could internally find out the reason for the write protection and provide a textual description of the write-protected media type. But Linux could never build such error messages with its errno-style error reporting.
  11. Your computer has become self-aware. You may stop worrying now about error handling in your source code.

Therefore, the best achievable error message in practice, assuming that the Operating System developers have read this guide too, would be:

Error printing letters for invoice arrears: Error opening file "invoice arrears.txt": Error 0x03A5: Cannot modify the file because the memory card is physically write protected.

The end-user will read it left-to-right, and may only understand it up to a point, but that will hopefully be enough to figure out the problem and maybe to work around it. If the user sends the error message to the developer, there will be enough detail to the right to help locate the exact issue.

Such an error message gets built from right to left. When the 'open' syscall fails, the OS delivers the error code (0x03A5) and the low-level error description ("Cannot modify the file because the memory card is physically write protected"). A single string is built out of these 2 components and gets returned to the level above in the call stack. Instead a normal 'return' statement, you would raise a C++ exception with 'throw'. At every relevant stage in the way up while unwinding the call stack (at every 'catch' point), the error string becomes a new prefix (like "Error opening file "invoice arrears.txt": "), and the exception gets passed further up (gets 'rethrown'). At the top level (the last 'catch'), the final error message is presented to the user.

The source code will contain a large number of 'throw' statements but only a few 'catch/rethrow' points. There will be very few final 'catch' levels, except for GUI applications, where each button handler will need one. However, all such GUI 'catch' points will look the same: they will probably call some helper routine in order to display a standard modal error message box.

How to Write Error Handlers

Say you have a large program written in C++ with many nested function calls, like this example:

int main ( int argc, char * argv[] )
{
   ...
   b();
   ...
}

void b ( void )
{
   ...
   c("file1.txt");
   c("file2.txt");
   ...
}

void c ( const char * filename )
{
   ...
   d( filename );
   ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   // Error check example: we only accept filenames that are at least 10 characters long.

   if ( strlen( filename ) < 10 )
   {
     // What now? Ideally, we should report that the filename should be at least 10 characters long.
   }

   // Yes, you should check the return value of printf(). TODO: More on that still to come.

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     // What now?
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     // What now?
   }
   ...
}

Let's try to deal with the errors in routine e() above. It's a real pain, as it distracts us from the real work we need to do. But it has to be done.

Here is a very common approach where all routines return an integer error code, like most Linux system calls do. Note that zero means no error.

int main ( int argc, char * argv[] )
{
   ...
   int error_code = b();
   if ( error_code != 0 )
   {
     fprintf( stderr, "Error %d calling b().", error_code );
     return 1;  // This is equivalent to exit(1);
                // We could also return error_code directly, but you need to check
                // what the exit code limit is on your operating system.
   }
   ...
}

void b ( void )
{
   ...
   int err_code_1 = c("file1.txt")
   if ( err_code_1 != 0 )
   {
     return err_code_1;
   }

   int err_code_2 = c("file2.txt")
   if ( err_code_2 != 0 )
   {
     return err_code_2;
   }
   ...
}

int c ( const char * filename )
{
   ...
   int err_code = d( filename );
   if ( err_code != 0 )
     return err_code;
   ...
}

int d ( const char * filename )
{
   ...
   int err_code = e( filename );
   if ( err_code != 0 )
     return err_code;
   ...
}

void e ( const char * filename )
{
   if ( strlen( filename ) < 10 )
   {
     return some non-zero value, but which one?
            Shall we create our own list of error codes?
            Or should we just pick a random one from errno.h, like EINVAL?
   }

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     return some non-zero value, but which one? Note that printf() sets errno.
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     fprintf( stderr, "Error opening file %s: %s", filename, e.what() );
     return some non-zero value, but which one? Note that fopen() sets errno.
   }
   ...
}

As shown in the example above, the code has become less readable. All function calls are now inside if() statements, and you have to manually check the return values for possible errors. Maintaining the code has become cumbersome.

There is just one place in routine main() where the final error message gets printed, which means that only the original error code makes its way to the top and any other context information gets lost, so it's hard to know what went wrong during which operation. We could call printf() at each point where an error is detected, like we do after the fopen() call, but then we would be calling printf() all over the place. Besides, we may want to return the error message to a caller over the network or display it to the user in a dialog box, so printing errors to the standard output may not be the right thing to do.

The same code uses C++ exceptions and looks much more readable:

int main ( int argc, char * argv[] )
{
   try
   {
     ...
     b();
     ...
   }
   catch ( const std::exception & e )
   {
     // We can decide here whether we want to print the error message to the console, write it to a log file,
     // display it in a dialog box, send it back over the network, or all of those options at the same time.
     fprintf( stderr, "Error calling b(): %s", e.what() );
     return 1;
   }
}

void b ( void )
{
   ...
   c("file1.txt");
   c("file2.txt");
   ...
}

void c ( const char * filename )
{
   ...
   d( filename );
   ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   if ( strlen( filename ) < 10 )
   {
     throw std::runtime_error( "The filename should be at least 10 characters long." );
   }

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     throw std::runtime_error( collect_errno_msg( "Cannot write to the application log: " ) );
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     throw std::runtime_error( collect_errno_msg( "Error opening file %s: ", filename ) );
   }
   ...
}

If the strlen() check above fails, the throw() invocation stops execution of routine e() and returns all the way up to the 'catch' statement in routine main() without executing any more code in any of the intermediate callers b(), c(), etc.

We still have a number of error-checking if() statements in routine e(), but we could write thin wrappers for library or system calls like printf() and fopen() in order to remove most of those if()'s. A wrapper like fopen_e() would just call fopen() and throw an exception in case of error, so the caller does not need to check with if() any more.

Improving the Error Message with try/catch Statements

Let's improve routine e() so that all error messages generated by that routine automatically mention the filename. That should also be the case for any errors generated by any routines called from e(), even though those routines may not get the filename passed as a parameter. The improved code looks like this:

void e ( const char * filename )
{
   try
   {
     if ( strlen( filename ) < 10 )
     {
       throw std::runtime_error( "The filename should be at least 10 characters long." );
     }

     if ( printf( "About to open file %s", filename ) < 0 )
     {
       throw std::runtime_error( collect_errno_msg( "Cannot write to the application log: " ) );
     }

     FILE * f = fopen( filename, ... );
     if ( f == NULL )
     {
       throw std::runtime_error( collect_errno_msg( "Error opening the file." ) );
     }
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, e.what() ) );
   }
   catch ( ... )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, "Unexpected C++ exception." ) );
   }
}

In the example above, helper routines format_msg() and collect_errno_msg() have not been introduced yet, see below for more information.

Note that all exception types are converted to an std::exception object, so only the error message is preserved. There are other options that will be discussed in another section further ahead.

You may not need a catch(...) statement if your application uses exclusively exception classes ultimately derived from std::exception. However, if you always add one, the code will generate better error messages if an unexpected exception type does come up. Note that, in this case, we cannot recover the original exception type or error message (if there was a message at all), but the resulting error message should get the developer headed in the right direction. You should provide at least add one catch(...) statement at the application top-level, in the main() function. Otherwise, the application might end up in the unhandled exception handler, which may not be able to deliver a clue to the right person at the right time.

We could improve routine b() in the same way too:

void b ( void )
{
   try
   {
     ...
     c("file1.txt");
     c("file2.txt");
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error loading your personal address book files: %s", e.what() ) );
   }
}

You need to find a good compromise when placing such catch/rethrow blocks in the source code. Write too many, and the error messages will become bloated. Write too little of them, and the error messages may miss some important clue that would help troubleshoot the problem. For example, the error message prefix we just added to routine b() may help the user realise that the affected file is part of his personal address book. If the user has just added a new address book entry, he will probably guess that the new entry is invalid or has rendered the address book corrupt. In this situation, that little error message prefix provides the vital clue that removing the new entry or reverting to the last address book backup may work around the problem.

If you look a the original code, you'll realise that routine c() is actually the first one to get the filename as a parameter, so routine c() may be the optimal place for the try/catch block we added to routine e() above. Whether the best place is c() or e(), or both, depends on who may call these routines. If you move the try/catch block from e() to c() and someone calls e() directly from outside, he will need to provide the same kind of try/catch block himself. You need to be careful with your call-tree analysis, or you may end up mentioning the filename twice in the resulting error message, but that's still better than not mentioning it at all.

Using try/catch Statements to Clean Up

Sometimes, you need to add try/catch blocks in order to clean up after an error. Consider this modified c() routine from the example above:

void c ( const char * filename )
{
  my_class * my_instance = new my_class();
  ...
  d( filename );
  ...
  delete my_instance;
}

If d() were to throw an exception, we would get a memory leak. This is one way to fix it:

void c ( const char * filename )
{
  my_class * my_instance = new my_class();

  try
  {
    ...
    d( filename );
    ...
  }
  catch ( ... )
  {
    delete my_instance;
    throw;
  }

  delete my_instance;
}

Unfortunately, C++ lacks the 'finally' clause, which I consider to be a glaring oversight. May other languages, such as Java or Object Pascal, do have 'finally' clauses. Without it, we need to write "delete my_instance;" twice in the example above. See further below for an alternative approach with smart pointers and other wrapper classes.

The Final Version

This is what the example code above looks like with smart pointers, wrapper functions and a little extra polish:

int main ( const int argc, char * argv[] )
{
   try
   {
     ...
     b();
     ...
   }
   catch ( const std::exception & e )
   {
     return top_level_error( e.what() );
   }
   catch ( ... )
   {
     return top_level_error( "Unexpected C++ exception." );
   }
}

int top_level_error ( const char * const msg )
{
  if ( fprintf( stderr, "Error calling b(): %s", msg ) < 0 )
  {
    // It's hard to decide what to do here. At least let the developer know.
    assert( false );
  }

  return 1;
}

void b ( void )
{
   try
   {
     ...
     c("file1.txt");
     c("file2.txt");
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error loading your personal address book files: %s", e.what() ) );
   }
}

void c ( const char * filename )
{
  std::auto_ptr< my_class > my_instance( new my_class() );
  ...
  d( filename );
  ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   try
   {
     if ( strlen( filename ) < 10 )
     {
       throw std::runtime_error( "The filename should be at least 10 characters long." );
     }

     printf_to_log_e( "About to open file %s", filename );

     auto_close_file f( fopen_e( filename, ... ) );

     const size_t read_count = fread_e( some_buffer, some_byte_count, 1, f.get_FILE() );

     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, e.what() ) );
   }
   catch ( ... )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, "Unexpected C++ exception." ) );
   }
}

The rest of the article has not been written yet