2
votes

Problem:

I'm seeing intermittent issues which causes SSL connections to be dropped. I am able to reproduce the issue in my test environment (1 out of 50k calls fails). The software I'm testing is a SMTP/POP3/IMAP-server.

The SSL connection is dropped after the handshake has been performed successfully, after a few commands have been sent back and forth between the SMTP client and the server.

My server software (running on Windows) is written using:

  • Boost: 1.56.0
  • OpenSSL: 1.0.1j

The client application is written using .NET3.5, SslStream

The application is quite similar to HTTP Server 3 in the Boost examples

The call I'm making looks like this:

boost::asio::async_read_until(ssl_socket_, receive_buffer_, delimitor, strand_.wrap(AsyncReadCompletedFunction));

In AsyncReadCompletedFunction, the error contains the following:

  • error_code::value = 67567722
  • error_code::message = block type is not 01

Sometimes the error is instead:

  • error_code::value = 67530866
  • error_code::message = padding check failed

Or:

  • error_code::value = 336265225
  • error_code::message = PEM lib

These errors are set by OpenSSL.

Isolation

  1. When reducing the number of threads which calls io_service::run() from 15 to 1, the problem appears to go away.
  2. I don't see the issue for non-SSL connections.
  3. I've made sure that all calls to Asio goes through a strand. Both the call to async_write/async_read and the completion handlers are executed in a strand, as follows.

    std::shared_ptr<TCPConnection> shared_this = shared_from_this();
    
    strand_.post([AsyncWriteCompletedFunction, send_buffer_, shared_this]
    {
            boost::asio::async_write (shared_this->ssl_socket_, send_buffer_, shared_this->strand_.wrap(AsyncWriteCompletedFunction));
    });
    
  4. I've tested multiple variants, such as just wrapping the handler in a strand and just the call to boost::asio::async_write in a strand.

  5. For testing, I introduced a recursive_mutex which I place a scoped_lock on prior to doing anything. All the methods in the connection class starts by taking this lock so no two methods are executing at the same time.
  6. I've introduced synchronous logging to file in my methods, and have confirmed that there's no overlapping read/write calls being made:

    2014-12-01 13:12:45.362: Session 2254, Starting AsyncWrite - Call to boost::asio::async_write
    2014-12-01 13:12:45.362: Session 2254, AsyncWriteCompleted - Write completion handler called
    2014-12-01 13:12:45.362: Session 2254, Starting AsyncRead - Call to boost::asio::async_read
    2014-12-01 13:12:45.362: Session 2254, AsyncReadCompleted - Read completion handler called, with error:
            The read operation failed. Bytes transferred: 0 Remote IP: 127.0.0.1, Session: 2254, Code: 67567722, Message: block type is not 01
    
  7. I've considered that this could be an error in the client I use to run the tests. But it's single-threaded and implemented using a loop with synchronous calls. I've double-checked the documentation for the SslStream class in .NET Framework 3.5, but can't see anything wrong there either (the client code is very simple).

The test runs within VMware on a Windows XP Service Pack 3. I'm running test on other OS:es as well, but so far I'm not seeing the there.

EDIT 1:

I've installed Wireshark to trace the SMTP-traffic. I see the following:

  1. Handshake performed succesfully.
  2. Server sends welcome message ("220 HI ESMTP")
  3. Client sends command EHLO test
  4. Server responds with its supported commands
  5. Client sends command AUTH LOGIN
  6. Server requests user name
  7. Client sends user name. This is the second line from the end.
  8. Server disconnects client. (RST).

When looking at the server logs, I can see that the client user name sent in step 7 was never received by the server. The async_read operation completed, and the completion handler was called with error code 336265225: PEM lib"

Wireshark trace I've run out of ideas to test. Anyone has suggestions on what should be tried?

1
Does the problem manifest with all the serialization enabled? (Including the - redundant - mutex?). Also, does it not manifest without strands/locking on a single IO thread? Or could it be just much less frequent? Is there a (S)SCCE that we can stress test?sehe
Have you considered tcpdumping/wiresharking the dropped connections to see exactly what happens?sehe
Also, try running the same software configuration on different hardware. 1 out of 50k failure rate is pretty spurious. It could be a faulty NIC/driver other related infrastructure.sehe
Yes, the problem manifests with all serialization enabled, normally within 20 minutes. I've ran single-threaded without strands/locking for 20 hours and didn't get the issue. Hard to tell whether its just much less frequent. I considered wireshark, but I'm not sure I'll be able to see what's going on with the connections encryped? Unless someone comes with a hint, I'll try to create a SSCCE based on the Boost examples. Right now I don't have that.Nitramk
Another observation: Prior to Asio calling SSL_read, sometimes there's already an error on the OpenSSL error stack. I added a call to ERR_get_error check to the start of engine::want engine::perform in Asio, and this actually returns an error BEFORE the read-operation has been executed. Which I believe means that even if the SSL_read operation succeeds, Asio will report an error because there's an error in the OpenSSL error queue. This error must be remaining from some prior operation. Troubleshooting continues......Nitramk

1 Answers

1
votes

This issue was caused by bug in the Asio SSL implementation. Reported it to the Asio Github issue tracker.

When an error occur in OpenSSL, errors-codes are pushed onto a queue. A single call to OpenSSL may result in multiple error-codes being pushed to the queue. For example, a low-level component in OpenSSL may push one error to the queue, while another higher-level component may push another. (It could be seen as a stack of errors).

These queues are linked to the thread. In the Asio imlplementation for Windows, this is done using the thread-id (int). So every thread has its own queue containing errors.

After Asio has executed an SSL function, it only removes the first error from the queue. If more than one item have been pushed to the queue, the remaining items will be left behind. This has the effect that when Asio calls a SSL function later on, and checks the queue, it may pick up the error from a previous operation, and think that the current operation has faulted, even though it hasn't.

Sometimes when Asio calls OpenSSL, it clears the queue first. For example, the function add_certificate_authority starts of by executing ::ERR_clear_error(). It seems other people have had the same issue and claimed that this is the proper solution.But other functions in Asio, such as async_read do not do this.

My issue was found when running my test suite. The test suite contains a number of tests which tests error handling, for example by sending invalid data to the SSL server, by trying to use a private key with an invalid password and so on and here's where the fun starts:

My test suite triggers the code:

 context.set_password_callback(callbackfunc)
 context.use_private_key_file(myfile, boost::asio::ssl::context::pem)

The callback func will return an incorrect password (to test that this is handled properly). use_private_key_file will call SSL_CTX_use_PrivateKey_file which pushes two errors to the queue. But Asios implementation of use_private_key_file only removes the first item. use_private_key_file properly reports an error to my application code.

My application code detects this and triggers a restart of the service-component (cleaning up, restarting threads, etc). The actual Windows-process will continue to run.

Some time later when i call async_read, that call may happen to be executed on a thread which has the same ID as the thread where I called use_private_key_file an hour ago. It's really different threads (original one has terminated), but they happen to have the same Thread-Id. After Asio calls SSL_read, it will check the error queue for the current thread id and retrieve the error which was previously reported in the call to use_private_key_file.

So that's how a async_read can succeed but still fail with a error referring to something completely different.