epoll_wait return EPOLLOUT even with EPOLLET flag

Question

I am using linux epoll in edge trigger mode. Each time a new connection is incoming, I add the file descriptor to epoll with EPOLLIN|EPOLLOUT|EPOLLET flag. My first question is: What's the right way to check which kind of event(s) occur for each ready file descriptor after the epoll_wait returns? I mean, I see some example code e.g from https://github.com/yedf/handy/blob/master/raw-examples/epoll-et.cc line 124 do it like this:

for (int i = 0; i < n; i++) {
    //...
    if (events & (EPOLLIN | EPOLLERR)) {
        if (fd == lfd) {
            handleAccept(efd, fd);
        } else {
            handleRead(efd, fd);
        }
    } else if (events & EPOLLOUT) {
        if (output_log)
            printf("handling epollout\n");
        handleWrite(efd, fd);
    } else {
        exit_if(1, "unknown event");
    }
}

What caught my attention is: it uses "if and else if and else" to check which event occurs, which means if it handleRead, then it can't handleWrite at the same time. And I think this may cause loss of event in the following condition: Both socket read and write operation have meet EAGAIN and then the remote end both read and send some data, thus the epoll wait may set both EPOLLIN and EPOLLOUT, but it can only handleRead, and the data remaining in output buffer can't be sent since handleWrite is not being called. So is the above usage wrong?

According man 7 epoll QA:

If more than one event occurs between epoll_wait(2) calls, are they combined or reported separately?

They will be combined.

If i got it right, several events can occur on a single file descriptor between epoll_wait calls. So I think I should use multiple "if if and if" to check on by one whether readable/writable/error events occur instead of using "if and else if". I went to see how nginx epoll module do, from https://github.com/nginx/nginx/blob/953f53921505a884f3912f2d8db5217a71c0479a/src/event/modules/ngx_epoll_module.c#L867 I see the following code:

    if (revents & (EPOLLERR|EPOLLHUP)) {
        //...
    }
    if ((revents & EPOLLIN) && rev->active) {
        //....
        rev->handler(rev);
    }
    if ((revents & EPOLLOUT) && wev->active) {
        //....
        wev->handler(wev);
    }

It seems to adhere to my thoughts of checking all EPOLLERR..,EPOLLIN,EPOLLOUT events one after another. Then I do the same kind of thing as nginx do in my application. But What I realized after experiment is: if I add the file descriptor to epoll with EPOLLIN|EPOLLOUT|EPOLLET flag, and I didn't fill up the output buffer, I will always get EPOLLOUT flag set after epoll_wait returns due to some data arrives and this fd becomes readable, therefore redundant write_handler would be called, which is not what I expect.

I did some search and found that this situation indeed exists and not caused by any bug in my application. According to the top voted answer at epoll with edge triggered event says:

On a somewhat related note: if you register for EPOLLIN and EPOLLOUT events and assuming you never fill up the send buffer, you still get the EPOLLOUT flag set in the event returned by epoll_wait each time EPOLLIN is triggered - see https://lkml.org/lkml/2011/11/17/234 for a more detailed explanation.

And the link in this answer says:

It's doesn't mean there's an EPOLLOUT "event", it just means a message is triggered (by the socket becoming readable) so you get a status update. In theory the program doesn't need to be told about EPOLLOUT here (it should be assuming the socket is writable already), but it doesn't do any harm.

So far What I understand about epoll edge trigger mode is:

the epoll_wait return when the state of any fd being monitored has changed, e.g from nothing to read -> readable or buffer is full-> buffer can write
the epoll_wait may return one or several event(flags) for each fd in the ready list.
the flags in sturct epoll_event.events field indicate the current state of this fd. Even if we don't fill out the output buffer, the EPOLLOUT flag would be set when epoll_wait return due to readable, because the current state of the fd is just writable.

Please correct me if I am wrong. Then my question would be: Should I maintain a flag in each connection to indicate whether EAGAIN occurs when write to output buffer, if it is not set, don't call write_handler/handleWrite in "if (events & EPOLLOUT)" branch, so that my upper layer program would not be told about EPOLLOUT here?

It sounds like the answer is yes. Well, you can make your program deal with it however you want. That is one way. You could also consider using epoll_ctl to disable EPOLLOUT when you're not waiting to write data. — user253751
@demonatic Is there a good reason for using edge-triggered semantics? A cleaner and more robust way would be to use level-triggered polling and modify the event sets as you need them at the moment. — Ctx
@Ctx I use edge trigger mainly to reduce system calls and the number of fds possibly returns from epoll_wait — demonatic

Ron Burk Ron Burk · Accepted Answer · 2020-10-14T06:38:35

What a great question (since I had pretty much the same question)! I'll just summarize what I think I know now wrt to your informative question/description and your helpful links and hopefully smarter folk will correct any mistakes.

Yes, the if/else handling of event flags is definitely bogus. For sure at least two can events can arrive at effectively the same time. E.g., both the read and write sides might have become unblocked since last you called epoll_wait(). And, of course, as soon as you accept() the connection, both reading and writing suddenly become possible, so you get an "event" of EPOLLIN|EPOLLOUT.

I really didn't grok that epoll_wait() is always delivering the entire current state, rather than only the parts of the state that changed -- thanks for clearing that up. To be perhaps clearer, epoll_wait() won't return an fd unless something changed on that socket, but if something did change, it returns all the flags representing the current state. So, I found myself staring at a stream of EPOLLIN|EPOLLOUT events wondering why it was claiming there was an "output" event, even though I hadn't written anything yet. Your answer being correct: it's just telling me the output side is still writeable.

"Should I maintain a flag..." Yes, but I would imagine that in all but the most trivial situations you were probably going to end up maintaining at least one bit of "am I currently blocked" state for your readers/writers anyway. For example, if you ever want to process data in an order different than how it arrives (e.g., prioritize responses over requests to make your server more resistant to overload) you instantly have to give up the simplicity of just having the arrival of I/O drive everything. In the particular case of writing, epoll simply doesn't have enough information to notify you at the "right" time. As soon as you accept a connection, there's an event that says "you can write now"--but you probably have nothing to write if you're a server who couldn't possibly have already gotten a request from the client. epoll just can't know whether you have something to write or not, so you were always going to have to either suffer essentially "extraneous" events, or maintain your own state.

In all but the simplest cases, the socket file descriptor ends up being insufficient information for handling I/O events, so you invariably have to associate some data structure with it, or object if you prefer. So, my C++ looks something like:

nAwake = epoll_wait(epollFd, events, 100, milliseconds);
if(nAwake < 0)
    {
    perror("epoll_wait failed");
    assert(false);
    }
for(int iSocket=0; iSocket < nAwake; ++iSocket)
    {
    auto This = static_cast<Eventable*>(events[iSocket].data.ptr);
    auto eventFlags = events[iSocket].events;
    fprintf(stderr, "%s event on socket [%d] -> %s\n",
        This->ClassName(), This->fd, DumpEvent(eventFlags));

    This->Event(eventFlags);
    }

Where Eventable is a C++ class (or derivative thereof) that has all the state needed to decide how to handle the flags epoll delivers. (Of course, this is letting the kernel store a pointer to a C++ object, requiring a design that is very clear about pointer ownership/lifetimes.)

And since you're writing low-level code on Linux, you may also care about EPOLLRDHUP. This not-highly-portable flag lets you save one call to read(). If the client (curl seems pretty good at evoking this behavior) closes its write side of the connection (sends a FIN), you normally discover that when epoll tells you EPOLLIN, but read() returns zero bytes. However, Linux maintains an extra bit to indicate your client's write side (your read side) has been closed. So, if you tell epoll you want the EPOLLRDHUP event you can use it to avoid doing a read() whose sole purpose will turn out to be telling you the writer closed their side.

Note that EPOLLIN will still be turned on whenever EPOLLRDHUP is, AFAIK. Even after you do a shutdown(fd, SHUT_RD). Another example of how you will usually be driven to maintain your own idea of the state of the connection. You care more about clients who are kind enough to do half-shutdowns if you are implementing HTTP.

epoll_wait return EPOLLOUT even with EPOLLET flag

2 Answers