Zbr's days.
December
Sun Mon Tue Wed Thu Fri Sat
           
18
         
2007
Months
Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Tue, 18 Dec 2007

Fundamental race between block layer/IO and networking.

This header is about impossibility to work without races with netowork's ->sendpage() method, which is used mostly to transfer IO mapped pages, without either turning off offload capabilities and copying data into new buffer or using own acks in the protocol.

->sendpage() in the optimised case (hardware supports checksum offloading and scater/gather) will not copy content of the page to the new buffer, but instead will increase page's reference counter, so that page could not be freed. When ->sendpage() returns this does not guarantee, that data was sent, received by remote side or whatever, since packet can be queued (in hardware or qdisk), it can be later retransmitted, there is no way to know that data was received until ACK (lets talk about TCP) is received, but there is no API to know that ACK was received. When ACK is received, appropriate packet will be found in the TCP retransmit queue and freed, this will drop page's reference counter.
If user (and there is no other way actually) does expect that after ->sendpage()'s return data can be processed (for example rewritten), then there is non-zero probability that remote side will get this new data, instead of old, which can lead to state machine breaks and data corruption.
One can try to use sendfile() and simultaneously write data to the file - remote side can get mix of the old and new data. One can argue that using proper locking around sendfile() and write will help, but actually it will not - consider the case when we send only single page - after sendfile() returned, data still can be in the queue, so subsequent write, which already does not race with sendfile() itself, but not with data sending, will overwrite data and remote side will get new one instead of old data.

There are two fixes for thei problem: first is not to use ->sendpage() (or use it with copy of the data into new buffer, which is essentially how usual send() works), second is to use protocol specific acknoledgement system, so that any subsequent operation on given data would be postponed not until ->sendpage()/sendfile() returns, but until that ACK received.
Both greatly harm performance.

I would be really glad to find that my conclusions are incorrect.

/devel/fs :: Link / Comments (8)

Jens Axboe wrote at 2007-12-18 23:55:

Your conclusion is correct, I (re)discovered the same thing when doing the splice stuff. I thought it was common knowledge, since you just discovered it again I guess not...

Matthew wrote at 2007-12-19 12:55:

Presumably an alternative workaround would be to instrument some kind of callback from the TCP stack when it receives the ACK from the peer, and only then complete the BIO request. Is there really no way of implementing such a mechanism such that it could be useful in the general case?

Zbr wrote at 2007-12-19 13:45:

Really tricky way is to provide skb destructor to the ->sendpage() callback, which will invoke original one, or use page->lru (we have two pointers there!), which is not used for IO mapped pages, but used for slab allocated pages.

It is possible to extend current scheme, but it will be a real hack :)

Jens Axboe wrote at 2007-12-19 21:00:

BTW, my suggestion would be to properly implement socket ops ->splice_write() and do the pipe buffer release on the ACK.

Zbr wrote at 2007-12-19 21:19:

That's a main issue - how to add private callbacks into network so that it could inform users about received acks.

Distributed storage for example would call bio_endio() in that callback, splice would release pipe buffer (and thus release pages), other users would be happy too...

But... There is no generic way to attach private data to the page - in some cases lru is unused, in others - mapping, and I'm pretty sure there is a case, when every single item inside struct page is used (otherwise SLUB did not add new members :). So, it would be possible to attach private data to skb - but area, specially allocated for private data (skb->cb) is only guaranteed to live (and be untouched) on single network layer, so we have to change generic skb structure, and there is no way to add new members there.

It is possible to overwrite skb_destructor, but every user of that field (at least some protocols and netfilter) should copy it into private area and invoke from own destructor callback.

Jens Axboe wrote at 2007-12-19 21:37:

There's no way around wrapping the page into some sort of container, my suggestion is to use splice for this as well. The pipe_buffer is just a page/len/off (yet another one...) and the pipe_info may be ok as a pages container which includes a priv for lookup of eg a bio for the distributed storage path.

Zbr wrote at 2007-12-19 21:46:

But how to provide pipe_buf into __kfree_skb(), which is called when ack is received and skb returned from the queue?

It does not matter what will get called when skb is about to be freed, it can be whatever we want - bio_endio() in a loop, splice buffer release, userspace awakening or anything else, but there is no generic way to attach it to the skb (and thus to page) and get notification, and this is a main problem.

Such notification can be invoked in process or BH context, in some cases even in hardware interrupt (when LRO is turned on iirc).

Another hack would be to extend socket processing code (all interesting data is always attached to sockets), socket gets implicit notification when skb is freed (by adjusting its private memory limits), so if socket would have a private callback (it has actually), it could check the queue of bio/pipe/whatever requests and find which become ready, but it is a hack too :)

Zbr wrote at 2007-12-19 22:00:

Actually, problem is not that complex as showed first time.

The only thing we have to change is sock_wfree() (I think no one cares about transferring data over UDP) - this destructor is called when skb, allocated for sending, is freed for TCP, SCTP and DCCP sockets at least.

What we want is to put there a private callback invocation, if it is provided, which will perform whatever operations we want, and those operations will _not_ race since sock_wfree() invocation means data is _fully_ processed. It has to check for clones though, but that is not too complex.

So, this solution is not very intrusive (sockets are big enough without this changes already :) and can be even accepted.

Please solve this captcha to be allowed to post (need to reload in a minute): 19 - 40

Comments are closed for this story.