Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Fri, 18 Jul 2008

Completed distributed storage redesign.

I also managed to play second octave F# and sometimes the whole chromatic scale down to small (minor?) octave F on my trumpet, and I belive I started to understand overall trumpet kung-fu, but expect it is not what you wanted to read under DST tag.

So, DST becomes smaller, cleaner and simpler. Notably, I decided to drop userspace target completely for now.
Kernel part now operates on transaction entity, which holds a reference to the node, where data should be sent/received. There can be at most two such nodes if block IO request spans the boundary. In case of mirroring (which will be dropped for the first release) list of nodes to mirror this data to will be maintained by the first node, so transaction will not need to know about them.
In theory block request can be as much as BIO_MAX_PAGES pages, which is 256 for now, but I decided to limit minimum node size to be not smaller than above bio limit, so there will be always at most two nodes per request.
Each node has either block device behind it (so it will just call generic_make_request() with different block device for given bio), or network state machine.
Network state will have two threads: RX and TX. Receive one is used to get replies for the read/write messages, search appropriate transaction and complete it. In case of DST server it will also handle read/write requests and generate replies, but the whole processing will be exactly the same, client node will have a switch to process read/write requests from the network, but they should be only received by server.
Sending thread is tricky. It is used as fallback for non-blocking sockets, which are used first at generic_make_request() time, i.e. when higher level user performed read or write, if block was not fully sent, then it is queued to this thread and it will try to send the rest of the data when polling allows. ->make_request_fn() function returns in this case and higher layer can proceed with own operations.
Transaction is not freed until reply is received from the remote side or resending retry count fires.
Transaction is always allocated (from the appropriate memory pool) and that is actually all allocations in DST itself. In case it works with block devices, it is possible to clone a bio, when it crosses the boundaries (or even always, I have to check it, but it is essentially what device mapper with lots of own additional allocations), but it should be very rare condition.
Network stack will allocate data itself too.

That was a theory. Practice tells me, that essentially 90% of the code should be rewritten from scratch, so I recloned the tree and so far implemented generic bits of registering block device, creating various sysfs files and directories and other similar trivial bits. I still plan to finish it this weekend (without mirroring), but things may turn to me a different side though...

/devel/dst :: Link / Comments ()