Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Sat, 18 Nov 2006

New generic wildcard search design.


Well, it is not new, but it is different from what was described before. And although previous variant has higher lookup speed, it requires too much memory, so it is not acceptible.

New one is quite simple - it will be implemented on top of trie structures. Main feature of tries is that they are quite compact (in case of PATRICIA tries) and allow W^d lookup speed in the worst case, where W is width of the interval in bits (i.e. 32 bits in current netchannels implementation, which does not support IPv6 due to my laziness and absence of requirements feedback from the community), and d is number of dimensions (5 currently), which results in very bad number of 33.5 millions operations in the worst case of (2^W)*(2^d) == 2^(W+d) (137.5 milliards) rules.
Actually with netchannels it will be much smaller since only two dimensions use full 32 bit space (source and destination addresses), while others use either 16 bits (source and destination ports) or 8 bits (protocol number), so it will be something about 2 millions operations.

Basic idea behind tries is following: trie is a simple unbalanced tree each node of which represent only one bit (i'th level of the tree represent bits in i-1 position in the searched value), so it's maximum depth is 32 (in PATRICIA trie several bits can be combined into the same node if there are no other rules, which have different bits in that positions). In that case rule 192.168.0.0 - 192.168.0.255 will be presented as 0b11000000101010000000000000000000(192.168.0.0) - 0b11000000101010000000000011111111(192.168.0.255), so actually trie will be stopped at bit 24 (counted from the most significant one) and the rest will present a wildcard (0-255 in the least significant byte in host order) or it will contain additional nodes if there are additional rules which has bits in that range, for example 192.168.0.0-192.168.0.20.
Each direction is a base for own tree.
Each node contains a pointer to the subtree of the next-dimension tree, where its rule starts. So when each node is examined (i.e. each bit is checked) in the first dimension and bit is found to to belong to some rule, next dimension is checked from the bit position specified in the node and this is being done recursively for each dimension. If in some dimension requested bit is not found in the trie (where it should be by the previous dimension rule), it means that there are no rules for requested values.

In the previous memory-hungry design, if rules do not have ranges but only single tuples (i.e. single source/destination IP address and single ports and protocol), or if all rules present the same range (for example all port rules have the same 1-65535 range), it can not lead to the described problem with memory usage. The more range overlaps rulset has, the more memory such algorithm will use - in theory, when each new rule requires interval split, i.e. when each new rule overlaps with all previous and has additional range (for example when each new rule is slightly bigger than previous one), it is O(d*W*(N^d)) where N is number of rules and d is number of directions, W is width in bits of each dimension (actually it can be hidden by O(), since it is a constant for different set of rules).

/devel/networking :: Link / Comments ()