Skip to main content

Boeing B Tree's

In July 1970 Rudolf Bayer and Mcreight from Boeing Scientific research laboratories published the original B-Trees paper in the mathematical and information sciences report.  They never said what BTrees stand for, and it could well be Boeing Trees though Balanced Tree's is also a good reminder of what they are.

When studying BTrees before getting to the actual algorithm, wouldn't it be nice if we understand the exact motivation of the people who actually published the paper, the original people who thought of this idea.



They explain it in their paper, first just remember we are talking about 1970, this was before most of us were born, computers were slow, really slow.

In the paper they said that they are working on the organization and maintenance of index for dynamic random access file.  Let's dissect it, we know what index is, this is a thing that allows us to locate fast items in our data without scanning the whole data right.  So they want to understand and suggest better how to organize and maintain such an index.

So one of the key issues in btrees is not only search but sequential access to items once you have the page! if I'm in a node and this node has 10 keys then I can sequentially get the keys from smaller to larger in that page.

They use a random access backup store - I'm emphasizing this, they are talking about a backup store so this must be very slow they say it's like a disc or a drum.

They want to achieve from this index, retrieval, insertion and deletion of keys which all would be in logarithmic time relative to the size of the index.

In the actual experiments they did they had up to 100K indices, and they maintained it with their BTree method which they are going to present with at least 4 transactions per second, on an IBM 360/44 with a 2311 disc.

So in this paper they talk about how to dynamically maintain this index, using random access files, doing key deletion insertion retrieval paging, why paging? Because they can't hold everything in memory, so they need paging, this is a core concept of the BTree the fact you can't hold the whole index in memory and you need to page in and out.


So let's think what we want of such a datastructure and this would enable us to build this BTree before knowing what this BTree is about, at least except for the algorithms of insertion and deletion etc but mainly of the core structure of this BTree.

We want to be able to search in logarithmic time, we need therefore a kind of search tree, but in standard search tree if you insert all items in a linearly increasing way you might end up with a simple line tree, so we need to rebalance this tree we must otherwise if we don't rebalance this tree then it won't be logarithmic insertion deletion and search.

In addition, as our memory is limited, we want to store part of the index in memory, once we store some part of the tree in memory let's call it a page even if it contains multiple keys ordered, we could just linearly go through them.

Now why would we want more than one item in a single page, we not only 2 , because we store it on disk, once you store stuff on disk it becomes slow much slower than memory, the fact that disk is slower than memory means we want to store them sequentially so that at least moving from one item to another in a disk would be fast because we only need to move the head of the disk a little bit further and not scan or move through the whole thing, or even better if it's on the same block we could just read this page, and we read multiple items.

So now we understand that the tree need to be rebalanced and also that we want to store blocks so meaning each node of the tree would store multiple keys and not just once so that we are efficient with disc.

Within each page P the keys are sequential in increasing order x1, x2, x3, except for the root page, each such page contains l+1 pointers to p0,p1, ... pl to the sons of P.

I guess this was an impolitely correct era as they call the children sons! :)

And now to the actual algorithm for insertion deletion etc.

We want to avoid a whole line in tree we want it to be balanced tree not an unbalanced tree.

Unbalanced would be O(n) and balanced should be O(log(n)).

For search insert and delete operations.

Each note in BTree have more than 1 key.  for example root could have 2 keys and children of root should, could have more than 2 like 3 keys.

Each node can have 3 nodes, left would be smaller than all keys in current parent, all nodes in middle node would be like both smaller and larger meaning you could fit them in the middle of the current parent.
All nodes of the right child all bigger than the current parent node.

You have k+1 children for each node if a new has 2 keys then you would have 3 children, one more of children then the number of keys in each node.

All leaves would appear in same level in BTree we wanted balancing tree.


Insert
So you want to insert a new node you start from root and find the place, but if node is full you split the node to two, one node would have the larger values, the medium value would be moved up the tree.

Each BTree tree has what's called a degree, the degree is max children a node can have and it can have degree-1 keys in each node.

You insert a new number you move down the tree to find where it fits, if it's full you split and promote the median to parent node.

If you try to prompt and move the new key to parent node then you have to split the parent, but then once parent is being split if you can't just split it then you need to increase the height of the tree by one by creating new nodes and splitting and repeating this process.

So you first find the first leaf node you can insert, if can't insert it to this leaf split the node and then you could split parent and split grandparent etc.

Let's see an example let's create a btree of degree 3 which means 2 keys in each node and 3 children for each node.

We insert 5 we have a single root node with 5 all is fine.
Insert 10 - still all is fine we just have still the root node with both 5 and 10 keys however as the degree is 3 next time we want to insert a node we would need to split the root.

we insert 15 we split the root node which had 5 and 10 now it has only 10 which is median and on left child node we have 5 and on right children nodes we have 15.

Add 7 we start from root node 7 is smaller than 10, so we move below and add it inside the key sub child of 5 so left children has both 5 and 7

And now we add 4 it's less than root node of 10, so we move to the left which has 5 and 7 keys we add it but now we have 3 keys, so we split the node, and we pass the 5 which is the median to the parent.

So now our root node has 2 keys! 5 and 10! And 4 and 7 are on the left sub child.

To sum up

B trees is a self balancing tree data structure that mains search sequential access insertions and deletions in logarithmic time.


Comments

Popular posts from this blog

Functional Programming in Scala for Working Class OOP Java Programmers - Part 1

Introduction Have you ever been to a scala conf and told yourself "I have no idea what this guy talks about?" did you look nervously around and see all people smiling saying "yeah that's obvious " only to get you even more nervous? . If so this post is for you, otherwise just skip it, you already know fp in scala ;) This post is optimistic, although I'm going to say functional programming in scala is not easy, our target is to understand it, so bare with me. Let's face the truth functional programmin in scala is difficult if is difficult if you are just another working class programmer coming mainly from java background. If you came from haskell background then hell it's easy. If you come from heavy math background then hell yes it's easy. But if you are a standard working class java backend engineer with previous OOP design background then hell yeah it's difficult. Scala and Design Patterns An interesting point of view on scala, is

Alternatives to Using UUIDs

  Alternatives to Using UUIDs UUIDs are valuable for several reasons: Global Uniqueness : UUIDs are designed to be globally unique across systems, ensuring that no two identifiers collide unintentionally. This property is crucial for distributed systems, databases, and scenarios where data needs to be uniquely identified regardless of location or time. Standardization : UUIDs adhere to well-defined formats (such as UUIDv4) and are widely supported by various programming languages and platforms. This consistency simplifies interoperability and data exchange. High Collision Resistance : The probability of generating duplicate UUIDs is extremely low due to the combination of timestamp, random bits, and other factors. This collision resistance is essential for avoiding data corruption. However, there are situations where UUIDs may not be the optimal choice: Length and Readability : UUIDs are lengthy (typically 36 characters in their canonical form) and may not be human-readable. In URLs,

Dev OnCall Patterns

Introduction Being On-Call is not easy. So does writing software. Being On-Call is not just a magic solution, anyone who has been On-Call can tell you that, it's a stressful, you could be woken up at the middle of the night, and be undress stress, there are way's to mitigate that. White having software developers as On-Calls has its benefits, in order to preserve the benefits you should take special measurements in order to mitigate the stress and lack of sleep missing work-life balance that comes along with it. Many software developers can tell you that even if they were not being contacted the thought of being available 24/7 had its toll on them. But on the contrary a software developer who is an On-Call's gains many insights into troubleshooting, responsibility and deeper understanding of the code that he and his peers wrote. Being an On-Call all has become a natural part of software development. Please note I do not call software development software engineering b