Patricia Tree
Contents
 Modified Merkle Patricia Trie Specification (also Merkle Patricia Tree)
 Main specification: Merkle Patricia Trie
Modified Merkle Patricia Trie Specification (also Merkle Patricia Tree)
Merkle Patricia tries provide a cryptographically authenticated data structure that can be used to store all (key, value) bindings, although for the scope of this paper we are restricting keys and values to strings (to remove this restriction, just use any serialization format for other data types). They are fully deterministic, meaning that a Patricia trie with the same (key,value) bindings is guaranteed to be exactly the same down to the last byte and therefore have the same root hash, provide the holy grail of O(log(n)) efficiency for inserts, lookups and deletes, and are much easier to understand and code than more complex comparisonbased alternatives like redblack tries.
Preamble: Basic Radix Tries
In a basic radix trie, every node looks as follows:
[i0, i1 ... in, value]
Where i0 ... in
represent the symbols of the alphabet (often binary or hex), value
is the terminal value at the node, and the values in the i0 ... in
slots are either NULL
or pointers to (in our case, hashes of) other nodes. This forms a basic (key, value) store; for example, if you are interested in the value that is currently mapped to dog
in the trie, you would first convert dog
into letters of the alphabet (giving 64 6f 67
), and then descend down the trie following that path until at the end of the path you read the value. That is, you would first look up the root hash in a flat key/value DB to find the root node of the trie (which is basically an array of keys to other nodes), use the value at index 6
as a key (and look it up in the flat key/value DB) to get the node one level down, then pick index 4
of that to lookup the next value, then pick index 6
of that, and so on, until, once you followed the path: root > 6 > 4 > 6 > 15 > 6 > 7
, you look up the value of the node that you have and return the result.
Note there is a difference between looking something up in the "trie" vs the underlying flat key/value "DB". They both define key/values arrangements, but the underlying DB can do a traditional 1 step lookup of a key, while looking up a key in the trie requires multiple underlying DB lookups to get to the final value as described above. To eliminate ambiguity, let's refer to the latter as a path
.
The update and delete operations for radix tries are simple, and can be defined roughly as follows:
def update(node_hash, path, value):
if path == '':
curnode = db.get(node_hash) if node else [ NULL ] * 17
newnode = curnode.copy()
newnode[1] = value
else:
curnode = db.get(node_hash) if node else [ NULL ] * 17
newnode = curnode.copy()
newindex = update(curnode[path[0]], path[1:], value)
newnode[path[0]] = newindex
db.put(hash(newnode), newnode)
return hash(newnode)
def delete(node_hash, path):
if node_hash is NULL:
return NULL
else:
curnode = db.get(node_hash)
newnode = curnode.copy()
if path == '':
newnode[1] = NULL
else:
newindex = delete(curnode[path[0]],path[1:])
newnode[path[0]] = newindex
if len(filter(x > x is not NULL, newnode)) == 0:
return NULL
else:
db.put(hash(newnode), newnode)
return hash(newnode)
The "Merkle" part of the radix trie arises in the fact that a deterministic cryptographic hash of a node is used as the pointer to the node (for every lookup in the key/value DB key == sha3(rlp(value))
, rather than some 32bit or 64bit memory location as might happen in a more traditional trie implemented in C. This provides a form of cryptographic authentication to the data structure; if the root hash of a given trie is publicly known, then anyone can provide a proof that the trie has a given value at a specific path by providing the nodes going up each step of the way. It is impossible for an attacker to provide a proof of a (path, value) pair that does not exist since the root hash is ultimately based on all hashes below it, so any modification would change the root hash.
While traversing a path 1 nibble at a time as described above, most nodes contain a 17element array. 1 index for each possible value held by the next hex character (nibble) in the path, and 1 to hold the final target value in the case that the path has been fully traversed. These 17element array nodes are called branch
nodes.
Main specification: Merkle Patricia Trie
However, radix tries have one major limitation: they are inefficient. If you want to store just one (path,value) binding where the path is (in the case of the ethereum state trie), 64 characters long (number of nibbles in bytes32
), you will need over a kilobyte of extra space to store one level per character, and each lookup or delete will take the full 64 steps. The Patricia trie introduced here solves this issue.
Optimization
Merkle Patricia tries solve the inefficiency issue by adding some extra complexity to the data structure. A node in a Merkle Patricia trie is one of the following:

NULL
(represented as the empty string) 
branch
A 17item node[ v0 ... v15, vt ]

leaf
A 2item node[ encodedPath, value ]

extension
A 2item node[ encodedPath, key ]
With 64 character paths it is inevitable that after traversing the first few layers of the trie, you will reach a node where no divergent path exists for at least part of the way down. It would be naive to require such a node to have empty values in every index (one for each of the 16 hex characters) besides the target index (next nibble in the path). Instead we shortcut the descent by setting up an extension
node of the form [ encodedPath, key ]
, where encodedPath
contains the "partial path" to skip ahead (using compact encoding described below), and the key
is for the next db lookup.
In the case of a leaf
node, which can be determined by a flag in the first nibble of encodedPath
, the situation above occurs and also the "partial path" to skip ahead completes the full remainder of a path. In this case value
is the target value itself.
The optimization above however introduces some ambiguity.
When traversing paths in nibbles, we may end up with an odd number of nibbles to traverse, but because all data is stored in bytes
format, it is not possible to differentiate between, for instance, the nibble 1
, and the nibbles 01
(both must be stored as <01>
). To specify odd length, the partial path is prefixed with a flag.
Specification: Compact encoding of hex sequence with optional terminator
The flagging of both odd vs. even remaining partial path length and leaf vs. extension node as described above reside in the first nibble of the partial path of any 2item node. They result in the following:
hex char bits  node type partial path length

0 0000  extension even
1 0001  extension odd
2 0010  terminating (leaf) even
3 0011  terminating (leaf) odd
For even remaining path length (0
or 2
), another 0
"padding" nibble will always follow.
def compact_encode(hexarray):
term = 1 if hexarray[1] == 16 else 0
if term: hexarray = hexarray[:1]
oddlen = len(hexarray) % 2
flags = 2 * term + oddlen
if oddlen:
hexarray = [flags] + hexarray
else:
hexarray = [flags] + [0] + hexarray
// hexarray now has an even length whose first nibble is the flags.
o = ''
for i in range(0,len(hexarray),2):
o += chr(16 * hexarray[i] + hexarray[i+1])
return o
Examples:
> [ 1, 2, 3, 4, 5, ...]
'11 23 45'
> [ 0, 1, 2, 3, 4, 5, ...]
'00 01 23 45'
> [ 0, f, 1, c, b, 8, 10]
'20 0f 1c b8'
> [ f, 1, c, b, 8, 10]
'3f 1c b8'
Here is the extended code for getting a node in the Merkle Patricia trie:
def get_helper(node,path):
if path == []: return node
if node = '': return ''
curnode = rlp.decode(node if len(node) < 32 else db.get(node))
if len(curnode) == 2:
(k2, v2) = curnode
k2 = compact_decode(k2)
if k2 == path[:len(k2)]:
return get_helper(v2, path[len(k2):])
else:
return ''
elif len(curnode) == 17:
return get_helper(curnode[path[0]],path[1:])
def get(node,path):
path2 = []
for i in range(len(path)):
path2.push(int(ord(path[i]) / 16))
path2.push(ord(path[i]) % 16)
path2.push(16)
return get_helper(node,path2)
Example Trie
Suppose we want a trie containing four path/value pairs ('do', 'verb')
, ('dog', 'puppy')
, ('doge', 'coin')
, ('horse', 'stallion')
.
First, we convert both paths and values to bytes
. Below, actual byte representations for paths are denoted by <>
, although values are still shown as strings, denoted by ''
, for easier comprehension (they, too, would actually be bytes
):
<64 6f> : 'verb'
<64 6f 67> : 'puppy'
<64 6f 67 65> : 'coin'
<68 6f 72 73 65> : 'stallion'
Now, we build such a trie with the following key/value pairs in the underlying DB:
rootHash: [ <16>, hashA ]
hashA: [ <>, <>, <>, <>, hashB, <>, <>, <>, hashC, <>, <>, <>, <>, <>, <>, <>, <> ]
hashC: [ <20 6f 72 73 65>, 'stallion' ]
hashB: [ <00 6f>, hashD ]
hashD: [ <>, <>, <>, <>, <>, <>, hashE, <>, <>, <>, <>, <>, <>, <>, <>, <>, 'verb' ]
hashE: [ <17>, hashF ]
hashF: [ <>, <>, <>, <>, <>, <>, hashG, <>, <>, <>, <>, <>, <>, <>, <>, <>, 'puppy' ]
hashG: [ <35>, 'coin' ]
Where for example rootHash = <59 91 bb 8c 65 14 14 8a 29 db 67 6a 14 ac 50 6c d2 cd 57 75 ac e6 3c 30 a4 fe 45 77 15 e9 ac 84>
.
When one node is referenced inside another node, what is included is H(rlp.encode(x))
, where H(x) = sha3(x) if len(x) >= 32 else x
and rlp.encode
is the RLP encoding function.
Note that when updating a trie, one needs to store the key/value pair (sha3(x), x)
in a persistent lookup table if the newlycreated node has length >= 32. However, if the node is shorter than that, one does not need to store anything, since the function f(x) = x is reversible.
Tries in Ethereum
All of the merkle tries in Ethereum use a Merkle Patricia Trie.
From a block header there are 3 roots from 3 of these tries.
 stateRoot
 transactionsRoot
 receiptsRoot
State Trie
There is one global state trie, and it updates over time. In it, a path
is always: sha3(ethereumAddress)
and a value
is always: rlp(ethereumAccount)
. More specifically an ethereum account
is a 4 item array of [nonce,balance,storageRoot,codeHash]
. At this point it's worth noting that this storageRoot
is the root of another patricia trie:
Storage Trie
Storage trie is where all contract data lives. There is a separate storage trie for each account. To calculate a 'path' in this trie first understand how solidity organizes a variable's position. To get the path
there is one extra hashing (the link does not describe this). For instance the path
for the zeroith variable is sha3(<0000000000000000000000000000000000000000000000000000000000000000>)
which is always 290decd9548b62a8d60345a988386fc84ba6bc95484008f6362f93160ef3e563
. The value at the leaf is the rlp encoding of the storage value 1234
(0x04d2
), which is <82 04 d2>
.
For the mapping example, the path
is sha3(<6661e9d6d8b923d5bbaab1b96e1dd51ff6ea2a93520fdc9eb75d059238b8c5e9>)
Transactions Trie
There is a separate transactions trie for every block. A path
here is: rlp(transactionIndex)
. transactionIndex
is its index within the block it's mined. The ordering is mostly decided by a miner so this data is unknown until mined. After a block is mined, the transaction trie never updates.
Receipts Trie
Every block has its own Receipts trie. A path
here is: rlp(transactionIndex)
. transactionIndex
is its index within the block it's mined. Never updates.