# Just-in-Time EVM Calldata Decoding

By [Alex Miller](https://paragraph.com/@alexmiller) · 2022-07-15

---

Displaying raw, [ABI](https://docs.soliditylang.org/en/latest/abi-spec.html)\-encoded calldata has been a drag on web3 user experience for years. It should be obvious that most users cannot read blocks of hex data; I [wrote about this](https://blog.gridplus.io/readable-ethereum-transactions-a-new-standard-945c5e9ef2c7) in early 2021 when [GridPlus](https://gridplus.io) introduced a new contract readability feature. We have recently [replaced](https://github.com/GridPlus/lattice-firmware-history#v0150) this with a “just-in-time” calldata decoder, which takes advantage of self-validation in the ABI spec. With this new approach, transaction requests may include “decoder data”, which is used to mark down the calldata in a more readable way _in the same request_. This means users no longer need to pre-load ABI definitions, which is a significant UX improvement. It also saves a good bit of code space on our secure microcontroller, which always makes me happy.

Being more of an engineer than a theoretician, I designed this decoder-data encoding protocol by trial and error so I cannot guarantee its correctness. However, I think the design is pretty good and figured it might be useful for other wallet teams to see.

Basically, the goal is to generate a small piece of data that can be included with transaction calldata to display much more human readable information on a signing screen. Because naming things is fun, I will term this **Calldata Decoder Data** (**CDD**) and will soon outline a protocol to encode it.

The EVM Function Definition
---------------------------

As many readers will know, Ethereum and all other EVM chains build contract function calldata with a bespoke [ABI encoding protocol](https://docs.soliditylang.org/en/v0.8.13/abi-spec.html#contract-abi-specification). This protocol is interrelated with a much simpler, separate protocol which is used to build strings representing the function’s name and parameter set (a.k.a. “function signature”). Since this latter protocol doesn’t seem to have a name, I will call it “**FSB**” for “function signature builder”. I will also refer to a “function signature” as **fSig** from now on.

> **NOTE:** The ABI docs do kind of [mention](https://docs.soliditylang.org/en/latest/abi-spec.html#function-selector) the FSB, though I found the specification underwhelming and will attempt to elaborate below. For posterity, the official “spec” says: _“The signature is defined as the canonical expression of the basic prototype without data location specifier, i.e. the function name with the parenthesised list of parameter types. Parameter types are split by a single comma - no spaces are used.”_

FSB and ABI are closely related because:

1.  FSB uses \[a subset of\] ABI types.
    
2.  The FSB output is an input to the ABI-encoded calldata.
    

FSB builds a string (fSig) to represent the function using the following serialization:

    "${functionName}(${param0Type},…,${paramNType})"
    

This results in something like `myFunction(address,bytes32,uint256)`. Note the absence of spaces!

It is important to note that only a **subset** of ABI types can be used in building fSigs. These are called “**canonical types**”. To be fair, it is a very large subset, but there are a few [exceptions](https://github.com/ethereum/eth-abi/blob/master/eth_abi/grammar.py) that require conversion from their loosey-goosey Solidity types:

*   `int` → `int256`
    
*   `uint` → `uint256`
    
*   `fixed` → `fixed128x18`
    
*   `ufixed` → `ufixed128x18`
    
*   `function` → `bytes24`
    
*   `byte` → `bytes1`
    

> **NOTE:** I have never used `fixed` or `ufixed` types and have never seen them in the wild, so they are not part of the GridPlus CDD encoding protocol that I will present shortly, though they could be added in a future version.

### Building/Validating Calldata

Once you have the fSig string constructed, you need to hash it with `keccak256` and take the first four bytes of that hash. This is called the “**function selector**” and it prefixes EVM transaction calldata. Simply put, the function selector **is** the first four bytes of transaction calldata; the rest is param data serialized using ABI. Therefore, the only purpose of FSB is to build that function selector.

This is an important relationship because it means EVM calldata is, in a sense, **self-validating**. You cannot construct a function selector of e.g. `myFunction(address,bool)` and get away with throwing a dynamic `bytes` buffer into the calldata - that will not pass network consensus! It also gives wallets a lot of useful sanity checks, such as making sure that `bool` param value is not some large integer in the calldata. Of course there are situations where these checks do not hold, but they are still, on the whole, pretty helpful.

Now yes yes, I am aware that 4 bytes is not very large; it means there is a 0.0000000023% chance of collision on any two random fSigs. But it’s big enough to still be very useful in practice and since the param structure in calldata always needs to match the fSig, attacks in this domain are pretty limited (they generally require deploying a separate contract and changing the function name to some random colliding value).

> **NOTE:** Lattice users are reminded to always use [address tags](https://docs.gridplus.io/lattice-manager/address-tags) with high value contract interactions. A full user-based sanity check would involve validating the address, function name, and param order/values. Tags make this much easier.

Building Encoded CDD
--------------------

We seek to construct a minimal piece of data such that we can rebuild a transaction’s fSig and decode calldata parameters into individual values for better readability on the wallet interface (e.g. a [Lattice1](https://gridplus.io/lattice) secure screen).

At a high level, we need to find each individual parameter and describe it in the context of all other parameters. We call the individual parameter “atomic” because it cannot be further reduced. Don’t worry, this will be clearer with examples.

### Atomic Parameters

Each canonical param type (e.g. `uint256`, `bytes8`, etc) can be encoded with a four-item array descriptor containing:

    [
      paramName,
      paramTypeIdx,
      size,
      arraySizes
    ]
    

*   `paramName` is a string representing the parameter name. Note that this is \[annoyingly\] **not** defined in the fSig, so its usage is left to the wallet (and integrations). Speaking for GridPlus: when we only have an fSig, we use `“#1”`, `“#2”`, etc to name the parameters because, well, we don’t know what they’re called. If instead we are fetching a [Solidity JSON ABI](https://docs.ethers.io/v5/api/utils/abi/formats/#abi-formats--solidity) (which is what you get from Etherscan’s API), we use those param names here.
    
*   `paramTypeIdx` is an enum value (uint8 type) based on the following set of ABI-ish types: `[address, bool, uint, int, bytes, string, tuple]`. We use these as basic types that can be expanded to build the canonical type, if necessary. `address`, `bool`, and `string` are canonical types already, as is `bytes` in the case of a dynamic buffer type (as opposed to a fixed buffer, e.g. `bytes32`). `tuple` is a “meta type” that will be discussed in the next section.
    
*   `size` is a uint8 type used to further specify the canonical type if the `paramTypeIdx` enum value maps to `uint`, `int`, or `bytes`. _This value describes the param size in bytes, not bits!_ For example, if you wanted `bytes16`, you would have `paramTypeIdx=4` and `size=2`. If the type is already canonical, you must use `size=0`.
    
*   `arraySizes` is an array type containing uint8 values that describe the dimension sizes of any arrays associated with this param. For example, `bool[1][5]` would have `arraySizes=[1, 5]`. Dynamic array dimensions are represented as `0`, so `bool[]` would have `arraySizes=[0]`. If no array sizes are used (e.g. `bool`), you should have `arraySizes=[]`, i.e. an empty array.
    

If a parameter can be described using _only_ these four values, we call it an “atomic parameter”. Some types (currently only `tuple`) cannot be described atomically and require additional rules.

### Non-Atomic Parameters

Atomic parameters are the basic building blocks of CDD but they are not sufficient because sometimes they must be nested, for example with `tuple` types. Fortunately, the solution for building non-atomic descriptors is pretty simple: **recursively fetch atomic descriptors and concatenate them.**

Take the following fSig params: `(uint256,(bool,address))`. This produces two descriptors:

*   `uint256` → `[“#1”, 2, 32, []]` (**atomic**)
    
*   `(bool,address)` → `[“#2”, 6, 0, []]` (**not atomic**)
    

The use of enum value `6` indicates this is a tuple, so when building the CDD we would need to recurse until we describe all nested atomic params. For this example, we would build the following nested atomic descriptors:

*   `bool` → `[“#2-1”, 1, 0, []]`
    
*   `address` → `[“#2-2”, 0, 0, []]`
    

These would be concatenated to the tuple’s descriptor like so:

    [  ["#1", 2, 4, []],
      [    "#2", 6, 0, [],
        ["#2-1", 1, 0, []],
        ["#2-2", 0, 0, []]
      ]
    ]
    

If there were additional tuples after we recursed, we would need to keep recursing. For example, fSig params `((bool, address),(bool,(bytes8[],bytes)[2],bool)` would lead to the following (unserialized) CDD:

    [  [    "#1", 6, 0, [],
        ["#1-1", 1, 0, []],
        ["#1-2", 0, 0, []]
      ],
      [    "#2", 6, 0, [],
        ["#2-1", 1, 0, []],
        [      "#2-2", 6, 0, [2],
          ["#2-2-1", 4, 1, [0]],
          ["#2-2-2", 4, 0, []]
        ],
        ["#2-3", 1, 0, []]
      ]
    ]
    

Serializing CDD
---------------

Because [Ethereum RLP](https://ethereum.org/en/developers/docs/data-structures-and-encoding/rlp/) is an efficient and widely used protocol for EVM things, we use that for serializing. The result will contain all the information a decoder might need to deserialize and digest the definitions.

The function name is concatenated with the param set to produce the full CDD. For example, the fSig `myFunction(uint256,bool[2])` produces full CDD:

    [  "myFunction",  [    ["#1", 2, 32, []],
        ["#2", 1, 0, [2]]
      ]
    ]
    

which RLP serializes into: `0xdb8a6d7946756e6374696f6ecfc68223310220c0c78223320180c102`

Validating CDD
--------------

The serialized CDD above can be used to reconstruct the fSig and function selector. Using the same example we might see a transaction with calldata:

A wallet would receive an incoming request with both calldata and serialized CDD. The latter can be RLP-deserialized and the fSig can be reconstituted into a string using some logic that is outside of the scope of this article. The fSig is now hashed:

    > keccak256(myFunction(uint256,bool[2]))
    '91061af786aadc13d8e123a127b60be62170486ce5e1ba89bdff34d5be95bbb4'
    

The first four bytes of this should match the first four bytes of the transaction calldata. If they do, we have **validated the** **parameter types and function name** and can safely decode the calldata with our CDD.

Again, this validation does **not** extend to param names, since they are not covered in the FSB protocol. 🙄 So display of param names is left up to both the wallet and requester/integration.

CDD Size
--------

One nice thing about our CDD encoding protocol is that the data is relatively small, largely thanks to RLP. I pulled ~10,000 fSigs from [4byte](https://www.4byte.directory/) and built CDD data from each one (these use the param names “#1”, “#2”, etc). The average size is **32.18 bytes** and very few signatures are >50 bytes. This makes the CDD protocol useful in a constrained memory environment such as Lattice firmware.

![CDD size distribution](https://storage.googleapis.com/papyrus_images/95903ee1c88efb90b185aef2974ac6058f8ec3397077645b3f1efc184c2136d1.png)

CDD size distribution

There were of course a few outliers (>800 bytes 😳), but anything >150 bytes is extremely uncommon.

Limitations of GridPlus CDD
---------------------------

A list of fSigs that are decodable by current Lattice firmware ([v0.15.0](https://github.com/GridPlus/lattice-firmware-history#v0150)) can be found [here](https://github.com/GridPlus/gridplus-sdk/blob/v2.2.0/src/__test__/vectors.jsonc#L34). There are also two parsers that may be useful for reference: [fSig](https://github.com/GridPlus/gridplus-sdk/blob/v2.2.0/src/calldata/evm.ts#L37) and [Solidity JSON-ABI](https://github.com/GridPlus/gridplus-sdk/blob/v2.2.0/src/calldata/evm.ts#L11). There are a few notable limitations related to GridPlus’ implementation:

*   We do not use `fixed128x18` or `ufixed128x18` types because I’ve never seen them in the wild and did not want to add the complexity to our pure-C decoder lib for them. If demand exists in the future I will work on incorporating these types into the enum.
    
*   There remain some [“extreme” edge cases](https://github.com/GridPlus/gridplus-sdk/blob/v2.2.0/src/__test__/vectors.jsonc#L122) that still do not decode, though I believe this is related to constraints in the C decoder implementation (FYI writing this stuff in C is **not** fun). These edge cases are related to multi-layered tuples that contain arrays. Needless to say, these situations are uncommon - not sure if anyone will ever need something like `nestedTup(((bytes)[2],bool))`, but if they do we cannot support it yet.
    
*   Param names cannot be validated just-in-time (i.e. in the same request), though this is not related to GridPlus’ implementation specifically.
    

That said, the CDD decoding protocol can easily be extended and rewritten in a higher level language such a TypeScript, which would probably be more robust.

Hopefully this was useful, or at least interesting.✌️

---

*Originally published on [Alex Miller](https://paragraph.com/@alexmiller/just-in-time-evm-calldata-decoding)*
