Farcaster is a decentralized social network. This post discusses why we opted to only support little-endian systems and UTF8-encoded strings at the protocol level. We originally made these decisions in early 2022, but are publishing it now (early 2025) for posterity.
Endianness refers to how bits/bytes are arranged in memory. Some quick definitions to get us on the same page:
Word: Unit of data. For 64-bit systems, a word is 8 bytes (8 bits/byte x 8 bytes = 64 bits)
Big-endian (BE): The most significant byte (i.e. the leftmost byte if written in binary notation left to right) of a word occupies the lowest memory address.
Little-endian (LE): The least significant byte (i.e. the rightmost byte if written in binary notation left to right) of a word occupies the lowest memory address.
Each CPU architecture decides which kind of endianness it uses. Some (like ARM) even allow bi-endianness in their instruction set.
Most of the time, the endianness of the underlying CPU architecture doesn’t matter because it is abstracted away from you. When you write code comparing two numbers, it will be compiled into the relevant instructions performing the comparison without you needing to tell it the endianness of the numbers represented.
However, some algorithms—like those used for hashing—need to inspect exact bits and perform comparisons/bit shifting based on that, especially for performance reasons. In cases like these, the data you pass in to the algorithm (the message bytes you want to hash) must be in a format where you’re consistently getting the result you expect.
Generally speaking, little-endian is preferable because it's trivial to cast between different data types since the least significant byte is always at the same starting memory address—this is partially why most computers in the world use it today.
TL;DR: Always use JavaScript's Uint8Array
to represent a sequence of bytes. JS always uses the endianness of the underlying CPU architecture, which is little-endian for the vast majority of the world.
Generally you don’t need to worry about endianness in JS. However, any time you are manipulating raw bytes (which is the case with most crypto signatures and hashes) care is required.
JavaScript exposes abstractions for dealing with raw byte arrays. In increasing level of abstraction:
ArrayBuffer: array of raw bytes (length is the number of bytes).
TypedArray: a view of an ArrayBuffer
, where each element is the same size and type. The most common you’re likely to interact with is Uint8Array
i.e. each element is an unsigned 8-bit integer. Note that the Node.js Buffer
type is treated like Uint8Array
in most contexts, but Uint8Array
should be used if writing code that may run in a browser, since browsers don’t support Buffer
.
DataView: a different view of an ArrayBuffer
, but unlike TypedArray
allows you to write arbitrary values/types at arbitrary locations in the buffer.
See this article for a deeper dive into these abstractions. Generally you’ll want to use Uint8Array
unless you have some specific need to operate at a lower level.
Note that DataView
allows you to specify whether the underlying bytes are stored as little-endian or big-endian, but will default to using big-endian. TypedArray
will use the native endianness of the system.
At the time we were considering big- vs little-endian, the cryptographic libraries we used (like Paul Miller's great work in @noble/hashes) explicitly require the system is running on little-endian, and will fail to start if that precondition is not met. Since the first versions of Farcaster Hubs were written in TypeScript running on Node.js, this nudged us towards requiring little-endian.
Strings can be encoded multiple ways, the most common of which is UTF-8. The primary reason UTF-8 became so popular is that it efficiently represents Latin characters (i.e. English and many other European languages) since each of those characters can be represented with a single byte.
UTF-8 is less desirable if you are writing in a non-Latin alphabet. In those languages, UTF-16 allows for a more efficient representation of most characters—for example, the Chinese character 你 ("you") is 3 bytes in UTF-8, but only 2 bytes in UTF-16.
> char = '\u4f60'
'你'
> char.length // Returns the number of code units, NOT bytes!
1
> Buffer.from(char, 'utf16le').length // Returns number of bytes to represent as UTF-16
2
> Buffer.from(char, 'utf8').length // Returns number of bytes to represent as UTF-8
3
JavaScript represents all strings as UTF-16. If you are storing strings and want them in UTF-8 encoding, you'll need to convert them:
> encoder = new TextEncoder()
{ encoding: 'utf-8' }
> encoder.encode(char);
Uint8Array(3) [ 228, 189, 160 ] // 3 bytes
For Farcaster, the reality is that English is the modern day lingua franca and is the most spoken language on the protocol today, so we save a significant amount of space by using UTF-8 to store text, but this comes at a cost of reducing storage efficiency for non-Latin character languages.
UTF-8 is also the representation for strings in Protocol Buffers, which the Farcaster protocol uses to represent messages on the network, providing another reason to prefer UTF-8.
The Farcaster protocol:
Uses little-endian encoding due to its prolific use today and because some core libraries used by the original Node.js-based hubs only supported little-endian architectures.
Uses UTF-8 to store strings since it saves storage space for the majority-written languages on the network, and integrates well with protocol buffers which use UTF-8 to represent strings.
Shane da Silva
A brief discussion on why Farcaster uses little-endian and UTF-8 at the protocol level. https://paragraph.xyz/@sds.eth/endianness-and-string-encoding