Here be dragons: using JavaScript to store JSON in a Postgres database

Thoughts

7 comments

4mo

Here be dragons: using JavaScript to store JSON in a Postgres database

4mo

Rediscovered a post I forgot to publish. Recommended reading if you ever use JavaScript to store JSON in Postgres. https://paragraph.com/@sds.eth/here-be-dragons-using-javascript-to-store-json-in-a-postgres-database

Branksy Pop

4mo

🐉

Blinky Stitt

4mo

It's a good one. Just use actual tables is my take away every time I see this. Yes it's a step harder. But indexing json is 2 steps harder

Shane da Silva

4mo

Generally agree—a good schema design up front is preferable. But sometimes you’re storing data from a third party whose schema you don’t control. In those situations, JSON can be very versatile and useful!

egg

4mo

Follow @CuntreyFanz before your uncle asks “what’s a Base?” at dinner

Hunt Town

4mo

@sds, you just received 1,004 claps from ＠black1004 on this content! Want to join the fun? Explore content, swipe right to clap, and earn $HUNT based on your own clapping activity. Your daily clap allowance: 100 👏 Install the Clap Mini App to get +50 free daily allowance 👇

Share Here be dragons: using JavaScript to store JSON in a Postgres database

Twitter Bluesky

More from Shane da Silva

Cover image for So you want to build a job processing system?

Shane da Silva

Jan 8

So you want to build a job processing system?

Sharing lessons learned during a migration from Faktory Enterprise

Cover image for Endianness and string encoding in Farcaster

Shane da Silva

Jan 13

Endianness and string encoding in Farcaster

Why we opted for little-endian UTF8-encoded strings in the protocol

Subscribe to Shane da Silva

<100 subscribers

Subscribe to Shane da Silva

<100 subscribers

It was an otherwise normal day until we started seeing errors like the following on our Postgres database:

Unsupported Unicode escape sequence
Unicode high surrogate must not follow a high surrogate
Unicode low surrogate must follow a high surrogate

We hadn't recently deployed any relevant changes and this was only affecting the display of some posts but not others. Why were we suddenly seeing unsupported escape sequences, and what exactly were these "surrogate" pairs, and why now?

Unicode: a quick refresher

As discussed in an earlier post, strings can be represented with different encodings. Since JSON is serialized as a Unicode string (one of UTF-8, UTF-16, or UTF-32†) there are multiple ways to represent the same Unicode character in a JSON string. Consider the following two literal strings:

"✓"

"\\u2713"

If you parse each of these in JavaScript, they'll evaluate to the same JSON value:

> unicodeChar = "✓"
'✓'
> escapedChar = "\\u2713"
'\\u2713'
> fromUnicode = JSON.parse(`"${unicodeChar}"`)
'✓'
> fromEscapeSequence = JSON.parse(`"${escapedChar}"`);
'✓'
> fromUnicode === fromEscapeSequence
true
> unicodeChar === escapedChar
false

What's the big deal? The problem is that the JSON parser doesn't check that a Unicode escape sequence is valid. As long as the escape sequence is \u followed by 4 hexadecimal digits, it will accept it, but that doesn't mean it actually represents a valid Unicode string.

For example, the dragon 🐉 can be represented in UTF-16 with the surrogate pair \ud83d\udc09. The first 16-bit code unit \ud83d is the high surrogate (from the range U+D800–U+DBFF), and the second \udc09 is the low surrogate (from the range U+DC00–U+DFFF). Mix them up and you have an invalid pair, which can lead to surprising behavior.

JavaScript vs Postgres: UTF-16 vs UTF-8

In JavaScript, an invalid surrogate pair is gracefully ignored and the underlying byte representation is preserved—no error is thrown:

> JSON.parse('"\\ud83d\\udc09"') // Valid surrogate pair
'🐉'
> JSON.parse('"\\udc09\\ud83d"') // Invalid surrogate pair
'\udc09\ud83d'

But in Postgres, its behavior depends on whether you use the json or jsonb type:

postgres=> select '"\ud83d\udc09"'::jsonb; -- Valid surrogate pair
 jsonb
-------
 "🐉"
(1 row)

postgres=> select '"\udc09\ud83d"'::jsonb; -- Invalid surrogate pair disallowed with jsonb
ERROR:  invalid input syntax for type json
LINE 1: select '"\udc09\ud83d"'::jsonb;
               ^
DETAIL:  Unicode low surrogate must follow a high surrogate.
CONTEXT:  JSON data, line 1: "\udc09..."

postgres=> select '"\udc09\ud83d"'::json; -- Invalid surrogate pair allowed with json
      json
----------------
 "\udc09\ud83d"
(1 row)

Since Postgres stores all strings by default as UTF-8 and explicitly doesn't support UTF-16, it needs all surrogate pairs to be valid if it wants to be able to represent the actual JSON value (rather than raw bytes). This becomes a problem any time you try to access/index/manipulate the JSON value using Postgres, since this require the values to be jsonb.

Postgres does document this behavior:

RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by \uXXXX. In the input function for the json type, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow \u). However, the input function for jsonb is stricter: it disallows Unicode escapes for characters that cannot be represented in the database encoding. The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL's text type)...

The key takeaway is that you can have a string value with invalid surrogate pairs in JavaScript (UTF-16) which can successfully be saved to a json column in your Postgres database, only to at some point in the future lead to a failure if you manipulate that JSON in any way (i.e. cast it to jsonb).

Whether to use json or jsonb is highly dependent on your workload—there's no single right answer. A quick rule of thumb is that json saves on storage but isn't easily indexable, whereas jsonb is more efficient to index or otherwise manipulate.

Detecting and handling invalid Unicode sequences

Any situation involving invalid Unicode sequences involves one of two approaches:

Prevent the invalid input from ever entering your system (raise an error as early as possible)
Modify/repair the invalid input before ingesting into your system (remove the invalid data upon detection)

In our case, since the JSON in question was data returned by a third-party API that we didn't control, we simply wanted to gracefully ignore the problematic record(s), but still have a copy of the data for posterity and debugging purposes.

To that end, we opted to continue storing the data as json since we weren't accessing any particular values from this data or searching for records with JSON matching certain criteria. We then added our own helper function that allowed us to quickly filter out problematic records in situations where we needed only valid Unicode sequences:

CREATE OR REPLACE FUNCTION valid_json_string(input anyelement)
  RETURNS bool
  LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL RESTRICTED AS
$func$
BEGIN
   RETURN input::jsonb IS NOT NULL;
EXCEPTION
  WHEN SQLSTATE '22P02' THEN  -- invalid_text_representation
  RETURN false;
END
$func$;

This can then be used on any value that can be casted to jsonb (e.g. text, json):

SELECT * FROM my_table WHERE valid_json_string(data)

You can add a functional index so that you do not need to recalculate validity every time you run the query:

CREATE INDEX my_table_valid_json_string_data ON my_table (valid_json_string(data));

Takeaways

Different programming languages have different ways of handling invalid Unicode sequences.
When using JavaScript to write JSON data to a database like Postgres, care must be taken to either avoid storing invalid data, or to track invalid data and omit/mutate when necessary.
Invalid sequences can unexpectedly appear if you ever ingest data from a third-party you don't control, especially JSON data given it allows invalid surrogate pairs to exist in the string itself, or if you are truncating data at an arbitrary number of bytes.

It was an otherwise normal day until we started seeing errors like the following on our Postgres database:

Unsupported Unicode escape sequence
Unicode high surrogate must not follow a high surrogate
Unicode low surrogate must follow a high surrogate

Unicode: a quick refresher

"✓"

"\\u2713"

If you parse each of these in JavaScript, they'll evaluate to the same JSON value:

> unicodeChar = "✓"
'✓'
> escapedChar = "\\u2713"
'\\u2713'
> fromUnicode = JSON.parse(`"${unicodeChar}"`)
'✓'
> fromEscapeSequence = JSON.parse(`"${escapedChar}"`);
'✓'
> fromUnicode === fromEscapeSequence
true
> unicodeChar === escapedChar
false

JavaScript vs Postgres: UTF-16 vs UTF-8

In JavaScript, an invalid surrogate pair is gracefully ignored and the underlying byte representation is preserved—no error is thrown:

> JSON.parse('"\\ud83d\\udc09"') // Valid surrogate pair
'🐉'
> JSON.parse('"\\udc09\\ud83d"') // Invalid surrogate pair
'\udc09\ud83d'

But in Postgres, its behavior depends on whether you use the json or jsonb type:

postgres=> select '"\ud83d\udc09"'::jsonb; -- Valid surrogate pair
 jsonb
-------
 "🐉"
(1 row)

postgres=> select '"\udc09\ud83d"'::jsonb; -- Invalid surrogate pair disallowed with jsonb
ERROR:  invalid input syntax for type json
LINE 1: select '"\udc09\ud83d"'::jsonb;
               ^
DETAIL:  Unicode low surrogate must follow a high surrogate.
CONTEXT:  JSON data, line 1: "\udc09..."

postgres=> select '"\udc09\ud83d"'::json; -- Invalid surrogate pair allowed with json
      json
----------------
 "\udc09\ud83d"
(1 row)

Postgres does document this behavior:

RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by \uXXXX. In the input function for the json type, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow \u). However, the input function for jsonb is stricter: it disallows Unicode escapes for characters that cannot be represented in the database encoding. The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL's text type)...

Whether to use json or jsonb is highly dependent on your workload—there's no single right answer. A quick rule of thumb is that json saves on storage but isn't easily indexable, whereas jsonb is more efficient to index or otherwise manipulate.

Detecting and handling invalid Unicode sequences

Any situation involving invalid Unicode sequences involves one of two approaches:

Prevent the invalid input from ever entering your system (raise an error as early as possible)
Modify/repair the invalid input before ingesting into your system (remove the invalid data upon detection)

CREATE OR REPLACE FUNCTION valid_json_string(input anyelement)
  RETURNS bool
  LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL RESTRICTED AS
$func$
BEGIN
   RETURN input::jsonb IS NOT NULL;
EXCEPTION
  WHEN SQLSTATE '22P02' THEN  -- invalid_text_representation
  RETURN false;
END
$func$;

This can then be used on any value that can be casted to jsonb (e.g. text, json):

SELECT * FROM my_table WHERE valid_json_string(data)

You can add a functional index so that you do not need to recalculate validity every time you run the query:

CREATE INDEX my_table_valid_json_string_data ON my_table (valid_json_string(data));

Takeaways

Different programming languages have different ways of handling invalid Unicode sequences.
When using JavaScript to write JSON data to a database like Postgres, care must be taken to either avoid storing invalid data, or to track invalid data and omit/mutate when necessary.
Invalid sequences can unexpectedly appear if you ever ingest data from a third-party you don't control, especially JSON data given it allows invalid surrogate pairs to exist in the string itself, or if you are truncating data at an arbitrary number of bytes.

Shane da Silva

More from Shane da Silva

Shane da Silva

More from Shane da Silva

7 comments

Shane da Silva

More from Shane da Silva

Shane da Silva

7 comments

More from Shane da Silva

Here be dragons: using JavaScript to store JSON in a Postgres database

Sharp edges with UTF-16 and surrogate pairs

Here be dragons: using JavaScript to store JSON in a Postgres database

Sharp edges with UTF-16 and surrogate pairs

7 comments

7 comments

Unicode: a quick refresher

JavaScript vs Postgres: UTF-16 vs UTF-8

Detecting and handling invalid Unicode sequences

Takeaways

Unicode: a quick refresher

JavaScript vs Postgres: UTF-16 vs UTF-8

Detecting and handling invalid Unicode sequences

Takeaways