Thoughts

Thoughts
Share Dialog
Share Dialog

Subscribe to Shane da Silva

Subscribe to Shane da Silva
Here be dragons: using JavaScript to store JSON in a Postgres database
Rediscovered a post I forgot to publish. Recommended reading if you ever use JavaScript to store JSON in Postgres. https://paragraph.com/@sds.eth/here-be-dragons-using-javascript-to-store-json-in-a-postgres-database
🐉
It's a good one. Just use actual tables is my take away every time I see this. Yes it's a step harder. But indexing json is 2 steps harder
Generally agree—a good schema design up front is preferable. But sometimes you’re storing data from a third party whose schema you don’t control. In those situations, JSON can be very versatile and useful!
Follow @CuntreyFanz before your uncle asks “what’s a Base?” at dinner
@sds, you just received 1,004 claps from @black1004 on this content! Want to join the fun? Explore content, swipe right to clap, and earn $HUNT based on your own clapping activity. Your daily clap allowance: 100 👏 Install the Clap Mini App to get +50 free daily allowance 👇
It was an otherwise normal day until we started seeing errors like the following on our Postgres database:
Unsupported Unicode escape sequence
Unicode high surrogate must not follow a high surrogate
Unicode low surrogate must follow a high surrogate
We hadn't recently deployed any relevant changes and this was only affecting the display of some posts but not others. Why were we suddenly seeing unsupported escape sequences, and what exactly were these "surrogate" pairs, and why now?
As discussed in an earlier post, strings can be represented with different encodings. Since JSON is serialized as a Unicode string (one of UTF-8, UTF-16, or UTF-32†) there are multiple ways to represent the same Unicode character in a JSON string. Consider the following two literal strings:
"✓""\\u2713"If you parse each of these in JavaScript, they'll evaluate to the same JSON value:
> unicodeChar = "✓"
'✓'
> escapedChar = "\\u2713"
'\\u2713'
> fromUnicode = JSON.parse(`"${unicodeChar}"`)
'✓'
> fromEscapeSequence = JSON.parse(`"${escapedChar}"`);
'✓'
> fromUnicode === fromEscapeSequence
true
> unicodeChar === escapedChar
falseWhat's the big deal? The problem is that the JSON parser doesn't check that a Unicode escape sequence is valid. As long as the escape sequence is \u followed by 4 hexadecimal digits, it will accept it, but that doesn't mean it actually represents a valid Unicode string.
For example, the dragon 🐉 can be represented in UTF-16 with the surrogate pair \ud83d\udc09. The first 16-bit code unit \ud83d is the high surrogate (from the range U+D800–U+DBFF), and the second \udc09 is the low surrogate (from the range U+DC00–U+DFFF). Mix them up and you have an invalid pair, which can lead to surprising behavior.
In JavaScript, an invalid surrogate pair is gracefully ignored and the underlying byte representation is preserved—no error is thrown:
> JSON.parse('"\\ud83d\\udc09"') // Valid surrogate pair
'🐉'
> JSON.parse('"\\udc09\\ud83d"') // Invalid surrogate pair
'\udc09\ud83d'But in Postgres, its behavior depends on whether you use the json or jsonb type:
postgres=> select '"\ud83d\udc09"'::jsonb; -- Valid surrogate pair
jsonb
-------
"🐉"
(1 row)
postgres=> select '"\udc09\ud83d"'::jsonb; -- Invalid surrogate pair disallowed with jsonb
ERROR: invalid input syntax for type json
LINE 1: select '"\udc09\ud83d"'::jsonb;
^
DETAIL: Unicode low surrogate must follow a high surrogate.
CONTEXT: JSON data, line 1: "\udc09..."
postgres=> select '"\udc09\ud83d"'::json; -- Invalid surrogate pair allowed with json
json
----------------
"\udc09\ud83d"
(1 row)Since Postgres stores all strings by default as UTF-8 and explicitly doesn't support UTF-16, it needs all surrogate pairs to be valid if it wants to be able to represent the actual JSON value (rather than raw bytes). This becomes a problem any time you try to access/index/manipulate the JSON value using Postgres, since this require the values to be jsonb.
Postgres does document this behavior:
RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by
\uXXXX. In the input function for thejsontype, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow\u). However, the input function forjsonbis stricter: it disallows Unicode escapes for characters that cannot be represented in the database encoding. Thejsonbtype also rejects\u0000(because that cannot be represented in PostgreSQL'stexttype)...
The key takeaway is that you can have a string value with invalid surrogate pairs in JavaScript (UTF-16) which can successfully be saved to a json column in your Postgres database, only to at some point in the future lead to a failure if you manipulate that JSON in any way (i.e. cast it to jsonb).
Whether to use json or jsonb is highly dependent on your workload—there's no single right answer. A quick rule of thumb is that json saves on storage but isn't easily indexable, whereas jsonb is more efficient to index or otherwise manipulate.
Any situation involving invalid Unicode sequences involves one of two approaches:
Prevent the invalid input from ever entering your system (raise an error as early as possible)
Modify/repair the invalid input before ingesting into your system (remove the invalid data upon detection)
In our case, since the JSON in question was data returned by a third-party API that we didn't control, we simply wanted to gracefully ignore the problematic record(s), but still have a copy of the data for posterity and debugging purposes.
To that end, we opted to continue storing the data as json since we weren't accessing any particular values from this data or searching for records with JSON matching certain criteria. We then added our own helper function that allowed us to quickly filter out problematic records in situations where we needed only valid Unicode sequences:
CREATE OR REPLACE FUNCTION valid_json_string(input anyelement)
RETURNS bool
LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL RESTRICTED AS
$func$
BEGIN
RETURN input::jsonb IS NOT NULL;
EXCEPTION
WHEN SQLSTATE '22P02' THEN -- invalid_text_representation
RETURN false;
END
$func$;This can then be used on any value that can be casted to jsonb (e.g. text, json):
SELECT * FROM my_table WHERE valid_json_string(data)You can add a functional index so that you do not need to recalculate validity every time you run the query:
CREATE INDEX my_table_valid_json_string_data ON my_table (valid_json_string(data));Different programming languages have different ways of handling invalid Unicode sequences.
When using JavaScript to write JSON data to a database like Postgres, care must be taken to either avoid storing invalid data, or to track invalid data and omit/mutate when necessary.
Invalid sequences can unexpectedly appear if you ever ingest data from a third-party you don't control, especially JSON data given it allows invalid surrogate pairs to exist in the string itself, or if you are truncating data at an arbitrary number of bytes.
It was an otherwise normal day until we started seeing errors like the following on our Postgres database:
Unsupported Unicode escape sequence
Unicode high surrogate must not follow a high surrogate
Unicode low surrogate must follow a high surrogate
We hadn't recently deployed any relevant changes and this was only affecting the display of some posts but not others. Why were we suddenly seeing unsupported escape sequences, and what exactly were these "surrogate" pairs, and why now?
As discussed in an earlier post, strings can be represented with different encodings. Since JSON is serialized as a Unicode string (one of UTF-8, UTF-16, or UTF-32†) there are multiple ways to represent the same Unicode character in a JSON string. Consider the following two literal strings:
"✓""\\u2713"If you parse each of these in JavaScript, they'll evaluate to the same JSON value:
> unicodeChar = "✓"
'✓'
> escapedChar = "\\u2713"
'\\u2713'
> fromUnicode = JSON.parse(`"${unicodeChar}"`)
'✓'
> fromEscapeSequence = JSON.parse(`"${escapedChar}"`);
'✓'
> fromUnicode === fromEscapeSequence
true
> unicodeChar === escapedChar
falseWhat's the big deal? The problem is that the JSON parser doesn't check that a Unicode escape sequence is valid. As long as the escape sequence is \u followed by 4 hexadecimal digits, it will accept it, but that doesn't mean it actually represents a valid Unicode string.
For example, the dragon 🐉 can be represented in UTF-16 with the surrogate pair \ud83d\udc09. The first 16-bit code unit \ud83d is the high surrogate (from the range U+D800–U+DBFF), and the second \udc09 is the low surrogate (from the range U+DC00–U+DFFF). Mix them up and you have an invalid pair, which can lead to surprising behavior.
In JavaScript, an invalid surrogate pair is gracefully ignored and the underlying byte representation is preserved—no error is thrown:
> JSON.parse('"\\ud83d\\udc09"') // Valid surrogate pair
'🐉'
> JSON.parse('"\\udc09\\ud83d"') // Invalid surrogate pair
'\udc09\ud83d'But in Postgres, its behavior depends on whether you use the json or jsonb type:
postgres=> select '"\ud83d\udc09"'::jsonb; -- Valid surrogate pair
jsonb
-------
"🐉"
(1 row)
postgres=> select '"\udc09\ud83d"'::jsonb; -- Invalid surrogate pair disallowed with jsonb
ERROR: invalid input syntax for type json
LINE 1: select '"\udc09\ud83d"'::jsonb;
^
DETAIL: Unicode low surrogate must follow a high surrogate.
CONTEXT: JSON data, line 1: "\udc09..."
postgres=> select '"\udc09\ud83d"'::json; -- Invalid surrogate pair allowed with json
json
----------------
"\udc09\ud83d"
(1 row)Since Postgres stores all strings by default as UTF-8 and explicitly doesn't support UTF-16, it needs all surrogate pairs to be valid if it wants to be able to represent the actual JSON value (rather than raw bytes). This becomes a problem any time you try to access/index/manipulate the JSON value using Postgres, since this require the values to be jsonb.
Postgres does document this behavior:
RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by
\uXXXX. In the input function for thejsontype, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow\u). However, the input function forjsonbis stricter: it disallows Unicode escapes for characters that cannot be represented in the database encoding. Thejsonbtype also rejects\u0000(because that cannot be represented in PostgreSQL'stexttype)...
The key takeaway is that you can have a string value with invalid surrogate pairs in JavaScript (UTF-16) which can successfully be saved to a json column in your Postgres database, only to at some point in the future lead to a failure if you manipulate that JSON in any way (i.e. cast it to jsonb).
Whether to use json or jsonb is highly dependent on your workload—there's no single right answer. A quick rule of thumb is that json saves on storage but isn't easily indexable, whereas jsonb is more efficient to index or otherwise manipulate.
Any situation involving invalid Unicode sequences involves one of two approaches:
Prevent the invalid input from ever entering your system (raise an error as early as possible)
Modify/repair the invalid input before ingesting into your system (remove the invalid data upon detection)
In our case, since the JSON in question was data returned by a third-party API that we didn't control, we simply wanted to gracefully ignore the problematic record(s), but still have a copy of the data for posterity and debugging purposes.
To that end, we opted to continue storing the data as json since we weren't accessing any particular values from this data or searching for records with JSON matching certain criteria. We then added our own helper function that allowed us to quickly filter out problematic records in situations where we needed only valid Unicode sequences:
CREATE OR REPLACE FUNCTION valid_json_string(input anyelement)
RETURNS bool
LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL RESTRICTED AS
$func$
BEGIN
RETURN input::jsonb IS NOT NULL;
EXCEPTION
WHEN SQLSTATE '22P02' THEN -- invalid_text_representation
RETURN false;
END
$func$;This can then be used on any value that can be casted to jsonb (e.g. text, json):
SELECT * FROM my_table WHERE valid_json_string(data)You can add a functional index so that you do not need to recalculate validity every time you run the query:
CREATE INDEX my_table_valid_json_string_data ON my_table (valid_json_string(data));Different programming languages have different ways of handling invalid Unicode sequences.
When using JavaScript to write JSON data to a database like Postgres, care must be taken to either avoid storing invalid data, or to track invalid data and omit/mutate when necessary.
Invalid sequences can unexpectedly appear if you ever ingest data from a third-party you don't control, especially JSON data given it allows invalid surrogate pairs to exist in the string itself, or if you are truncating data at an arbitrary number of bytes.
<100 subscribers
<100 subscribers
7 comments
Here be dragons: using JavaScript to store JSON in a Postgres database
Rediscovered a post I forgot to publish. Recommended reading if you ever use JavaScript to store JSON in Postgres. https://paragraph.com/@sds.eth/here-be-dragons-using-javascript-to-store-json-in-a-postgres-database
🐉
It's a good one. Just use actual tables is my take away every time I see this. Yes it's a step harder. But indexing json is 2 steps harder
Generally agree—a good schema design up front is preferable. But sometimes you’re storing data from a third party whose schema you don’t control. In those situations, JSON can be very versatile and useful!
Follow @CuntreyFanz before your uncle asks “what’s a Base?” at dinner
@sds, you just received 1,004 claps from @black1004 on this content! Want to join the fun? Explore content, swipe right to clap, and earn $HUNT based on your own clapping activity. Your daily clap allowance: 100 👏 Install the Clap Mini App to get +50 free daily allowance 👇