How to (Structurally) Serialize a JavaScript Object

How to (Structurally) Serialize a JavaScript Object

In this post I want to catalog the myriad ways of serializing JavaScript objects, with a focus on mimicking the behavior of the Structured Clone Algorithm.

In this post I want to catalog the myriad ways of serializing JavaScript objects, with a focus on mimicking the behavior of the Structured Clone Algorithm while preserving the easy-of-use of JSON stringify. The comparison is incomplete and focuses on “self describing” formats that have similar use cases as JSON. Also excluded are libraries and protocols that don’t have first class JS implementations, are bundled with larger unrelated libraries, or are otherwise unlikely candidates for your typical JS developer.

Introduction

The Structured Clone Algorithm (SCA) is the web’s canonical way of serializing/deserializing JavaScript objects. It is used when passing objects via postMessage calls to workers or iframes, or when storing objects in IndexedDB. One major benefit over JSON is support for ECMAScript types like Map, Set, Date, RegExp and Error, as well as typed arrays such as Uint8Array.

It also supports various platform objects, such as File, Blob, and FileList, as well as more unusual ones, like DOMMatrix or VideoFrame. For the purpose of this article we will only check for support for Blob and File, which have implementations outside the browser.

In addition to the ECMAScript types and platform objects mentioned above, the SCA also supports preserving object identity and circular references. If the same object is present in the source multiple times, the identities in the cloned object will reflect the original identities. This is a big advantage over JSON and something that is frequently requested. For example, the flatted module on npm has over 29 million downloads per week.

const obj = { a: 3 };
const clone = structuredClone({ main: obj, copy: obj });
clone.main === clone.copy // true

While all JS engines implement the Structured Clone Algorithm, and even expose it trough the structuredClone builtin function, they do not support serialization to a buffer or provide access to its internal binary representation (which isn’t standardized), leaving it in developers hands how they want to achieve similar results.

User-Defined Types

When the topic of serialization comes up, usually it is implied that there is also a way to serialize user-defined types. This is not what the SCA is about. The SCA is closer to “Extended JSON” than Protocol Buffers and as such provides no mechanism for developers to add custom types.

When the algorithm encounters an unknown class instance, it will extract its enumerable properties and put them in a POJO, with no corresponding step on the deserialization side. Interestingly, it will even ignore the toJSON method (it is not JSON after all).

Perhaps this is for the better, since executing arbitrary constructors based on wire data can be a security issue. Of course, such a mechanism can always be layered on top through a pre- and post-processing step, at the cost of performance and inconvenience. For this reason, it will be noted on each candidate if there is a way to define custom “transferrable” types.

A Word on Typed Arrays

Typed Arrays deserve special attention in this context. It is perhaps poorly understood that JS’ typed arrays are actually “Array Buffer Views”, as in, they provide a view on an underlying ArrayBuffer. In the most common scenarios this distinction is irrelevant, such as when creating a fresh array:

crypto.getRandomValues(new Uint8Array(16))

…or receiving one from a platform method:

new TextEncoder().encode("Hello World")

In these cases, a new ArrayBuffer of the exact size of the Uint8Array is created and the developer needn’t worry about it. However, this isn’t the only way typed arrays can be used. Take the following example:

const ab = new ArrayBuffer(1024);
const view8 = new Uint8Array(ab, 0, 8);
const view16 = new Uint16Array(ab, 0, 4);

Here, both views represent a view on the first 8 bytes of the same underlying array buffer. Modifying them in one view, will reflect on the other and vice versa. More importantly, strictly following the SCA when serializing these views means preserving the object identity of the buffer, meaning all 1024 bytes have to be included in the serialized output, even though we’re most likely only interested in the first 8 bytes!

Not making a distinction between a view and the underlying buffer can be a source of bugs. SQLite Viewer for VSCode had two bugs because I didn’t account for the fact that an Uint8Array can have a larger underlying array buffer. In my case, a extra byte was prepended, which made the SQLite format unreadable.

Every serialization implementation can decide whether it wants to preserve the object identity of the underlying buffer property and include it wholesale, or if it wants to treat typed arrays like slices and only inline its particular contents. The web platform is clear on which version it prefers:

const ab = new ArrayBuffer(4096)
const view1 = new Uint8Array(ab, 0, 8);
const view2 = new Uint8Array(ab, 8, 16);
const [view1Copy, view2Copy] = structuredClone([view1, view2]);
view1Copy.buffer === view2Copy.buffer // true
view1Copy.buffer.byteLength // 4096

However, for the purpose of storage or sending messages across a network, inlining just the slice that a view provides is probably closer to what developers would expect, even if the underlying buffer identities are not preserved in that case.

Not included: BSON

BSON inherits the semantics of JSON and does not implement the SCA. It also does not support circular references out of the box. While useful in many circumstances, including binary-encoding some of the candidates discussed below, it will not be part of this article otherwise.

Not included: Protocol Buffers, Flat Buffers and Friends

While these libraries certainly have what it takes to replicate anything described in the SCA, they rely on predefined schemas and generally lack the “Drop-In Replacement for JSON” easy-of-use that many JS developer are looking for. While some may have builtin ways to include the schema along with the serialized output, they are excluded here nonetheless.

Honorable mention: v8::ValueSerializer

V8 has an internal serialization format that implements the SCA, is backwards compatible (can be stored), and is the closest thing to a canonical binary representation for JS objects. While it’s unlikely to be shared by other JS engines, given V8’s prevalence, it is a strong contender for a default format.

Unfortunately, it is poorly documented and has no pure JS implementation that make it useable outside of the V8 contexts in which it is exposed it to developers (i.e. node:v8 module in Node and Deno). It also has no libraries for other languages, making it a poor choice for an exchange format.

@ungap/structured-clone

@ungap/structured-clone is a pure JS library that implements the SCA without any opinions on binary serialization.

For example the serialization for an object like this:

const ab = new ArrayBuffer(8);
const data = new Uint8Array(ab).fill(255);
const view = new Uint16Array(ab);
const obj = {
  big: 2n**64n - 1n, 
  set: new Set([1,2,3]),
  exp: new RegExp("/^\d+$/", "g"),
  map: new Map([["a", 3]]), 
  data,
  view,
}
obj.self = obj; // add circular refernce

will produce a JS array that looks like this:

[
  [ 2, [ [ 1, 2 ], [ 3, 4 ], [ 8, 9 ], [ 10, 11 ], [ 13, 14 ], [ 15, 16 ], [ 17, 0 ] ] ],
  [ 0, "big" ],
  [ 8, "18446744073709551615" ],
  [ 0, "set" ],
  [ 6, [ 5, 6, 7 ] ],
  [ 0, 1 ],
  [ 0, 2 ],
  [ 0, 3 ],
  [ 0, "exp" ],
  [ 4, { source: "\\/^d+$\\/", flags: "g" } ],
  [ 0, "map" ],
  [ 5, [ [ 12, 7 ] ] ],
  [ 0, "a" ],
  [ 0, "data" ],
  [ "Uint8Array", [ 255, 255, 255, 255, 255, 255, 255, 255 ] ],
  [ 0, "view" ],
  [ "Uint16Array", [ 65535, 65535, 65535, 65535 ] ],
  [ 0, "self" ]
]
import * as ungap from 'npm:@ungap/structured-clone';
ungap.deserialize(ungap.serialize(obj));

While this lends itself well for further JSON or binary serialization, doing so has a number of disadvantages:

  • It is barely human-readable, comparable to a fully fledged binary format
  • When used in combination with JSON, the overhead of inlined binary is larger than even the 33% penalty of using Base64 encoding.
  • It doesn’t inline buffers well when using binary encodings like MessagePack either, because each byte is a number in a regular array. This means every number above 127 will be prefixed with an extra header byte1, leading to a ~50% penalty2.
  • It does not support common platform objects like Blob and File.
  • It has no hooks for handling custom types
  • No support for BigUint64Array and BigInt64Array

Deserializing the object will preserve circular references, but not the identity of the underlying array buffer. As mentioned earlier, this is a reasonable choice and probably even preferable if accuracy isn’t the primary goal.

On the flip side, the code is simple and compact and has no external dependencies.

@worker-tools/structured-json (Typeson)

This is my own entry to the mix, though it is really just a thin wrapper around the Typeson library’s structured clone preset, which does the heavy lifting.

Like the ungap version, it is intended to produce JSON output, but instead of a custom array-based format, it attaches a $types property that carries meta information.

import * as SJSON from 'npm:@worker-tools/structured-json';
SJSON.revive(SJSON.encapsulate(obj));

Using the same obj as above, it will produce the following POJO:

{
  big: "18446744073709551615",
  set: [ 1, 2, 3 ],
  exp: { source: "\\/^d+$\\/", flags: "g" },
  map: [ [ "a", 3 ] ],
  data: { encoded: "//////////8=", byteOffset: 0, length: 8 },
  view: { index: 0, byteOffset: 0, length: 4 },
  self: "#",
  "$types": {
    big: "bigint",
    set: "set",
    exp: "regexp",
    map: "map",
    "map.0": "arrayNonindexKeys",
    data: "uint8array",
    view: "uint16array",
    self: "#"
  }
}

The footprint is much heavier than the ungap version, since it depends on a whole separate library. I’ve had issues importing it in various scenarios, which is why I’ve created this wrapper. That being said, it also has a few advantages:

  • The format is human readable
  • It can be consumed by endpoints that aren’t aware of the encoding:
    • It makes reasonable choices for representing non-JSON types, e.g. arrays for sets and arrays of tuples for maps
    • The $types property can be ignored entirely. If you were to put this in a JSONB column in Postgres or SQLite, it would not be out of place, work with builtin functions for the most part, and could even be updated without corrupting the format.
  • It has support for some builtin types like Blob, File and FileList, but the use of -Async variants of each library function is required.

In addition to the issues mentioned above, it also has more fundamental problems:

  • When encoding with MessagePack or similar, you pay a 33% Base64 tax for nested binary data
  • It preserves the object identity of array buffers in typed array, meaning extra caution is necessary when dealing with array buffer views.
  • The $types object duplicates the keys and can add significant overhead for deeply nested structures.
  • No support for Error, BigUint64Array and BigInt64Array
  • Defining custom types is only possible by opening up the underlying Typeson library
  • Generally more quirky and less battle-tested

MessagePack + Extensions

Message Pack is a popular and simple binary format that is in many cases a drop-in replacement for JSON. While it inherits the semantics of JSON, it comes with support for extensions that some libraries use to implement the SCA out of the box. For example, msgpackr has a structuredClone option that matches the semantics of the SCA pretty accurately.

  • It supports basic types and preserves (circular) references
  • Typed arrays are treated as slices (the array buffer reference is not preserved)
  • Unlike the options so far, it is a binary format that will efficiently inline binary data without overhead
  • The extension system is open, i.e. developers can define additional types
  • It has no support for platform objects like Blob
import { Packr, Unpackr } from 'npm:msgpackr';
const packr = new Packr({ structuredClone: true });
const unpackr = new Unpackr({ structuredClone: true });
unpackr.unpack(packr.pack(obj))

The obj above produces the following buffer:

d6 69 00 00 00 01 d4 72 40 97 a3 62 69 67 a3 73 65 74 a3 65 78 70 a3 6d 61 70 a4 64 61 74 61 a4
76 69 65 77 a4 73 65 6c 66 cf ff ff ff ff ff ff ff ff d4 73 00 93 01 02 03 d4 78 00 92 a8 5c 2f
5e 64 2b 24 5c 2f a1 67 81 a1 61 03 c7 09 74 01 ff ff ff ff ff ff ff ff c7 09 74 04 ff ff ff ff
ff ff ff ff d6 70 00 00 00 01 

Note the 3 distinct blocks of ff that are the Uint64Max bigint, Uint8Array and Uint16Array respectively.

Problems arise when interacting with other MessagePack implementations. The extensions used to make the SCA work are not standardized and different implementations make different choices. For example, msgpackr uses extension type 0x78 (lowercase x) to denote a RegExp object, while msgpack-lite, another popular library, uses 0x0a for much the same purpose.

Besides mismatching extension codes, the concept of a “reference” isn’t standardized either and msgpackr’s SCA preset uses additional optimizations that will be difficult to replicate in other libraries.

For these reasons, MessagePack in general and msgpackr/msgpack-lite in particular are only good choices when controlling both ends of the serialization and both ends are written in JavaScript and have access to these libraries. For other languages, rather than porting these libraries, there’s probably more luck to be had with the following approach.

CBOR + Standard Extensions

While I can’t say I’m a big fan of CBOR, I have to admit that its centralized registry for extension types makes it a better exchange format for JS objects when interoperability is desired. Since CBOR is an integral part of Passkeys, the world isn’t entirely devoid of CBOR parsers either, raising this above the level of a theoretical benefit.

Specifically, the extensions relevant to the SCA are:

JSTagData ItemSemantics
Date1integer or floatEpoch-based date/time; see Section 3.4.2
Set258arrayMathematical finite set
Map259mapMap datatype with key-value operations (e.g. .get()/.set()/.delete())
RegExp21066Array[UTF8string, UTF8string?]ECMAScript RegExp
Uint8Array64byte stringuint8 Typed Array
Uint8ClampedArray68byte stringuint8 Typed Array, clamped arithmetic
Uint16Array69byte stringuint16, little endian, Typed Array
Uint32Array70byte stringuint32, little endian, Typed Array
BigUint64Array71byte stringuint64, little endian, Typed Array
Int16Array77byte stringsint16, little endian, Typed Array
Int32Array78byte stringsint32, little endian, Typed Array
BigInt64Array79byte stringsint64, little endian, Typed Array
Float32Array85byte stringIEEE 754 binary32, little endian, Typed Array
Float64Array86byte stringIEEE 754 binary64, little endian, Typed Array
Errors27arraySerialised language-independent object with type name and constructor arguments
Object references28multiplemark value as (potentially) shared
Object references29unsigned integerreference nth marked value

Late additions like 21066 for RegExp and the odd naming here and there make it rather obvious that this format wasn’t designed with the SCA in mind either, but the current sent of extensions are enough to mimic the behavior of the SCA reasonably well.

Thanks to its centralized registry and unlike your typical MessagePack library, there’s at least a chance that something reasonable will pop out the other end when two different CBOR implementations are used, making this the preferred choice when a binary format with interoperability and efficient inlining of nested buffers is desired.

For completeness sake, here is our trusty obj encoded as CBOR with the extensions above:

d8 1c b9 00 07 63 62 69 67 1b ff ff ff ff ff ff ff ff 63 73 65 74 d9 01 02 83 01 02 03 63 65 78
70 d9 52 4a 82 68 5c 2f 5e 64 2b 24 5c 2f 61 67 63 6d 61 70 d9 01 03 a1 61 61 03 64 64 61 74 61
d8 40 48 ff ff ff ff ff ff ff ff 64 76 69 65 77 d8 45 48 ff ff ff ff ff ff ff ff 64 73 65 6c 66
d8 1d 19 00 00

The cbor-x library was used to generate the above, with the following settings and additions:

import * as CBOR from 'npm:cbor-x'
const encoder = new CBOR.Encoder({ structuredClone: true, useRecords: false, pack: false, tagUint8Array: true, structures: null })
CBOR.addExtension({
  tag: 21066,
  Class: RegExp,
  encode(regexp, encode) { return encode([regexp.source, regexp.flags]) },
  decode(data) { return new RegExp(data[0], data[1]) },
});
encoder.encode(obj);

There are no registered types for platform objects like File or Blob. As with Errors, the generic “Serialised language-independent object with type name and constructor arguments” (tag 27) will have to do in these cases.

Conclusion

When needing a serialization format that lives within a Node, Deno or Bun code base (or possibly as an exchange format between them), the node:v8 module is a good and performant choice. However, it lacks documentation besides its source code in the V8 repository and I’m not aware of any alternative implementations.

When a more open format with more interoperability is needed, a world of possibilities opens up. @ungap/structured-clone and @worker-tools/structured-json offer backwards compatibility with JSON. The latter is human-readable, while the former is mostly not. Both can be binary encoded with a variety of libraries like BSON, but suffer from overhead associated with nested buffers, ranging form 33% to ~50%. Structured JSON suffers form overhead with (deeply) nested objects in general.

Formats with binary serialization in mind from the beginning are MessagePack and CBOR, each requiring a set of extensions to achieve SCA compatibility. CBOR is the better choice when documentation and standardization are desired. It has registered extension codes for the core ECMAScript types. If it’s more important to have any library at all, MessagePack + non-standard extensions might be a better choice, but one library’s SCA implementation will not match another’s, most likely requiring additional user code. With only 127 extension codes to pick from, conflict freeness is not guaranteed.

  1. 0xcc for u8; Numbers below 127 are encoded as “fixedints” and do not carry an extra header byte. The situation is similar in other MsgPack-like formats such as CBOR. ↩︎

  2. The effect is diminished for Uint16Array and Uint32Array, but having to pick these for the sake of more efficient wire serialization ony furthers the point that this is not a good target when there are nested blobs. ↩︎


© 2024 Florian Klampfer

Powered by Hydejack v9.1.6