AIP-210

Unicode

APIs should be consistent on how they explain, limit, and bill for string values and their encodings. This ranges from little ambiguities (like fields “limited to 1024 characters”) all the way to billing confusion (are names and values of properties in Datastore billed based on characters or bytes?).

In general, if we talk about limits measured in bytes, we are discriminating against non-ASCII text since it takes up more space. On the other hand, if we talk about “characters”, we are ambiguous about whether those are Unicode “code points”, “code units” for a particular encoding (e.g. UTF-8 or UTF-16), “graphemes”, or “grapheme clusters”.

Unicode primer

Character encoding tends to be an area we often gloss over, so a quick primer:

  • Strings are just bytes that represent numbers according to some encoding format.
  • When we talk about characters, we sometimes mean Unicode code points, which are numbers in the Unicode spec (up to 21 bits).
  • Other times we might mean graphemes or grapheme clusters, which may have multiple numeric representations and may be represented by more than one code point. For example, á may be represented as a composition of U+0061 + U+0301 (the a + the accent combining mark) or as a single code point, U+00E1.
  • Protocol buffers uses UTF-8 (“Unicode Transformation Format”) which is a variable-length encoding scheme using up to 4 code units (8-bit bytes) per code point.

Guidance

Character definition

TL;DR: In our APIs, “characters” means “Unicode code points”.

In API documentation (e.g., API reference documents, blog posts, marketing documentation, billing explanations, etc), “character” must be defined as a Unicode code point.

Length units

TL;DR: Set size limits in “characters” (as defined above).

All string field length limits defined in API comments must be measured and enforced in characters as defined above. This means that there is an underlying maximum limit of (4 * characters) bytes, though this limit will only be hit when using exclusively characters that consist of 4 UTF-8 code units (32 bits).

If you use a database system (e.g. Spanner) which allows you to define a limit in characters, it is safe to assume that this byte-defined requirement is handled by the underlying storage system.

Billing units

APIs may use either code points or bytes (using the UTF-8 encoding) as the unit for billing or quota measurement (e.g., Cloud Translation chooses to use characters). If an API does not define this, the assumption is that the unit of billing is characters (e.g., $0.01 per character, not $0.01 per byte).

Unique identifiers

TL;DR: Unique identifiers should limit to ASCII, generally only letters, numbers, hyphens, and underscores, and should not start with a number.

Strings used as unique identifiers should limit inputs to ASCII characters, typically letters, numbers, hyphens, and underscores ([a-zA-Z][a-zA-Z0-9_-]*). This ensures that there are never accidental collisions due to normalization. If an API decides to allow all valid Unicode characters in unique identifiers, the API must reject any inputs that are not in Normalization Form C. Generally, unique identifiers should not start with a number as that prefix is reserved for Google-generated identifiers and gives us an easy way to check whether we generated a unique numeric ID for or whether the ID was chosen by a user.

Unique identifiers should use a maximum length of 64 characters, though this limit may be expanded as necessary. 64 characters should be sufficient for most purposes as even UUIDs are only 60 characters.

Normalization

TL;DR: Unicode values should be stored in Normalization Form C.

Values should always be normalized into Normalization Form C. Unique identifiers must always be stored in Normalization Form C (see the next section).

Imagine we’re dealing with Spanish input “estaré” (the accented part will be bolded throughout). This text has what we might visualize as 6 “characters” (in this case, they are grapheme clusters). It has two possible Unicode representations:

  • Using 6 code points: U+0065 U+0073 U+0074 U+0061 U+0072 U+00E9
  • Using 7 code points: U+0065 U+0073 U+0074 U+0061 U+0072 U+0065 U+0301

Further, when encoding to UTF-8, these code points have two different serialized representations:

  • Using 7 code-units (7 bytes): 0x65 0x73 0x74 0x61 0x72 0xC3 0xA9
  • Using 8 code-units (8 bytes): 0x65 0x73 0x74 0x61 0x72 0x65 0xCC 0x81

To avoid this discrepancy in size (both code units and code points), use Normalization Form C which provides a canonical representation for strings.

Uniqueness

TL;DR: Unicode values must be normalized to Normalization Form C before checking uniqueness.

For the purposes of unique identification (e.g., name, id, or parent), the value must be normalized into Normalization Form C (which happens to be the most compact). Otherwise we may have what is essentially “the same string” used to identify two entirely different resources.

In our example above, there are two ways of representing what is essentially the same text. This raises the question about whether the two representations should be treated as equivalent or not. In other words, if someone were to use both of those byte sequences in a string field that acts as a unique identifier, would it violate a uniqueness constraint?

The W3C recommends using Normalization Form C for all content moving across the internet. It is the most compact normalized form on Unicode text, and avoids most interoperability problems. If we were to treat two Unicode byte sequences as different when they have the same representation in NFC, we’d be required to reply to possible “Get” requests with content that is not in normalized form. Since that is definitely unacceptable, we must treat the two as identical by transforming any incoming string data into Normalized Form C or rejecting identifiers not in the normalized form.

There is some debate about whether we should view strings as sequences of code points represented as bytes (leading to uniqueness determined based on the byte-representation of said string) or to interpret strings as a higher level abstraction having many different possible byte-representations. The stance taken here is that we already have a field type for handling that: bytes. Fields of type string already express an opinion of the validity of an input (it must be valid UTF-8). As a result, treating two inputs that have identical normalized forms as different due to their underlying byte representation seems to go against the original intent of the string type. This distinction typically doesn’t matter for strings that are opaque to our services (e.g., description or display_name), however when we rely on strings to uniquely identify resources, we are forced to take a stance.

Put differently, our goal is to allow someone with text in any encoding (ASCII, UTF-16, UTF-32, etc) to interact with our APIs without a lot of “gotchas”.

References