nrk.no

How we talk about things in systems

Kategori: Dev

A handsome boy in high grass by marneejill on Flickr CC BY-SA 2.0

– Max, it’s time for dinner!

Suddenly the kid next door is at your doorstep ­– along with half the neighborhood’s dogs.

We give names to things we care about, and not only dogs.

Things are named to differentiate one thing from another, so we can have a shared understanding of what we are talking about. Without it, we are lost, unable to communicate.

This article discusses different methods for providing identifiers to objects in systems. For each method, I’ll include examples from systems in NRK, the Norwegian Broadcasting Corporation (where I have been working for over 10 years), as well as other examples, from both computer, and real-life systems.

Pros and cons of the methods will be discussed, as well as explaining how they can be used to provide stable and unique identifiers that can be exchanged effectively between systems.

Introduction to identifiers

Let’s start by defining an identifier:

An identifier is a name that gives an identity to either an object, or a classification of objects.

At NRK we have a lot of things to name.

This article, for instance. It is something we have named.

You are reading it on a web page we have named. It’s only because we named it that you were able to find it. That’s how the web system works, and how the web browser is able to find and retrieve web pages. I’ll return to that shortly.

– “Let’s switch to NRK”

The most widely known name we have here at NRK, is probably the name NRK itself. Everyone in Norway knows what it is and what it means.  …or, do they?

How do you interpret a Norwegian saying  “la oss skru på NRK” – “let’s switch to NRK”? – does it even make sense to say something like that?

Maybe it used to, back when we offered a single TV channel. Today we have four channels on broadcast TV, and even more on the web. If we include our radio stations, there’s a substantial number of alternatives.

But still, we usually understand each other when we say “let’s switch to NRK”, because of the context and history we share in Norway. But for a computer system, that isn’t good enough.

At this point, computers don’t yet deal with cultural history or social environments like humans do. Computers need something unambiguous and durable to be able to talk to other computers about the same things.

Ideally, identifiers should be durable. In a computer system, identifiers that change over time, are costly and error prone. A change in an identifier needs to be forwarded to all connected systems. The more systems that share information about the object referenced, the longer it will take to update all systems, and the risk of forgetting to update one, or making an erroneous change increases.

Methods for providing identifiers

There are several methods of providing identifiers that can be assigned to objects, I’ll be discussing these:

  • manually given identifiers
  • incrementing identifiers
  • calculated identifiers
  • randomly generated identifiers
  • combinations of identifiers
  • web identifiers

I will explain how they work, and how we relate to them at NRK.

Manually given identifiers

The name NRK is actually an abbreviation for “Norsk Rikskringkasting” (the Norwegian Broadcasting Corporation). This is our official company name as registered in Enhetsregisteret in Brønnøysundregistrene; the Central Coordinating Register for Legal Entities of Brønnøysund Register Centre.

Our name is unique in Norway: No other company or legal entity may use this name. Both the name and the abbreviation was given manually by someone a long time ago in 1933.

Somewhat similar, we name our children with unique names within the family. Outside the family, this given name is no longer unique. Many friends outside of work just call me Olav. This is only half of my given name, but in most situations it identifies me well enough for friends to get my attention.

But in my department at work, we are four people whose names include Olav. In this context, this name is not enough to identify one of us uniquely.

In a larger organization of people, such as the country of Norway, even the full name, including their family name, will not be enough to ensure uniqueness for some. We need to add more information, and it will probably begin to sound like the start of a Norwegian Folk Tale: “Knut-Olav Hoven, son of Petter, from Sande in Vestfold county”. Yes, there are two municipalities, and several other places named Sande here in Norway…

But let’s return to the NRK.
Hurtigruten Midnatsol shot from a helicopter, mountainous Norwegian landscape surrounding a fjord.

One service most Norwegians know, is our Web TV service, NRK TV.

Every program published on this platform is identified by something that might look random to you, but that in fact follows a very specific pattern.

Let’s take “Hurtigruten Minutt for Minutt” as an example, a 5 day TV marathon in the midnight sun that you can watch at

https://tv.nrk.no/serie/hurtigruten-minutt-for-minutt/DVFJ67003511/27-12-2011

This URL (Uniform Resource Locator) is both an identifier and an address to a document on the Internet. We’ll get back to that last bit later.

First, let’s inspect this URL.

The most obvious, is that it gives a hint of which show this is, “hurtigruten-minutt-for-minutt”.

Further this URL contains DVFJ67003511, a program code identifying this content.

The letters DVFJ is a code identifying the organizational unit in NRK responsible for the production. The number 67 identifies the production job, and the number 0035 means this is the 35th part of this series (or a part of a longer program that was split by the news), and the number 11 means it was produced in 2011.

For other kinds of programs, such as news, these numbers mean something different. Luckily for you, you won’t need to understand what they mean to watch the program, but for some of my coworkers, this is essential information.

Further, the Internet domain name “tv.nrk.no” is also a given name, given by someone at NRK, assigned to a web server that delivers you the video you want to watch.

This name is guaranteed to be unique on the web (the public Internet). This is ensured by the Domain Name System (DNS) and rules coordinated by Internet Corporation for Assigned Names and Numbers (ICANN), a non-profit organization.

Incrementing identifiers

Incrementing identifiers are very common in many database systems for identifying objects. For example, when using an auto increment primary key in a relational database management system, the first created object is assigned the identifier 1, the second 2 etc.

Take for instance a web page located at the URL https://www.nrk.no/about/a-gigantic-small-broadcaster-1.3698462, which ends with a numeric identifier 3698462. This number is the identifier for the article A Gigantic Small Broadcaster in our publishing platform.

When the user is requesting this URL from their web browser, our systems will pass this number on when talking to each other, so they can know which article the browser requested.

As with given names, these sequential identifiers are only unique within the context of the system that provided the identifier.

We have other systems using sequential identifiers, such as Stadnamn, the location service that provides our weather service yr.no with all its location information.

Take for instance the identifier 1-2831398, which identifies the location Hamar OL amfi Nordlyshallen inside the Stadnamn system. The data about this location is provided by The Norwegian Mapping Authority (Norwegian: Kartverket), and their identifier – which is called an SSR ID – is 709466. It actually used to be 1285642, but they changed their identifier system some time ago.

Stadnamn holds information on locations from all over the world, not only those provided by The Norwegian Mapping Authority. Hence we need to keep references to identifiers from several external data sources, so we can load updates from their datasets into ours. Remember what I wrote earlier about changing identifiers being costly and error prone? Well, we still show you the old SSR ID on the YR.no web pages… Sorry!

Incrementing identifiers aren’t limited to sequential numbers. Time is also a good source for incrementing identifiers. Twitter uses it for its Snowflake identifying system. An example is 818115652540125185. It’s a combination of milliseconds since Twitter-epoch, a short numerical identifier to the machine in their computer cluster and a machine local sequential identifier.

The identifiers are sortable in time and each machine guarantees the uniqueness of the assigned identifier. A consequence from using incrementing identifiers, is that they grow; every new identifier is larger than the previous, and this suddenly became a problem for Twitter.

They used to be handled as numerical identifiers, but some programming languages, such as JavaScript, can’t handle the large numbers Twitter use, thus resulting in a loss of precision (i.e. interpreted as a different identifier) or a data parse error. Their solution was to treat the identifiers as strings instead of numbers. And this affected many third party libraries and companies integrating with Twitter services.

Another time based identifier is the UUID (universally unique identifier) version 1 which looks like 00018f8-f371-11e6-bc64-92361f002671. This example identifier includes a timestamp taken at Wednesday, February 15, 2017 12:22:40 CET (24 hour clock, Norwegian timezone).

Calculated identifiers

Calculated identifiers are based on the properties of the object it identifies.

In git, the source control system we use at NRK, each commit object is identified by a cryptographic checksum (SHA1) of the content in that commit.

A receiver of a git commit can verify its authenticity by calculating the checksum of the content. Every commit is chained to a previous commit, so this method is also verifying the validity of all the files in the git branch including its history.

The service we use for publishing images on the web identifies the images using a calculated checksum from the color and position of all pixels encoded in the image.

Let’s examine the image at the URL https://gfx.nrk.no/rofUXy01X1ZIpUBpaDnjrwgcTTug6PP9zDclDnRY0wtw, which embeds a 44 character image identifier. This identifier is a combination of a checksum and crop information.

The checksum part identifies the original image that was uploaded to the image service, and the crop part identifies which area of the original image we want your device to display.

This solution allows us to distribute and cache these images very effectively on the web, because it enables us to use different sections of the same picture on the front page and in articles on different devices. If the author wants another aspect, size or to zoom in on the image, that would result in a new identifier for the new image crop.

With cryptographic checksums as identifiers, when two peers holds the same identifier, they know it’s the same content. And if the identifiers are not equal, they know the content isn’t identical.

Let’s get back to our example of given identifiers.

As mentioned earlier, personal names are not unique in a larger population such as a nation. In Norway, every person has a national identification number (Norwegian: fødselsnummer) which uniquely identifies the person. This is an 11 digit number that starts with a 6 digit birth date and ends with a 5 digit personal number.

This personal ID number is partly calculated by your properties, such as your gender, and it ends with a checksum. One digit indicates what gender a person has, and if that person goes through a sex reassignment surgery that digit would be wrong, and the person has to be assigned a new national identification number.

We can look at Norway as a big system of distributed systems, with different government and public services, and companies like banks. When they all use that same number for identifying a person, there are many systems to update when that number change – a costly but necessary change.

Of course, these identifiers are only unique within Norway, and if we want to identify ourselves across borders we need to bring our passport which is holding more identifiers.

It’s worth mentioning that in Norway the national identification number is considered both an identifier and an authenticator, thus you should handle it with care, especially the 5 digit personal number.

Randomly generated identifiers

CC0

Randomness is important in cryptographic algorithms, a requirement in creating secure information systems. The size of a randomly generated key is one factor in how difficult the key is to guess and thus to open or modify the cryptographically secured content. But, a very large key is not efficient enough for identifying purposes, as they require a lot of computing power to generate.

For identifying purposes we need something smaller and more lightweight. UUID version 4 provides random identifiers, such as b1714e8c-9d27-4d3f-b14e-8c9d279d3f8b, which identifies an archival record in our radio archives.

UUID version 4 has a very low risk of creating duplicate IDs, but there can be no guarantees. The probability for a duplicate increases for every generated identifier, so some validation needs to be applied.

In the NRK radio archives this generated identifier is stored in a database with a uniqueness constraint. That way we can ensure uniqueness of this identifier within the context of this system.

Combinations of identifiers

An object can be assigned more than one identifier. Assigning multiple identifiers to an object will give you more options when referring to that object at a later time.

For example, when you enter a customer loyalty program you might register with your telephone number, and then receive a message on your phone, instructing you to activate your customer account.

You then supply your name, home address and possibly also your bank account number, depending on what kind of program you enter. Then, when you buy something with your debit card, that purchase is registered in the customer program on your account. That transaction uses your bank account number as a lookup identifier.

But when you call their support line they might ask you for your name. A lookup on your name in their system might not result in a single result, so they might also ask you for your telephone number.

Web identifiers

A URL is an address to a document on the web. It’s also a URI (Uniform Resource Identifier), a standardized way to identify resources and objects. A URI does not guarantee uniqueness in nature, but a URI contains something called a scheme that names the protocol that defines the rules for how to interpret the identifier.

The URLs used in my examples have the scheme https, a protocol that most web browsers understand (HTTP/1.1), and thus they are able to retrieve the documents you want from the web. This protocol defines that the authority information can be translated to a server address by using the Domain Name System, ensuring uniqueness to the resources provided by that authority, i.e. to the documents and services on the server.

In the example of “tv.nrk.no”, the URL points to a server that NRK controls.

A URL will also include some system specific local identifier that identifies a single object within that system. When combined, the authority and the local identifier, uniquely identifies that content on the web.

Pros and cons with the different methods

Manually given identifiers are easy to handle, and short enough for humans to easily exchange, verbally or written. They might include some information, such as the episode number and year of production in our example, providing the system user with hints on what to expect to find behind the identifier.

But there are some downsides to this strategy. They have a time cost, because someone needs to assign them manually. The risk of human error is high when they are transferred in written form or verbally, and it’s easy to create duplicates, or to mistakenly reuse an identifier.

If the identifier includes some information, for example a property of the object it identifies, how do you handle a change in that property? If you change the identifier, how do you handle such a change in connected systems?

Incrementing identifiers

Using incrementing identifiers are easy and lightweight, and some are relatively easy to exchange verbally, at least until they grow long. They are quite easy to reason with and can be useful when debugging systems behavior, such as extracting a timestamp or to see if one object was created before or after another. But that’s also its downside, as it might be very predictable and thus susceptible to several kinds of attack, such as denial of service or The German Tank problem.

Randomly generated identifiers

Randomly generated identifiers solve some of the problems that incrementing identifiers have, because they are harder to predict and attack.

On the downside, all random algorithms are quite heavy and require a lot of good entropy to provide real randomness. Some generators use pseudo-randomness, trading unpredictability (and thus security) for speed, which in some cases is desirable.

Calculated identifiers

Calculated identifiers are very useful in distributed systems, such as distributing images on the web or working on shared code bases. This can be a big save in Internet networking costs, since one can fetch content from a closely located cache, instead of querying the central authority server.

Cryptographic checksums

When using cryptographic checksums, you can trust that the content has not been tampered with, because you can check the validity of the content yourself.

On the downside, calculating checksums of very large content can be slow and require a lot of processing power, which can drain the battery faster on mobile devices. There are no guarantees that a calculated identifier is unique, and SHA1 has already been proven broken in providing colliding identifiers for two different contents (Shattered, 2017).

Ensuring uniqueness in identifiers for different content is preferable, as it helps the stability of the identifier. But if the identifier is derived from properties of the object, a change in a property might require a change of the identifier.

Combined identifiers

For combined identifiers, any of the single identifiers might change. For example, some people change their name when getting married, people sometimes switch bank, and people sometimes change their telephone number. But most likely, not all change at once. Taken together, these identifiers bring more stability in identifying the correct object than separately.

Identifiers of the web

A very important quality of web URLs as identifiers, is that they guarantee uniqueness across distributed systems. They are easy to work with, as they are just string characters and they give a uniform way of working with them, which is a big win for computer systems.

Unfortunately, in order to ensure uniqueness to every object in a distributed system such as the web, many URIs tend to be very long – often too long for humans to exchange verbally.

Let’s assume we bring all our systems onto the web, and have a unified way of uniquely identifying any object in the world. Not all URIs, not even https URLs, have to be resolvable to a document, and that is actually just fine.

How do you represent the concept of a boat? Let’s say that you direct your web browser to a URI identifying the concept of a boat, for example https://example.com/boat, what would you expect to see?

Would you expect to see a picture of a boat or a description of what the term boat means?

According to the World Wide Web Consortium (W3C) note Cool URIs for the Semantic Web from 2008:

There should be no confusion between identifiers for Web documents and identifiers for other resources. URIs are meant to identify only one of them, so one URI can’t stand for both a Web document and a real-world object.

What this means, is that a document describing what a boat is has to be located at a different URI than the URI identifying the concept. A web server might send a redirect response to the URI identifying the document describing this concept, for example https://example.com/boat.html. Or it might not send any response at all. Which doesn’t mean that there are no boats (or spoons)…

Exchanging identifiers

I have discussed several methods for providing identifiers, with different pros and cons. Those that provide higher qualities of uniqueness and stability are often difficult for humans to read and exchange, and those that humans find easy are mostly lacking these qualities.

That is within systems. When exchanging information between systems, we look for some particular qualities in the way we reference the objects from other systems.

We need standardized ways to exchange identifiers.

Since not all systems are digitally connected, we sometimes need a physical bearer of the identifier for the objects.

For example, when you buy a box of cereals, that box has a physical identifier tag, so the store can charge you the right amount of money. Identifiers can be encoded and represented in machine readable form in barcodes, QR codes or RFID chips, to name a few.

We want uniqueness of the identifier, so that there is no ambiguity in which object we talk about. Identifiers in form of https URIs offer us the quality of uniqueness.

We also want stability in the identifiers, so the same identifier should reference the same object, always. A URI doesn’t guarantee this by itself, and the more information that is embedded into the identifier, the more difficult it is to ensure stability over time.

What makes a cool URI? A cool URI is one which does not change. What sorts of URI change? URIs don’t change: People change them.

Sir Tim Berners-Lee, Cool URIs don’t change (1999)

Dealing with stability of URIs has always been difficult, as technology changes, servers are reconfigured, and organizations restructure or go out of business.

At NRK we have an identification system, called Ice Age, that provides a way of ensuring stability of identifiers as new systems are created and old systems die. It has a structure which includes the year when the system was inducted. This identification system is slightly inspired by the URI namespace guidelines at W3C (2006).

A namespace provides context to the identifiers.

It might be organized hierarchically, so that a system can delegate authority of a sectioned namespace to another system. A URI namespace is hierarchical, and it’s used as a prefix to a system specific local identifier.

For example, the NRK radio archives system that was created in 2013 has been given the URI namespace http://id.nrk.no/2013/radioarkiv/.

The radio archives system has full authority over this namespace, which means that it can assign any identifier to any object as long it starts with this namespace, and these identifiers are unique across the Internet. No other systems are allowed to assign identifiers to objects using this prefix.

If a new radio archives system gets created in 20 years, then it could be assigned the URI namespace http://id.nrk.no/2038/radioarkiv/.

As a side note, notice that identifiers using this identification system has the URI scheme http instead of https. The reason for that choice was sadly just because we didn’t want to spend time and money in buying and managing a cryptographic certificate for that domain.

That’s unfortunate from a security perspective, but let’s not forget that these URIs are primarily used for identifying objects, not as URLs for resolving documents. Today things would have been done differently, as we try to use https everywhere.

Interpreting identifiers

I often come by computer systems that parse identifiers from other systems to extract information, for instance by using regular expressions to extract a number. One common reason for doing so that I’ve observed, is when one needs a part of the identifier, that is specific to an integrated system, for communicating with that system.

However, in the definition of an identifier in the beginning of this article, I explained that an identifier is something that gives an identity. An identifier does not provide properties to the object it identifies, so there is actually no reason for parsing it. When referencing, an identifier can be handled as a sequence of characters.

For web identifiers, both a web server and web browser might need to parse the URIs, but they do so differently.

A web server receiving a request for a web resource might need to parse the requested URI in order to know what it means to the system. As mentioned, a URI has an authority part that references the server owning the identifier. The authority server has first-hand knowledge of the structure of its identifiers and knows how to interpret them.

Web browsers, which integrate with web servers, have no knowledge of how the web server is structuring its web resources. They parse the URI identifier according to the URI specification, to figure out which scheme – and thus which protocol – to use for retrieving the web documents.

Conclusion

Naming our pets has been a common practice for a long time. In recent years we have also started injecting small identity chips under their skin, so they can be identified, should they get lost.

Identifiers we assign to things help us communicate better. More and more systems are digitized, and we interact more with digital systems. Some identifiers create bridges between real-life and digital systems, while others are only meant to exist within the digital world.

I could say that you should always use URIs to identify anything, but not all systems are digital systems. Every need is different.

Still, we humans need to use identifiers, for instance when we call somebody by their name to get their attention. When we assign multiple identifiers to an object, we are able to cover both human needs and digital needs.

When two digital systems exchange information, there are no theoretical reasons not to use URIs to identify objects.

A practical obstacle might be the time and money needed to extend the systems to support this, as few systems use URI identifiers internally for its own data.

Often it’s quicker and easier to do customized ad-hoc integrations directly to other systems, but the total cost will depend on how many systems you will need to integrate in total.

Spending a little more time designing each system to support a uniform protocol for information exchange might be more cost effective in the longer run – and it doesn’t have to be URIs and https…

7 kommentarer

  1. Interessant artikkel!

    «Alexa, play NRK P1 on kitchen» resulterer i at det spilles NRK P1+. Om man ber om at det kun skal spilles NRK kommer en sær distriktssending. Noen tanker om hvordan dette kan/bør løses? Noe NRK eller forbrukere kan gjøre?

    Svar på denne kommentaren

    • Kjartan Michalsen (NRK) (svar til Per Magnus)

      Hei Per Magnus!

      Alexa får sine radiokanaler fra TuneIn, og utfordringen i det er at Alexa enda ikke er så god til å forstå norsk, spesielt ikke våre kanalnavn.

      For å høre det du vil, kan du ganske enkelt åpne Alexa-appen på din mobil, og velge hvilke kanal du vil høre under tuneIn-kortet. Da kan du høre samme kanal neste gang ved å si «Alexa, play tunein», og den spiller av sist brukte.

      Ellers kommer nok NRK med et tilbud når Alexa faktisk lanseres i Norge. Vi ser at den blir mye brukt til lyd-konsum i de landene den er lansert. Enn så lenge har vi bare vårt Dagsnytt-tilbud tilgjengelig.

      Kjartan

Legg igjen en kommentar til Olaf Rosendahl Avbryt svar

Din e-postadresse vil ikke bli publisert. Obligatoriske felt er merket med *. Les vår personvernserklæring for informasjon om hvilke data vi lagrer om deg som kommenterer.