Thoughts on Strings in Swift

Okay, I really do understand the plight. Honestly I do. I mean, I get it! Strings in Swift are not randomly accessible collections of bytes like they are in C or 16-bit words like they are in Objective-C or Java. In Swift a String is a collection of grapheme clusters. So, why not allow you to access them as such using integer values.

In just about every other language out there I can do this:

char *someString = "I like it like that!";
char ch = someString[4];

This will take the fifth character (zero based arrays) from the string someString and place it in the variable ch. In this example ch would now have the value ‘k‘.

But in Swift, you can’t do this. If you tried the following:

let s = "I like it like that!"
let ch = s[4]

You would get a compile time error. Instead they want you to do the following:

let s = "I like it like that!"
let ch = s[s.index(s.startIndex, offsetBy: 4)]

Yeah… Like that’s a lot more readable. ๐Ÿ™„

But like I said, I really do understand the problem here. You see my little “eye-rolling” emoji in the previous paragraph? That’s the problem!

Welcome To The Real World ๐Ÿ–•๐Ÿผ

The problem is that we can’t represent all of the languages and symbols in the world within the range of the standard ASCII character set. Even if we use all eight bits of a byte that still only leaves us with about 224 distinct characters (256 minus the first 32 or so control characters). To support all of the known written languages on this planet we need to support about 2,000,000 characters! And that’s not counting symbols like punctuation and ๐Ÿ˜€ (emojis). And let’s not forget about diacritics!

Over the years a LOT of different schemes have been developed to try to accommodate all these languages, characters, and symbols. Expanding the definition of a character from one byte to two bytes and code pages just to name a couple.

Unicode was developed to help standardize the numerical value of all of the characters and symbols in the world but it didn’t really do anything about the need for more than eight bits. After all the Unicode standard still requires (currently) 21 bits to encode all of the possible code-points in the world.

UTF-8 to the Rescue

To help with the problem the Unicode Consortium came up with the “Universal Coded Character Set Transformation Format โ€“ 8-bitโ€“ or simply โ€“UTF-8“. It’s a method that allows characters to only use as many bytes as they need to. This way the letter “a” only takes up one byte but the Euro symbol “โ‚ฌ” takes up three bytes. Some characters take up two bytes like the symbol for cents “ยข” and some take up a full four bytes like the smily face emoji “๐Ÿ˜€”.

Yes, I’m Finally Getting To The Problem!

So, let’s see what happens in the C language when I try to access characters by their numerical index.

char *someString = "Hi! ๐Ÿ˜€ Do you like my smile?";
char ch = someString[6];

So if I wanted to get the seventh character in the string “Hi! ๐Ÿ˜€ Do you like my smile?”, you might try grabbing the character at index 6 (remember C arrays are zero based). But you would probably be surprised to find that instead you got one of the bytes that makes up the character for the emoji ๐Ÿ˜€.

To see why, let’s look a the individual bytes that make up the first seven “characters” of that string.

Index 0 1 2 3 4 5 6 7 8 9
Character H i !   ๐Ÿ˜€   D
Byte value 72 105 33 32 240 159 152 128 32 68

The first thing you should notice is that it is taking 10 bytes instead of seven. That’s because the fifth character (๐Ÿ˜€) is actually taking up four bytes. So, instead of looking at the individual bytes, I actually want to pick them by their individual “characters” no matter how many bytes they take up.

Swift is actually trying to do this in it’s weird way. When I say, in Swift:

let someString = "Hi! ๐Ÿ˜€ Do you like my smile?"
let ch = someString[someString.index(someString.startIndex, offsetBy: 6)]

What I’m really asking it is to give me the seventh character from the beginning of the string – which would be the character “D”. Down inside the mechanics of the string handling code the string may be stored as a series of UTF-8 encoded bytes or perhaps a series of UTF-16 encoded bytes or perhaps a series of UTF-32 encoded bytes. We don’t know. And we’re not supposed to. What ever the encoding is, the code scans the string, decoding the bytes into characters as it goes along, until it gets to the seventh character.

Okay, fine. But why not just let me do that?!?!????๐Ÿคฌ

It’s just a step to the left…

But, because Swift is a very flexible language we can actually tell it ourselves to let us do this.

public extension String {
    subscript (index: Int) -> Character {
        let actualIndex = self.index(self.startIndex, offsetBy: index)
        return self[actualIndex]
    }
}

I plugged this into Xcode Playground and it works just fine.

By the way, the type “Character” in Swift is actually just an alias for “String“. They’re one in the same.1 More on this in my next post about grapheme clusters.

1) Okay, so, I stand corrected. They’re not the same but they are interchangeable in a LOT of cases and they’re tightly coupled! They are almost identical from a definition standpoint and a String is nothing more than a series of Characters – or, mini-Strings. Again more on this in my next post.

Leave a Reply