Avoiding a Swift NSRegularExpression Pitfall

Until Swift gets a native regular expression class of it’s own we are stuck with the Objective-C version – NSRegularExpression. The Objective-C version is very capable and works very well but it’s important to remember that it is working on Objective-C NSStrings which, unlike Swift’s native Strings, are based on the UTF-16 encoding. What this means is that the values that are returned by NSRegularExpression in the NSRange structures are integer values of UTF-16 code units in the NSString rather than String.Index structures in a Swift String. So if you try to use those values the get the substring that the range represents, you have to do so using the string’s String.UTF16View instead so that the values make sense. If you don’t then any characters that are represented by two or more UTF-16 code units will cause those integer values to be incorrect.

For example, if you were just dealing with straight ASCII text then the following code:

let decl:       String                 = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>"
let regex:      NSRegularExpression    = try NSRegularExpression(pattern: "^\\<\\?xml\\s+version=\"([^\"]+)\"(?:\\s+encoding=\"([^\"]+)\")?(?:\\s+standalone=\"([^\"]+)\")?\\s*\\?\\>")
let allMatches: [NSTextCheckingResult] = regex.matches(in: decl, range: NSRange(location: 0, length: decl.count))
if !allMatches.isEmpty {
    let aMatch: NSTextCheckingResult = allMatches[0]
    for x in (0 ..< aMatch.numberOfRanges) {
        let rng:  NSRange      = aMatch.range(at: x)
        let idx1: String.Index = decl.index(decl.startIndex, offsetBy: rng.lowerBound)
        let idx2: String.Index = decl.index(decl.startIndex, offsetBy: rng.upperBound)
        let str:  String       = String(decl[idx1 ..< idx2])
        print("Range \(x): \"\(str)\"")
    }
}
else {
    print("No Matches!")
}

Would, indeed, produce the expected output:

Range 0: "<?xml version="1.0" encoding="UTF-8" standalone="yes"?>"
Range 1: "1.0"
Range 2: "UTF-8"
Range 3: "yes"

But, what happens if we introduce an emoji into the text? Let’s change the string in line 1 so that it includes an American Flag emoji (🇺🇸) and run the application again.

let decl: String = "<?xml version=\"1.0\" encoding=\"UTF-8🇺🇸\" standalone=\"yes\"?>"

Now, suddenly, we get this output:

No Matches!

The introduction of the American Flag emoji (🇺🇸) caused our code to be incorrect. This is because the emoji is not really one character but rather TWO characters. It’s actually the Unicode character “🇺” followed by the Unicode character “🇸”. Because they’re side-by-side they form what’s called a Grapheme Cluster. Swift Strings correctly treats these two characters as one character – 🇺🇸. But Objective-C’s NSString doesn’t. It still sees them as two separate characters. To further complicate matters is that both 🇺 and 🇸 are actually represented by TWO UTF-16 code points each instead of just one.

So let’s continue the demonstration by correcting just line 3 so that we are calculating the correct range. We’re also going to stretch the string in line 1 but adding about 10 spaces to the end of it (to avoid a bad index error) and add a couple of debugging lines after line 3.

let decl:       String                 = "<?xml version=\"1.0\" encoding=\"UTF-8🇺🇸\" standalone=\"yes\"?>          "
let regex:      NSRegularExpression    = try NSRegularExpression(pattern: "^\\<\\?xml\\s+version=\"([^\"]+)\"(?:\\s+encoding=\"([^\"]+)\")?(?:\\s+standalone=\"([^\"]+)\")?\\s*\\?\\>")
let allMatches: [NSTextCheckingResult] = regex.matches(in: decl, range: NSRange(location: 0, length: decl.endIndex.utf16Offset(in: decl)))
print("                         decl.count = \(decl.count)")
print("decl.endIndex.utf16Offset(in: decl) = \(decl.endIndex.utf16Offset(in: decl))")
if !allMatches.isEmpty {
    let aMatch: NSTextCheckingResult = allMatches[0]
    for x in (0 ..< aMatch.numberOfRanges) {
        let rng:  NSRange      = aMatch.range(at: x)
        let idx1: String.Index = decl.index(decl.startIndex, offsetBy: rng.lowerBound)
        let idx2: String.Index = decl.index(decl.startIndex, offsetBy: rng.upperBound)
        let str:  String       = String(decl[idx1 ..< idx2])
        print("Range \(x): \"\(str)\"")
    }
}
else {
    print("No Matches!")
}

And now the output:

                         decl.count = 66
decl.endIndex.utf16Offset(in: decl) = 69
Range 0: "<?xml version="1.0" encoding="UTF-8🇺🇸" standalone="yes"?>   "
Range 1: "1.0"
Range 2: "UTF-8🇺🇸" s"
Range 3: ""?>"

As the output demonstrates the difference between String.count (line 1) and String.Index.utf16Offset(in:) (line 2) is three. These are the three extra UTF-16 code points I mentioned previously. The American flag emoji (🇺🇸) is actually two separate characters comprising a total of four UTF-16 code points. And sure enough, starting at the end of line 5, our output is off by three!

The code below demonstrates one way (there’s actually a few ways) to correctly get the substring using the values in NSRange.

let decl:       String                 = "<?xml version=\"1.0\" encoding=\"UTF-8🇺🇸\" standalone=\"yes\"?>          "
let regex:      NSRegularExpression    = try NSRegularExpression(pattern: "^\\<\\?xml\\s+version=\"([^\"]+)\"(?:\\s+encoding=\"([^\"]+)\")?(?:\\s+standalone=\"([^\"]+)\")?\\s*\\?\\>")
let allMatches: [NSTextCheckingResult] = regex.matches(in: decl, range: NSRange(location: 0, length: decl.endIndex.utf16Offset(in: decl)))
print("                         decl.count = \(decl.count)")
print("decl.endIndex.utf16Offset(in: decl) = \(decl.endIndex.utf16Offset(in: decl))")
if !allMatches.isEmpty {
    let aMatch: NSTextCheckingResult = allMatches[0]
    for x in (0 ..< aMatch.numberOfRanges) {
        let rng:  NSRange      = aMatch.range(at: x)
        let idx1: String.Index = String.Index(utf16Offset: rng.lowerBound, in: decl)
        let idx2: String.Index = String.Index(utf16Offset: rng.upperBound, in: decl)
        let str:  String       = String(decl[idx1 ..< idx2])
        print("Range \(x): \"\(str)\"")
    }
}
else {
    print("No Matches!")
}

In lines 10 and 11 we’re using String.Index(utf16Offset:in:) to get the proper index from the NSRange values. And you can see that the output of the program is correct.

                         decl.count = 66
decl.endIndex.utf16Offset(in: decl) = 69
Range 0: "<?xml version="1.0" encoding="UTF-8🇺🇸" standalone="yes"?>"
Range 1: "1.0"
Range 2: "UTF-8🇺🇸"
Range 3: "yes"

I am in the process of creating a “native” Swift regular expression class. I put native in quotes because, for now, it will simply be a wrapper around the NSRegularExpression class but behaves in a “Swift”y manner. Stay tuned for news of that in the coming week.

One thought on “Avoiding a Swift NSRegularExpression Pitfall

Leave a Reply