[swift unboxed]

Safely unboxing the Swift language & standard library


Swift Substrings

When is a (sub)string not a string? Always and never.

27 November 2017 ∙ Standard Library ∙ written by

Text strings are common enough that programming languages include special features or syntactic sugar to deal with them. Take everyone’s favorite language C — strings are just arrays of characters but instead of typing the character array ['h','e','l','l','o'] you can type "hello" and the compiler takes care of it.

Higher level languages such as Swift treat strings as more than just character arrays — they’re full types with all kinds of features. In this first look at strings, we’ll look at a slice of their behavior: substrings.

Strings, Briefly

First, let’s take a whirlwind tour of how strings are implemented. This is from String.swift in the standard library:

public struct String {
  public var _core: _StringCore
}

There are some initializers too, but only a single stored property in the main type declaration! All the good stuff must be in StringCore.swift:

public struct _StringCore {
  public var _baseAddress: UnsafeMutableRawPointer?
  var _countAndFlags: UInt
  public var _owner: AnyObject?
}

There’s a lot more to this type but again, let’s focus on the stored properties:

  • Base address — a raw pointer to the underlying storage.
  • Count — length of the string, stored in the lower (UInt.bitWidth - 2) bits of _countAndFlags. On a 64-bit runtime, that means 62 bits available for a maximum string length on the order of 4 x 1018 — that’s 4 exabytes!
  • Flags — two bits for flags: one bit to say whether the string is backed by a Swift native _StringBuffer or an NSString-style Cocoa buffer, and the second bit on whether the buffer stores ASCII or UTF-16.

_StringCore has many more complexities, but this quick look at properties should get us most of the way there: strings have some underlying storage and size.

Substrings

How do you construct a substring in Swift? The easiest way is to take a slice of a string via subscript:

let str = "Hello Swift!"

let slice = str[str.startIndex..<str.index(str.startIndex, offsetBy: 5)]
// "Hello"

Well, it’s easy but the code might not look so great 😄.

String indices aren’t plain old integers, thus the dance with startIndex and index(_:offsetBy:). Since we’re starting at startIndex, we could save a little space with a partial range:

let withPartialRange = str[..<str.index(str.startIndex, offsetBy: 5)]
// still "Hello"

Or with the collection slicing methods:

let slice = str.prefix(5)
// still "Hello"

Remember, strings are collections and you can use all the usual collection methods such as prefix(), suffix(), dropFirst(), etc.

Inside a Substring

Part of the magic of substrings is that they reuse the storage of their “parent” string. You can think of a substring as a base string and a range.

Substring is a base string and a range

That means a 100-character slice of an 8000-character string doesn’t need to allocate and duplicate storage for those 100 characters.

That also means you could be unintentionally extending the lifetime of your base string. If you have the text of an entire novel in a massive string and take a single word as a slice, the big string will stick around as long as the substring is alive.

What exactly is inside the substring that keeps track of all this?

public struct Substring {
  internal var _slice: RangeReplaceableBidirectionalSlice<String>

The internal _slice property holds all the needed information from the base string:

// Still inside Substring
internal var _wholeString: String {
  return _slice._base
}
public var startIndex: Index { return _slice.startIndex }
public var endIndex: Index { return _slice.endIndex }

The computed properties _wholeString (returning the entire original string), as well as startIndex and endIndex (specifying what part of the string to slice) simply pass through to the underlying slice properties.

You can also see how the slice holds on to the original string with _base.

Substring to String

So you have a bunch of substrings hanging around, but your functions expect strings. What to do? Not to worry, you can easily convert a substring into a string:

let string = String(substring)

Since substrings share storage with their base string, creating a new string will presumably create a new bucket of storage. What’s going on inside the String initializer that takes a substring?

extension String {
  public init(_ substring: Substring) {
    // 1
    let x = substring._wholeString
    // 2
    let start = substring.startIndex
    let end = substring.endIndex
    // 3
    let u16 = x._core[start.encodedOffset..<end.encodedOffset]
    // 4A
    if start.samePosition(in: x.unicodeScalars) != nil
    && end.samePosition(in: x.unicodeScalars) != nil {
      self = String(_StringCore(u16))
    }
    // 4B
    else {
      self = String(decoding: u16, as: UTF16.self)
    }
  }
}
  1. Grab a reference to the entire base string.

  2. Get the start and end indices.

  3. Get the UTF-16 representation of the slice. _core is a _StringCore instance, and the encodedOffset properties are UTF-16-friendly indices into the string.

  4. Check whether the index into the unicode scalar view matches up. Branch 4A means you don’t have any unpaired surrogates (tl;dr: Unicode is hard) and can instantiate a new string with a _StringCore based directly on the UTF-16 buffer for the slice.

    Otherwise, take branch 4B and re-decode the buffer as UTF-16 to instantiate a string with init(decoding:as:). Either way, end up with a fresh String instance.

It’s easy enough to convert substrings to strings, but is that required? Do you need to wrap every substring in String() when you want to use it? Doesn’t that eat up a lot of the efficiency gain we got from having lightweight substrings in the first place? 🤔

StringProtocol

Enter StringProtocol! In a great example of protocol-oriented programming, StringProtocol abstracts out the functionality of strings such as uppercased(), lowercased(), being comparable, hashable, a collection, etc. Then both String and Substring conform to StringProtocol.

That means you can == a string and a substring without first converting the substring:

let helloSwift = "Hello Swift"
let swift = helloSwift[helloSwift.index(helloSwift.startIndex, offsetBy: 6)...]

// comparing a substring to a string 😱
swift == "Swift"  // true

You can also iterate over substrings, and take substrings of substrings.

You’ll find a few places in the standard library have functions that take StringProtocol arguments rather than String — for instance, converting a string to an integer or float uses an initializer like init(text: StringProtocol).

Maybe in your own code you don’t care whether you’re dealing specifically with a string or a substring? In that case, consider taking a StringProtocol argument and callers can pass either without having to do the conversion beforehand.

The Closing Brace

To summarize:

  • Strings are strings, just as they’ve always been.
  • Substrings are slices of strings. They share a storage buffer with their base string as well as a start-end index range.
  • StringProtocol factors out what a string-like thing is and how to access its features, into a protocol. Both String and Substring conform to StringProtocol.
Substring and String conform to StringProtocol

So who’s ready to code up their own custom string type? Make your own type that conforms to StringProtocol and join the string party? 🎊

/// Do not declare new conformances to `StringProtocol`. Only the `String` and
/// `Substring` types in the standard library are valid conforming types.
public protocol StringProtocol

Oh well. Let’s all make our own custom boolean types instead? 😜

}