Skip to content Skip to sidebar Skip to footer

HTMLKit Swift Parsing

Parse between elements eg 7:33AM \n
Dinner
\n
\n 12:23

Solution 1:

you can solve this the same way you would do it on any other browser. The problem is not HTMLKit specific.

Since there is no way to select a HTML Text Node via CSS, you have to select its parent and then access the text via the textContent property or access the parent node's child nodes.

So here are some options to solve your problem, using HTMLKit as an example and the following sample DOM:

let html = """
<html>
<body>
<dl>
  <dt>Breakfast</dt>
  <dd id="Breakfast"><span>10:00</span>AM</dd>
  <dt>Dinner</dt>
  <dd id="Dinner"><span>12:23</span>PM</dd>
</dl>
</body>
</html>
"""

let doc = HTMLDocument(string: html)
let elements = doc.querySelectorAll("dd")
  • Option 1: Select the dd elements and access the textContent
elements.forEach { ddElement in
  print(ddElement.textContent)
}

// Would produce:
// 10:00AM
// 12:23PM
  • Option 2: Select the dd elements and iterate through their child nodes, while filtering out everything except for HTMLText nodes. Additionally you can provide your own custom filter:
elements.forEach { ddElement in
  let iter: HTMLNodeIterator = ddElement.nodeIterator(showOptions: [.text], filter: nil)
  iter.forEach { node  in
    let textNode = node as! HTMLText
    print(textNode.textContent)
  }
}

// Would produce:
// 10:00
// AM
// 12:23
// PM
  • Option 3: Expanding on the previous option, you can provide a custom filter for the node iterator:
for dd in elements {
  let iter: HTMLNodeIterator = dd.nodeIterator(showOptions: [.text]) { node in
    if !node.textContent.contains("AM") && !node.textContent.contains("PM") {
        return .reject
    }
    return .accept
  }

  iter.forEach { node  in
    let textNode = node as! HTMLText
    print(textNode.textContent)
  }
}

// Would produce:
// AM
// PM
  • Option 4: Wrap the AM and PM in their own <span> elements and access those, e.g. with dd > span selector:
doc.querySelectorAll("dd > span").forEach { elem in
   print(elem.textContent)
}

// Given the sample DOM would produce:
// 10:00
// 12:23

// if you wrap the am/pm in spans then you would also get those in the output

Your snippet produces: ["", ""] with the sample DOM from above. Here is why:

let test: [String] = doc.querySelectorAll("span")
  .compactMap { element in  // element is a <span> HTMLElement

    // However the elements returned here are <dt> elements and not <span>
    guard let span = doc.querySelector("dt") else {
        return nil
    }
    // The <dt> elements in the DOM do not have IDs, hence an empty string is returned
    return span.elementId
  }

I hope this helps and clarifies some things.


Post a Comment for "HTMLKit Swift Parsing"