HTMLKit Swift Parsing
Parse between elements eg 7:33AM \n Dinner \n \n 12:23
Solution 1:
you can solve this the same way you would do it on any other browser. The problem is not HTMLKit specific.
Since there is no way to select a HTML Text Node via CSS, you have to select its parent and then access the text via the textContent
property or access the parent node's child nodes.
So here are some options to solve your problem, using HTMLKit as an example and the following sample DOM:
let html = """
<html>
<body>
<dl>
<dt>Breakfast</dt>
<dd id="Breakfast"><span>10:00</span>AM</dd>
<dt>Dinner</dt>
<dd id="Dinner"><span>12:23</span>PM</dd>
</dl>
</body>
</html>
"""
let doc = HTMLDocument(string: html)
let elements = doc.querySelectorAll("dd")
- Option 1: Select the
dd
elements and access thetextContent
elements.forEach { ddElement in
print(ddElement.textContent)
}
// Would produce:
// 10:00AM
// 12:23PM
- Option 2: Select the
dd
elements and iterate through their child nodes, while filtering out everything except forHTMLText
nodes. Additionally you can provide your own custom filter:
elements.forEach { ddElement in
let iter: HTMLNodeIterator = ddElement.nodeIterator(showOptions: [.text], filter: nil)
iter.forEach { node in
let textNode = node as! HTMLText
print(textNode.textContent)
}
}
// Would produce:
// 10:00
// AM
// 12:23
// PM
- Option 3: Expanding on the previous option, you can provide a custom filter for the node iterator:
for dd in elements {
let iter: HTMLNodeIterator = dd.nodeIterator(showOptions: [.text]) { node in
if !node.textContent.contains("AM") && !node.textContent.contains("PM") {
return .reject
}
return .accept
}
iter.forEach { node in
let textNode = node as! HTMLText
print(textNode.textContent)
}
}
// Would produce:
// AM
// PM
- Option 4: Wrap the
AM
andPM
in their own<span>
elements and access those, e.g. withdd > span
selector:
doc.querySelectorAll("dd > span").forEach { elem in
print(elem.textContent)
}
// Given the sample DOM would produce:
// 10:00
// 12:23
// if you wrap the am/pm in spans then you would also get those in the output
Your snippet produces: ["", ""]
with the sample DOM from above. Here is why:
let test: [String] = doc.querySelectorAll("span")
.compactMap { element in // element is a <span> HTMLElement
// However the elements returned here are <dt> elements and not <span>
guard let span = doc.querySelector("dt") else {
return nil
}
// The <dt> elements in the DOM do not have IDs, hence an empty string is returned
return span.elementId
}
I hope this helps and clarifies some things.
Post a Comment for "HTMLKit Swift Parsing"