Htmlkit Swift Parsing
Parse between elements eg 7:33AM \n Dinner \n \n 12:23
Solution 1:
you can solve this the same way you would do it on any other browser. The problem is not HTMLKit specific.
Since there is no way to select a HTML Text Node via CSS, you have to select its parent and then access the text via the textContent
property or access the parent node's child nodes.
So here are some options to solve your problem, using HTMLKit as an example and the following sample DOM:
let html ="""
<html>
<body>
<dl>
<dt>Breakfast</dt>
<dd id="Breakfast"><span>10:00</span>AM</dd>
<dt>Dinner</dt>
<dd id="Dinner"><span>12:23</span>PM</dd>
</dl>
</body>
</html>
"""let doc =HTMLDocument(string: html)
let elements = doc.querySelectorAll("dd")
- Option 1: Select the
dd
elements and access thetextContent
elements.forEach { ddElement inprint(ddElement.textContent)
}
// Would produce:// 10:00AM// 12:23PM
- Option 2: Select the
dd
elements and iterate through their child nodes, while filtering out everything except forHTMLText
nodes. Additionally you can provide your own custom filter:
elements.forEach { ddElement inlet iter: HTMLNodeIterator= ddElement.nodeIterator(showOptions: [.text], filter: nil)
iter.forEach { node inlet textNode = node as!HTMLTextprint(textNode.textContent)
}
}
// Would produce:// 10:00// AM// 12:23// PM
- Option 3: Expanding on the previous option, you can provide a custom filter for the node iterator:
for dd in elements {
let iter: HTMLNodeIterator= dd.nodeIterator(showOptions: [.text]) { node inif!node.textContent.contains("AM") &&!node.textContent.contains("PM") {
return .reject
}
return .accept
}
iter.forEach { node inlet textNode = node as!HTMLTextprint(textNode.textContent)
}
}
// Would produce:// AM// PM
- Option 4: Wrap the
AM
andPM
in their own<span>
elements and access those, e.g. withdd > span
selector:
doc.querySelectorAll("dd > span").forEach { elem inprint(elem.textContent)
}
// Given the sample DOM would produce:// 10:00// 12:23// if you wrap the am/pm in spans then you would also get those in the output
Your snippet produces: ["", ""]
with the sample DOM from above. Here is why:
let test: [String] = doc.querySelectorAll("span")
.compactMap { element in// element is a <span> HTMLElement// However the elements returned here are <dt> elements and not <span>guardlet span = doc.querySelector("dt") else {
returnnil
}
// The <dt> elements in the DOM do not have IDs, hence an empty string is returnedreturn span.elementId
}
I hope this helps and clarifies some things.
Post a Comment for "Htmlkit Swift Parsing"