Skip to content Skip to sidebar Skip to footer

Htmlkit Swift Parsing

Parse between elements eg 7:33AM \n
Dinner
\n
\n 12:23

Solution 1:

you can solve this the same way you would do it on any other browser. The problem is not HTMLKit specific.

Since there is no way to select a HTML Text Node via CSS, you have to select its parent and then access the text via the textContent property or access the parent node's child nodes.

So here are some options to solve your problem, using HTMLKit as an example and the following sample DOM:

let html ="""
<html>
<body>
<dl>
  <dt>Breakfast</dt>
  <dd id="Breakfast"><span>10:00</span>AM</dd>
  <dt>Dinner</dt>
  <dd id="Dinner"><span>12:23</span>PM</dd>
</dl>
</body>
</html>
"""let doc =HTMLDocument(string: html)
let elements = doc.querySelectorAll("dd")
  • Option 1: Select the dd elements and access the textContent
elements.forEach { ddElement inprint(ddElement.textContent)
}

// Would produce:// 10:00AM// 12:23PM
  • Option 2: Select the dd elements and iterate through their child nodes, while filtering out everything except for HTMLText nodes. Additionally you can provide your own custom filter:
elements.forEach { ddElement inlet iter: HTMLNodeIterator= ddElement.nodeIterator(showOptions: [.text], filter: nil)
  iter.forEach { node  inlet textNode = node as!HTMLTextprint(textNode.textContent)
  }
}

// Would produce:// 10:00// AM// 12:23// PM
  • Option 3: Expanding on the previous option, you can provide a custom filter for the node iterator:
for dd in elements {
  let iter: HTMLNodeIterator= dd.nodeIterator(showOptions: [.text]) { node inif!node.textContent.contains("AM") &&!node.textContent.contains("PM") {
        return .reject
    }
    return .accept
  }

  iter.forEach { node  inlet textNode = node as!HTMLTextprint(textNode.textContent)
  }
}

// Would produce:// AM// PM
  • Option 4: Wrap the AM and PM in their own <span> elements and access those, e.g. with dd > span selector:
doc.querySelectorAll("dd > span").forEach { elem inprint(elem.textContent)
}

// Given the sample DOM would produce:// 10:00// 12:23// if you wrap the am/pm in spans then you would also get those in the output

Your snippet produces: ["", ""] with the sample DOM from above. Here is why:

let test: [String] = doc.querySelectorAll("span")
  .compactMap { element in// element is a <span> HTMLElement// However the elements returned here are <dt> elements and not <span>guardlet span = doc.querySelector("dt") else {
        returnnil
    }
    // The <dt> elements in the DOM do not have IDs, hence an empty string is returnedreturn span.elementId
  }

I hope this helps and clarifies some things.

Post a Comment for "Htmlkit Swift Parsing"