Skip to content Skip to sidebar Skip to footer

Java Equivalent To PHP Simple HTML DOM Parser

Since I have to multithread which I can not eloquently solve in PHP I would like to programm in Java, unfortunately I could not finde a library which will allow me to parse a HTML

Solution 1:

I went from Simple HTML DOM Parser to JSoup and I'm quite happy with it.


Solution 2:

I can see that we have two challenges here:

  • Parsing of HTML that might not be well-formed XHTML that ease any and nice to parse. I'd recommend TagSoup library that can read ugly HTML and produce well-formed StaX stream that can be then used elsewhere.

  • Building of DOM representaion of HTML document and dealing with that. As you probably know in JDK there is full-blown implementation of XML DOM (org.w3c.dom.*). But I guess this is not the type of API you've been looking for. What about DOM4J or older JDOM that can wrap JDK Document and you can enjoy easy to use API?


Solution 3:

I've successfully used TagSoup as a SAX parser to populate DOM4J Documents which I then query with XPath. It took me a while to work out the incantations - (Scala, but I'm sure that you can convert):

parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val reader = new SAXReader(parserFactory.newSAXParser.getXMLReader)
val doc = reader.read(new InputSource(new StringReader(page)))

Post a Comment for "Java Equivalent To PHP Simple HTML DOM Parser"