Skip to content Skip to sidebar Skip to footer

Getting Html Elements Via Xpath In Bash

I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question: curl 'https://www.kaggle.com/competitions/search?SearchVisibility=AllCom

Solution 1:

Getting HTML elements via XPath in bash

from html file (with not valid xml)

One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html to use html as input. But with that you need to have a xslt stylesheet.

<xsl:stylesheetxmlns:xsl="http://www.w3.org/1999/XSL/Transform"version="1.0"><xsl:outputmethod="text" /><xsl:templatematch="/*"><xsl:value-ofselect="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" /></xsl:template></xsl:stylesheet>

Notice that the xapht is changed. There is no tbodyin the input file. Call xsltproc:

xsltproc --html  test.xsl competitions.html2> /dev/null

Where the xslproc complaining about errors in html is ignored ( send to /devn/null ).

The output is: /c/R

To use different xpath expression from command line you may use a xslt template and replace the __xpath__.

E.g. xslt template:

<xsl:stylesheetxmlns:xsl="http://www.w3.org/1999/XSL/Transform"version="1.0"><xsl:outputmethod="text" /><xsl:templatematch="/*"><xsl:value-ofselect="__xpaht__" /></xsl:template></xsl:stylesheet>

And use (e.g) sed for the replacement.

 sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
 xsltproc --html  test.xsl competitions.html2> /dev/null

Post a Comment for "Getting Html Elements Via Xpath In Bash"