Getting Html Elements Via Xpath In Bash
I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question: curl 'https://www.kaggle.com/competitions/search?SearchVisibility=AllCom
Solution 1:
Getting HTML elements via XPath in bash
from html file (with not valid xml)
One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html
to use html as input. But with that you need
to have a xslt stylesheet.
<xsl:stylesheetxmlns:xsl="http://www.w3.org/1999/XSL/Transform"version="1.0"><xsl:outputmethod="text" /><xsl:templatematch="/*"><xsl:value-ofselect="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" /></xsl:template></xsl:stylesheet>
Notice that the xapht is changed. There is no tbody
in the input file.
Call xsltproc:
xsltproc --html test.xsl competitions.html2> /dev/null
Where the xslproc complaining about errors in html is ignored ( send to /devn/null ).
The output is: /c/R
To use different xpath expression from command line you may use a xslt template and replace the __xpath__
.
E.g. xslt template:
<xsl:stylesheetxmlns:xsl="http://www.w3.org/1999/XSL/Transform"version="1.0"><xsl:outputmethod="text" /><xsl:templatematch="/*"><xsl:value-ofselect="__xpaht__" /></xsl:template></xsl:stylesheet>
And use (e.g) sed for the replacement.
sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
xsltproc --html test.xsl competitions.html2> /dev/null
Post a Comment for "Getting Html Elements Via Xpath In Bash"