Jsoup Not Downloading Entire Page
The webpage is: http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm I want to extract all the elements using Jsoup. The code I am
Solution 1:
The problem is the internal Jsoup Http Connection Handling. Nothing wrong with the selector engine. I didn't go deep in but there always problem with proprietary way to handle http connection. I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection.
The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample.
Output of the program
- load from file= 1452
- load from http client= 1452
- load from jsoup connect= 1350
load from jsoup connect using maxBodySize= 1452
package test; import java.io.IOException; import java.io.InputStream; import org.apache.http.HttpResponse; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.HttpClientBuilder; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; publicclassTestJsoup { /** * @param args * @throws IOException */publicstaticvoidmain(String[] args)throws IOException { Documentdoc= Jsoup.parse(loadContentFromClasspath(), "UTF8", ""); Elementses= doc.getElementsByClass("tr_normal"); System.out.println("load from file= " + es.size()); doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", ""); es = doc.getElementsByClass("tr_normal"); System.out.println("load from http client= " + es.size()); Stringurl="http://www.hkex.com.hk/eng/market/sec_tradinfo" + "/stockcode/eisdeqty_pf.htm"; doc = Jsoup.connect(url).get(); es = doc.getElementsByClass("tr_normal"); System.out.println("load from jsoup connect= " + es.size()); intmaxBodySize=2048000;//2MB (default is 1MB) 0 for unlimited size doc = Jsoup.connect(url).maxBodySize(maxBodySize).get(); es = doc.getElementsByClass("tr_normal"); System.out.println("load from jsoup connect using maxBodySize= " + es.size()); } publicstatic InputStream loadContentByHttpClient()throws ClientProtocolException, IOException { Stringurl="http://www.hkex.com.hk/eng/market/sec_tradinfo" + "/stockcode/eisdeqty_pf.htm"; HttpClientclient= HttpClientBuilder.create().build(); HttpGetrequest=newHttpGet(url); HttpResponseresponse= client.execute(request); return response.getEntity().getContent(); } publicstatic InputStream loadContentFromClasspath()throws ClientProtocolException, IOException { return TestJsoup.class.getClassLoader().getResourceAsStream( "eisdeqty_pf.htm"); } }
Post a Comment for "Jsoup Not Downloading Entire Page"