Quantcast
Channel: Scraping XML with JSoup - Stack Overflow
Viewing all articles
Browse latest Browse all 2

Answer by ollo for Scraping XML with JSoup

$
0
0

There are two Problems with that feed:

  1. The document contains only <link />..actual link.. instead of full link tag
  2. The description (containing the price tag) is escaped Html, which wont get parsed

Solution:

    final String url = "http://www.amazon.com/gp/rss/movers-and-shakers/appliances/ref=zg_bsms_appliances_rsslink";    Document doc = Jsoup.connect(url).get();    for( Element item : doc.select("item") ) // Select all items    {        final String title = item.select("title").first().text(); // select the 'title' of the item        final String link = item.select("link").first().nextSibling().toString().trim(); // select 'link' (-1-)        final Document descr = Jsoup.parse(StringEscapeUtils.unescapeHtml4(item.select("description").first().toString()));        final String price = descr.select("span.price").first().text(); // select 'price' (-2-)        // Output - Example        System.out.println(title);        System.out.println(link);        System.out.println(price);        System.out.println();    }

Note 1: Workaround for the link; select the (empty) link tag and get the text of next Node (= TextNode with the actual link).

Note 2: Workaround for price; select the description tag, unescape the html, parse it and select the price. For unescaping i used StringEscapeUtils.unescapeHtml4() from Apache Commons Lang.

Output:
(using link from above)

#1: Epicurean Gourmet Series 20-Inch-by-15-Inch Cutting Board with Cascade Effect, Nutmeg with Natural Corehttp://www.amazon.com/Epicurean-Gourmet-20-Inch-15-Inch-Cutting/dp/B003MU9PLU/ref=pd_zg_rss_ms_la_appliances_1$72.95#2: GE 45600 Z-Wave Basic Handheld Remotehttp://www.amazon.com/GE-45600-Z-Wave-Handheld-Remote/dp/B0013V6RW0/ref=pd_zg_rss_ms_la_appliances_2$3.00#3: First Alert RD1 Radon Gas Test Kithttp://www.amazon.com/First-Alert-RD1-Radon-Test/dp/B00002N83E/ref=pd_zg_rss_ms_la_appliances_3$10.60#4: Presto 04820 PopLite Hot Air Popper, Whitehttp://www.amazon.com/Presto-04820-PopLite-Popper-White/dp/B00006IUWA/ref=pd_zg_rss_ms_la_appliances_4$9.99#5: New 20 oz Espresso Coffee Milk Frothing Pitcher, Stainless Steel, 18/8 gaugehttp://www.amazon.com/Espresso-Coffee-Frothing-Pitcher-Stainless/dp/B000FNK3Z4/ref=pd_zg_rss_ms_la_appliances_5$8.19#6: PUR 18 Cup Dispenser with One Pitcher Filter DS-1800Zhttp://www.amazon.com/PUR-Dispenser-Pitcher-Filter-DS-1800Z/dp/B0006MQCA4/ref=pd_zg_rss_ms_la_appliances_6$22.17#7: Hamilton Beach 70610 500-Watt Food Processor, Whitehttp://www.amazon.com/Hamilton-Beach-70610-500-Watt-Processor/dp/B000SAOF5S/ref=pd_zg_rss_ms_la_appliances_7$21.95#8: West Bend 77203 Electric Can Opener, Metallichttp://www.amazon.com/West-Bend-77203-Electric-Metallic/dp/B00030J1U2/ref=pd_zg_rss_ms_la_appliances_8$35.79#9: Custom Leathercraft 2077L Black Ski Glove, Largehttp://www.amazon.com/Custom-Leathercraft-2077L-Black-Glove/dp/B00499BS9A/ref=pd_zg_rss_ms_la_appliances_9$8.83#10: Cuisinart CPC-600 1000-Watt 6-Quart Electric Pressure Cooker, Brushed Stainless and Matte Blackhttp://www.amazon.com/Cuisinart-CPC-600-1000-Watt-Electric-Stainless/dp/B000MPA044/ref=pd_zg_rss_ms_la_appliances_10$64.95

Viewing all articles
Browse latest Browse all 2

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>