1
votes

It's perfectly easy to download all images from a website using wget.

But I need this feature on client-side, best would be in Java.

I know wget's source can be accessed online, but I don't know any C and the source is quite complex. Of course, wget has also other features which "blow up the source" for me.

As Java has a built-in HttpClient, yet I don't know how sophisticated wget really is, could you tell me if it is hard to re-implement the "download all images recursively" feature in Java?

How is this done, exactly? Does wget fetch the HTML source code of the given URL, extract all URLs with the given file endings (.jpg, .png) from the HTML and downloads them? Does it also search for images in the stylesheets that are linked in that HTML document?

How would you do this? Would you use regular expressions to search for (both relative and absolute) image URLs within the HTML document and let HttpClient download each of them? Or is there already some Java library that does something similar?

3
You might want to take a look at Jerry. It provides JQuery like selectors for HTML documents and it might help you find all of the images to download. - Christian Trimble
If you are familiar with wget. why dont you use wget in java? I mean write a simple java class that will invoke a script which will inturn contain your wget!! - Krishna
@Krishna: I'm implementing this task for two programs, one that runs on Android and one on Windows, where I don't have access to wget, unfortunately. This is why I need a pure Java solution, without calling any external programs. - caw
@C.Trimble: Thanks, Jerry definitely looks cool and is good to have in the toolbox :) - caw

3 Answers

2
votes

In Java you could use the Jsoup library to parse any web page and extract anything you want

0
votes

For me crawler4j was the open source library to recursively crawl (and replicate) a site, e.g. like this (their QuickStart example): (it also supports CSS URL crawling)

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp3|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "http://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("http://www.ics.uci.edu/");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);

         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
         }
    }
}

More webcrawlers and HTML parsers can be found here.

-1
votes

Found this program which downloads images. It is open source.

You could get the images in a website using the <IMG> tags. Look into the following question. It might help you. Get all Images from WebPage Program | Java