2
votes

I'm trying to build a CBIR system and recently wrote a program in Python using OpenCV functions that lets me query a local database of images and return a result (followed this tutorial). I now need to link this up with another web scraping module (used Scrapy) wherein I output ~1000 links to images online. These images are scattered throughout the web and should be input to the first OpenCV module. Is it possible to perform calculations on this online image set without downloading it ?

These are the steps I followed for the OpenCV module

1) Define the region-based color image descriptor

2) Extract features from dataset (Indexing) (dataset to be passed as command line argument)

# import the necessary packages
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
from colordescriptor import ColorDescriptor
import argparse
import glob
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
  help = "Path to the directory that contains the images to be indexed")
ap.add_argument("-i", "--index", required = True,
  help = "Path to where the computed index will be stored")
args = vars(ap.parse_args())

# initialize the color descriptor
cd = ColorDescriptor((8, 12, 3))
# open the output index file for writing
output = open(args["index"], "w")

# use glob to grab the image paths and loop over them
for imagePath in glob.glob(args["dataset"] + "/*.jpg"):
    # extract the image ID (i.e. the unique filename) from the image
    # path and load the image itself
    imageID = imagePath[imagePath.rfind("/") + 1:]
    image = cv2.imread(imagePath)

    # describe the image
    features = cd.describe(image)

    # write the features to file
    features = [str(f) for f in features]
    output.write("%s,%s\n" % (imageID, ",".join(features)))

# close the index file
output.close()

3) Deifning the similarity metric

# import the necessary packages
import numpy as np
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import csv

class Searcher:
    def __init__(self, indexPath):
        # store our index path
        self.indexPath = indexPath

    def search(self, queryFeatures, limit = 5):
        # initialize our dictionary of results
        results = {}

        # open the index file for reading
        with open(self.indexPath) as f:
            # initialize the CSV reader
            reader = csv.reader(f)

            # loop over the rows in the index
            for row in reader:
                # parse out the image ID and features, then compute the
                # chi-squared distance between the features in our index
                # and our query features
                features = [float(x) for x in row[1:]]
                d = self.chi2_distance(features, queryFeatures)

                # now that we have the distance between the two feature
                # vectors, we can udpate the results dictionary -- the
                # key is the current image ID in the index and the
                # value is the distance we just computed, representing
                # how 'similar' the image in the index is to our query
                results[row[0]] = d

            # close the reader
            f.close()

        # sort our results, so that the smaller distances (i.e. the
        # more relevant images are at the front of the list)
        results = sorted([(v, k) for (k, v) in results.items()])

        # return our (limited) results
        return results[:limit]

    def chi2_distance(self, histA, histB, eps = 1e-10):
        # compute the chi-squared distance
        d = 0.5 * np.sum([((a - b) ** 2) / (a + b + eps)
            for (a, b) in zip(histA, histB)])

        # return the chi-squared distance
        return d

`

4) Perform the actual search

# import the necessary packages
from colordescriptor import ColorDescriptor
from searcher import Searcher
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import argparse
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--index", required = True,
    help = "Path to where the computed index will be stored")
ap.add_argument("-q", "--query", required = True,
    help = "Path to the query image")
ap.add_argument("-r", "--result-path", required = True,
    help = "Path to the result path")
args = vars(ap.parse_args())

# initialize the image descriptor
cd = ColorDescriptor((8, 12, 3))

# load the query image and describe it
query = cv2.imread(args["query"])
features = cd.describe(query)

# perform the search
searcher = Searcher(args["index"])
results = searcher.search(features)

# display the query
cv2.imshow("Query", query)

# loop over the results
for (score, resultID) in results:
    # load the result image and display it
    result = cv2.imread(args["result_path"] + "/" + resultID)
    cv2.imshow("Result", result)
    cv2.waitKey(0)

And the final command line command is:

python search.py --index index.csv --query query.png --result-path dataset

where index.csv is the file generated after step 2 on the database of images. query.png is my query image and dataset is the folder containing the ~100 images.

So is it possible to modify the indexing such that I don't need a local dataset and to be querying and indexing can be done directly from the list of URLs ?

For clarity: you want to scrape websites for images and then run openCV on them with out downloading them all at once?rady
Yes exactly. My code scrapes through a few websites and returns URLs for ~1000 imagesTushar
Store to temp is not a downloading procedure ? How to work on unowned file(s) ?dsgdfg
This definitely helped. pyimagesearch.com/2015/03/02/… But anyway I decided not to run the opencv module on the url since my dataset is really large. The latency and download time would be too large so I'm willing to trade disk space for faster searchesTushar