1
votes

Please Observe the google Doc below:

https://docs.google.com/document/d/1dw6mJW0VxHzD3_h86RgtZwmelBQE8tYGgi41jb1oz-o/edit

I am attempting to put the data into Hbase using either MapReduce or Importtsv. But my main problem is dealing with the photos. I would like to put the photos in a seperate column family. How do i go about selecting only the photos and importing them into HBase, given that the photos dont have nothing that it can be identified by...like a (text) name.

I thought about using Regex. But some of the districts are of different structure. for instance, "Arizona 1" vs. "Alaska at large".

I need to know how to specifically identify the photos, so they that can be distinguished and imported appropriately.

2
Check my answer below and let me know if you found it useful. - hex494D49

2 Answers

1
votes

Having in mind the structure of the document mentioned above, this is the expression you need. It will match all image URLs and each image description.

<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>

Demo

Usage in PHP:

$html = '<p>Members of our tim</p><image xlink:href="https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Bradley Byrne.jpg</desc></image><h1>Some big title</h1><p>Something <span>more</span> here</p><image xlink:href="https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Spencer Bachus 113th Congress.jpg</desc></image><h1>TITLE</h1><p>Testing, testing, testing</p><image xlink:href="https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Kyrsten Sinema 113th Congress.jpg</desc></image><p>Last updated on 25th of July, 2014</p>';
$pattern = '/<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>/';
if(preg_match_all($pattern, $html, $matches)){
  $size_of_matches = count($matches[0]);
  for($i = 0; $i < $size_of_matches; $i++){
    echo $matches[1][$i] . " -> " . $matches[2][$i] . "<br />";
  }
}

Output:

https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo -> Bradley Byrne.jpg
https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms -> Spencer Bachus 113th Congress.jpg
https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s -> Kyrsten Sinema 113th Congress.jpg
0
votes

I do not have experience with MapReduce or Importtsv, so I went about this a different way using c#. As hex494D49 pointed out, the images do have text associated with them. You just have to obtain that data from the document's source (i.e. right click-->View page source).

This code reads in the document's source, makes an attempt to match the politician with an image file (based on the available information that was posted), and writes the results to a text file. The code has many examples of the c# flavor of regex. A sample of the output is here.