0
votes

I have 2 questions:

I have a big file of records, a few million ones. I need to transfer this file from one machine to a hadoop cluster machine. I guess there is no scp command in hadoop (or is there?) How to transfer files to the hadoop machine?

Also, once the file is on my hadoop cluster, I want to search for records which contain a specific string, say 'XYZTechnologies'. How to do this is Pig? Some sample code would be great to give me a head-start.

This is the very first time I am working on Hadoop/Pig. So please pardon me if it is a "too basic" question.

EDIT 1

I tried what Jagaran suggested and I got the following error:

2012-03-18 04:12:55,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "(" "( "" at line 3, column 26.
Was expecting:
    <QUOTEDSTRING> ...

Also, please note that, I want to search for the string anywhere in the record, so I am reading the tab separated record as one single column:

A = load '/user/abc/part-00000' using PigStorage('\n') AS (Y:chararray);

3
Copy into HDFS: stackoverflow.com/q/1533330/179529. Pig is not meant for search. It is used for scanning a lot of data for manipulation (ETL).Guy

3 Answers

2
votes

for your first question, i think that Guy has already answered it. as for the second question, it looks like if you just want to search for records which contain a specific string, a bash script is better, but if you insist on Pig, this is what i suggest:

A = load '/user/abc/' using PigStorage(',') AS (Y:chararray);
B = filter A by CONTAINS(A, 'XYZTechnologies');
store B into 'output' using PigStorage()

keep in mind that PigStorage default delimeter is tab so put a delimeter that does not appear in your file. then you should write a UDF that returns a boolean for CONTAINS, something like:

public class Contains extends EvalFunc<Boolean> {
@Override
public Boolean exec(Tuple input) throws IOException  
{
    return input.get(0).toString().contains(input.get(1).toString());
}
}

i didn't test this, but this is the direction i would have tried.

1
votes

For Copying to Hadoop. 1. You can install Hadoop Client in the other machine and then do hadoop dfs -copyFromLocal from commandline 2. You could simple write a java code that would use FileSystem API to copy to hadoop.

For Pig. Assuming you know field 2 may contain XYZTechnologies

A = load '<input-hadoop-dir>' using PigStorage() as (X:chararray,Y:chararray);
-- There should not be "(" and ")" after 'matches'
B = Filter A by Y matches '.*XYZTechnologies.*';
STORE B into 'Hadoop=Path'  using PigStorage();
0
votes

Hi you can use the hadoop grep function to find the specific string in the file. for e.g my file contains some data as follows

Hi myself xyz. i like hadoop. hadoop is good. i am practicing.

so the hadoop command is hadoop fs -text 'file name with path' | grep 'string to be found out'

Pig shell: --Load the file data into the pig variable

**data = LOAD 'file with path' using PigStorage() as (text:chararray);

-- find the required text

txt = FILTER data by ($0 MATCHES '.string to be found out.');

--display the data.

dump txt; ---or use Illustrate txt;

-- storing it in another file STORE txt into "path' using PigStorage();