1
votes

I'm trying to extract data from multiple files using csv custom extractor that uses a filter based on the content of other file. Ex. Files.txt content

file1
file4

Directories structure

/file1/file.txt
/file2/file.txt
/file3/file.txt
/file4/file.txt

I've extracted the Files.txt content to rowset @files and the files in directory to @filesDirectory rowset.

My problem is that if i join @filesDirectory with @files, no matter what files are in Files.txt, all files are read... I just want to read the files specified on it. But if i specify the file (without join the two rowset) it works! Any help?


Here is the query:

DECLARE @input string = @"/{dirname}/file.txt";
DECLARE @filterFile = @"/fileFilter.txt";
 @inputData =

        EXTRACT 
            dirname string,
            content string
        FROM @input
        USING Extractors.Text(delimiter : '\n', quoting : false);

 @inputFilter =
        EXTRACT 
            directories string                
        FROM @filterFile
        USING Extractors.Text();

@result = SELECT * FROM @inputData AS id
            LEFT JOIN @inputFilter AS if ON (id.dirname = id.directories)
2
Can you provide a sample script and code so we can reproduce this? Otherwise it is really hard to be of assistance. You did not post the logic of the custom extractor nor the u-sql scriptsPeter Bons
Edited the first post. The problem is that all files are read. The filter not works...Jorge Ribeiro

2 Answers

1
votes

I used INNER JOIN and the U-SQL join syntax which is two equals signs (==) and this worked for me. NB the files are still read but are filtered out of the results:

DECLARE @inputFile string = "/input/{dirName}/file.txt";

@input =
    EXTRACT dirName string,
            content string
    FROM @inputFile
    USING Extractors.Csv();


@inputFilter =
    EXTRACT directories string
    FROM "/input/files.txt"
    USING Extractors.Csv();


@output =
    SELECT *
    FROM @input
         INNER JOIN
             @inputFilter
         ON dirName == directories
    WHERE dirName LIKE "file%";


OUTPUT @output
TO "/output/output.csv"
USING Outputters.Csv();

My results with similar folder structure:

My results

0
votes

Did you consider using a file list in your Extract expression? This cannot be a dynamic expression or parameter, so you'll have to generate the U-SQL script before each run based on the data in /input/files.txt, but this will avoid reading all of the files and filtering them on runtime.

DECLARE @input string = @"/{dirname}/file.txt";
DECLARE @filterFile = @"/fileFilter.txt";
@inputData =

    EXTRACT 
        dirname string,
        content string
    FROM "/file1/file.txt",
         "/file4/file.txt"
    USING Extractors.Text(delimiter : '\n', quoting : false);