1
votes

I try to scrape emailaddresses with Powershell from a directory, with subdirectories and within them .txt files. So i have this code:

$input_path = ‘C:\Users\Me\Documents\toscrape’
$output_file = ‘C:\Users\Me\Documents\toscrape\output.txt’
$regex = ‘\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file

But when I execute it, it gives me an error

select-string : The file C:\Users\Me\Documents\toscrape\ can not be read: Could not
path 'C:\Users\Me\Documents\toscrape\'.
At line:1 char:1
+ select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [Select-String], ArgumentException
    + FullyQualifiedErrorId : ProcessingFile,Microsoft.PowerShell.Commands.SelectStringCommand

I've tried variations to the $input_path, with Get-Item, Get-ChildItem, -Recurse, but nothing seems to work. Can anyone figure out how I need to scrape my location and all its subdirectories and files for the regex pattern?

2
I'm not totally clear on what you're trying to do, but if you need to get a list of TXT files from a directory structure, you need something like this: Get-ChildItem -Path $input_path -Include "*.txt" -Recurseboxdog
I dont think thats the correct regexArcSet

2 Answers

3
votes

The error is because Select-String assumes the -Path points to a file or is a wildcard pattern, and $input_path is pointing to a folder. You could use:

$input_path = 'C:\Users\Me\Documents\toscrape\*.txt'
Select-String $input_path ....

However, since you want to recurse through subdirectories, you'll need to use Get-ChildItem to do that.

$input_path = 'C:\Users\Me\Documents\toscrape'
$output_file = 'C:\Users\Me\Documents\toscrape\output.txt'
$regex = '\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b'

Get-ChildItem $input_path -Include *.txt -Recurse |
    Select-String -Pattern $regex -AllMatches |
    Select-Object -ExpandProperty Matches |
    Select-Object -ExpandProperty Value |
    Set-Content $output_file

Note that your regex may cause problems here. You're using \b for word boundary, but period ., hyphen -, and percent sign % are all non-word (\W) characters. The word characters (\w) are [A-Za-z0-9_].

For example:

PS C:\> '%[email protected]' -match '\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b'
True
PS C:\> $Matches.Values
[email protected]

If that's what you want the pattern to do, that's great, but it is something to be aware of. Regex for an email address is notoriously difficult.

0
votes

Your correction didn't work but gave me another error, @Bacon Bits. However you put me on the right track. I adapted a bit and this seemed to work out for me.

$input_path = 'C:\Users\Me\Documents\toscrape'
$output_file = 'C:\Users\Me\Documents\toscrape\output.txt'
$regex = '\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b'

Get-ChildItem $input_path -Recurse | Select-String -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file