0
votes

I'm developing an application in python to pull multiple types of data from free form text. This text can include: email addresses, URLs, and file paths.

My question is: How can I extract file paths (both Linux and Windows) using a regex while excluding URLs (which tend to look similar to file paths).

I have used a variety of regex expressions to try and pull Linux as well as Windows file paths from the text. However, these expressions also pick up on the URLs. I would like to exclude this from happening.

Currently, I am using the following regular expressions for emails and URLs.

Emails:

([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\\.[a-zA-Z0-9_-]+)

URLs:

(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?

The desired end behavior of this application is to store valid email addresses, URLs, and file paths in a data structure.


Here is an example of some text:

This is an example of some text which will include email addresses: [email protected], websites such as: http://www.example.com, and file paths like: /Users/example/Documents/example.text and C:\Windows\System32\ I need to pull out only the file paths both Unix and Windows format.

1
Can you also add some sample data that you want to capture as path and confuses with the looks of a URL?Pushpesh Kumar Rajwanshi
Sure, I have added an example of some text. All the regular expressions that I have tried identified the a portion of the URL, namely: //www.example.com as a file path.Brandon Dalton
Can you also add the regex that you used for matching filepaths? The fix is very simple. Just ensure what you identify as a path should not contain :// inside it. Give me your regex for path and I will correct it to reject URLs.Pushpesh Kumar Rajwanshi
You can use one of the regexes that work for both file path and URLs, and put a condition on file path such as the string matches the regex of file path but not for URL.nimishxotwod
@PushpeshKumarRajwanshi It is not possible to match arbitrary file paths with regex. Especially Linux ones that can contain any char. You never know the end of the path and it is easy to over- or undermatch.Wiktor Stribiżew

1 Answers

0
votes

Here's a solution that properly deals with your example

import re

example = "This is an example of some text which will include email addresses: [email protected], websites such as: http://www.example.com, and file paths like: /Users/example/Documents/example.text and C:\Windows\System32\ I need to pull out only the file paths both Unix and Windows format."

emails = re.findall("(?: )([^ ]*@[^ ]*\.[a-z]{2,3})", example)
urls = re.findall("(?: )((?:http|ftp|https):[^ ,]*)",example)
unix_paths = re.findall("(?: )(/[^ ,]*)(?:[ ,])", example)
windows_paths = re.findall("(?: )(C:\\\\[^ ,]*)(?:[ ,])", example)

it uses spaces and commas as delimiters. It doesn't work with paths that are in begging/end of text, but that shouldn't be to hard to correct