I'm developing an application in python to pull multiple types of data from free form text. This text can include: email addresses, URLs, and file paths.
My question is: How can I extract file paths (both Linux and Windows) using a regex while excluding URLs (which tend to look similar to file paths).
I have used a variety of regex expressions to try and pull Linux as well as Windows file paths from the text. However, these expressions also pick up on the URLs. I would like to exclude this from happening.
Currently, I am using the following regular expressions for emails and URLs.
Emails:
([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\\.[a-zA-Z0-9_-]+)
URLs:
(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?
The desired end behavior of this application is to store valid email addresses, URLs, and file paths in a data structure.
Here is an example of some text:
This is an example of some text which will include email addresses: [email protected], websites such as: http://www.example.com, and file paths like: /Users/example/Documents/example.text and C:\Windows\System32\ I need to pull out only the file paths both Unix and Windows format.
://
inside it. Give me your regex for path and I will correct it to reject URLs. – Pushpesh Kumar Rajwanshi