I have a parse tree which includes some information. To extract the information that I need, I am using a code which splits the string based on forward slash (/), but that is not a perfect code. I explain more details here:
I had used this code in another project earlier and that worked perfectly. But now the parse trees of my new dataset are more complicated and the code makes wrong decisions sometimes.
The parse tree is something like this:
(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) )
As you see, the leaves of the tree are the words right before the forward slashes. To get these words, I have used this code before:
parse_tree.split("/");
But now, in my new data, I see instances like these:
1) (TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )
where there are multiple slashes due to website addresses (In this case, only the last slash is the separator of the word).
2) (NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )
Where the slash is a word itself.
Could you please help me to replace my current simple regular expression with an expression which can manage these cases?
To summarize what I need, I would say that I need a regular expression which can split based on forward slash, but it must be able to manage two exceptions: 1) if there is a website address, it must split based on the last slash. 2) If there are two consecutive slashes, it must split based on the second split (and the first slash must NOT be considered as a separator, it is a WORD).
§instead of/. - collapsar