0
votes

I wnat to make preprocessing for Weka arff file which contains 2000 lines for nlp project (sentiment analysis)

I want a code that just add a single quotation at the start and end of each sentence. for example this is a sample for my dataset:

The Da Vinci Code is one of the most beautiful movies ive ever seen.,1
The Da Vinci Code is an * amazing * book, do not get me wrong.,1
then I turn on the light and the radio and enjoy my Da Vinci Code.,1
The Da Vinci Code was REALLY good.,1
i love da vinci code....,1 

I want the output to be:

'The Da Vinci Code is one of the most beautiful movies ive ever seen.',1
'The Da Vinci Code is an * amazing * book, do not get me wrong.',1
'then I turn on the light and the radio and enjoy my Da Vinci Code.',1
'The Da Vinci Code was REALLY good.',1
'i love da vinci code....',1 

Just want to add a single quotation at the beginning and end of each sentence (before the 1 ).

I would really appreciate it if you help me do it

Is there any tool that I can use instead of writing a code?

1
Can you edit your question with information about what you have tried to accomplish what you want, and where it failed? Please also explain why C++ is mentioned specifically.KompjoeFriek

1 Answers

0
votes

You could use Regular Expressions to achieve this. Regular expressions are a powerful formalism for pattern matching in strings. A large amount of existing tools support Regular Expressions, which allows you to match/replace the texts you want without the need to write any code yourself.

To match and replace using Regular Expressions (regexp), you need two parts:

  1. Match: An expression to match something in your string or strings.
  2. Substitution/Replace: An expression to indicate what to replace an match with.

Match:

/([^\.]+)(\.+)(,1\s+)/g
  • Group 1: Match all characters except for a literal dot, at least 1 character.
  • Group 2: Match only literal dots, at least 1 character.
  • Group 3: Match a literal comma, followed by a literal 1, followed by at least 1 whitespace character.
  • Regex flag g (global): multiple matches

Substitution:

'$1$2'$3
  • Enclose group 1 and 2 with quotes, followed by group 3.

You can view an interactive version of the above Match and Substitution here

Now you can use that match and substitution to work with your favorite regexp tool.

Like sed:

sed -i -E "s/([^\.]+)(\.+)(,1\s+)/'\1\2'\3/g" yourfile.txt

Or Windows PowerShell:

(Get-Content yourfile.txt) -replace '([^\.]+)(\.+)(,1\s+)', '''$1$2''$3' | Out-File output.txt

Other tools might use a different syntax. Provided match/substitution patterns can probably be optimized further.