0
votes

I'm working with a php array which contains some values parsed from a previous scraping process (using Simple HTML DOM Parser). I can normally print / echo the values of this array, which contains special chars é,à,è, etc. BUT, the problem is the following :

When I'm using fwrite to save values in a .csv file, some characters are not successfully saved. For example, Székesfehérvár is well displayed on my php view in HTML, but saved as Székesfehérvár in the .csv file which I generate with the php script above.

I've already set-up several things in the php script :

  • The page I'm scraping seems to be utf-8 encoded
  • My PHP script is also declared as utf-8 in the header
  • I've tried a lot of iconv and mb_encode methods in different places in the code
  • NOTE that when I'm make a JS console.log of my php array, using json_encode, the characters are also broken, maybe linked to the original encoding of the page I'm scraping?

Here's a part of the script, it is the part who is writing values in a .csv file

<?php 

$data = array(
            array("item1", "item2"), 
            array("item1", "item2"),
            array("item1", "item2"),
            array("item1", "item2")
            // ...
);

//filename
$filename = 'myFileName.csv';

foreach($data as $line) {
    $string_txt = ""; //declares the content of the .csv as a string
    foreach($line as $item) {
        //writes a new line of the .csv
        $line_txt = "";
        //each line of the .csv equals to the values of the php subarray, tab separated
        $line_txt .= $item . "\t";
    }

    //PHP endline constant, indicates the next line of the .csv
    $line_txt .= PHP_EOL;
    
    //add the line to the string which is the global content of the .csv
    $line_txt .= $string_txt;
}

//writing the string in a .csv file 
$file = fopen($filename, 'w+');
fwrite($file, $string_txt);
fclose($file);

I am currently stuck because I can't save values with accentuated characters correctly.

4
“The page i'm scrapping seems to be utf-8 encoded” - it much rather seems, that the page you are scraping actually uses these numeric entities to represent these characters already. You probably just haven’t noticed, because you looked at your debug outputs after the browser has interpreted them as HTML. html_entity_decode should help.misorude
@misorude, thanks for your help. I don't really understand your comment, let me add that : when I do a print_r of my $data array, all the characters are availables, but the problem is when I try to do something else with this array, such as a json_encode for JS, or write in a .csv. Do you understand what I mean ? thxMaxime
Do a print_r("Sz&#233;kes"); - notice something?misorude
Yes, the print_r return Székes. Following you, I use htmlentities to get the original numeric entities of the values, but my question is now : How can I store the values as Székes for example, and not as Sz&#233;kes ? thx @misorudeMaxime
“Yes, the print_r return Székes - so do you understand my initial comment now then? “How can I store the values as Székes for example, and not as Sz&#233;kes ?” - by making the value that you have, into the value that you want – you currently have Sz&#233;kes. And no, I did not say to use htmlentities.misorude

4 Answers

1
votes

Put this line in your code

header('Content-Type: text/html; charset=UTF-8');

Hope this helps you!

1
votes

Try it


$file = fopen('myFileName.csv','w');
$data= array_map("utf8_decode", $data);
fputcsv($file,$data);

0
votes

Excel has problems displaying utf8 encoded csv files. I saw this before. But you can try utf8 BOM. I tried it and works for me. This is simply adding these bytes at the start of your utf8 string:

$line_txt .= chr(239) . chr(187) . chr(191) . $item . "\t";

For more info: Encoding a string as UTF-8 with BOM in PHP

Alternatively, you can use the file import feature in Excel and make sure the file origin says 65001 : Unicode(UTF8). It should display your text properly and you will need to save it as an Excel file to preserve the format.

0
votes

The solution (provided by @misorude) :

When scraping HTML contents from webpages, there is a difference between what's displayed in your debug and what's really scraped in the script. I had to use html_entity_decode to let PHP interpret the true value of the HTML code I've scraped, and not the browser's interpretation.

To validate a good retriving of values before store them somewhere, you could try a console.log in JS to see if values are correctly drived :

PHP

//decoding numeric HTML entities who represents "Sóstói Stadion"
$b = html_entity_decode("S&#243;st&#243;i Stadion"); 

Javascript (to test):

<script>
var b = <?php echo json_encode($b) ;?>;

//print "Sóstói Stadion" correctly
console.log(b); 
</script>