1
votes

I Use ExtraTreesClassifier for training and predicting. I execute the same source code on the same dataset on Windows 10 and Linux Ubuntu 16.04, surprisingly i get a huge difference in the execution time.

The results :

+---------------+-----------+----------+----------+---------+
| Dataset in Mo | Win Train | Win Pred | Ub Train | Ub Pred |
+---------------+-----------+----------+----------+---------+
| 430           | 104       | 11       | 2420     | 2019    |
+---------------+-----------+----------+----------+---------+
| 530           | 122       | 14       | 2948     | 2162    |
+---------------+-----------+----------+----------+---------+
| 699           | 140       | 18       | 3672     | 2500    |
+---------------+-----------+----------+----------+---------+

Note: the loading time of the csv file and the creation of the dataFrame is negligible.

The source code:

import time
import pandas as pd
import datatable as dt
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier


def __init__(self):
    self.ExTrCl = ExtraTreesClassifier()

def train_with_dt(self, csv_file_path):
    start_0_time = time.time()
    data_arn = dt.fread(csv_file_path)
    end_time = time.time()
    print(" time Read_csv file : ",end_time-start_0_time," s")
    
    data_classe = np.ravel(data_arn[:,"familyId"])
    del data_arn[:,"familyId"]
    
    start_time_train = time.time()
    self.ExTrCl.fit(data_arn, data_classe)
    end_time = time.time()
    
    print(" train only time : ",end_time-start_time_train, " s")
    
    
def test_groupe_score_dt(self, test_matrix, list_classes):
    start_0_time = time.time()
    dt_dftest = dt.Frame(np.array(test_matrix),names=self.list_motifs)
    end_time = time.time()
    
    print(" time creatind Fram dt = ",end_time-start_0_time)
    
    result = self.ExTrCl.predict(dt_dftest)
    end_time = time.time()
    
    print(" Time pred = ",end_time-start_0_time," s")
    

The OS information and the library version used are in the table below. I update all the used library.

+---------------------------------------+-------------------------------------------+
| Windows 10                            | Ubuntu 16.04                              |
| Intel i7-8550U CPU @ 1.80Ghz  1.99Ghz | Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz |
| cpu cores       : 4                   | cpu cores       : 1                       |
| 64 bit OS                             | 64 bit OS                                 |
| RAM 16 Go                             | RAM 1007 Go                               |
+---------------------------------------+-------------------------------------------+
| Python 3.7.7                          | Python 3.5.2                              |
| -----------------                     | -------------                             |
| biopython==1.77                       | biopython==1.73                           |
| datatable==0.11.0a0+pr2536.12         | datatable==0.10.1                         |
| numpy==1.19.0                         | numpy==1.18.5                             |
| pandas==1.0.5                         | pandas==0.24.2                            |
| pyahocorasick==1.4.0                  | pyahocorasick==1.4.0                      |
| scikit-learn==0.23.1                  | scikit-learn==0.22.2.post1                |
| scipy==1.5.0                          | scipy==1.4.1                              |
| suffix-trees==0.3.0                   | suffix-trees==0.3.0                       |
+---------------------------------------+-------------------------------------------+

using cprofile :

         1619734 function calls (1589052 primitive calls) in 6495.451 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4828 6248.349    1.294 6248.349    1.294 {built-in method numpy.core.multiarray.array}
      100  130.458    1.305  130.458    1.305 {method 'build' of 'sklearn.tree._tree.DepthFirstTreeBuilder' objects}
        1   48.288   48.288   48.288   48.288 {built-in method datatable.lib._datatable.gread}
        2   21.834   10.917   25.749   12.874 Main.py:40(get_matrix_nbOcrrs_listStr_AhoCorasick)
        2   20.747   10.374 2570.626 1285.313 model.py:233(test_groupe_score_dt)
     4365    6.476    0.001    6.476    0.001 {method 'reduce' of 'numpy.ufunc' objects}
        1    5.851    5.851 6492.121 6492.121 Main.py:309(main)
     6710    3.705    0.001    3.705    0.001 {method 'copy' of 'list' objects}
      400    2.548    0.006    2.548    0.006 {method 'predict' of 'sklearn.tree._tree.Tree' objects}
        1    2.288    2.288 6495.453 6495.453 Main.py:1()
        1    1.334    1.334 3889.596 3889.596 model.py:189(train_with_dt)
      400    0.827    0.002    3.628    0.009 _classes.py:880(predict_proba)
        4    0.522    0.131 4936.793 1234.198 _forest.py:591(predict)
      400    0.354    0.001    3.982    0.010 _forest.py:442(_accumulate_prediction)
   376662    0.150    0.000    0.150    0.000 {method 'add_word' of 'ahocorasick.Automaton' objects}
      803    0.120    0.000    0.120    0.000 {built-in method marshal.loads}
2272/2260    0.070    0.000    0.144    0.000 {built-in method builtins.__build_class__}
   1081/1    0.069    0.000 6495.453 6495.453 {built-in method builtins.exec}
  143/119    0.064    0.000    0.116    0.001 {built-in method _imp.create_dynamic}
        2    0.046    0.023    0.046    0.023 {method 'make_automaton' of 'ahocorasick.Automaton' objects}

...etc

Thank you for your help.

1
The newer packages on Windows are the first place I would look. Package version upgrades often come with performance improvements. - 0x5453
Thank you for your suggestion :) , in my case, i think the problem com mainly from 'scikit-learn' , 'numpy' and 'scipy', i use the last update on both Win and Linux, but as you can see, the version are slightly different (i don't know why). If the latest version on windows com with improvement, in this case i can do nothing on Linux to get approximately the same running time ? is that right ? I have just to wait the latest release or install manually the exactly the same windows version (if available) on linux. - ibra
@0x5453 , i think the upgrade doesn't work on linux side, for example panda: (python3 -m pip install pandas --upgrade) just say (Requirement already satisfied,...) and i still have the same old version (pandas==0.24.2), but in windows (pandas==1.0.5). I find a question that deal with update problem in (stackoverflow.com/questions/36244753/…) - ibra

1 Answers

0
votes

I find the problem that lead the huge difference in execution time between windows and Linux.
When I deed the update for the python package, the system says Requirement already satisfied..., so I thought that the packages are updated. After the comment of @0x5453 , I tried to update again the package, but it doesn't work, I find a question that deal with update problem in (Python3.4: Upgrade pandas does not work).
Solution : I installed the python 3.8 beside the python 3.5, and i created a virtual environment, where i installed the latest version of the package needed. and it improves the execution time considerably.

+---------------+-----------+----------+------------------+----------------+-
| Dataset in Mo | Win Train | Win Pred | Ub Train (Py3.5) | Ub Pred (3.5)  | 
+---------------+-----------+----------+------------------+----------------+-
| 430           | 104       | 11       | 2420 (win x23)   | 2019 (win x153)| 
+---------------+-----------+----------+------------------+----------------+-
| 530           | 122       | 14       | 2948 (win x24)   | 2162 (win x154)| 
+---------------+-----------+----------+------------------+----------------+-
| 699           | 140       | 18       | 3672 (win x26)   | 2500 (win x204)| 
+---------------+-----------+----------+------------------+----------------+-

-+------------------+-----------------+
 | Ub Train (Py3.8) | Ub Pred (Py3.8) |
-+------------------+-----------------+
 | 353 (win x3)     | 270  (win x24)  |
-+------------------+-----------------+
 | 771 (win x6)     | 646  (win x46)  |
-+------------------+-----------------+
 | 901 (win x6)     | 430  (win x23)  |
-+------------------+-----------------+

Now, the remaining difference between windows and Linux Python 3.8 could be explained by the difference in the machine used.