0
votes

I've seen similar questions posted but they're not exactly the same as what I've encountered. I am using Python 3.7 and Pandas 0.25.0.

Weirdly, if I download this zip file directly from this link, I am able to read it via pd.read_csv as follows:

pd.read_csv('publicleaderboarddata.zip')
       TeamId           TeamName       SubmissionDate    Score
0      688191  Sergey Mushinskiy  2017-05-24 12:20:34  0.06630
1      688203       DeepVoltaire  2017-05-24 12:25:03  0.06630
2      688237        RakeshNikam  2017-05-24 13:02:31  0.06512
......

However, if I do:

this_leaderboard_df = pd.read_csv('https://www.kaggle.com/c/6649/publicleaderboarddata.zip,
                                  compression='zip')

I will get a BadZipFileerror as follows. Why does this happen?

--------------------------------------------------------------------------- BadZipFile Traceback (most recent call last) in ----> 1 this_leaderboard_df = pd.read_csv(this_leaderboard_link, compression='zip') 2 this_leaderboard_df.head(e)

~/.virtualenvs/py3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 683 ) 684 --> 685 return _read(filepath_or_buffer, kwds) 686 687 parser_f.name = name

~/.virtualenvs/py3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 455 456 # Create the parser. --> 457 parser = TextFileReader(fp_or_buf, **kwds) 458 459 if chunksize or iterator:

~/.virtualenvs/py3/lib/python3.7/site-packages/pandas/io/parsers.py in init(self, f, engine, **kwds) 893 self.options["has_index_names"] = kwds["has_index_names"] 894 --> 895 self._make_engine(self.engine) 896 897 def close(self):

~/.virtualenvs/py3/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine) 1133 def _make_engine(self, engine="c"): 1134 if engine == "c": -> 1135 self._engine = CParserWrapper(self.f, **self.options) 1136 else: 1137 if engine == "python":

~/.virtualenvs/py3/lib/python3.7/site-packages/pandas/io/parsers.py in init(self, src, **kwds) 1915 kwds["usecols"] = self.usecols 1916 -> 1917 self._reader = parsers.TextReader(src, **kwds) 1918 self.unnamed_cols = self._reader.unnamed_cols 1919

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.cinit()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in init(self, file, mode, compression, allowZip64, compresslevel) 1223 try: 1224 if mode == 'r': -> 1225 self._RealGetContents() 1226 elif mode in ('w', 'x'): 1227 # set the modified flag so central directory gets written

/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in _RealGetContents(self) 1290 raise BadZipFile("File is not a zip file") 1291 if not endrec: -> 1292 raise BadZipFile("File is not a zip file") 1293 if self.debug > 1: 1294 print(endrec)

BadZipFile: File is not a zip file

1
to download it you have to login to kaggle. Logout from kaggle and then try to download directly from link to see login form. pandas can't login to this page so it gets HTML pages with login form instead of zip file.furas
@furas ah yes this is the answer to my error. thanks!Zhiya
you could use Selenium to control web browser and login to kaggle and click on link to download file.furas

1 Answers

1
votes

To download it you have to be logged in to Kaggle. If you logout from Kaggle and try to download directly from link then you see login form.

pandas can't login to this page so it gets HTML with login form instead of zip file.

You could use Selenium to control web browser and then script can use browser to login to Kaggle and to download file.