1
votes

I try to unzip file.zip with files (a, b, c) in pentaho kettle (file management -> unzip file). it working fine. But if i try to unzip file.zip with files (a, b, ж), for example, i have errors:

2016/01/18 17:46:17 - cfgbuilder - Warning: The configuration parameter [org] is not supported by the default configuration builder for scheme: sftp
2016/01/18 17:46:17 - Unzip file - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : Could not unzip file [file:///D:/projects/loaders/loader_little_files/src.zip]. Exception : [MALFORMED]
2016/01/18 17:46:17 - Unzip file - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : java.lang.IllegalArgumentException: MALFORMED
2016/01/18 17:46:17 - Unzip file -  at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
2016/01/18 17:46:17 - Unzip file -  at java.util.zip.ZipFile.getZipEntry(ZipFile.java:566)
2016/01/18 17:46:17 - Unzip file -  at java.util.zip.ZipFile.access$900(ZipFile.java:60)
2016/01/18 17:46:17 - Unzip file -  at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:524)
2016/01/18 17:46:17 - Unzip file -  at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:499)
2016/01/18 17:46:17 - Unzip file -  at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:480)
2016/01/18 17:46:17 - Unzip file -  at org.apache.commons.vfs2.provider.zip.ZipFileSystem.init(ZipFileSystem.java:91)
2016/01/18 17:46:17 - Unzip file -  at org.apache.commons.vfs2.provider.AbstractVfsContainer.addComponent(AbstractVfsContainer.java:53)
2016/01/18 17:46:17 - Unzip file -  at org.apache.commons.vfs2.provider.AbstractFileProvider.addFileSystem(AbstractFileProvider.java:103)
2016/01/18 17:46:17 - Unzip file -  at org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.createFileSystem(AbstractLayeredFileProvider.java:88)
2016/01/18 17:46:17 - Unzip file -  at org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.findFile(AbstractLayeredFileProvider.java:61)
2016/01/18 17:46:17 - Unzip file -  at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:790)
2016/01/18 17:46:17 - Unzip file -  at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:712)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.core.vfs.KettleVFS.getFileObject(KettleVFS.java:151)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.core.vfs.KettleVFS.getFileObject(KettleVFS.java:106)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.entries.unzip.JobEntryUnZip.unzipFile(JobEntryUnZip.java:618)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.entries.unzip.JobEntryUnZip.processOneFile(JobEntryUnZip.java:516)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.entries.unzip.JobEntryUnZip.execute(JobEntryUnZip.java:461)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.Job.execute(Job.java:730)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.Job.execute(Job.java:546)
2016/01/18 17:46:17 - Unzip file -  at org.pentaho.di.job.Job.run(Job.java:435)

I'am using windows 7, when i create "ж" file.

I'am trying to rename file in linux to "ж" - the result has not changed.

How can i do this? Any hidden setting? Thanks!

3
My first guess would be that that symbol, ж, is not recognized and therefore the program decides that it is malformed.Zimano
it's my question :) How can tune kettle, for this to workDanStopka
What version are you using? This is working for me. What charset is the cyrillic character from? From your stacktrace this looks like an error in the zip-library that Kettle is using.bolav
confirm problem in 5.2.0, and cant unzip file contains 'ж' letter.simar
weird but ZipCode expects filenames to be in UTF-8, despite even parameter -Dfile.encodingsimar

3 Answers

3
votes

Non utf-8 encoding in zip files.

Taken from here. https://blogs.oracle.com/xuemingshen/entry/non_utf_8_encoding_in

Important parts

  1. The Zip specification (historically) does not specify what character encoding to be used for the embedded file names
  2. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

Windows NFTS filesystem encoding UTF-16. Cyrillic symbols in file names cause problems in java application. Troubles will arise in use some third party tools to create zip archive (unless u use java based tools - which rarely) and then unzip them using java tools like PDI.

Good staff for Linux users, ext4 use by default UTF-8 (actually it doesn't rely on encoding just byte sequence, but GUI like gnome (environment where u create files whatever shell, or gnome nautilus file manager) assume UTF-8 to decode symbols to write file name on disk. QT relies on locale. Of cause there are ways to override but by default as I know UTF-8 become wide used as default locale.

Conclusion:

  • zip file created in linux(tested in ubuntu) can be unzipped using PDI.
  • zip file created using JavaAPI can be unzipped anywhere using PDI
  • zip file created on Windows can cause trouble unzipped using PDI
1
votes

How to decompress zip file created on Windows 8.1, using 7zip. Files have names contain cyrilic symbols. Zip archive contains 3 files inside named:

  • а.txt
  • ж.txt
  • ё.txt

Fortunately all needed libraries (Apache commons-compress and commons-io) are in directory PENTAHO_HOME/lib, so u don't have to add extra libraries to kettle.

Here is code underneath, for "User Defined Java Class" step

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Enumeration;
import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
import org.apache.commons.compress.archivers.zip.ZipFile;
import org.apache.commons.io.IOUtils;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException{

Object[] r = getRow();

r = createOutputRow(r, data.outputRowMeta.size());

String fname = getVariable("FNAME", null);
String outDir = getVariable("OUT", null);
System.out.println(fname + "  " + outDir);

try {
        java.io.File inputFile = new java.io.File(fname);
        ZipFile zipFile = new ZipFile(inputFile, "cp866", false);
        Enumeration enumEntry = zipFile.getEntries();
        int i = 0;
        while(enumEntry.hasMoreElements()){
            ZipArchiveEntry entry = (ZipArchiveEntry) enumEntry.nextElement();
            String entryName = entry.getName();
            System.out.println(entryName);
            OutputStream os = new FileOutputStream(new File(outDir, Integer.valueOf(++i) + entryName));
            InputStream is = zipFile.getInputStream(entry);
            IOUtils.copy(is, os);
            is.close();
            os.close();
        }
    } catch (Exception exc) {
        System.out.println("Faild to unzip");
        exc.printStackTrace();
    }
putRow(data.outputRowMeta, r);

return true;

}

Important parts of code are:

String fname = getVariable("FNAME", null);
String outDir = getVariable("OUT", null);

Those mean that 2 variables should be available in transformation

FNAME - absolute path to ZipFile,

OUT - directory where need to extract files

In this line:

ZipFile zipFile = new ZipFile(inputFile, "cp866", false);

"cp866" means encoding used by 7zip for zipfile entries(cp866 on windows). If u use another zipper then u might need to change encoding. Here is some notice https://commons.apache.org/proper/commons-compress/zip.html. Part Recommendations for Interoperability. U can write own algorith to identify encoding, rely on for example on known part of name of files in zip archive. Anyway I think most probably this kettle job/tranformation will use zip file from single certain source, and just need to identify and set proper encoding of zip file in code.

This line:

Integer.valueOf(++i) + entryName)

Why file name generated using integer? If wrong encoding is used then ZipFile will decode filename of zip entries to [].txt (ZipFile can't decode а.txt, ж.txt so it will replace symbols 'а', 'ж' with '[]'). Which lead to (if u have wrong encoding and filenames have same length and written in cyrilic) each enty in loop will overwrite same file and u will get in the end, single file named [].txt.

With counter in file name u will guaranty all files will have different name even if u not able to decode correct file name.

1[].txt 
2[].txt
3[].txt

Anyway if u know exactly encoding then just remove this part of code to eliminate numbers in file name.

0
votes

only one worked for me in Debian Jessie - install WinRAR into wine and choose there file names encoding