0
votes

Good afternoon,

I am working with Java Saxon 9.8.0.4. I would like to use EXPath File Module function "file:list" with its third "pattern" parameter. But I am in doubt, which style of pattern is supported.

I read both Saxon documentation and EXPath documentation. But I do not know, which patterns are supported in Saxon 9.8.0.4. It would be great to support regular expression, but I understand it is overkill for most users. I tried several blind tests, but just * and ? wildchars works for me as defined in EXPath documentation.

Yes, I can quite easily do regexp postprocessing in for-each, but to know more about list function could help.

Thank You in advance for Your help, Stepan

P.S: My use-case is to get all files without extension ("test" and not "test.txt") recursively from large and deep directory structure and process all of matching files with XSL-T 3.0. Most of such files have identical fileName and thus I can not do "copy to one folder" pre-processing for Saxon's -s:directory -o:directory one time invocation and invocation of Java (Saxon) for each file is of cource terrible time overhead. So I would like to read all matching files into sequence and process each item of such sequence using for-each (files are text ones and I read them using unparsed-text). And no, GAWK is not solution, as I have all transformation infrastructure from XML to SQL already in XSL-T, because 95 % of files are XMLs.

--ADDED code and explanation below:

Example of my test files.

XML file "a.xml":

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="a.xsl"?>
<root/>

XSL-T file "a.xsl":

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:saxon="http://saxon.sf.net/"
  xmlns:expathFile="http://expath.org/ns/file"
  exclude-result-prefixes="xs saxon"
  version="3.0">
  <xsl:output method="text" />
  <xsl:template match="/root">
    <xsl:variable name="list" select="expathFile:list('C:\temp\temp\test\', false(), '^.*$')"/>
    <xsl:for-each select="$list">
      <xsl:value-of select="."/>
    </xsl:for-each>
  </xsl:template>

My folder "C:\temp\temp\test\" contains 6 test files: "a.txt", "b.txt", "c.txt", "e", "f", "g".

But after testing of online Java RegExp tester on "http://www.regexplanet.com/advanced/java/index.html" I have found, that the problem is solely on my side, because Java regular expression behaves little different than PCRE (Perl), sed, gawk regular expressions. So it is my fault and I need to learn Java regular expression.

1

1 Answers

1
votes

Saxon uses the same code for this pattern as for the filter in select="pattern" in collection URIs, which is described at http://www.saxonica.com/documentation/index.html#!sourcedocs/collections

Extracting the relevant details:

The pattern used in the select parameter can use glob-like syntax, for example *.xml selects all files with extension "xml". More generally, the pattern is converted to a regular expression by prepending "^", appending "$", replacing "." by "\.", "*" by ".*", and "?" by ".?", and it is then used to match the file names appearing in the directory using the Java regular expression rules. So, for example, you can write ?select=*.(xml|xhtml) to match files with either of these two file extensions. Note however, that special characters used in the URL (that is, characters such as backslash and curly braces that are not allowed in the query part of a URI) must be escaped using the %HH convention. For example, vertical bar needs to be written as %7C. This escaping can be achieved using the encode-for-uri() function.

Note that Saxon's collection() function now also supports match=pattern in the URI, where the pattern is a standard XPath 3.1 regular expression.