How do I unescape a html attribute value in Prolog?

3

votes

I find a predicate xml_quote_attribute/2 in a library(sgml) of SWI-Prolog. This predicate works with the first argument as input and the second argument as output:

?- xml_quote_attribute('<abc>', X).
X = '&lt;abc&gt;'.

But I couldn't figure out how I can do the reverse conversion. For example the following query doesn't work:

?- xml_quote_attribute(X, '&lt;abc&gt;').
ERROR: Arguments are not sufficiently instantiated

Is there another predicate that does the job?

Bye

htmlprologswi-prolog

Seems to be a good bounty candidate – false

When I needed this functionality, I wrote it myself. But that was almost 5 years ago... maybe something has been added in meanwhile. – CapelliC

4

votes

This is how Ruud's solution looks like with DCG notation + pushback lists / semicontext notation.

:- use_module(library(dcg/basics)).

html_unescape --> sgml_entity, !, html_unescape.
html_unescape, [C] --> [C], !, html_unescape.
html_unescape --> [].

sgml_entity, [C] --> "&#", integer(C), ";".
sgml_entity, "<" --> "&lt;".
sgml_entity, ">" --> "&gt;".
sgml_entity, "&" --> "&amp;".

Using DCGs makes the code a bit more readable. It also does away with some of the superfluous backtracking that Cookie Monster noted is the result of using append/3 for this.

2

votes

Here's the naive solution, using lists of character codes. Most likely it will not give you the best performance possible, but for strings that are not extremely long, it might just be alright.

html_unescape("", "") :- !.

html_unescape(Escaped, Unescaped) :-
    append("&", _, Escaped),
    !,
    append(E1, E2, Escaped),
    sgml_entity(E1, U1),
    !,
    html_unescape(E2, U2),
    append(U1, U2, Unescaped).

html_unescape(Escaped, Unescaped) :-
    append([C], E2, Escaped),
    html_unescape(E2, U2),
    append([C], U2, Unescaped).

sgml_entity(Escaped, [C]) :-
    append(["&#", L, ";"], Escaped),
    catch(number_codes(C, L), error(syntax_error(_), _), fail),
    !.

sgml_entity("&lt;", "<").
sgml_entity("&gt;", ">").
sgml_entity("&amp;", "&").

You will have to complete the list of SGML entities yourself.

Sample output:

?- html_unescape("&lt;a&gt; &#26361;&#25805;", L), format('~s', [L]).
<a> 曹操
L = [60, 97, 62, 32, 26361, 25805].

2

votes

If you don't mind linking a foreign module, then you can make a very efficient implementation in C.

html_unescape.pl:

:- module(html_unescape, [ html_unescape/2 ]).
:- use_foreign_library(foreign('./html_unescape.so')).

html_unescape.c:

#include <stdio.h>
#include <string.h>
#include <SWI-Prolog.h>

static int to_utf8(char **unesc, unsigned ccode)
{
    int ok = 1;
    if (ccode < 0x80)
    {
        *(*unesc)++ = ccode;
    }
    else if (ccode < 0x800)
    {
        *(*unesc)++ = 192 + ccode / 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else if (ccode - 0xd800u < 0x800)
    {
        ok = 0;
    }
    else if (ccode < 0x10000)
    {
        *(*unesc)++ = 224 + ccode / 4096;
        *(*unesc)++ = 128 + ccode / 64 % 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else if (ccode < 0x110000)
    {
        *(*unesc)++ = 240 + ccode / 262144;
        *(*unesc)++ = 128 + ccode / 4096 % 64;
        *(*unesc)++ = 128 + ccode / 64 % 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else
    {
        ok = 0;
    }
    return ok;
}

static int numeric_entity(char **esc, char **unesc)
{
    int consumed;
    unsigned ccode;
    int ok = (sscanf(*esc, "&#%u;%n", &ccode, &consumed) > 0 ||
              sscanf(*esc, "&#x%x;%n", &ccode, &consumed) > 0) &&
             consumed > 0 &&
             to_utf8(unesc, ccode);
    if (ok)
    {
        *esc += consumed;
    }
    return ok;
}

static int symbolic_entity(char **esc, char **unesc, char *name, int ccode)
{
    int ok = strncmp(*esc, name, strlen(name)) == 0 &&
             to_utf8(unesc, ccode);
    if (ok)
    {
        *esc += strlen(name);
    }
    return ok;
}

static foreign_t pl_html_unescape(term_t escaped, term_t unescaped)
{
    char *esc;
    if (!PL_get_chars(escaped, &esc, CVT_ATOM | REP_UTF8))
    {
        PL_fail;
    }
    else if (strchr(esc, '&') == NULL)
    {
        return PL_unify(escaped, unescaped);
    }
    else
    {
        char buffer[strlen(esc) + 1];
        char *unesc = buffer;
        while (*esc != '\0')
        {
            if (*esc != '&' || !(numeric_entity(&esc, &unesc) ||
                                 symbolic_entity(&esc, &unesc, "&lt;", '<') ||
                                 symbolic_entity(&esc, &unesc, "&gt;", '>') ||
                                 symbolic_entity(&esc, &unesc, "&amp;", '&')))
                                    // TODO: more entities...
            {
                *unesc++ = *esc++;
            }
        }
        return PL_unify_chars(unescaped, PL_ATOM | REP_UTF8, unesc - buffer, buffer);
    }
}

install_t install_html_unescape()
{
    PL_register_foreign("html_unescape", 2, pl_html_unescape, 0);
}

The following statement will build a shared library html_unescape.so from html_unescape.c. Tested on Ubuntu 14.04; may be different on Windows.

swipl-ld -shared -o html_unescape html_unescape.c

Start up SWI-Prolog:

swipl html_unescape.pl

Sample output:

?- html_unescape('&lt;a&gt; &#26361;&#25805;', S).
S = '<a> 曹操'.

With special thanks to the SWI-Prolog documentation and source code, and to C library to convert unicode code points to UTF8?

1

votes

Not aspiring as being the ultimate answer, since it doesn't give a solution for SWI-Prolog. For a Java based interpreter the problem is that XML escaping is not part of J2SE, at least not in a simple form (didn't figure out how to use Xerxes or the like).

A possible route would be to interface to StringEscapeUtils ( * ) from Apache Commons. But then again this would not be necessary on Android since there is a class TextUtil. So we rolled our own ( * * ) little conversion. It works as follows:

?- text_escape('<abc>', X).
X = '&lt;abc&gt;'
?- text_escape(X, '&lt;abc&gt;').
X = '<abc>'

Note the use of the Java methods codePointAt() and charCount() respectively appendCodePoint() in the Java source code. So it could also escape and unescape code points above the basic plane, i.e. in a range >0xFFFF (currently not implemented, left as an exercise).

On the other hand the Apache libraries, at least version 2.6, are NOT surrogate pair aware and will place two decimal entities per code point instead as one.

Bye

( * ) Java: Class StringEscapeUtils Source
http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/Entities.java#Entities.escape%28java.io.Writer,java.lang.String%29

( * * ) Jekejeke Prolog: Module xml
http://www.jekejeke.ch/idatab/doclet/prod/en/docs/05_run/10_docu/05_frequent/07_theories/20_system/03_xml.html

How do I unescape a html attribute value in Prolog?

4 Answers