8
votes

The documentation of Perl's Marpa parser contains the following section about tainted data:

Marpa::R2 exists to allow its input to alter execution in flexible and powerful ways. Marpa should not be used with untrusted input. In Perl' s taint mode, it is a fatal error to use Marpa's SLIF interface with a tainted grammar, a tainted input string, or tainted token values.

I am not sure, if I understand the consequences of this limitation. I understand, that the grammar must not be tainted. But I do not understand that the input must not be tainted. For me it is the task of the parser to validate the input. It sounds unreasonable to me that a parser has to trust its input.

Is it really that way? Is it impossible to implement any kind of public network service with Marpa?

I ask this because one of the reference use cases is the Marpa HTML parser and it seems to me contradictory to use a parser for HTML, which must not be used with tainted data although about 99,99% of all HTML is possibly tainted.

Can anybody explain this contradiction?

2
There are ways how to untaint data in Perl.choroba
@choroba yes by parsing it.ceving
Laundering-and-Detecting-Tainted-Data. You can e.g. check the length of the input at least.choroba

2 Answers

6
votes

Marpa is actually safer than other parsers, because the language it parses is exactly that specified by the BNF. With regexes, PEG, etc., it's very hard to determine what language is actually parsed. In practice programmers tend to get a few test cases working and then give up.

In particular, the parsing of unwanted inputs could be a major security issue -- with traditional parsers you usually don't know everything you are letting through. Rarely does a test suite check to see if inputs which should be errors are in fact accepted. Marpa parses exactly the language in its specification -- nothing less and nothing more.

So why the scare language about taint mode? Marpa, in its most general case, can be seen as a programming language, and has exactly the same security issues. Allowing the user to execute arbitrary code is by definition insecure, and it is exactly what C, Perl, Marpa, etc. do by design. You cannot give an untrusted user a general language interface. This would be clear for C, Python, etc., but I thought someone might overlook it in the case of Marpa. Hence the scare language.

Marpa is IMHO more secure than competing technologies. However, in the most general case, that is not secure enough.

1
votes

taint mode is a perl optional setting that says - treat user input as untrusted. It stops you using any "tainted" variables - such as those read directly from STDIN or ENV in certain functions, because doing so is dangerous.

The typical example being code injection exploits: Exploits of a mom

That's all "taint mode" does - it enforces running a sanitisation prior to using untrusted input in a risky way.

untainting is straightforward - all you need do is apply a regular expression filter to your source data, such that any 'dangerous' metacharacters are excluded. (It should be noted - perl doesn't actually know what is 'dangerous' and what isn't - it assumes you're not being an idiot and just 'matching' everything)

This will error:

#!/usr/bin/env perl -T
use strict;
use warnings;

my $tainted = $ENV{'USERNAME'};
system ( "echo $tainted" );

Because I'm passing an untrusted variable through to "system" and it might have embedded code injection.

Insecure dependency in system while running with -T switch at

(It might also complain about insecure path)

So to untaint, I need to sanitise. A reasonable sanitisation would be - username must be only alphanumeric:

#!/usr/bin/env perl -T
use strict;
use warnings;

$ENV{'PATH'} = '/bin'; # an untainted value 

my $tainted = $ENV{'USERNAME'};
my ( $untainted ) = $tainted =~ m/(\w+)/g;
system ( "echo $untainted"); # no error now

And because I have used a regex - perl assumes I haven't done something boneheaded (like (.*)) and thus considers the data untainted.

Why is this important? Well, it depends what your parser does. It's not uncommon for parsers - by their nature - to get 'broken' by invalid input. See the above, for example - where escaping some inline SQL bypasses validation.

In your specific case:

  • taint mode is optional. You should use it when you're getting untrusted input (e.g. from potentially malicious users) but it's perhaps more trouble than it's worth for your own use.

  • Filtering HTML to validate length and character set is probably sensible. For example - checking it's an "ascii compatible character encoding".

Fundamentally though I think you're overthinking what taint checking is - it's not an exhaustive validation method - it's a safety net. All it does is ensure you've done some basic sanitisation before passing user input to an unsafe mechanism. That's to stop ridiculous gotchas like the one I outline - most of these can be caught by a simple regex.

If you're aware of the problem, and aren't concerned about malicious user input, then I don't think you need to be worried overly. A character whitelist will suffice, and then parse away.