taint mode
is a perl optional setting that says - treat user input as untrusted. It stops you using any "tainted" variables - such as those read directly from STDIN
or ENV
in certain functions, because doing so is dangerous.
The typical example being code injection exploits:
That's all "taint mode" does - it enforces running a sanitisation prior to using untrusted input in a risky way.
untainting is straightforward - all you need do is apply a regular expression filter to your source data, such that any 'dangerous' metacharacters are excluded. (It should be noted - perl
doesn't actually know what is 'dangerous' and what isn't - it assumes you're not being an idiot and just 'matching' everything)
This will error:
#!/usr/bin/env perl -T
use strict;
use warnings;
my $tainted = $ENV{'USERNAME'};
system ( "echo $tainted" );
Because I'm passing an untrusted variable through to "system" and it might have embedded code injection.
Insecure dependency in system while running with -T switch at
(It might also complain about insecure path)
So to untaint, I need to sanitise. A reasonable sanitisation would be - username must be only alphanumeric:
#!/usr/bin/env perl -T
use strict;
use warnings;
$ENV{'PATH'} = '/bin'; # an untainted value
my $tainted = $ENV{'USERNAME'};
my ( $untainted ) = $tainted =~ m/(\w+)/g;
system ( "echo $untainted"); # no error now
And because I have used a regex - perl assumes I haven't done something boneheaded (like (.*)
) and thus considers the data untainted.
Why is this important? Well, it depends what your parser does. It's not uncommon for parsers - by their nature - to get 'broken' by invalid input. See the above, for example - where escaping some inline SQL bypasses validation.
In your specific case:
taint mode is optional. You should use it when you're getting untrusted input (e.g. from potentially malicious users) but it's perhaps more trouble than it's worth for your own use.
Filtering HTML to validate length and character set is probably sensible. For example - checking it's an "ascii compatible character encoding".
Fundamentally though I think you're overthinking what taint checking is - it's not an exhaustive validation method - it's a safety net. All it does is ensure you've done some basic sanitisation before passing user input to an unsafe mechanism. That's to stop ridiculous gotchas like the one I outline - most of these can be caught by a simple regex.
If you're aware of the problem, and aren't concerned about malicious user input, then I don't think you need to be worried overly. A character whitelist will suffice, and then parse away.