Skip to content Skip to sidebar Skip to footer

Trouble Getting Regular Expression To Work

I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I h

Solution 1:

If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:

#!/usr/bin/perl -w
use strict;
use warnings;

$_ = do { local $/; <DATA> };

# see http://www.perlmonks.org/?node_id=161281
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
s{
  <               # open tag
  (?:             # open group (A)
    (!--) |       #   comment (1) or
    (\?) |        #   another comment (2) or
    (?i:          #   open group (B) for /i
      (           #     one of start tags
        SCRIPT |  #     for which
        APPLET |  #     must be skipped
        OBJECT |  #     all content
        STYLE     #     to correspond
      )           #     end tag (3)
    ) |           #   close group (B), or
    ([!/A-Za-z])  #   one of these chars, remember in (4)
  )               # close group (A)
  (?(4)           # if previous case is (4)
    (?:           #   open group (C)
      (?!         #     and next is not : (D)
        [\s=]     #       \s or "="
        ["`']     #       with open quotes
      )           #     close (D)
      [^>] |      #     and not close tag or
      [\s=]       #     \s or "=" with
      `[^`]*` |   #     something in quotes ` or
      [\s=]       #     \s or "=" with
      '[^']*' |   #     something in quotes ' or
      [\s=]       #     \s or "=" with
      "[^"]*"     #     something in quotes "
    )*            #   repeat (C) 0 or more times
  |               # else (if previous case is not (4))
    .*?           #   minimum of any chars
  )               # end if previous char is (4)
  (?(1)           # if comment (1)
    (?<=--)       #   wait for "--"
  )               # end if comment (1)
  (?(2)           # if another comment (2)
    (?<=\?)       #   wait for "?"
  )               # end if another comment (2)
  (?(3)           # if one of tags-containers (3)
    </            #   wait for end
    (?i:\3)       #   of this tag
    (?:\s[^>]*)?  #   skip junk to ">"
  )               # end if (3)
  >               # tag closed
 }{}gsx;         # STRIP THIS TAG

print;

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

remove script, ul


1
2
paragraph

NOTE: This regex doesn't work for nested tag-containers e.g.:

<!DOCTYPE html>
<meta charset="UTF-8">
<title>Nested &lt;object> example</title>
<body>
<object data="uri:here">fallback content for uri:here
  <object data="uri:another">uri:another fallback
  </object>!!!this text should be striped too!!!
</object>

Output

Nested &lt;object> example

!!!this text should be striped too!!!

Don't parse html with regexs. Use a html parser or a tool built on top of it e.g., HTML::Parser:

#!/usr/bin/perl -w
use strict;
use warnings;

use HTML::Parser ();

HTML::Parser->new(
    ignore_elements => ["script"],
    ignore_tags => ["ul"],
    default_h => [ sub { print shift }, 'text'],
    )->parse_file(\*DATA) or die "error: $!\n";

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

<html><title>remove script, ul</title>

<body>
<li>1
<li>2
<p>paragraph

Solution 2:

To reply your last comment:

perl -e'$file="<script etc>\nfoo\n</script>bar"; $file =~ s/<script.*script>//gis; print $file'

this does seem to do what you want, as suggested by others. I don't see how that is different from what you're trying, though.

....

Can you add this:

use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper($file);

before the regexp and give us the result?

.....

Bingo:

line 5 and 6 of your $file =~ list already filter them out:

$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
## Here they come:
$file =~ s/<script(.*)>//gi;
$file =~ s/<\/script>//gi;
$file =~ s/<head>//gi;

Solution 3:

I'm not sure what programming language you're using, but assuming that you're in perl, try putting the s modifier at the end of the regex:

$file =~ /<script type(.*)<\/script>/sgi

The /s modifier makes the . match any character, including newlines (normally it doesn't include newlines)


Edit: I apologize, I'm not good at Perl, but I did some looking around and I finally realized that the s/ in front is for substitutions. In this case, your regex should be:

$file =~ s/<script type(.*)<\/script>/sgi

to remove everything, including the script tags. However, if you just want the content between the tags it is:

$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/$1$2/sgi;

Notice the $1$2 between the slashes. This text is the replacment text. In this case we are using the text from capturing groups in place of the original. In your question you were using two slashes in a row (s/<ul(.*)>//gi) which means you're substituting the whole match for an empty string. It seems to me that you're actually looking to replace everything with a blank space (ASCII 20) like s/<ul(.*)>/ /gi.


Since your last edit - You'll want to use one regex for the scripts since you don't want the contents:

$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/ /sgi;

and another generic regex for all the other tags:

$file =~ s/<\/?\s*[^>]+>//sgi

I'm assuming here that you don't want to limit to just the tags you displayed above, you just want to kill all HTML. There is a *nix utility called html2text that does this. You might want to look into using that.


Solution 4:

You’re going to have to be a lot more careful than that. See both approaches in this answer.


Solution 5:

This:

$file =~ s/<div(.*)>//gi;

won't do what you expect. The '*' operator is greedy. If you have a line like:

hello<div id="foo"><b>bar!</b>baz

it'll substitute as much as it can, leaving only:

hellobaz

You want:

$file =~ s/<div[^>]*>//gi;

or

$file =~ s/<div.*?>//gi;

Post a Comment for "Trouble Getting Regular Expression To Work"