Search This Blog

Tuesday, March 10, 2009

My first Perl script

It's nothing big, but it's the first one and, as Perl is write only language, I'd better add the short description. The script takes a list of files passed as arguments to the command; reads all lines (http addresses) from them and creates the list of unique domains names.
#!/usr/bin/env perl

%seen = ();
foreach (@ARGV)
{
open (LFILE,"$_");

for $line ()
{
       @sline=split(/\//,$line);
       print ("@sline[2]\n") unless $seen{@sline[2]}++;
}

close LFILE;
}
Perl tutorial from tizag.com was helpful.

5 comments:

Anonymous said...

I like Perl, but:

cat file1 file2 file... fileN | sed 's/.*\/\///' | uniq

seems to be enough. ;)
BTW it seems, that you want www.google.com and google.com to be separate domains?

PS. sort -u instead of uniq may be more useful.

Wawrzek said...

Cat in not necessary isn't it?

I think that sed 's/...' file1 file2 ...fileN with uniq (or sort -u) should be enough.

Anyway they are not doing the same what my Perl script. The script on the output return list of domains not http addresses. Compare
wp.pl vs. http://wp.pl/jakis/tam/adres=?coswiecej

Indeed, for the script www.google.com is something else than google.com.

Anonymous said...

Oh, I missed the part after domain. You can use slightly modified sed, or:

awk -F "/" '{print $3}' file1 file 2 | sort -u

I think it's still more readable, easier to remember, faster, more elegant than this Perl.

And yes, cat was not necessary. But it was debug friendly. ;)

Demon said...

Perl is the best scripting language for Text processing and handle regex. I have posted few articles related to those at my blog

http://icfun.blogspot.com/search/label/perl

Also Perl's Cpan has lots of support that I don't even need to think extra while developing project. I didn't find such help on other programming language except Java and .NET

Anonymous said...

You can use diamond loop to scan all lines from all files:


use strict;

my %uniq;

while (<>) {
  m!//([^/]+)!
   and $uniq{$1} = 1;
}

print keys %uniq;


then you can use 5.10 named captures:


m!//(?<domain>[^/]+)!
   and $uniq{$+{domain}} = 1;


and then you can write something really elegant:


use strict;

print keys %{
  {
   map {
    m!//(?<domain>[^/]+)!
     and $+{domain} => 1
   } <>
  }
};

BTW: Formatting comments on blogger sucks.