#!/usr/bin/env perl %seen = (); foreach (@ARGV) { open (LFILE,"$_"); for $line (Perl tutorial from tizag.com was helpful.) { @sline=split(/\//,$line); print ("@sline[2]\n") unless $seen{@sline[2]}++; } close LFILE; }
Kind of my extended memory with thoughts mostly on Linux and related technologies. You might also find some other stuff, a bit of SF, astronomy as well as old (quantum) chemistry posts.
Search This Blog
Tuesday, March 10, 2009
My first Perl script
It's nothing big, but it's the first one and, as Perl is write only language, I'd better add the short description.
The script takes a list of files passed as arguments to the command; reads all lines (http addresses) from them and creates the list of unique domains names.
Subscribe to:
Post Comments (Atom)
5 comments:
I like Perl, but:
cat file1 file2 file... fileN | sed 's/.*\/\///' | uniq
seems to be enough. ;)
BTW it seems, that you want www.google.com and google.com to be separate domains?
PS. sort -u instead of uniq may be more useful.
Cat in not necessary isn't it?
I think that sed 's/...' file1 file2 ...fileN with uniq (or sort -u) should be enough.
Anyway they are not doing the same what my Perl script. The script on the output return list of domains not http addresses. Compare
wp.pl vs. http://wp.pl/jakis/tam/adres=?coswiecej
Indeed, for the script www.google.com is something else than google.com.
Oh, I missed the part after domain. You can use slightly modified sed, or:
awk -F "/" '{print $3}' file1 file 2 | sort -u
I think it's still more readable, easier to remember, faster, more elegant than this Perl.
And yes, cat was not necessary. But it was debug friendly. ;)
Perl is the best scripting language for Text processing and handle regex. I have posted few articles related to those at my blog
http://icfun.blogspot.com/search/label/perl
Also Perl's Cpan has lots of support that I don't even need to think extra while developing project. I didn't find such help on other programming language except Java and .NET
You can use diamond loop to scan all lines from all files:
use strict;
my %uniq;
while (<>) {
m!//([^/]+)!
and $uniq{$1} = 1;
}
print keys %uniq;
then you can use 5.10 named captures:
m!//(?<domain>[^/]+)!
and $uniq{$+{domain}} = 1;
and then you can write something really elegant:
use strict;
print keys %{
{
map {
m!//(?<domain>[^/]+)!
and $+{domain} => 1
} <>
}
};
BTW: Formatting comments on blogger sucks.
Post a Comment