Log Analysis Shell Script

Discussion in 'Software' started by Mada_Milty, Nov 22, 2006.

  1. Mada_Milty

    Mada_Milty MajorGeek

    Good Day,

    I'm running a Squid (v 2.6) proxy server on Ubuntu 6.06. This app generates a client access log in the format:

    Code:
    time elapsed--remotehost--code/status--bytes--method--URL--rfc931--peerstatus/peerhost type
    where dashes separate fields. I've also attached this log for convenience. What I need to do is create a simple shell script that will analyze this log and produce a summary showing first, how long clients spent online. I'm currently researching how to do this (its been awhile since I've used Linux's file manipulation commands), but while I do so, I thought I would open a thread for advice and recommendations. I have a vague recollection of the sed and awk commands... am I looking in the right direction?

    Thanks everyone...
     

    Attached Files:

  2. TimW

    TimW MajorGeeks Administrator - Jedi Malware Expert Staff Member

  3. goldfish

    goldfish Lt. Sushi.DC

    If it were me I'd got for perl.

    What sort of processing do you want to apply to it? Just tracking user session lengths? That would be reasonabley simple to do in perl, as opposed to a bash script which I recon would be quite complex.
     
  4. Mada_Milty

    Mada_Milty MajorGeek

    That's what I was thinking, but hoping to avoid. It's been a good while since I've worked with PERL, and even then, I hardly touched it.... looks like back to the textbooks for me! I'm glad it's so similar to C++; I'm pretty decent with that, still....
     
  5. goldfish

    goldfish Lt. Sushi.DC

    Appart from the fact that you can do regular expressions SO much easier :)
     
  6. Mada_Milty

    Mada_Milty MajorGeek

    Okay, I've found my textbook sadly lacking!

    [​IMG]

    :rolleyes:

    I've learned how to open a file, and recurse all the lines...whoopi-do!

    There's nothing here on pattern matching (which is what I really need to be able to extract the pertinent information from this file), so...does anyone have any good references on PERL? (currently looking at www.perl.com) I don't suppose I can embed regular shell commands?
     
  7. Mada_Milty

    Mada_Milty MajorGeek

  8. goldfish

    goldfish Lt. Sushi.DC

    I would tend to agree with that sentiment :)

    There are plenty of perl books out there - a "perl-monger" friend of mine wrote one :eek:
     
  9. Mada_Milty

    Mada_Milty MajorGeek

    Code:
    1164117990.680  11380 192.168.0.123 TCP_MISS/200 24160 GET http://www.asus.com/ - DIRECT/216.148.234.177 text/html
    Any recommendations on how to extract the URL from these lines? It's variable length, but there is always a " -" at the end of it.

    I'm trying some combination of the index and substr functions, but I'm having no luck...
     
  10. goldfish

    goldfish Lt. Sushi.DC

    Regular expressions my friend :)

    Lets see what we can do ....
    Code:
     
    if ($string =~ /(http:\/\/.+?) -/) {
        print $1;
    }
    
    Give that a try :)
     
  11. Mada_Milty

    Mada_Milty MajorGeek

    Okay, sorry that I'm so new to this... correct me if I'm wrong here.

    I'm trying to figure out this pattern you're trying to match...

    $string is obvious, it's the current line of the file I'm reading... I'm just using the default $_

    I see you have
    "http:" - That much is clear to me
    \/\/ - two forward slashes escaped by backslashes to get "http://"
    . - concatenation operator so we can add to this string
    +? - not too sure about this one... what's this do? Wildcard for any number of characters?

    Next question: why is this part in brackets?

    then you have the dash, and finally the pattern delimiter.

    If this evaluates as true, then it's just going to print the match? (of course, I can add my own statements...)
     
  12. goldfish

    goldfish Lt. Sushi.DC

    Ok, let me explain this for you :)

    In a regex, . isn't the concatination operator. It means "any character, excluding newlines". The + is a numerator, which means match (whatever was before) one or more times. So .+ means match any character one or more times. But by default this will look for a greedy match, i.e. it will match as many characters as possible. The ? will stop this from happening. As such, as soon as it finds the next character (in this case a " "), it will stop.

    The brackets group the part of the match you want. Otherwise you'd be using $_ which would give you the entire match, from http to the - . We want just the URL itself, so the brackets will load http://your-matched-url.com/ into $1.

    So instead of getting:
    Code:
    http://your-url.com/ - 
    
    You'll get
    Code:
    http://your-url.com/
    
    Also it should be noted that the if statement should let you keep the $1 in a block, which makes things a bit easier. Otherwise the $1 will have the scope of the entire code which can get confusing.

    In your case you might add $1 into an array or list, rather than printing it.

    And also if you're going to run multiple regex's on the current line, it would be a very good idea to load $_ into a new variable. Sounds a bit silly but sometimes regex's will start behaving oddly if you're referencing $_ directly.
     

MajorGeeks.Com Menu

Downloads All In One Tweaks \ Android \ Anti-Malware \ Anti-Virus \ Appearance \ Backup \ Browsers \ CD\DVD\Blu-Ray \ Covert Ops \ Drive Utilities \ Drivers \ Graphics \ Internet Tools \ Multimedia \ Networking \ Office Tools \ PC Games \ System Tools \ Mac/Apple/Ipad Downloads

Other News: Top Downloads \ News (Tech) \ Off Base (Other Websites News) \ Way Off Base (Offbeat Stories and Pics)

Social: Facebook \ YouTube \ Twitter \ Tumblr \ Pintrest \ RSS Feeds