Parsing CSVs in Power Shell


I  was messing around with some log files for the podcast and exported a CSV. There's lots of ways for us to get stats, but I was just poking around.
Given a CSV format like this:
"File","Hits","Bandwidth"
"/hanselminutes_0026_lo.wma","78173","163625808"
"/hanselminutes_0076_robert_pickering.wma","24626","-1789110063"
"/hanselminutes_0077.wma","17204","1959963618"
"/hanselminutes_0076_robert_pickering.mp3","15796","-55874279"
"/hanselminutes_0078.wma","14832","-1241370004"
"/hanselminutes_0075.mp3","13685","-1840937989"
"/hanselminutes_0075.wma","12129","1276597408"
"/hanselminutes_0078.mp3","11058","-1186433073"
And noticing that there are different file extensions, but one logical show...for example there's an MP3 of Show 76 and WMA of that same show I wanted to find out how many downloads per show.
I fired up PowerShell. First I can bring in the CSV file and notice that it's parsed into a PowerShell object list nicely:
PS C:\> import-csv File.csv | Select File,Hits
File                                                        Hits
----                                                        ----
/hanselminutes_0026_lo.wma                                  78173
/hanselminutes_0076_robert_pickering.wma                    24626
/hanselminutes_0077.wma                                     17204
/hanselminutes_0076_robert_pickering.mp3                    15796
/hanselminutes_0078.wma                                     14832
/hanselminutes_0075.mp3                                     13685
/hanselminutes_0075.wma                                     12129
/hanselminutes_0078.mp3                                     11058
Notice that there are headings in this output? PowerShell's Import-CSV is smart enough to infer the headers and positions and split things up. After the first | pipe I don't ever have to think CSV again.
I'm going to need a new column, though, called Show with the Show number it in. I'll probably glean it from the Filename. Just to get started, I'll make a new "calculated column."
PS C:\> import-csv file.csv | select File, Hits, @{Name="Show";Expression={ $_.File } }
File                                    Hits    Show
----                                    ----    ----
/hanselminutes_0026_lo.wma              78173   /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering... 24626   /hanselminutes_0076_robert_pickering...
/hanselminutes_0077.wma                 17204   /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering... 15796   /hanselminutes_0076_robert_pickering...
/hanselminutes_0078.wma                 14832   /hanselminutes_0078.wma
/hanselminutes_0075.mp3                 13685   /hanselminutes_0075.mp3
/hanselminutes_0075.wma                 12129   /hanselminutes_0075.wma
/hanselminutes_0078.mp3                 11058   /hanselminutes_0078.mp3
/hanselminutes_0077.mp3                 11003   /hanselminutes_0077.mp3
/hanselminutes_0074.mp3                 6494    /hanselminutes_0074.mp3
Except you can see that while I have a new column called "Show" it's just got the File duplicated. Notice the format for making a new Column:
select File, Hits, @{Name="Show";Expression={ $_.File } }.
Inside that { } scriptblock I need to do my work to get a show number.
It's a classic joke, so I'll say it again: "So you've got a problem and you've decided to use Regular Expressions to solve it. Now you've got two problems."
The show format has changed a bit over the years and some have leading zeros, some have names at the end. It's organic. So, in front of the expression we add a Regular Expression. You can make them easily in PowerShell by putting [regex] in front of a string. The ; is just a separator between statements. The first statement is just the assignment of a RegEx to a variable called $re. Later in my script block I apply that RegEx to the $_.File (remember that $_ is the equivalent of "this" in PowerShell, so $_.File is this.File).
The @{Name="Value";Name="Value"} format is how to express a Hashtable to PowerShell.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv | 
        select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } }
File                                    Hits     Show
----                                    ----     ----
/hanselminutes_0026_lo.wma              78173    26
/hanselminutes_0076_robert_pickering... 24626    76
/hanselminutes_0077.wma                 17204    77
/hanselminutes_0076_robert_pickering... 15796    76
/hanselminutes_0078.wma                 14832    78
/hanselminutes_0075.mp3                 13685    75
/hanselminutes_0075.wma                 12129    75
/hanselminutes_0078.mp3                 11058    78
/hanselminutes_0077.mp3                 11003    77
/hanselminutes_0074.mp3                 6494     74
The RegEx "\d{2}(?=[_.])" says "find me the first two decimals to the left of either a "_" or a "." It's not foolproof, at least not until show 100, but this is a Poor Man's Solution.
Next, I'll sort by Show descending then group all the like shows together.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
        select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
        sort Show -desc | group Show
Count Name  Group
----- ----  -----
    9 79    {@{File=/Hanselminutes_0079.wma; Hits=27; Show=79},...
   12 78    {@{File=/hanselminutes_0078.m4b.torrent; Hits=18; Show=78},...
   12 77    {@{File=/hanselminutes_0077.wma.torrent; Hits=52; Show=77},...
   18 76    {@{File=/hanselminutes_76.mp3; Hits=1; Show=76}, ...
   11 75    {@{File=/hanselminutes_0075_lo.wma; Hits=468; Show=75},...
In this listing I can see via Count that there were 9 different formats of Show 79 downloaded in this time period. However, the good data is trapped in the Group column. All my previous line items are grouped in there. I need to sum up all the Hits for all the downloads of the the 9 different formats of Show 76...and all the shows.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
    select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
    sort Show -desc | group Show | 
    select Name,{($_.Group | Measure-Object -Sum Hits).Sum }
Name   ($_.Group | Measure-Object -Sum Hits).Sum
----   ------------------------------------------
79      9205
78     27575
77     29807
76     42798
75     27174
74     13060
73     10532
72     10145
71      9826
70      8745
69      8065
68      8132
67      7024
66      8535
65     13041
Now, I select out Name and then take each item's Group (remember $_ is this, so $_.Group means "the current item's Group." Each item will go through Measure-Object which has options like -sum and -average, etc. We grab them up and Sum them.
But the Column Header looks silly. I want a pretty name.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
    select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
    sort Show -desc | group Show |  
    select Name,
    @{Name="Hits";Expression={($_.Group | Measure-Object -Sum Hits).Sum }}
I just make a Hashtable with the Name "Hits" and put the scriptblock in a key called "Expression." Add some more sorting and:
Name   Hits
----   ----
26     78673
76     42798
77     29807
78     27575
75     27174
74     13060
65     13041
73     10532
Now I can see that show 76 got 42798 downloads this time period and the classic show 26 did nicely with 78673.

Reference: http://www.hanselman.com/blog/ParsingCSVsAndPoorMansWebLogAnalysisWithPowerShell.aspxhttp://www.hanselman.com/blog/ParsingCSVsAndPoorMansWebLogAnalysisWithPowerShell.aspx

Comments

Popular posts from this blog

Enterprise CA option is greyed out / unavailable

Rearm the Office 2013 installation

MP has rejected registration request due to failure in client certificate