Parsing CSVs in Power Shell

I  was messing around with some log files for the podcast and exported a CSV. There's lots of ways for us to get stats, but I was just poking around.
Given a CSV format like this:
And noticing that there are different file extensions, but one logical show...for example there's an MP3 of Show 76 and WMA of that same show I wanted to find out how many downloads per show.
I fired up PowerShell. First I can bring in the CSV file and notice that it's parsed into a PowerShell object list nicely:
PS C:\> import-csv File.csv | Select File,Hits
File                                                        Hits
----                                                        ----
/hanselminutes_0026_lo.wma                                  78173
/hanselminutes_0076_robert_pickering.wma                    24626
/hanselminutes_0077.wma                                     17204
/hanselminutes_0076_robert_pickering.mp3                    15796
/hanselminutes_0078.wma                                     14832
/hanselminutes_0075.mp3                                     13685
/hanselminutes_0075.wma                                     12129
/hanselminutes_0078.mp3                                     11058
Notice that there are headings in this output? PowerShell's Import-CSV is smart enough to infer the headers and positions and split things up. After the first | pipe I don't ever have to think CSV again.
I'm going to need a new column, though, called Show with the Show number it in. I'll probably glean it from the Filename. Just to get started, I'll make a new "calculated column."
PS C:\> import-csv file.csv | select File, Hits, @{Name="Show";Expression={ $_.File } }
File                                    Hits    Show
----                                    ----    ----
/hanselminutes_0026_lo.wma              78173   /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering... 24626   /hanselminutes_0076_robert_pickering...
/hanselminutes_0077.wma                 17204   /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering... 15796   /hanselminutes_0076_robert_pickering...
/hanselminutes_0078.wma                 14832   /hanselminutes_0078.wma
/hanselminutes_0075.mp3                 13685   /hanselminutes_0075.mp3
/hanselminutes_0075.wma                 12129   /hanselminutes_0075.wma
/hanselminutes_0078.mp3                 11058   /hanselminutes_0078.mp3
/hanselminutes_0077.mp3                 11003   /hanselminutes_0077.mp3
/hanselminutes_0074.mp3                 6494    /hanselminutes_0074.mp3
Except you can see that while I have a new column called "Show" it's just got the File duplicated. Notice the format for making a new Column:
select File, Hits, @{Name="Show";Expression={ $_.File } }.
Inside that { } scriptblock I need to do my work to get a show number.
It's a classic joke, so I'll say it again: "So you've got a problem and you've decided to use Regular Expressions to solve it. Now you've got two problems."
The show format has changed a bit over the years and some have leading zeros, some have names at the end. It's organic. So, in front of the expression we add a Regular Expression. You can make them easily in PowerShell by putting [regex] in front of a string. The ; is just a separator between statements. The first statement is just the assignment of a RegEx to a variable called $re. Later in my script block I apply that RegEx to the $_.File (remember that $_ is the equivalent of "this" in PowerShell, so $_.File is this.File).
The @{Name="Value";Name="Value"} format is how to express a Hashtable to PowerShell.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv | 
        select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } }
File                                    Hits     Show
----                                    ----     ----
/hanselminutes_0026_lo.wma              78173    26
/hanselminutes_0076_robert_pickering... 24626    76
/hanselminutes_0077.wma                 17204    77
/hanselminutes_0076_robert_pickering... 15796    76
/hanselminutes_0078.wma                 14832    78
/hanselminutes_0075.mp3                 13685    75
/hanselminutes_0075.wma                 12129    75
/hanselminutes_0078.mp3                 11058    78
/hanselminutes_0077.mp3                 11003    77
/hanselminutes_0074.mp3                 6494     74
The RegEx "\d{2}(?=[_.])" says "find me the first two decimals to the left of either a "_" or a "." It's not foolproof, at least not until show 100, but this is a Poor Man's Solution.
Next, I'll sort by Show descending then group all the like shows together.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
        select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
        sort Show -desc | group Show
Count Name  Group
----- ----  -----
    9 79    {@{File=/Hanselminutes_0079.wma; Hits=27; Show=79},...
   12 78    {@{File=/hanselminutes_0078.m4b.torrent; Hits=18; Show=78},...
   12 77    {@{File=/hanselminutes_0077.wma.torrent; Hits=52; Show=77},...
   18 76    {@{File=/hanselminutes_76.mp3; Hits=1; Show=76}, ...
   11 75    {@{File=/hanselminutes_0075_lo.wma; Hits=468; Show=75},...
In this listing I can see via Count that there were 9 different formats of Show 79 downloaded in this time period. However, the good data is trapped in the Group column. All my previous line items are grouped in there. I need to sum up all the Hits for all the downloads of the the 9 different formats of Show 76...and all the shows.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
    select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
    sort Show -desc | group Show | 
    select Name,{($_.Group | Measure-Object -Sum Hits).Sum }
Name   ($_.Group | Measure-Object -Sum Hits).Sum
----   ------------------------------------------
79      9205
78     27575
77     29807
76     42798
75     27174
74     13060
73     10532
72     10145
71      9826
70      8745
69      8065
68      8132
67      7024
66      8535
65     13041
Now, I select out Name and then take each item's Group (remember $_ is this, so $_.Group means "the current item's Group." Each item will go through Measure-Object which has options like -sum and -average, etc. We grab them up and Sum them.
But the Column Header looks silly. I want a pretty name.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
    select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
    sort Show -desc | group Show |  
    select Name,
    @{Name="Hits";Expression={($_.Group | Measure-Object -Sum Hits).Sum }}
I just make a Hashtable with the Name "Hits" and put the scriptblock in a key called "Expression." Add some more sorting and:
Name   Hits
----   ----
26     78673
76     42798
77     29807
78     27575
75     27174
74     13060
65     13041
73     10532
Now I can see that show 76 got 42798 downloads this time period and the classic show 26 did nicely with 78673.



Popular posts from this blog

The install4j wizard could not find a Java Runtime Environment on your system. Please locate a suitable JRE...

Clients Unable to update - "Cached cookie has expired or new PID is available"

Java Control Panel Icon "Application Not Found"