Parsing CSVs in Power Shell
I was messing around with some log files for the podcast and exported a CSV. There's lots of ways for us to get stats, but I was just poking around.
Given a CSV format like this:
"File","Hits","Bandwidth"
"/hanselminutes_0026_lo.wma","78173","163625808"
"/hanselminutes_0076_robert_pickering.wma","24626","-1789110063"
"/hanselminutes_0077.wma","17204","1959963618"
"/hanselminutes_0076_robert_pickering.mp3","15796","-55874279"
"/hanselminutes_0078.wma","14832","-1241370004"
"/hanselminutes_0075.mp3","13685","-1840937989"
"/hanselminutes_0075.wma","12129","1276597408"
"/hanselminutes_0078.mp3","11058","-1186433073"
"/hanselminutes_0026_lo.wma","78173","163625808"
"/hanselminutes_0076_robert_pickering.wma","24626","-1789110063"
"/hanselminutes_0077.wma","17204","1959963618"
"/hanselminutes_0076_robert_pickering.mp3","15796","-55874279"
"/hanselminutes_0078.wma","14832","-1241370004"
"/hanselminutes_0075.mp3","13685","-1840937989"
"/hanselminutes_0075.wma","12129","1276597408"
"/hanselminutes_0078.mp3","11058","-1186433073"
And noticing that there are different file extensions, but one logical show...for example there's an MP3 of Show 76 and WMA of that same show I wanted to find out how many downloads per show.
I fired up PowerShell. First I can bring in the CSV file and notice that it's parsed into a PowerShell object list nicely:
PS C:\> import-csv File.csv | Select File,Hits
File Hits
---- ----
/hanselminutes_0026_lo.wma 78173
/hanselminutes_0076_robert_pickering.wma 24626
/hanselminutes_0077.wma 17204
/hanselminutes_0076_robert_pickering.mp3 15796
/hanselminutes_0078.wma 14832
/hanselminutes_0075.mp3 13685
/hanselminutes_0075.wma 12129
/hanselminutes_0078.mp3 11058
---- ----
/hanselminutes_0026_lo.wma 78173
/hanselminutes_0076_robert_pickering.wma 24626
/hanselminutes_0077.wma 17204
/hanselminutes_0076_robert_pickering.mp3 15796
/hanselminutes_0078.wma 14832
/hanselminutes_0075.mp3 13685
/hanselminutes_0075.wma 12129
/hanselminutes_0078.mp3 11058
Notice that there are headings in this output? PowerShell's Import-CSV is smart enough to infer the headers and positions and split things up. After the first | pipe I don't ever have to think CSV again.
I'm going to need a new column, though, called Show with the Show number it in. I'll probably glean it from the Filename. Just to get started, I'll make a new "calculated column."
PS C:\> import-csv file.csv | select File, Hits, @{Name="Show";Expression={ $_.File } }
File Hits Show
---- ---- ----
/hanselminutes_0026_lo.wma 78173 /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering... 24626 /hanselminutes_0076_robert_pickering...
/hanselminutes_0077.wma 17204 /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering... 15796 /hanselminutes_0076_robert_pickering...
/hanselminutes_0078.wma 14832 /hanselminutes_0078.wma
/hanselminutes_0075.mp3 13685 /hanselminutes_0075.mp3
/hanselminutes_0075.wma 12129 /hanselminutes_0075.wma
/hanselminutes_0078.mp3 11058 /hanselminutes_0078.mp3
/hanselminutes_0077.mp3 11003 /hanselminutes_0077.mp3
/hanselminutes_0074.mp3 6494 /hanselminutes_0074.mp3
---- ---- ----
/hanselminutes_0026_lo.wma 78173 /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering... 24626 /hanselminutes_0076_robert_pickering...
/hanselminutes_0077.wma 17204 /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering... 15796 /hanselminutes_0076_robert_pickering...
/hanselminutes_0078.wma 14832 /hanselminutes_0078.wma
/hanselminutes_0075.mp3 13685 /hanselminutes_0075.mp3
/hanselminutes_0075.wma 12129 /hanselminutes_0075.wma
/hanselminutes_0078.mp3 11058 /hanselminutes_0078.mp3
/hanselminutes_0077.mp3 11003 /hanselminutes_0077.mp3
/hanselminutes_0074.mp3 6494 /hanselminutes_0074.mp3
Except you can see that while I have a new column called "Show" it's just got the File duplicated. Notice the format for making a new Column:
select File, Hits, @{Name="Show";Expression={ $_.File } }.
select File, Hits, @{Name="Show";Expression={ $_.File } }.
Inside that { } scriptblock I need to do my work to get a show number.
It's a classic joke, so I'll say it again: "So you've got a problem and you've decided to use Regular Expressions to solve it. Now you've got two problems."
The show format has changed a bit over the years and some have leading zeros, some have names at the end. It's organic. So, in front of the expression we add a Regular Expression. You can make them easily in PowerShell by putting [regex] in front of a string. The ; is just a separator between statements. The first statement is just the assignment of a RegEx to a variable called $re. Later in my script block I apply that RegEx to the $_.File (remember that $_ is the equivalent of "this" in PowerShell, so $_.File is this.File).
The @{Name="Value";Name="Value"} format is how to express a Hashtable to PowerShell.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } }
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } }
File Hits Show
---- ---- ----
/hanselminutes_0026_lo.wma 78173 26
/hanselminutes_0076_robert_pickering... 24626 76
/hanselminutes_0077.wma 17204 77
/hanselminutes_0076_robert_pickering... 15796 76
/hanselminutes_0078.wma 14832 78
/hanselminutes_0075.mp3 13685 75
/hanselminutes_0075.wma 12129 75
/hanselminutes_0078.mp3 11058 78
/hanselminutes_0077.mp3 11003 77
/hanselminutes_0074.mp3 6494 74
---- ---- ----
/hanselminutes_0026_lo.wma 78173 26
/hanselminutes_0076_robert_pickering... 24626 76
/hanselminutes_0077.wma 17204 77
/hanselminutes_0076_robert_pickering... 15796 76
/hanselminutes_0078.wma 14832 78
/hanselminutes_0075.mp3 13685 75
/hanselminutes_0075.wma 12129 75
/hanselminutes_0078.mp3 11058 78
/hanselminutes_0077.mp3 11003 77
/hanselminutes_0074.mp3 6494 74
The RegEx "\d{2}(?=[_.])" says "find me the first two decimals to the left of either a "_" or a "." It's not foolproof, at least not until show 100, but this is a Poor Man's Solution.
Next, I'll sort by Show descending then group all the like shows together.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show
Count Name Group
----- ---- -----
9 79 {@{File=/Hanselminutes_0079.wma; Hits=27; Show=79},...
12 78 {@{File=/hanselminutes_0078.m4b.torrent; Hits=18; Show=78},...
12 77 {@{File=/hanselminutes_0077.wma.torrent; Hits=52; Show=77},...
18 76 {@{File=/hanselminutes_76.mp3; Hits=1; Show=76}, ...
11 75 {@{File=/hanselminutes_0075_lo.wma; Hits=468; Show=75},...
----- ---- -----
9 79 {@{File=/Hanselminutes_0079.wma; Hits=27; Show=79},...
12 78 {@{File=/hanselminutes_0078.m4b.torrent; Hits=18; Show=78},...
12 77 {@{File=/hanselminutes_0077.wma.torrent; Hits=52; Show=77},...
18 76 {@{File=/hanselminutes_76.mp3; Hits=1; Show=76}, ...
11 75 {@{File=/hanselminutes_0075_lo.wma; Hits=468; Show=75},...
In this listing I can see via Count that there were 9 different formats of Show 79 downloaded in this time period. However, the good data is trapped in the Group column. All my previous line items are grouped in there. I need to sum up all the Hits for all the downloads of the the 9 different formats of Show 76...and all the shows.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show |
select Name,{($_.Group | Measure-Object -Sum Hits).Sum }
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show |
select Name,{($_.Group | Measure-Object -Sum Hits).Sum }
Name ($_.Group | Measure-Object -Sum Hits).Sum
---- ------------------------------------------
79 9205
78 27575
77 29807
76 42798
75 27174
74 13060
73 10532
72 10145
71 9826
70 8745
69 8065
68 8132
67 7024
66 8535
65 13041
---- ------------------------------------------
79 9205
78 27575
77 29807
76 42798
75 27174
74 13060
73 10532
72 10145
71 9826
70 8745
69 8065
68 8132
67 7024
66 8535
65 13041
Now, I select out Name and then take each item's Group (remember $_ is this, so $_.Group means "the current item's Group." Each item will go through Measure-Object which has options like -sum and -average, etc. We grab them up and Sum them.
But the Column Header looks silly. I want a pretty name.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show |
select Name,
@{Name="Hits";Expression={($_.Group | Measure-Object -Sum Hits).Sum }}
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show |
select Name,
@{Name="Hits";Expression={($_.Group | Measure-Object -Sum Hits).Sum }}
I just make a Hashtable with the Name "Hits" and put the scriptblock in a key called "Expression." Add some more sorting and:
Name Hits
---- ----
26 78673
76 42798
77 29807
78 27575
75 27174
74 13060
65 13041
73 10532
---- ----
26 78673
76 42798
77 29807
78 27575
75 27174
74 13060
65 13041
73 10532
Now I can see that show 76 got 42798 downloads this time period and the classic show 26 did nicely with 78673.
Reference: http://www.hanselman.com/blog/ParsingCSVsAndPoorMansWebLogAnalysisWithPowerShell.aspxhttp://www.hanselman.com/blog/ParsingCSVsAndPoorMansWebLogAnalysisWithPowerShell.aspx
Comments
Post a Comment