简介

感谢大家的支持，以及微软社区精英计划团队的肯定，我被邀请在微软MSDN网络建立个人主页，由于第一次建立主页的时候，需要提交相关博文的信息，为了实现该需求，我用PowerShell来完成博文的采集。本文讲述如何使用PowerShell来采集博客园上的博文信息。

需求

需要把提交的博文整理成一个表格，显示发布时间，内容标题，具体链结位置，技术分类和内容形式，入下表格。

发布时间	内容标题	具体链结位置	技术分类	内容形式
2010年07月22日	Windows Phone 7书托	http://www.cnblogs.com/procoder/archive/2010/07/22/Windows-Phone-7-Books.html	Windows Phone	博客

尽管文章列表的生成只是一次性的工作，可是Copy&Paste(拷贝粘贴)还是很annoying和error-prone（恼人和容易出错）的工作，这次继续使用PowerShell来简化工作。我承认我是一个偷懒的程序员。上次的文章讲述如何使用Powershell简化Windows Mobile和Windows Embedded CE的开发流程，可以参考如何使用PowerShell提升开发效率(以Windows Embedded CE为例)。

源代码

先上代码，然后再解析

#Global variables

$blogName = "procoder";

$articles = New-Object System.Collections.Generic.List``1[System.Object]


$OutputEncoding = New-Object -typename System.Text.UTF8Encoding;


$webClient = New-Object System.Net.WebClient;

$webClient.Encoding = [System.Text.Encoding]::UTF8;


$regex = New-Object System.Text.RegularExpressions.Regex('<a\s+id="homepage1_HomePageDays_ctl00_DayList_ctl\d+_TitleUrl" class="postTitle2" href="(?<url>.+)">(?<title>.+)</a>');

$regexDate = New-Object System.Text.RegularExpressions.Regex('http://www.cnblogs.com/\w+/archive/(?<year>\d+)/(?<month>\d+)/(?<day>\d+)/.+.html');


# Analyse the pages

# the number here is hardcoded, should be infinite. 
for($i=1; $i -lt 100; ++$i)

{

echo "Analysing Page $i ...";

$html = $webClient.DownloadString("http://www.cnblogs.com/" + $blogName +"/default.html?page=" + $i);


$matches = $regex.Matches($html);
if($matches.Count -eq 0)

{

#No more pages

$j = $i - 1;

$count = $articles.Count;

echo "Finished analysing, total $j pages and $count articles.";
break;

}


foreach ($match in $matches)

{

$article = "" | select title, url, date, catalog, type;

$article.title = $match.Groups["title"].Value;

$article.url = $match.Groups["url"].Value;

$article.catalog = "Windows Mobile`r`n Windows Embedded CE";

$article.type = "博客";

$date = $regexDate.Matches($article.url);
if($date.Count -gt 0)

{

$article.date = $date[0].Groups["year"].Value + "年" + $date[0].Groups["month"].Value + "月" + $date[0].Groups["day"].Value+ "日";

}
else
{

echo "Cannot find the date."
}

$articles.Add($article);

}

}


# Generate the report

$head = '<style>
BODY{font-family:Verdana; background-color:lightblue;}

TABLE{border-width: 1px;border-style: solid;border-color: black;border-collapse: collapse;}

TH{font-size:1.3em; border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:#FFCCCC}

TD{border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:white}

</style>'

$header = "<H1>博客文章列表</H1>"
$title = "博客文章列表"

$path = Get-Location;

$path = $path.Path + "/report.html";


$articles | 

Select-Object date, title, url, catalog, type | 

ConvertTo-HTML -head $head -body $header -title $title | 

Out-File -FilePath $path -encoding "unicode";

下面是在PowerShell执行的截图，关于PowerShell的环境配置，请看上篇文章。

下面是生成的文章列表。

代码解析

$blogName = "procoder";

需要采集的博客名字，如果有需要可能把之改成自己博客的名字，这个也可以通过参数传递进来。

$articles = New-Object System.Collections.Generic.List``1[System.Object]

$articles是用于保存采集文章信息的容器。注意生成的时候格式有点怪，需要加上``1

$OutputEncoding = New-Object -typename System.Text.UTF8Encoding;

由于我使用的是英文的操作系统，所有需要把环境变量$OutputEncoding改成UTF8的编码方式。

$webClient = New-Object System.Net.WebClient;

$webClient.Encoding = [System.Text.Encoding]::UTF8;

使用WebClient进行采集，由于采集内容有中文，把编码改成UTF8.

$regex = New-Object System.Text.RegularExpressions.Regex('<a\s+id="homepage1_HomePageDays_ctl00_DayList_ctl\d+_TitleUrl" class="postTitle2" href="(?<url>.+)">(?<title>.+)</a>');

$regexDate = New-Object System.Text.RegularExpressions.Regex('http://www.cnblogs.com/\w+/archive/(?<year>\d+)/(?<month>\d+)/(?<day>\d+)/.+.html');

使用正则表达式来匹配采集的结果。正则表达式根据采集的内容来写，例如下面为采集到的HTML源码，根据其格式采集出题目，链接以及日期信息。

<a id="homepage1_HomePageDays_ctl00_DayList_ctl00_TitleUrl" class="postTitle2" href="http://www.cnblogs.com/procoder/archive/2010/05/17/Microsoft_Word_Save_As_PDF.html">[Office 2010 易宝典]怎样直接将Office文档保存为PDF格式？</a>

echo "Analysing Page $i ...";

$html = $webClient.DownloadString("http://www.cnblogs.com/" + $blogName +"/default.html?page=" + $i);


$matches = $regex.Matches($html);
if($matches.Count -eq 0)

{

#No more pages

$j = $i - 1;

$count = $articles.Count;

echo "Finished analysing, total $j pages and $count articles.";
break;

}

调用$webClient.DownloadString采集网页的内容，把HTML源码保存到字符串中。通过$regex.Matches($html);来匹配出标题，链接等信息。如果没有匹配，表示采集完成。

foreach ($match in $matches)

{

$article = "" | select title, url, date, catalog, type;

$article.title = $match.Groups["title"].Value;

$article.url = $match.Groups["url"].Value;

$article.catalog = "Windows Mobile`r`n Windows Embedded CE";

$article.type = "博客";

$date = $regexDate.Matches($article.url);
if($date.Count -gt 0)

{

$article.date = $date[0].Groups["year"].Value + "年" + $date[0].Groups["month"].Value + "月" + $date[0].Groups["day"].Value+ "日";

}
else
{

echo "Cannot find the date."
}

$articles.Add($article);

}

匹配出年月日的信息，并且把所有匹配信息放到对象$artile中，最后存放到容器中。

# Generate the report

$head = '<style>
BODY{font-family:Verdana; background-color:lightblue;}

TABLE{border-width: 1px;border-style: solid;border-color: black;border-collapse: collapse;}

TH{font-size:1.3em; border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:#FFCCCC}

TD{border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:white}

</style>'

$header = "<H1>博客文章列表</H1>"
$title = "博客文章列表"

$path = Get-Location;

$path = $path.Path + "/report.html";


$articles | 

Select-Object date, title, url, catalog, type | 

ConvertTo-HTML -head $head -body $header -title $title | 

Out-File -FilePath $path -encoding "unicode";

最后使用ConvertTo-HTML把容器信息转换成HTML输出，然后使用Out-File导出到文件中，由于使用了中文，所有要指定编码为"unicode"。

本文转自Jake Lin博客园博客，原文链接：http://www.cnblogs.com/procoder/archive/2010/07/28/use-PowerShell-collect-cnblogs-articles.html，如需转载请自行联系原作者

客服电话

电子邮件

聪明的程序员用Delphi，真正的程序员用C++，偷懒的程序员用PowerShell ...

简介

需求

源代码

代码解析

请发表评论

全部评论

上一篇：

下一篇：

theindianappguy/machine_learning_flutter

微信小程序设置页面高度100%撑满 - object3

juven/maven-bash-completion: Maven Bash

挨打的读音是什么？是āi dǎ还是ái dǎ？

matlab 打包exe

剪的笔顺,诠释剪的笔画,认识剪的部首

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

六六分期app的软件客服如何联系？(六六分期

doraiso/Mastodon

关于我们

产品与服务

解决方案

139-2527-9053