在 .net core 截取網頁

在寫網站時,都會將某些邏輯獨立在別的 Web API,而在呼叫時就肯定會用到 HttpClient,使用上其實也不難,如以下範例所示:

HttpClient client = new HttpClient();
client.BaseAddress = new Uri(@"https://example.com");
var response = await client.GetAsync(@"/");
string result = await response.Content.ReadAsStringAsync();

如果要截取網頁的原始碼,基本做法是一樣的,但會牽涉到 decode 的問題,需額外做處理:

HttpClient client = new HttpClient();
client.BaseAddress = new Uri(@"https://www.google.com");

var response = await client.GetAsync(@"/");
string htmlStr = await response.Content.ReadAsStringAsync();
string htmlDecodeStr = System.Web.HttpUtility.HtmlDecode(htmlStr);
string urlDecodeStr = System.Text.RegularExpressions.Regex.Unescape(htmlDecodeStr);

做完字串的處理後,就可以對其做邏輯上的處理。