In this blog i will show you how to parse and extract proxies on the fly using HtmlAgilityPack library and Regular Expression, Let’s get started
Requisition:
Before working throgh this example code, you have to get htmlagilitypack library, you can get the lastest version of htmlagilitypack here
Next step, try to find on the Google some proxies site you want to parse and extract it’s proxies, for demo purpose i just create a new page in my blog which contains some proxies (>4.000 proxies), you can check out it here: http://code2code.info/page/transparent-proxy-updated-25022010-2100.aspx
To make it simple i will create Console application using visual studio named ParseProxyUsingHtmlPackandRegex
using System;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace ParseProxyUsingHtmlPackandRegex
{
class Program
{
static void Main(string[] args)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://code2code.info/page/transparent-proxy-updated-25022010-2100.aspx");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[@id='ctl00_cphBody_divText']");
string returnValue = node.InnerHtml.ToString();
Regex reg = new Regex(@"\d+\.\d+\.\d+\.\d+:\d+");
MatchCollection mathCollects = reg.Matches(returnValue);
if (mathCollects.Count>0)
{
for (int i = 0; i < mathCollects.Count; i++)
{
Console.WriteLine(mathCollects[i].Value);
}
}
Console.Read();
}
}
}
VB.NET
Imports System
Imports System.Text.RegularExpressions
Imports HtmlAgilityPack
Module Module1
Sub Main()
Console.Title = "get proxies on the fly using vb.net"
Dim web As New HtmlWeb()
Dim doc As HtmlDocument = web.Load("http://code2code.info/page/transparent-proxy-updated-25022010-2100.aspx")
Dim node As HtmlNode = doc.DocumentNode.SelectSingleNode("//*[@id='ctl00_cphBody_divText']")
Dim returnValue = node.InnerHtml.ToString()
Dim reg As New Regex("\d+\.\d+\.\d+\.\d+:\d+")
Dim mathCollects As MatchCollection = reg.Matches(returnValue)
If mathCollects.Count > 0 Then
For i As Integer = 0 To mathCollects.Count - 1
Console.WriteLine(mathCollects(i).Value)
Next
End If
Console.Read()
End Sub
End Module
Code explanation:
The first of all, I created HtmlWeb object, and call Load method with url as parameter then save it’s value to HtmlDocument object, The Load method of HtmlWeb object has 4 overload methods as you will see these bellow:
// Summary:
// Gets an HTML document from an Internet resource.
//
// Parameters:
// url:
// The requested URL, such as "http://Myserver/Mypath/Myfile.asp".
//
// Returns:
// A new HTML document.
public HtmlDocument Load(string url);
//
// Summary:
// Loads an HTML document from an Internet resource.
//
// Parameters:
// url:
// The requested URL, such as "http://Myserver/Mypath/Myfile.asp".
//
// method:
// The HTTP method used to open the connection, such as GET, POST, PUT, or PROPFIND.
//
// Returns:
// A new HTML document.
public HtmlDocument Load(string url, string method);
//
// Summary:
// Loads an HTML document from an Internet resource.
//
// Parameters:
// url:
// The requested URL, such as "http://Myserver/Mypath/Myfile.asp".
//
// method:
// The HTTP method used to open the connection, such as GET, POST, PUT, or PROPFIND.
//
// proxy:
// Proxy to use with this request
//
// credentials:
// Credentials to use when authenticating
//
// Returns:
// A new HTML document.
public HtmlDocument Load(string url, string method, WebProxy proxy, NetworkCredential credentials);
//
// Summary:
// Gets an HTML document from an Internet resource.
//
// Parameters:
// url:
// The requested URL, such as "http://Myserver/Mypath/Myfile.asp".
//
// proxyHost:
// Host to use for Proxy
//
// proxyPort:
// Port the Proxy is on
//
// userId:
// User Id for Authentication
//
// password:
// Password for Authentication
//
// Returns:
// A new HTML document.
public HtmlDocument Load(string url, string proxyHost, int proxyPort, string userId, string password);
Next, i must use firebug to inspect this proxy site to know which position I have to get proxies value, i see it is a div tag with id is ctl00_cphBody_divText and no have more than this div, so i use SelectSingleNode method with "//*[@id='ctl00_cphBody_divText']" is a xpath query.
The entire it’s value i put into returnValue variable as you can see this code:
string returnValue = node.InnerHtml.ToString();
Now, it’s time to know which value is proxy, i use regular express with pattern @"\d+\.\d+\.\d+\.\d+:\d+ to get the proxy if it matches.
To make sure if we get the proxies or not, i have add a condition to check number of match count value return, Then i loop and get values.
Have fun@