Quantcast parse and extract proxies on the fly

parse and extract proxies on the fly

by Neon Quach 31. Tháng Năm 2010 03:57

In this blog i will show you how to parse and extract proxies on the fly using HtmlAgilityPack library and Regular Expression, Let’s get started

Requisition:
Before working throgh this example code, you have to get htmlagilitypack library, you can get the lastest version of htmlagilitypack here

Next step, try to find on the Google some proxies site you want to parse and extract it’s proxies, for demo purpose i just create a new page in my blog which contains some proxies (>4.000 proxies), you can check out it here:  http://code2code.info/page/transparent-proxy-updated-25022010-2100.aspx


To make it simple i will create Console application using visual studio named ParseProxyUsingHtmlPackandRegex

using System;

using System.Text.RegularExpressions;

using HtmlAgilityPack;

 

namespace ParseProxyUsingHtmlPackandRegex

{

    class Program

    {

        static void Main(string[] args)

        {

            HtmlWeb web = new HtmlWeb();

            HtmlDocument doc = web.Load("http://code2code.info/page/transparent-proxy-updated-25022010-2100.aspx");

            HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[@id='ctl00_cphBody_divText']");

            string returnValue = node.InnerHtml.ToString();

            Regex reg = new Regex(@"\d+\.\d+\.\d+\.\d+:\d+");

            MatchCollection mathCollects = reg.Matches(returnValue);

            if (mathCollects.Count>0)

            {

                for (int i = 0; i < mathCollects.Count; i++)

                {

                    Console.WriteLine(mathCollects[i].Value);

                }

            }

            Console.Read();

        }

    }

}

VB.NET
Imports
System

Imports System.Text.RegularExpressions

Imports HtmlAgilityPack

 

Module Module1

 

    Sub Main()

        Console.Title = "get proxies on the fly using vb.net"

        Dim web As New HtmlWeb()

        Dim doc As HtmlDocument = web.Load("http://code2code.info/page/transparent-proxy-updated-25022010-2100.aspx")

        Dim node As HtmlNode = doc.DocumentNode.SelectSingleNode("//*[@id='ctl00_cphBody_divText']")

        Dim returnValue = node.InnerHtml.ToString()

        Dim reg As New Regex("\d+\.\d+\.\d+\.\d+:\d+")

        Dim mathCollects As MatchCollection = reg.Matches(returnValue)

        If mathCollects.Count > 0 Then

            For i As Integer = 0 To mathCollects.Count - 1

                Console.WriteLine(mathCollects(i).Value)

            Next

        End If

        Console.Read()

    End Sub

 

End Module


Code explanation:

The first of all, I created
HtmlWeb object, and call Load method with url as parameter then save it’s value to HtmlDocument object, The Load method of HtmlWeb object has 4 overload methods as you will see these  bellow:


        // Summary:

        //     Gets an HTML document from an Internet resource.

        //

        // Parameters:

        //   url:

        //     The requested URL, such as "http://Myserver/Mypath/Myfile.asp".

        //

        // Returns:

        //     A new HTML document.

        public HtmlDocument Load(string url);

        //

        // Summary:

        //     Loads an HTML document from an Internet resource.

        //

        // Parameters:

        //   url:

        //     The requested URL, such as "http://Myserver/Mypath/Myfile.asp".

        //

        //   method:

        //     The HTTP method used to open the connection, such as GET, POST, PUT, or PROPFIND.

        //

        // Returns:

        //     A new HTML document.

        public HtmlDocument Load(string url, string method);

        //

        // Summary:

        //     Loads an HTML document from an Internet resource.

        //

        // Parameters:

        //   url:

        //     The requested URL, such as "http://Myserver/Mypath/Myfile.asp".

        //

        //   method:

        //     The HTTP method used to open the connection, such as GET, POST, PUT, or PROPFIND.

        //

        //   proxy:

        //     Proxy to use with this request

        //

        //   credentials:

        //     Credentials to use when authenticating

        //

        // Returns:

        //     A new HTML document.

        public HtmlDocument Load(string url, string method, WebProxy proxy, NetworkCredential credentials);

        //

        // Summary:

        //     Gets an HTML document from an Internet resource.

        //

        // Parameters:

        //   url:

        //     The requested URL, such as "http://Myserver/Mypath/Myfile.asp".

        //

        //   proxyHost:

        //     Host to use for Proxy

        //

        //   proxyPort:

        //     Port the Proxy is on

        //

        //   userId:

        //     User Id for Authentication

        //

        //   password:

        //     Password for Authentication

        //

        // Returns:

        //     A new HTML document.

        public HtmlDocument Load(string url, string proxyHost, int proxyPort, string userId, string password);


Next, i must use firebug to inspect this proxy site to know which
position I have to get proxies value, i see it is a div tag with id is ctl00_cphBody_divText and no have more than this div, so i use SelectSingleNode method with "//*[@id='ctl00_cphBody_divText']" is a xpath query.

The entire it’s value i put into returnValue variable as you can see this code:

string returnValue = node.InnerHtml.ToString();

Now, it’s time to know which value is proxy, i use regular express with pattern
@"\d+\.\d+\.\d+\.\d+:\d+ to get the proxy if it matches.

To make sure if we get the proxies or not, i have add a condition to check number of match count value return, Then i loop and get values.

Have fun@

Tags: , ,


Categories: regular expression | c#

blog comments powered by Disqus

About me

I'm  currently employed as Software developer at sutrixmedia.com and also a Microsoft Certified Technology Specialist (MCTS), Microsoft Certified Professional Developer (MCPD) in Net Framework 2.0 and 3.5: Web Applications and MCTS .NET Framework 3.5, ADO.NET Applications

Powered by BlogEngine.NET 2.7.0.0 - Eco Theme by n3o Web Designers