To recreate the page that we need to scrape, I've created a simple function to build a HTML page containing two tables. You can use this function whilst doing your testing, but I imagine that in a real-life situation you will want to retrieve the HTML directly from the web page, or maybe read all the lines from a locally based file. The function I created looks like this, although feel free to modify this if you need to:
tags:
- <table[^>]*>(.*?)</table>
This will return all of the text in between the <table> tags and will allow us to then apply further Regular Expressions to get the text inside all of the <th>, <tr> and <td> tags. As some of the tables returned to me had tags, and some didn't, I decided to include a check in the function to see if they did exist. If they did, I would use the text inside these tags for the column names in my DataTable; if they didn't exist, I would simply create a default naming scheme (e.g. Column1, Column2 etc).
Logic
The logic of the function was actually fairly simple and could be broken down into the following "pseudo" steps:
1. Retrieve each instance of the table elements on the page.
2. Loop through each table, performing the following checks.
3. Check for the existence of <th> tags to determine if we know the names of the columns, otherwise just add a default name for each column.
4. Loop through the rows of the table and for each column, add the value to our column in the DataTable.
Implementation
Recreating these steps into a .NET function, I came up with this function named "ConvertHTMLTablesToDataSet" which accepts the full HTML string, performs the actions we identified above and then returns a DataSet with a corresponding DataTable for each HTML table that was found:
- Private Function ConvertHTMLTablesToDataSet(ByVal HTML As String) As DataSet
- ' Declarations
- Dim ds As New DataSet
- Dim dt As DataTable
- Dim dr As DataRow
- Dim dc As DataColumn
- Dim TableExpression As String = "<table[^>]*>(.*?)</table>"
- Dim HeaderExpression As String = "<th[^>]*>(.*?)</th>"
- Dim RowExpression As String = "<tr[^>]*>(.*?)</tr>"
- Dim ColumnExpression As String = "<td[^>]*>(.*?)</td>"
- Dim HeadersExist As Boolean = False
- Dim iCurrentColumn As Integer = 0
- Dim iCurrentRow As Integer = 0
-
- ' Get a match for all the tables in the HTML
- Dim Tables As MatchCollection = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
-
- ' Loop through each table element
- For Each Table As Match In Tables
-
- ' Reset the current row counter and the header flag
- iCurrentRow = 0
- HeadersExist = False
-
- ' Add a new table to the DataSet
- dt = New DataTable
-
- ' Create the relevant amount of columns for this table (use the headers if they exist, otherwise use default names)
- If Table.Value.Contains("<th") Then
- ' Set the HeadersExist flag
- HeadersExist = True
-
- ' Get a match for all the rows in the table
- Dim Headers As MatchCollection = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
-
- ' Loop through each header element
- For Each Header As Match In Headers
- dt.Columns.Add(Header.Groups(1).ToString)
- Next
- Else
- For iColumns As Integer = 1 To Regex.Matches(Regex.Matches(Regex.Matches(Table.Value, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Count
- dt.Columns.Add("Column " & iColumns)
- Next
- End If
-
- ' Get a match for all the rows in the table
- Dim Rows As MatchCollection = Regex.Matches(Table.Value, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
-
- ' Loop through each row element
- For Each Row As Match In Rows
-
- ' Only loop through the row if it isn't a header row
- If Not (iCurrentRow = 0 And HeadersExist = True) Then
-
- ' Create a new row and reset the current column counter
- dr = dt.NewRow
- iCurrentColumn = 0
-
- ' Get a match for all the columns in the row
- Dim Columns As MatchCollection = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
-
- ' Loop through each column element
- For Each Column As Match In Columns
-
- ' Add the value to the DataRow
- dr(iCurrentColumn) = Column.Groups(1).ToString
-
- ' Increase the current column
- iCurrentColumn += 1
- Next
-
- ' Add the DataRow to the DataTable
- dt.Rows.Add(dr)
-
- End If
-
- ' Increase the current row counter
- iCurrentRow += 1
- Next
-
- ' Add the DataTable to the DataSet
- ds.Tables.Add(dt)
-
- Next
-
- Return (ds)
-
- End Function
Viewing the results
If you want to test this function, you can create a simple .aspx page with a Panel on it:
- <%@ Page Language="VB" AutoEventWireup="false" CodeFile="Default1.aspx.vb" Inherits="Default1" %>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
-
- <html xmlns="http://www.w3.org/1999/xhtml" >
- <head runat="server">
- <title>Untitled Page</title>
- </head>
- <body>
- <form id="form1" runat="server">
- <div>
- <asp:Panel ID="pnlGridViews" runat="server">
- </asp:Panel>
- </div>
- </form>
- </body>
- </html>
And then create some dynamic GridView's for each DataTable e.g.
- Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
-
- ' Get some HTML (you can replace this test data with your real HTML)
- Dim HTML As String = GetHTML()
- Dim Results As DataSet = ConvertHTMLTablesToDataSet(HTML)
- Dim Grid As GridView
-
- ' Create a GridView for each DataTable
- For Each Data As DataTable In Results.Tables
- .Grid = New GridView
- .Grid.DataSource = Data
- .Grid.DataBind()
- .pnlGridViews.Controls.Add(Grid)
- .Next()
-
- End Sub
You may also need to include the following Import statements on your page:
- Imports System.Data
- Imports System.Text.RegularExpressions
- Imports System.Net
- Imports System.IO
When you run this test page in your development environment, if you have used the sample data from the GetHTML function above you should see the following tables:
Considerations and Improvements
You may want to entend the functionality of this approach. For example, the function assumes that the HTML that is retrieved will be valid and in the correct format. I was lucky in the sense that I knew exactly what would be included in the HTML before writing the function, however, if you are retrieving data from an external site this may not always be the case so you may want to build in your own validity check and associated error handling.
This Hack is part of the ASP.NET Hacks collection