Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Community Wiki

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Navigation

Google Ads

ASP.NET: Convert HTML tables to a DataSet

From Wiki

Jump to: navigation, search

Summary: An example of how we can extract data from HTML tables and create a DataSet object containing this data.


Contents [Hide]

Requirements

I recently needed to do some "screen scraping" from a locally installed 3rd party web application and then do some data manipulation based on the results. The application in question wrote the results out to the page inside tags and there were several of these tables on the page. I decided that the approach I would take would be to read all of these HTML tables, identifying them with a Regular Expression, and then convert them into one DataSet where I could then perform the required manipulation.

Sample Data

To recreate the page that we need to scrape, I've created a simple function to build a HTML page containing two tables. You can use this function whilst doing your testing, but I imagine that in a real-life situation you will want to retrieve the HTML directly from the web page, or maybe read all the lines from a locally based file. The function I created looks like this, although feel free to modify this if you need to:

  1. Private Function GetHTML() As String
  2.     ' Declarations  
  3.     Dim sb As New StringBuilder
  4.  
  5.     ' Create a valid HTML file  
  6.     sb.AppendLine("<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">")
  7.     sb.AppendLine("<html>")
  8.     sb.AppendLine("<head>")
  9.     sb.AppendLine("<meta http-equiv=""Content-Type"" content=""text/html; charset=iso-8859-1" > "")
  10.     sb.AppendLine("<title>Title</title>")
  11.     sb.AppendLine("</head>")
  12.     sb.AppendLine("<body>")
  13.  
  14.     ' Table One (with headers)  
  15.     sb.AppendLine("<table>")
  16.     sb.AppendLine("<tr>")
  17.     sb.AppendLine("<th>Table 1 - Header 1</th>")
  18.     sb.AppendLine("<th>Table 1 - Header 2</th>")
  19.     sb.AppendLine("<th>Table 1 - Header 3</th>")
  20.     sb.AppendLine("</tr>")
  21.     sb.AppendLine("<tr>")
  22.     sb.AppendLine("<td>Table 1 - Row 1 - Column 1</td>")
  23.     sb.AppendLine("<td>Table 1 - Row 1 - Column 2</td>")
  24.     sb.AppendLine("<td>Table 1 - Row 1 - Column 3</td>")
  25.     sb.AppendLine("</tr>")
  26.     sb.AppendLine("<tr>")
  27.     sb.AppendLine("<td>Table 1 - Row 2 - Column 1</td>")
  28.     sb.AppendLine("<td>Table 1 - Row 2 - Column 2</td>")
  29.     sb.AppendLine("<td>Table 1 - Row 2 - Column 3</td>")
  30.     sb.AppendLine("</td>")
  31.     sb.AppendLine("</tr>")
  32.     sb.AppendLine("<tr>")
  33.     sb.AppendLine("<td>Table 1 - Row 3 - Column 1</td>")
  34.     sb.AppendLine("<td>Table 1 - Row 3 - Column 2</td>")
  35.     sb.AppendLine("<td>Table 1 - Row 3 - Column 3</td>")
  36.     sb.AppendLine("</td>")
  37.     sb.AppendLine("</tr>")
  38.     sb.AppendLine("</table>")
  39.  
  40.     ' Table Two (without headers)  
  41.     sb.AppendLine("<table>")
  42.     sb.AppendLine("<tr>")
  43.     sb.AppendLine("<td>Table 2 - Row 1 - Column 1</td>")
  44.     sb.AppendLine("<td>Table 2 - Row 1 - Column 2</td>")
  45.     sb.AppendLine("<td>Table 2 - Row 1 - Column 3</td>")
  46.     sb.AppendLine("</td>")
  47.     sb.AppendLine("</tr>")
  48.     sb.AppendLine("<tr>")
  49.     sb.AppendLine("<td>Table 2 - Row 2 - Column 1</td>")
  50.     sb.AppendLine("<td>Table 2 - Row 2 - Column 2</td>")
  51.     sb.AppendLine("<td>Table 2 - Row 2 - Column 3</td>")
  52.     sb.AppendLine("</td>")
  53.     sb.AppendLine("</tr>")
  54.     sb.AppendLine("<tr>")
  55.     sb.AppendLine("<td>Table 2 - Row 3 - Column 1</td>")
  56.     sb.AppendLine("<td>Table 2 - Row 3 - Column 2</td>")
  57.     sb.AppendLine("<td>Table 2 - Row 3 - Column 3</td>")
  58.     sb.AppendLine("</td>")
  59.     sb.AppendLine("</tr>")
  60.     sb.AppendLine("</table>")
  61.  
  62.     ' Close the HTML elements  
  63.     sb.AppendLine("</body>")
  64.     sb.AppendLine("</html>")
  65.  
  66.     Return sb.ToString
  67.  
  68. End Function


Data Extraction

Whichever method we use to retrieve this HTML, we then need to be able to extract the relevant table elements. I decided to use a Regular Expression to do this (adding some options in to make sure that the case and any line breaks were ignored), specifically this one which targets the beginning and end
tags:
  1. <table[^>]*>(.*?)</table>

This will return all of the text in between the <table> tags and will allow us to then apply further Regular Expressions to get the text inside all of the <th>, <tr> and <td> tags. As some of the tables returned to me had tags, and some didn't, I decided to include a check in the function to see if they did exist. If they did, I would use the text inside these tags for the column names in my DataTable; if they didn't exist, I would simply create a default naming scheme (e.g. Column1, Column2 etc).


Logic

The logic of the function was actually fairly simple and could be broken down into the following "pseudo" steps:

 1. Retrieve each instance of the table elements on the page.
 2. Loop through each table, performing the following checks.
 3. Check for the existence of <th> tags to determine if we know the names of the columns, otherwise just add a default name for each column.
 4. Loop through the rows of the table and for each column, add the value to our column in the DataTable.


Implementation

Recreating these steps into a .NET function, I came up with this function named "ConvertHTMLTablesToDataSet" which accepts the full HTML string, performs the actions we identified above and then returns a DataSet with a corresponding DataTable for each HTML table that was found:

  1. Private Function ConvertHTMLTablesToDataSet(ByVal HTML As String) As DataSet  
  2.     ' Declarations  
  3.     Dim ds As New DataSet
  4.     Dim dt As DataTable
  5.     Dim dr As DataRow
  6.     Dim dc As DataColumn
  7.     Dim TableExpression As String = "<table[^>]*>(.*?)</table>"
  8.     Dim HeaderExpression As String = "<th[^>]*>(.*?)</th>"
  9.     Dim RowExpression As String = "<tr[^>]*>(.*?)</tr>"
  10.     Dim ColumnExpression As String = "<td[^>]*>(.*?)</td>"
  11.     Dim HeadersExist As Boolean = False
  12.     Dim iCurrentColumn As Integer = 0
  13.     Dim iCurrentRow As Integer = 0
  14.  
  15.     ' Get a match for all the tables in the HTML  
  16.     Dim Tables As MatchCollection = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
  17.  
  18.     ' Loop through each table element  
  19.     For Each Table As Match In Tables
  20.  
  21.         ' Reset the current row counter and the header flag  
  22.         iCurrentRow = 0
  23.         HeadersExist = False
  24.  
  25.         ' Add a new table to the DataSet  
  26.         dt = New DataTable
  27.  
  28.         ' Create the relevant amount of columns for this table (use the headers if they exist, otherwise use default names)  
  29.         If Table.Value.Contains("<th") Then
  30.             ' Set the HeadersExist flag  
  31.             HeadersExist = True
  32.  
  33.             ' Get a match for all the rows in the table  
  34.             Dim Headers As MatchCollection = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
  35.  
  36.             ' Loop through each header element  
  37.             For Each Header As Match In Headers
  38.                 dt.Columns.Add(Header.Groups(1).ToString)
  39.             Next
  40.         Else
  41.             For iColumns As Integer = 1 To Regex.Matches(Regex.Matches(Regex.Matches(Table.Value, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Count
  42.                 dt.Columns.Add("Column " & iColumns)
  43.             Next
  44.         End If
  45.  
  46.         ' Get a match for all the rows in the table  
  47.         Dim Rows As MatchCollection = Regex.Matches(Table.Value, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
  48.  
  49.         ' Loop through each row element  
  50.         For Each Row As Match In Rows
  51.  
  52.             ' Only loop through the row if it isn't a header row  
  53.             If Not (iCurrentRow = 0 And HeadersExist = True) Then
  54.  
  55.                 ' Create a new row and reset the current column counter  
  56.                 dr = dt.NewRow
  57.                 iCurrentColumn = 0
  58.  
  59.                 ' Get a match for all the columns in the row  
  60.                 Dim Columns As MatchCollection = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
  61.  
  62.                 ' Loop through each column element  
  63.                 For Each Column As Match In Columns
  64.  
  65.                     ' Add the value to the DataRow  
  66.                     dr(iCurrentColumn) = Column.Groups(1).ToString
  67.  
  68.                     ' Increase the current column  
  69.                     iCurrentColumn += 1
  70.                 Next
  71.  
  72.                 ' Add the DataRow to the DataTable  
  73.                 dt.Rows.Add(dr)
  74.  
  75.             End If
  76.  
  77.             ' Increase the current row counter  
  78.             iCurrentRow += 1
  79.         Next
  80.  
  81.         ' Add the DataTable to the DataSet  
  82.         ds.Tables.Add(dt)
  83.  
  84.     Next
  85.  
  86.     Return (ds)
  87.  
  88. End Function


Viewing the results

If you want to test this function, you can create a simple .aspx page with a Panel on it:

  1. <%@ Page Language="VB" AutoEventWireup="false" CodeFile="Default1.aspx.vb" Inherits="Default1" %>  
  2. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  
  3.  
  4. <html xmlns="http://www.w3.org/1999/xhtml" >  
  5. <head runat="server">  
  6.     <title>Untitled Page</title>  
  7. </head>  
  8. <body>  
  9.     <form id="form1" runat="server">  
  10.     <div>  
  11.         <asp:Panel ID="pnlGridViews" runat="server">  
  12.         </asp:Panel>  
  13.     </div>  
  14.     </form>  
  15. </body>  
  16. </html>

And then create some dynamic GridView's for each DataTable e.g.

  1. Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
  2.  
  3.     ' Get some HTML (you can replace this test data with your real HTML)  
  4.     Dim HTML As String = GetHTML()
  5.     Dim Results As DataSet = ConvertHTMLTablesToDataSet(HTML)
  6.     Dim Grid As GridView
  7.  
  8.     ' Create a GridView for each DataTable  
  9.     For Each Data As DataTable In Results.Tables
  10.         .Grid = New GridView
  11.         .Grid.DataSource = Data
  12.         .Grid.DataBind()
  13.         .pnlGridViews.Controls.Add(Grid)
  14.         .Next()
  15.  
  16. End Sub

You may also need to include the following Import statements on your page:

  1. Imports System.Data
  2. Imports System.Text.RegularExpressions
  3. Imports System.Net
  4. Imports System.IO

When you run this test page in your development environment, if you have used the sample data from the GetHTML function above you should see the following tables:

HTMLTableToDataSet.JPG

Considerations and Improvements

You may want to entend the functionality of this approach. For example, the function assumes that the HTML that is retrieved will be valid and in the correct format. I was lucky in the sense that I knew exactly what would be included in the HTML before writing the function, however, if you are retrieving data from an external site this may not always be the case so you may want to build in your own validity check and associated error handling.


This Hack is part of the ASP.NET Hacks collection

469 Rating: 3.8/5 (12 votes cast)