Screen Scraping – When All You Have Is A Hammer…

MiniTutorialLogo_thumb_0E235841

I had decided to create a list of what videos were already available on the Learning Pages of Silverlight.net.  When I clicked on the page for the entire list, however, I was quite daunted by the sheer number. I opened the “source” for the page, and found that there was an easy screen scraping capability, however. The name of each video was also a link to its landing page, and so I could grab the HTML and search for the appropriate links.

Using A Hand Grenade To Catch A Mouse

iStock_MouseTrapXSmallGrenade

I’m happy to agree that Silverlight probably isn’t the first development platform to come to mind to implement this little utility (and it probably shouldn’t be) but the more I thought about how I’d do it, the more I realized that, at least to a first approximation, the out of browser capabilities of Silverlight make it at least a non-insane alternative.  And thus was born a mini-tutorial on using Silverlight for creating desk-top utilities.

Getting the html was a snap; I just opened source in IE and saved the file with a known name to a known location. I will leave it as an exercise for the reader (or a future tutorial) to explore how one might grab the HTML directly from the ‘Net.

The first real step then was to create a workable UI. This too was a snap, I wrote this program like I write most; I looked for some existing program that I could steal from. Usually I steal from myself (that is, from earlier work by the human construct that, to the best of my ability to discern, I share a common set of memories and a clear world-line with, in this particular fork of the multi-verse).

This time, however, I stole from Tim Heuer. As often happens, not much of the original code survived, but that didn’t make it any less a valuable starting point.

Parsing  A Local File

The key techniques to making this work are:

  • Ensure that the application is installed locally
  • Ensure you can make changes and debug without having to uninstall and reinstall
  • Require elevated permissions (for file access)

After that, it is just a question of using Regex to parse the HTML.  Let’s walk through it, briefly.

The Xaml

To recreate this, open a new Silverlight 4 Application and add three rows to the Grid.  The top row will announce the name of the program, and will display a button to install the program or, if the program is installed, to parse the file.  The second row will display the original html, and the third row will display the files we’re looking for.

<UserControl
   x:Class="ScreenScraper.MainPage"
   xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
   xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
   xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
   xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
   mc:Ignorable="d">

   <Grid
      x:Name="LayoutRoot"
      Height="390"
      Width="550">
      <Grid.RowDefinitions>
         <RowDefinition
            Height="2*" />
         <RowDefinition
            Height="1*" />
         <RowDefinition
            Height="1*" />
      </Grid.RowDefinitions>
      <StackPanel
         HorizontalAlignment="Center"
         VerticalAlignment="Center"
         Grid.Row="0">
         <TextBlock
            Foreground="Black"
            Text="Trusted Application - Screen Scraper"
            FontSize="20"
            HorizontalAlignment="Center" />
         <Button
            x:Name="Parse"
            Content="Parse File"
            FontSize="20"
            Height="Auto"
            Width="Auto"
            HorizontalAlignment="Center" />
         <TextBlock
            x:Name="Warning"
            Text="This application must be run out of browser"
            TextWrapping="Wrap"
            TextAlignment="Center"
            FontSize="14"
            Foreground="Maroon" />
         <Button
            x:Name="InstallButton"
            Content="Install"
            Foreground="Maroon"
            FontSize="20"
            Width="Auto"
            Height="Auto"
            HorizontalAlignment="Center" />
      </StackPanel>

      <RichTextBox
         x:Name="FileContents"
         Grid.Row="1"
         FontSize="10"
         VerticalAlignment="Stretch"
         HorizontalAlignment="Stretch"
         TextWrapping="Wrap"
         VerticalScrollBarVisibility="Auto"
         IsReadOnly="True"
         Margin="10" />
      <ListBox
         x:Name="Output"
         Grid.Row="2"
         FontSize="10"
         VerticalAlignment="Stretch"
         HorizontalAlignment="Stretch"
         Margin="10" />
   </Grid>
</UserControl>

Key to making this look right is to set the visibility of the appropriate text and buttons in the first row. To ensure that there is no ambiguity, I begin by setting them all to collapsed. The top of MainPage.xaml.cs looks like this:

using System;
using System.Text.RegularExpressions;
using System.Windows;
using System.Windows.Controls;
using System.IO;
using System.Windows.Documents;

namespace ScreenScraper
{
    public partial class MainPage : UserControl
    {
        public MainPage()
        {
           InitializeComponent();
           Loaded += new RoutedEventHandler(MainPage_Loaded);
        }

        void MainPage_Loaded(object sender, RoutedEventArgs e)
        {
           Warning.Visibility = System.Windows.Visibility.Collapsed;
           InstallButton.Visibility = Visibility.Collapsed;
           Parse.Visibility = System.Windows.Visibility.Visible;
           FileContents.Visibility = System.Windows.Visibility.Collapsed;
           Output.Visibility = System.Windows.Visibility.Collapsed;

Event handlers must be set up for the two buttons:

           Parse.Click += new RoutedEventHandler( Parse_Click );
           InstallButton.Click += new RoutedEventHandler(InstallButton_Click);

The final task for the loaded event handler is to determine if the program has been installed yet (and if not, to make the Install button visible) and to make sure the program is actually running “out of browser.”

            if (App.Current.InstallState != InstallState.Installed)
            {
               Parse.Visibility = System.Windows.Visibility.Collapsed;
               InstallButton.Visibility = Visibility.Visible;
               Warning.Visibility = System.Windows.Visibility.Visible;
               Warning.Text = "Please install to run...";
            }
            else if (! App.Current.IsRunningOutOfBrowser)
            {
               Parse.Visibility = System.Windows.Visibility.Collapsed;
               Warning.Visibility = System.Windows.Visibility.Visible;
               Warning.Text = "This application must be run out of browser.";
            }
        }

The event handler for clicking the Install button does nothing more than instructing the application to install itself; all the heavy lifting is done by Silverlight

private void InstallButton_Click(object sender, RoutedEventArgs e)
{
    App.Current.Install();
}

The Parse button’s event handler, however, must see if the saved html file exists, and then, if so, create a file reader to obtain the html as a string.

void Parse_Click( object sender, RoutedEventArgs e )
{
   string filePath = System.IO.Path.Combine(
      Environment.GetFolderPath(
         Environment.SpecialFolder.MyDocuments ), "page.htm" );

   if ( File.Exists( filePath ) )
   {
      StreamReader fileReader = File.OpenText( filePath );
      string contents = fileReader.ReadToEnd();

Once we have the string, we want to display it in the Rich Text Box.  The newly updated Rich Text Box has a Blocks property, which is, typically a collection of Paragraphs.  Paragraphs in turn are collections of Inlines. And Inline is an abstract class from which are derived, among other things, Runs and Spans. A Run, you’ll be happy to know, has a Text property.

var para = new Paragraph();
var text = new Run();
text.Text = contents;
para.Inlines.Add( text );
FileContents.Blocks.Add( para );
fileReader.Close();

The code shown would be fine, except that we can’t be certain that the contents string isn’t null (or empty for that matter) and, now that we have the string, we need to parse out the part we want. Let’s modify the code above to take this into account and to factor out the parsing into its own method:

void Parse_Click( object sender, RoutedEventArgs e )
{
   string filePath = System.IO.Path.Combine(
      Environment.GetFolderPath(
        Environment.SpecialFolder.MyDocuments ), "page.htm" );

   if ( File.Exists( filePath ) )
   {
      StreamReader fileReader = File.OpenText( filePath );
      string contents = fileReader.ReadToEnd();
      var para = new Paragraph();
      var text = new Run();
      if ( string.IsNullOrEmpty( contents ) )
         MessageBox.Show( "No contents found!" );
      else
      {
         text.Text = contents;
         para.Inlines.Add( text );
         FileContents.Blocks.Add( para );
         FileContents.Visibility = System.Windows.Visibility.Visible;
         ParseString( contents );
      }
      fileReader.Close();
   }
   else
   {
      MessageBox.Show( "File page.htm not found." );
   }
}

All that remains, then is to write the ParseString method, which will create an instance of Regex initialized with the regular expression we’ll use. This is just the invariant bit of text that comes before each file name, followed by any number of characters terminated by the closing quotes and the closing angle bracket

a href=\”/learn/videos/all/.*\”>

We can then iterate through the collection and display each match and its position in the original HTML string

private void ParseString( string contents )
{

   var rx = new Regex("(a href=\"/learn/videos/all/)(.*)(\">)");
   var matches = rx.Matches(contents);
   string msg;
   if ( matches.Count > 0 )
   {
      Output.Visibility = System.Windows.Visibility.Visible;
      foreach (Match match in matches)
      {
         Output.Items.Add(
           match.Value
           + " at position "
           + match.Index );
      }
   }
}

Running this in debug mode causes the Install button to be shown.

ScreenScraperV1Install

Clicking the install button brings up the security check, where the user can decide where to put the shortcuts to the newly installed application:

ScreenScraperV1Security

Clicking Install here installs the application and launches the (now) out of Browser application

which now displays the Parse File button

ScreenScraperV1Parse

Clicking that button runs the logic to read the file on the client machine and to parse the results.

ScreenScraperV1Ran

A quick look at the Start menu reveals, sure enough, that our Silverlight Application is installed and ready to be run out of browser,

ScreenScraperInStartMenu

Uninstall and Re-install?

Having written this, it occurs to us that we can do a better job with the output by breaking the regular expression into three groups (using parentheses)

var rx = new Regex(“(a href=\”/learn/videos/all/)(.*)(\”>)“);

By doing so, we can extract just the name of the video by taking the entry at offset 2 in the Groups collection of the Match.

Output.Items.Add(match.Groups[2]);

This is an improvement, but how do we get the improved code to run? The application will only run if it is installed, and now that it is installed, we won’t get the button offering to install it!  Is the only answer to open the Control Panel for Programs and Features and uninstall?

While that will work, there is a much better way; one that will also let us step through the program even it if is installed. Here’s how:

  • Re-open the Silverlight Project’s properties (either right click on the ScreenScraper project and choose Properties from the drop down or select that project and press alt-Enter).
  • Click the Debug tab and click the radio button for Out of Browser application (the ScreenScraper.Web application will show in the drop down)
  • Right-click on the ScreenScraper project (not the .web project) and set it as the startup project.
  • Press F5 to debug.

You can now run and debug your “out of browser, installed only” application just like any other Silverlight application.

About Jesse Liberty

Jesse Liberty has three decades of experience writing and delivering software projects and is the author of 2 dozen books and a couple dozen online courses. His latest book, Building APIs with .NET will be released early in 2025. Liberty is a Senior SW Engineer for CNH and he was a Senior Technical Evangelist for Microsoft, a Distinguished Software Engineer for AT&T, a VP for Information Services for Citibank and a Software Architect for PBS. He is a Microsoft MVP.
This entry was posted in Mini-Tutorial, z Silverlight Archives and tagged . Bookmark the permalink.

14 Responses to Screen Scraping – When All You Have Is A Hammer…

  1. Right here is the perfect web site for anybody who would like to find
    out about this topic. You understand a whole lot
    its almost hard to argue with you (not that I actually would want to…HaHa).
    You definitely put a new spin on a subject that’s been discussed for years. Wonderful stuff, just wonderful!

    Here is my website … social media bo5s

  2. Hi dude,
    how are you,
    i have came to know about yr solution
    by hitting a hammer or throwing a hand granade is a best solutin i ve ever find on the web

    You are amazing muuuuuuuuuuuhhhhhhhhh

    thanks again for this

  3. Woops look like I meant to reply to Mike Apken, at the bottom of this post 😉 @silverlightversion.com

  4. Mike, yes that works just fine, and is required for many client side browser apps which need to work with web content but do have permissions to directly access it. The biggest problem turns into scalability though.

  5. Ben Hayat says:

    Jesse, I created a simple page.htm and no matter where I put (desktop, MyDocuments or favorites or etc) I get error that File operation not permitted. Here is the error. What could be the problem?

    —————
    System.Security.SecurityException was unhandled by user code

    Message=File operation not permitted. Access to path ” is denied.

    StackTrace:

    at System.IO.FileSecurityState.EnsureState()

    at System.Environment.InternalGetFolderPath(SpecialFolder folder, SpecialFolderOption option, Boolean checkHost)

    at System.Environment.GetFolderPath(SpecialFolder folder)

    at OOBScreenScraper.MainPage.Parse_Click(Object sender, RoutedEventArgs e)

    at System.Windows.Controls.Primitives.ButtonBase.OnClick()

    at System.Windows.Controls.Button.OnClick()

    at System.Windows.Controls.Primitives.ButtonBase.OnMouseLeftButtonUp(MouseButtonEventArgs e)

    • Ben, this is one of those too many interlocking factors, combinatorial explosion problems. I’d strip down to the very basics, tearing away until I get it to work, and then build back up. Consider things like attached drives not having full trust, not being in or out of browser or elevated priv’s, etc. If you get fully stuck, create the smallest possible program to illustrate the problem, post it to the forums and if you don’t get an answer in a few days, please send me a follow up email. Thanks!

  6. @Patrick Long

    The HTML agility pack is great but I wanted to stay 100% within Silverlight for this little exercise. I’m of two minds about my ambivalence, however; as showing folks how to use Regular Expressions for parsing HTML may *not* be a great idea.

    On the other hand, some of the total certainty I see in the Stackoverflow discussion gives me the willies.

  7. @David V. Corbin
    I’m pretty sure it is ethical to create a small application without using MVVM. But, since you brought it up, I think I’ll refactor it as an MVVM app and repost that as well 🙂

  8. David V. Corbin says:

    EEK…Click event handlers…where is the use of MVVM and a good Command Pattern?

    Shame, Shame Shame…..

  9. Vito P. Jokubaitis says:

    @Patrick Long

    Hear, hear with regards to the HTML Agility Pack. It, coupled with LINQ-to-XML, allowed me to make short work of a screen-scraping exercise that I faced a while back. If “parse HTML” is anywhere in the requirements, take the time to consider this valuable tool.

  10. denny says:

    something i have used for years is iMacros from this site:

    http://www.iopus.com/

    i call it from .net and it’s not hard at all to automate….

  11. Patrick Long says:

    You might want to consider the HTML agility pack whenever you want to parse HTML. It allows you to use XPath on the document and gives you a DOM tree to wiith. It is excellent.

    There are a few on stackoverflow.com who go crazy when they hear the words “parse”, “HTML” and “RegEx” in the same sentence.

    HTML Agility Pack Download
    http://htmlagilitypack.codeplex.com/

    HTML Agility Pack – How To
    http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack

    Stackoverflow discussion on parsing HTML with RegEx 🙂
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

  12. Mike Apken says:

    It is possible to create a web service that accepts an url as a parameter. The web service then opens the page reads the html and returns it as a string?

    If I understand the web service correctly, we do not have the cross-domain restriction from the web service.

    Been having this one on my to-do list for a couple of weeks now. Still not exactly sure how to open the url and do we have access to the inner text of the new frame?

Comments are closed.