I had decided to create a list of what videos were already available on the Learning Pages of Silverlight.net. When I clicked on the page for the entire list, however, I was quite daunted by the sheer number. I opened the “source” for the page, and found that there was an easy screen scraping capability, however. The name of each video was also a link to its landing page, and so I could grab the HTML and search for the appropriate links.
Using A Hand Grenade To Catch A Mouse
I’m happy to agree that Silverlight probably isn’t the first development platform to come to mind to implement this little utility (and it probably shouldn’t be) but the more I thought about how I’d do it, the more I realized that, at least to a first approximation, the out of browser capabilities of Silverlight make it at least a non-insane alternative. And thus was born a mini-tutorial on using Silverlight for creating desk-top utilities.
Getting the html was a snap; I just opened source in IE and saved the file with a known name to a known location. I will leave it as an exercise for the reader (or a future tutorial) to explore how one might grab the HTML directly from the ‘Net.
The first real step then was to create a workable UI. This too was a snap, I wrote this program like I write most; I looked for some existing program that I could steal from. Usually I steal from myself (that is, from earlier work by the human construct that, to the best of my ability to discern, I share a common set of memories and a clear world-line with, in this particular fork of the multi-verse).
This time, however, I stole from Tim Heuer. As often happens, not much of the original code survived, but that didn’t make it any less a valuable starting point.
Parsing A Local File
The key techniques to making this work are:
- Ensure that the application is installed locally
- Ensure you can make changes and debug without having to uninstall and reinstall
- Require elevated permissions (for file access)
After that, it is just a question of using Regex to parse the HTML. Let’s walk through it, briefly.
The Xaml
To recreate this, open a new Silverlight 4 Application and add three rows to the Grid. The top row will announce the name of the program, and will display a button to install the program or, if the program is installed, to parse the file. The second row will display the original html, and the third row will display the files we’re looking for.
<UserControl x:Class="ScreenScraper.MainPage" xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:d="http://schemas.microsoft.com/expression/blend/2008" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="d"> <Grid x:Name="LayoutRoot" Height="390" Width="550"> <Grid.RowDefinitions> <RowDefinition Height="2*" /> <RowDefinition Height="1*" /> <RowDefinition Height="1*" /> </Grid.RowDefinitions> <StackPanel HorizontalAlignment="Center" VerticalAlignment="Center" Grid.Row="0"> <TextBlock Foreground="Black" Text="Trusted Application - Screen Scraper" FontSize="20" HorizontalAlignment="Center" /> <Button x:Name="Parse" Content="Parse File" FontSize="20" Height="Auto" Width="Auto" HorizontalAlignment="Center" /> <TextBlock x:Name="Warning" Text="This application must be run out of browser" TextWrapping="Wrap" TextAlignment="Center" FontSize="14" Foreground="Maroon" /> <Button x:Name="InstallButton" Content="Install" Foreground="Maroon" FontSize="20" Width="Auto" Height="Auto" HorizontalAlignment="Center" /> </StackPanel> <RichTextBox x:Name="FileContents" Grid.Row="1" FontSize="10" VerticalAlignment="Stretch" HorizontalAlignment="Stretch" TextWrapping="Wrap" VerticalScrollBarVisibility="Auto" IsReadOnly="True" Margin="10" /> <ListBox x:Name="Output" Grid.Row="2" FontSize="10" VerticalAlignment="Stretch" HorizontalAlignment="Stretch" Margin="10" /> </Grid> </UserControl>
Key to making this look right is to set the visibility of the appropriate text and buttons in the first row. To ensure that there is no ambiguity, I begin by setting them all to collapsed. The top of MainPage.xaml.cs looks like this:
using System; using System.Text.RegularExpressions; using System.Windows; using System.Windows.Controls; using System.IO; using System.Windows.Documents; namespace ScreenScraper { public partial class MainPage : UserControl { public MainPage() { InitializeComponent(); Loaded += new RoutedEventHandler(MainPage_Loaded); } void MainPage_Loaded(object sender, RoutedEventArgs e) { Warning.Visibility = System.Windows.Visibility.Collapsed; InstallButton.Visibility = Visibility.Collapsed; Parse.Visibility = System.Windows.Visibility.Visible; FileContents.Visibility = System.Windows.Visibility.Collapsed; Output.Visibility = System.Windows.Visibility.Collapsed;
Event handlers must be set up for the two buttons:
Parse.Click += new RoutedEventHandler( Parse_Click ); InstallButton.Click += new RoutedEventHandler(InstallButton_Click);
The final task for the loaded event handler is to determine if the program has been installed yet (and if not, to make the Install button visible) and to make sure the program is actually running “out of browser.”
if (App.Current.InstallState != InstallState.Installed) { Parse.Visibility = System.Windows.Visibility.Collapsed; InstallButton.Visibility = Visibility.Visible; Warning.Visibility = System.Windows.Visibility.Visible; Warning.Text = "Please install to run..."; } else if (! App.Current.IsRunningOutOfBrowser) { Parse.Visibility = System.Windows.Visibility.Collapsed; Warning.Visibility = System.Windows.Visibility.Visible; Warning.Text = "This application must be run out of browser."; } }
The event handler for clicking the Install button does nothing more than instructing the application to install itself; all the heavy lifting is done by Silverlight
private void InstallButton_Click(object sender, RoutedEventArgs e) { App.Current.Install(); }
The Parse button’s event handler, however, must see if the saved html file exists, and then, if so, create a file reader to obtain the html as a string.
void Parse_Click( object sender, RoutedEventArgs e ) { string filePath = System.IO.Path.Combine( Environment.GetFolderPath( Environment.SpecialFolder.MyDocuments ), "page.htm" ); if ( File.Exists( filePath ) ) { StreamReader fileReader = File.OpenText( filePath ); string contents = fileReader.ReadToEnd();
Once we have the string, we want to display it in the Rich Text Box. The newly updated Rich Text Box has a Blocks property, which is, typically a collection of Paragraphs. Paragraphs in turn are collections of Inlines. And Inline is an abstract class from which are derived, among other things, Runs and Spans. A Run, you’ll be happy to know, has a Text property.
var para = new Paragraph(); var text = new Run(); text.Text = contents; para.Inlines.Add( text ); FileContents.Blocks.Add( para ); fileReader.Close();
The code shown would be fine, except that we can’t be certain that the contents string isn’t null (or empty for that matter) and, now that we have the string, we need to parse out the part we want. Let’s modify the code above to take this into account and to factor out the parsing into its own method:
void Parse_Click( object sender, RoutedEventArgs e ) { string filePath = System.IO.Path.Combine( Environment.GetFolderPath( Environment.SpecialFolder.MyDocuments ), "page.htm" ); if ( File.Exists( filePath ) ) { StreamReader fileReader = File.OpenText( filePath ); string contents = fileReader.ReadToEnd(); var para = new Paragraph(); var text = new Run(); if ( string.IsNullOrEmpty( contents ) ) MessageBox.Show( "No contents found!" ); else { text.Text = contents; para.Inlines.Add( text ); FileContents.Blocks.Add( para ); FileContents.Visibility = System.Windows.Visibility.Visible; ParseString( contents ); } fileReader.Close(); } else { MessageBox.Show( "File page.htm not found." ); } }
All that remains, then is to write the ParseString method, which will create an instance of Regex initialized with the regular expression we’ll use. This is just the invariant bit of text that comes before each file name, followed by any number of characters terminated by the closing quotes and the closing angle bracket
a href=\”/learn/videos/all/.*\”>
We can then iterate through the collection and display each match and its position in the original HTML string
private void ParseString( string contents ) { var rx = new Regex("(a href=\"/learn/videos/all/)(.*)(\">)"); var matches = rx.Matches(contents); string msg; if ( matches.Count > 0 ) { Output.Visibility = System.Windows.Visibility.Visible; foreach (Match match in matches) { Output.Items.Add( match.Value + " at position " + match.Index ); } } }
Running this in debug mode causes the Install button to be shown.
Clicking the install button brings up the security check, where the user can decide where to put the shortcuts to the newly installed application:
Clicking Install here installs the application and launches the (now) out of Browser application
which now displays the Parse File button
Clicking that button runs the logic to read the file on the client machine and to parse the results.
A quick look at the Start menu reveals, sure enough, that our Silverlight Application is installed and ready to be run out of browser,
Uninstall and Re-install?
Having written this, it occurs to us that we can do a better job with the output by breaking the regular expression into three groups (using parentheses)
var rx = new Regex(“(a href=\”/learn/videos/all/)(.*)(\”>)“);
By doing so, we can extract just the name of the video by taking the entry at offset 2 in the Groups collection of the Match.
Output.Items.Add(match.Groups[2]);
This is an improvement, but how do we get the improved code to run? The application will only run if it is installed, and now that it is installed, we won’t get the button offering to install it! Is the only answer to open the Control Panel for Programs and Features and uninstall?
While that will work, there is a much better way; one that will also let us step through the program even it if is installed. Here’s how:
- Re-open the Silverlight Project’s properties (either right click on the ScreenScraper project and choose Properties from the drop down or select that project and press alt-Enter).
- Click the Debug tab and click the radio button for Out of Browser application (the ScreenScraper.Web application will show in the drop down)
- Right-click on the ScreenScraper project (not the .web project) and set it as the startup project.
- Press F5 to debug.
You can now run and debug your “out of browser, installed only” application just like any other Silverlight application.
14 Responses to Screen Scraping – When All You Have Is A Hammer…