With F#, you can create ‘script files’ with the extension .fsx, these can run independently without having to create a container project (similar to python scripts). Combined with F# interactive (fsi), a REPL for F#, you have a full scripting solution for .NET.
Note: In this article I do not assume any previous knowledge of F#, but knowing some F# would certainly help.
There are three obvious approaches when web scraping with F#: 1. Use existing .NET libraries designed for web scraping in C# 2. Use F# Data (An awesome F# data access library) 3. Use Selenium through Canopy (An F# Selenium wrapper)
There are other methods and libraries (such as Http.fs) but these are less common and you could explore them yourself if you need to.
Option 1: Existing C# Libraries
I will not be discussing option 1 (although it is a totally valid option, F# and C# interop) since it doesn’t produce the best looking F# code; we’ll be trying to utilize the functional paradigm whenever possible. You don’t want to write as much code as you have to when using C#; in F#, you usually end up writing just the third.
Option 2: F# Data
The second approach is undoubtedly the best. F# Data allows you to define type providers for accessing data and includes HTTP tools as well as an HTML parser.
If you want to parse the contents of an HTML table, you’re in luck because F# Data provides a type provider for that (amongst for other formats). A type provider basically generates an F# type of the data you’re dealing with in real time. This means F# would download the webpage for you behind the scenes, and you get intellisense of the data you’re parsing live so you can dot into the schema of your data seamlessly. A modified example from their website:
// Configure the type provider type NugetStats = HtmlProvider<"https://www.nuget.org/packages/FSharp.Data"> // load the live package stats, you get intellisense for this let rawStats = NugetStats().Tables.``Version History`` // note the backticks (``) for writing natural text in F#!
As far as I can tell, this feature is exclusive to F#! There are type providers for JSON and XML so you can use the same technique on websites that have an API or an RSS feed.
If you still need to parse non-table HTML code, there’s the HTML Parser and HTML CSS Selectors. These allow you to go through the DOM tree manually, either through F# methods or CSS selectors, respectively. Combined with F#’s excellent data manipulation features, these allow swift handling and iteration over data. Consider this example from one of my web scraping projects:
let parsedList = (lovelyPageDoc.CssSelect ".portlet.clearfix.reorderableModule td > font") // using a css selector on a .Parse()ed webpage |> Seq.map (fun (x:HtmlNode) -> x.InnerText()) //get text of each element |> Seq.skip 26 //skip some elements |> Seq.pairwise //convert to pairs (in tuples) |> Seq.indexed //pair with indicies |> Seq.filter (fun (i, x) -> i % 2 <> 0) //get rid of every second element |> Seq.map snd //get the second element of every pairing |> Seq.toList //convert to list
As you can see, manipulating data in F# is really fluid, producing nice-looking code. And once you learn F# you really don’t need any of these comments since it all becomes self-documenting.
Finally, there are the HTTP Utilities that allow you to customise the requests you make (supplying query parameters, edit headers, etc.). The most important part in my opinion is that it allows you to specify a cookie container (which is unapologetically easy) that you could use to maintain cookies across requests, so you could access content behind a login easily.
Option 3: Selenium through Canopy
This is the easiest option (if you consider the past ones hard, which by the way, they’re not). Canopy is designed for web UI testing (such as opening your site, clicking on things, making assertions on elements, etc.) and is based on Selenium. Selenium uses various Web Drivers (light versions of web browsers, on a basic level) to achieve a similar goal in other languages. There are Web Drivers for Chrome, IE, Firefox, and so on; then there’s my favourite, the headless PhantomJS (the closest thing to a web scraper). What makes Canopy much better is the syntax, so easy, a five-year-old could literally understand it.
Show, don’t tell:
// logging-in to pluralsight, downloading homepage start phantomJS //start the web driver, could've been "start chrome" url "https://app.pluralsight.com/library/" // go to URL "#Username" << "my username" //set username and password elements "#Password" << "my pa$$word" click "#login.button.primary" //click on an element let mainPage = (element "html").GetAttribute("innerHTML") //get innerHTML File.WriteAllText("MainPage.html", mainPage) //write page to file
You probably don’t want to go manually clicking stuff just for web scraping (after all, that’s not Canopy’s main goal), but it’s nice to know you can do that. Most of the time when you receive unparsable minified JS for a response, you can just open chrome’s dev tools and inspect the XHR requests to find an API to consume, but if you’re unable to find any, Canopy will be most helpful.
I hope this has prompted you to explore F#, or at least adopt it as your web scraping language. The code you’ll write will be just as concise for whichever domain you choose, and access to .NET means there’s nothing you can’t do. The tooling around F# right now is not at its best (.NET Core is not yet supported for production), but the community is actively working on it. On a side note, I’d like to say that I’m not a professional F# developer nor a am I a writer, so feedback of any form is unconditionally welcome!