Namespace:
- Regex = using System.Text.RegularExpressions
- WebClient = using System.Net
Here’s a very simple example using Regex in Web Scraping (Note: I didn’t include validations, try/catch, and the likes for brevity):
private string ExecuteURL(string url)
{
WebClient w = new WebClient();
return w.DownloadString(url);
}
private List<CurrencyItem> LoadCurrencies()
{
List<CurrencyItem> list = new List<CurrencyItem>();
string result = ExecuteURL(“https://www.google.com/finance/converter”);
MatchCollection m1 = Regex.Matches(result, @”<option value=\””(.*?)\””>(.*?)</option>”, RegexOptions.Singleline);
foreach (Match m in m1)
{
list.Add(new CurrencyItem(m.Groups[1].Value, WebUtility.HtmlDecode(m.Groups[2].Value)));
}
return list;
}
public class CurrencyItem
{
public string Name { get; set; }
public string Code { get; set; }
public CurrencyItem(string _code, string _name)
{
Name = _name;
Code = _code;
}
}
On the sample above, we used the Converter API of Google and we want to list each country’s currency. The returned result of the API is an HTML which contains a list of this line:
- <option value=”USD”>United States ($)</option>
What I wanted to list are
- “USD”
- “United States ($)”
The regex I used then is
- @”<option value=\””(.*?)\””>(.*?)</option>”
The above regex means:
- *? == it can be any value
- () == we want to get only what’s inside these parenthesis
- The rest are just helpful to match the exact line we want to extract
In the collection we retrieved, we will have:
- m.Groups[1].Value == first enclosed in parenthesis (USD)
- m.Groups[2].Value == second enclosed in parenthesis (United States ($))
..and that’s it! You can now use the returned list to bind it to an object such as comboBox, datagridview, etc.
Leave a comment