Stuck! No API, rip data off the page

You want to use some data that is provided by some site, but they doesn’t expose any API to get that data. Well, a possible (and a very dirty) trick is to get the page HTML and rip data off it for our use.

Not so easy dude… there are many hurdles:

  • First one is the “Same origin policy” security restriction that prevents your site’s page to request any data(in our case HTML) from a remote site. Here is a wiki page that explains this issue and possible workarounds
  • bag of garbage

  • The (X)HTML page can appear like a bag full of garbage. I would suggest that you never parse HTML…it will FAIL soon, and you will have to fix it which will FAIL again, its a vicious circle
  • HTML page is full of Garbage

    So, let us assume somehow you get the page HTML within your page through plain Javascript/AJAX/jQuery or any other way, then how your page will get the data it wants…

    Well, most of the pages has similar construct:

    <doctype....
    <html>
    <head>
    ..
    ..
    <script>..</script>
    .
    </head>
    <body ....>
    ...
    ..
    <<<<<Tags jungle>>>>
    ..
    <some sweet tag> <3 Your Data <3 </some tag>
    ..
    .<more tags>
    
    </body>
    </html>

    What we will do is to load this page HTML into a javascript varible, then simply chop all the content section from the start up to tag, also trim any <script>
    tags. Then we will run a wild replace using regular expression that will make any tag <* into <div tag

    You can further continue this chopping and transformation of data to get closer to pure data. Once we have the HTML gone through the above process, we will have something that will look like this:

    <div>
    <div>
    .
    .
    </div>
    <div>
    <div> <3 Your Data <3 </div>
    <div>
    ..
    .

    which we will host(assign to innerHTML property) into an invisible div, which will force the browser to parse and render the HTML and make this HTML part of your DOM object.

    Later you can query this data, using something like…
    var data = docuemnt.getElementById("someId").innerText;

    where someID is the parent element, containing required data.

    Lets take an example of NewYork weather data page from AccuWeather.com:
    URL: http://www.accuweather.com/us/ny/new-york/10017/city-weather-forecast.asp

    if you inspect the page in IE developer tools, we will observe following:

    now assuming that you have managed to load this HTML into your page JavaScript,and its assigned to the src variable, thus src should look something like this:

    lets execute following JavaScript statements:


    if(src === "")
    return;
    var i = src.indexOf(" src = src.substr(i); // chopped the Head part

    tag = src.replace(/<[a-z]+/gi,"
    tag
    alert(tag);

    this should give us the output as shown below:

    Further executing some more JavaScript statements provided below, we can reach to our data:

    document.getElementById("host").innerHTML = tag;
    data = $(".info").find(".cond").find(".temp")[0].innerText;

    data = data.substr(0,3);
    alert("Today's Temp: "+data);
    // host here is a div element
    document.getElementById("host").innerHTML = "Today's Temp: "+data+"°";
    $("#host").show();

    Result will give us the temperature.

    Here is the full source code:

    download source code

    
    <html
    <head>
    <title>Page rip demo</title>
    <script type="text/javascript" src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.6.1.min.js"></script>
    </head>
    <body>
    <script type="text/javascript" >
    var url = "weather.htm";
    var src = "";
    onload = function()
    		{
    			$.ajax({'url':url,'success':function(source)								{
    								{	src = source;
    									alert(src);
    								},
    								'error':function(r,s,x)
    								{
    									alert("Error while trying to get data :(");
    								}});
    		}
    
    function GetTemperature()
    {
    	if(src === "")
    		return;
    
    	var i = src.indexOf("<body");
    	src = src.substr(i);		// chopped the Head part
    
    	tag = src.replace(/<[a-z]+/gi,"<div");
    	alert(tag);
    
    	document.getElementById("host").innerHTML = tag;
    	data = $(".info").find(".cond").find(".temp")[0].innerText;
    
    	data = data.substr(0,3);
    	alert("Today's Temp: "+data);
    	document.getElementById("host").innerHTML = "<h2>Today's Temp: "+data+"° C";
    	$("#host").show();
    }
    </script>
    <input type="button" onclick="GetTemperature()" value="Show Temp">
    
    <div id="host" style="display:none;"></div>
    </body>
    </html>
    

    Which will give us the following result:

    Pros: This approach is better than trying to parse the whole HTML yourself, which is more error prone and bound to fail frequently. Here we are letting the browser parse and include our HTML into the DOM, so we can run some simple JavaScript statements to get required data

    Cons: this approach is way dirty itself and will fail someday, as soon as the page layout changes or somebody alters the source page HTML layout. Not to mention the browser compatibility as different browser treats things differently.

    With the recent and continuing revolution going around on the web, things are getting easy with sites exposing their data through APIs or at least through Microdata/Microformats