You want to use some data that is provided by some site, but they doesn’t expose any API to get that data. Well, a possible (and a very dirty) trick is to get the page HTML and rip data off it for our use.
Not so easy dude… there are many hurdles:
HTML page is full of Garbage
So, let us assume somehow you get the page HTML within your page through plain Javascript/AJAX/jQuery or any other way, then how your page will get the data it wants…
Well, most of the pages has similar construct:
<doctype.... <html> <head> .. .. <script>..</script> . </head> <body ....> ... .. <<<<<Tags jungle>>>> .. <some sweet tag> <3 Your Data <3 </some tag> .. .<more tags> </body> </html>
What we will do is to load this page HTML into a javascript varible, then simply chop all the content section from the start up to tag, also trim any <script>
tags. Then we will run a wild replace using regular expression that will make any tag <* into <div tag
You can further continue this chopping and transformation of data to get closer to pure data. Once we have the HTML gone through the above process, we will have something that will look like this:
<div> <div> . . </div> <div> <div> <3 Your Data <3 </div> <div> .. .
which we will host(assign to innerHTML property) into an invisible div, which will force the browser to parse and render the HTML and make this HTML part of your DOM object.
Later you can query this data, using something like…
var data = docuemnt.getElementById("someId").innerText;
where someID is the parent element, containing required data.
Lets take an example of NewYork weather data page from AccuWeather.com:
URL: http://www.accuweather.com/us/ny/new-york/10017/city-weather-forecast.asp
if you inspect the page in IE developer tools, we will observe following:
now assuming that you have managed to load this HTML into your page JavaScript,and its assigned to the src variable, thus src should look something like this:
lets execute following JavaScript statements:
if(src === "")
return;
var i = src.indexOf(" src = src.substr(i); // chopped the Head part
tag = src.replace(/<[a-z]+/gi,"
tag
alert(tag);
this should give us the output as shown below:
Further executing some more JavaScript statements provided below, we can reach to our data:
document.getElementById("host").innerHTML = tag;
data = $(".info").find(".cond").find(".temp")[0].innerText;
data = data.substr(0,3);
alert("Today's Temp: "+data);
// host here is a div element
document.getElementById("host").innerHTML = "Today's Temp: "+data+"°";
$("#host").show();
Result will give us the temperature.
Here is the full source code:
<html
<head>
<title>Page rip demo</title>
<script type="text/javascript" src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.6.1.min.js"></script>
</head>
<body>
<script type="text/javascript" >
var url = "weather.htm";
var src = "";
onload = function()
{
$.ajax({'url':url,'success':function(source) {
{ src = source;
alert(src);
},
'error':function(r,s,x)
{
alert("Error while trying to get data :(");
}});
}
function GetTemperature()
{
if(src === "")
return;
var i = src.indexOf("<body");
src = src.substr(i); // chopped the Head part
tag = src.replace(/<[a-z]+/gi,"<div");
alert(tag);
document.getElementById("host").innerHTML = tag;
data = $(".info").find(".cond").find(".temp")[0].innerText;
data = data.substr(0,3);
alert("Today's Temp: "+data);
document.getElementById("host").innerHTML = "<h2>Today's Temp: "+data+"° C";
$("#host").show();
}
</script>
<input type="button" onclick="GetTemperature()" value="Show Temp">
<div id="host" style="display:none;"></div>
</body>
</html>
Which will give us the following result:
Pros: This approach is better than trying to parse the whole HTML yourself, which is more error prone and bound to fail frequently. Here we are letting the browser parse and include our HTML into the DOM, so we can run some simple JavaScript statements to get required data
Cons: this approach is way dirty itself and will fail someday, as soon as the page layout changes or somebody alters the source page HTML layout. Not to mention the browser compatibility as different browser treats things differently.
With the recent and continuing revolution going around on the web, things are getting easy with sites exposing their data through APIs or at least through Microdata/Microformats