{"id":136,"date":"2011-07-01T14:17:28","date_gmt":"2011-07-01T14:17:28","guid":{"rendered":"http:\/\/thewiredguy.com\/wordpress\/?p=136"},"modified":"2011-07-01T14:17:28","modified_gmt":"2011-07-01T14:17:28","slug":"dont-have-an-apirip-dat-off-the-page","status":"publish","type":"post","link":"https:\/\/inullable.in\/blog\/?p=136","title":{"rendered":"Stuck! No API, rip data off the page"},"content":{"rendered":"<p style=\"text-align: justify;\">You want to use some data that is provided by some site, but they doesn\u2019t expose any API to get that data. Well, a possible (and a very dirty) trick is to get the page HTML and rip data off it for our use.<\/p>\n<p>Not so easy dude&#8230; there are many hurdles:<\/p>\n<li>First one is the \u201c<a href=\"http:\/\/en.wikipedia.org\/wiki\/Same_origin_policy\">Same origin policy<\/a>\u201d security restriction that prevents your site\u2019s page to request any data(in our case HTML) from a remote site. <a href=\"http:\/\/en.wikipedia.org\/wiki\/Same_origin_policy\">Here is a wiki page<\/a> that explains this issue and possible <a href=\"http:\/\/en.wikipedia.org\/wiki\/Same_origin_policy#Workarounds\">workarounds<\/a><\/li>\n<p><img class=\"alignright\" style=\"width: 106px; height: 106px;\" src=\"http:\/\/t0.gstatic.com\/images?q=tbn:ANd9GcQdBo0aQ8luCwrIEum76p2Fm4rOHbqo-1qWiaHihhNec1aGebPO\" alt=\"bag of garbage\" \/><\/p>\n<li>The (X)HTML page can appear like a bag full of garbage. I would suggest that you <a href=\"http:\/\/www.codinghorror.com\/blog\/2009\/11\/parsing-html-the-cthulhu-way.html\">never parse HTML<\/a>&#8230;it will FAIL soon, and you will have to fix it which will FAIL again, its a vicious circle<\/li>\n<h3 style=\"text-align: center;\"><span style=\"color: #333399;\">HTML page is full of Garbage<\/span><\/h3>\n<p style=\"text-align: center;\"><img loading=\"lazy\" class=\"aligncenter\" src=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/html-page1.png\" alt=\"\" width=\"658\" height=\"385\" \/><\/p>\n<p>So, let us assume somehow you<a href=\"http:\/\/en.wikipedia.org\/wiki\/Same_origin_policy#Workarounds\"> get the page HTML<\/a> within your page through plain Javascript\/AJAX\/jQuery or any other way, then how your page will get the data it wants&#8230;<\/p>\n<p>Well, most of the pages has similar construct:<\/p>\n<pre>&lt;doctype....\n&lt;html&gt;\n&lt;head&gt;\n..\n..\n&lt;script&gt;..&lt;\/script&gt;\n.\n&lt;\/head&gt;\n&lt;body ....&gt;\n...\n..\n&lt;&lt;&lt;&lt;&lt;Tags jungle&gt;&gt;&gt;&gt;\n..\n&lt;some sweet tag&gt; &lt;3 Your Data &lt;3 &lt;\/some tag&gt;\n..\n.&lt;more tags&gt;\n\n&lt;\/body&gt;\n&lt;\/html&gt;<\/pre>\n<p>What we will do is to load this page HTML into a javascript varible, then simply chop all the content section from the start up to  tag, also trim any &lt;script&gt;<br \/>\ntags. Then we will run a wild replace using regular expression that will make any tag &lt;* into &lt;div tag<\/p>\n<p>You can further continue this chopping and transformation of data to get closer to pure data. Once we have the HTML gone through the above process, we will have something that will look like this:<\/p>\n<pre>&lt;div&gt;\n&lt;div&gt;\n.\n.\n&lt;\/div&gt;\n&lt;div&gt;\n&lt;div&gt; &lt;3 Your Data &lt;3 &lt;\/div&gt;\n&lt;div&gt;\n..\n.<\/pre>\n<p>which we will host(assign to innerHTML property) into an <span style=\"text-decoration: underline;\">invisible div<\/span>, which will force the browser to parse and render the HTML and make this HTML part of your DOM object.<\/p>\n<p>Later you can query this data, using something like&#8230;<br \/>\n<code>var data = docuemnt.getElementById(\"someId\").innerText;<\/code><\/p>\n<p>where someID is the parent element, containing required data.<\/p>\n<p>Lets take an example of NewYork weather data page from AccuWeather.com:<br \/>\n<strong><code style=\"border: black solid 1px; padding: 2px;\">URL: http:\/\/www.accuweather.com\/us\/ny\/new-york\/10017\/city-weather-forecast.asp<\/code><\/strong><\/p>\n<p><img src=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/AccuWeather.png\" alt=\"\" \/><\/p>\n<p>if you inspect the page in IE developer tools, we will observe following:<\/p>\n<p><a href=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/IEDevTools.png\" target=\"_blank\"><img loading=\"lazy\" src=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/IEDevTools.png\" alt=\"\" width=\"1008\" height=\"756\" \/><\/a><\/p>\n<p>now assuming that you have managed to load this HTML into your page JavaScript,and its assigned to the src variable, thus src should look something like this:<\/p>\n<p><img src=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/message_src.png\" alt=\"\" \/><\/p>\n<p>lets execute following JavaScript statements:<\/p>\n<p><code><br \/>\nif(src === \"\")<br \/>\nreturn;<br \/>\nvar i = src.indexOf(\" \tsrc = src.substr(i);\t\t\/\/ chopped the Head part<\/code><\/p>\n<p><code> <\/code><\/p>\n<p><code> tag = src.replace(\/&lt;[a-z]+\/gi,\"<br \/>\ntag<br \/>\nalert(tag);<br \/>\n<\/code><\/p>\n<p>this should give us the output as shown below:<\/p>\n<p><img src=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/div_src.png\" alt=\"\" \/><\/p>\n<p>Further executing some more JavaScript statements provided below, we can reach to our data:<br \/>\n<code><br \/>\ndocument.getElementById(\"host\").innerHTML = tag;<br \/>\ndata = $(\".info\").find(\".cond\").find(\".temp\")[0].innerText;<\/code><\/p>\n<p><code>data = data.substr(0,3);<br \/>\nalert(\"Today's Temp: \"+data);<br \/>\n\/\/ host here is a div element<br \/>\ndocument.getElementById(\"host\").innerHTML = \"Today's Temp: \"+data+\"\u00b0\";<br \/>\n$(\"#host\").show();<br \/>\n<\/code><\/p>\n<p>Result will give us the temperature.<\/p>\n<p>Here is the full source code:<\/p>\n<p><a style=\"border: outset 3px blue; background: #3A93F3; color: #000061; text-decoration: none; padding:2px;\" href=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/Weather.zip\" target=\"_blank\">download source code<\/a><\/p>\n<pre><code>\n&lt;html\n&lt;head&gt;\n&lt;title&gt;Page rip demo&lt;\/title&gt;\n&lt;script type=\"text\/javascript\" src=\"http:\/\/ajax.aspnetcdn.com\/ajax\/jQuery\/jquery-1.6.1.min.js\"&gt;&lt;\/script&gt;\n&lt;\/head&gt;\n&lt;body&gt;\n&lt;script type=\"text\/javascript\" &gt;\nvar url = \"weather.htm\";\nvar src = \"\";\nonload = function()\n\t\t{\n\t\t\t$.ajax({'url':url,'success':function(source)\t\t\t\t\t\t\t\t{\n\t\t\t\t\t\t\t\t{\tsrc = source;\n\t\t\t\t\t\t\t\t\talert(src);\n\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t'error':function(r,s,x)\n\t\t\t\t\t\t\t\t{\n\t\t\t\t\t\t\t\t\talert(\"Error while trying to get data :(\");\n\t\t\t\t\t\t\t\t}});\n\t\t}\n\nfunction GetTemperature()\n{\n\tif(src === \"\")\n\t\treturn;\n\n\tvar i = src.indexOf(\"&lt;body\");\n\tsrc = src.substr(i);\t\t\/\/ chopped the Head part\n\n\ttag = src.replace(\/&lt;[a-z]+\/gi,\"&lt;div\");\n\talert(tag);\n\n\tdocument.getElementById(\"host\").innerHTML = tag;\n\tdata = $(\".info\").find(\".cond\").find(\".temp\")[0].innerText;\n\n\tdata = data.substr(0,3);\n\talert(\"Today's Temp: \"+data);\n\tdocument.getElementById(\"host\").innerHTML = \"&lt;h2&gt;Today's Temp: \"+data+\"\u00b0 C\";\n\t$(\"#host\").show();\n}\n&lt;\/script&gt;\n&lt;input type=\"button\" onclick=\"GetTemperature()\" value=\"Show Temp\"&gt;\n\n&lt;div id=\"host\" style=\"display:none;\"&gt;&lt;\/div&gt;\n&lt;\/body&gt;\n&lt;\/html&gt;\n<\/code><\/pre>\n<p>Which will give us the following result:<\/p>\n<p><img src=\"http:\/\/thewiredguy.com\/wordpress\/wp-content\/uploads\/2011\/07\/temp_show.png\" alt=\"\" \/><\/p>\n<p><strong>Pros: <\/strong>This approach is better than trying to parse the whole HTML yourself, which is more error prone and bound to fail frequently. Here we are letting the browser parse and include our HTML into the DOM, so we can run some simple JavaScript statements to get required data<\/p>\n<p><strong>Cons: <\/strong>this approach is way dirty itself and will fail someday, as soon as the page layout changes or somebody alters the source page HTML layout. Not to mention the browser compatibility as different browser treats things differently.<\/p>\n<p>With the recent and continuing revolution going around on the web, things are getting easy with sites exposing their data through APIs or at least through Microdata\/Microformats<\/p>\n","protected":false},"excerpt":{"rendered":"<p>You want to use some data that is provided by some site, but they doesn\u2019t expose any API to get that data. Well, a possible (and a very dirty) trick is to get the page HTML and rip data off it for our use. Not so easy dude&#8230; there are many hurdles: First one is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[31,41,161,241,251,271],"tags":[],"_links":{"self":[{"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=\/wp\/v2\/posts\/136"}],"collection":[{"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=136"}],"version-history":[{"count":0,"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=\/wp\/v2\/posts\/136\/revisions"}],"wp:attachment":[{"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=136"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/inullable.in\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}