Jigsaw Data Scraping

Friday, 22 August 2014

Scraping dynamic data

I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

 https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

 http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

 http://twill.idyll.org/

Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Wednesday, 20 August 2014

Web Scraping data from different sites

I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB

Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Tuesday, 19 August 2014

Scrape Data Point Using Python

I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
<td>Jan. 19, 2014, 2:37 a.m.</td>
<td>0.0775/0.1146</td>
<td>860.00000</td>
<td>66.65 CAD</td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title

Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
 text = element.get_attribute('innerHTML').strip('|')
 try:
 bid = float(text)
 if lowest_bid > bid:
 lowest_bid = bid
 except:
 pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
 server = Connect(username)
 try:
 server.login(username,password)
 server.sendmail(username,username,contents)
 except smtplib.SMTPException,error:
 Print(error)
 Disconnect(server)

def Connect(username):
 serverName = username[username.index("@")+1:username.index(".")]
 while True:
 try:
 server = smtplib.SMTP(serverDict[serverName])
 except smtplib.SMTPException,error:
 Print(error)
 continue
 try:
 server.ehlo()
 if server.has_extn("starttls"):
 server.starttls()
 server.ehlo()
 except (smtplib.SMTPException,ssl.SSLError),error:
 Print(error)
 Disconnect(server)
 continue
 break
 return server

def Disconnect(server):
 try:
 server.quit()
 except smtplib.SMTPException,error:
 Print(error)

serverDict = {
 "gmail" :"smtp.gmail.com",
 "hotmail":"smtp.live.com",
 "yahoo" :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...

Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

Saturday, 16 August 2014

Data From Web Scraping Using Node.JS Request Is Different From Data Shown In The Browser

Right now, I am doing some simple web scraping, for example get the current train arrival/departure information for one railway station. Here is the example link, http://www.thetrainline.com/Live/arrivals/chester, from this link you can visit the current arrival trains in the chester station.

I am using the node.js request module to do some simple web scraping,

app.get('/railway/arrival', function (req, res, next) {
    console.log("/railway/arrival/ "+req.query["city"]);
    var city = req.query["city"];
    if(typeof city == undefined || city == undefined) { console.log("if it undefined"); city ="liverpool-james-street";}
    getRailwayArrival(city,
       function(err,data){
           res.send(data);
        }
       );
});

function getRailwayArrival(station,callback){
   request({
    uri: "http://www.thetrainline.com/Live/arrivals/"+station,
   }, function(error, response, body) {
      var $ = cheerio.load(body);

      var a = new Array();
      $(".results-contents li a").each(function() {
        var link = $(this);
        //var href = link.attr("href");
        var due = $(this).find('.due').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var destination = $(this).find('.destination').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var on_time = $(this).find('.on-time-yes .on-time').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        if(on_time == undefined) var on_time_no = $(this).find('.on-time-no').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var platform = $(this).find('.platform').text().replace(/(\r\n|\n|\r|\t)/gm,"");

        var obj = new Object();
        obj.due = due;obj.destination = destination; obj.on_time = on_time; obj.platform = platform;
        a.push(obj);
console.log("arrival ".green+due+" "+destination+" "+on_time+" "+platform+" "+on_time_no);
    });
    console.log("get station data "+a.length +"   "+ $(".updated-time").text());
    callback(null,a);

});
}

The code works by giving me a list of data, however these data are different from the data seen in the browser, though the data come from the same url. I don't know why it is like that. is it because that their server can distinguish the requests sent from server and browser, that if the request is from server, so they sent me the wrong data. How can I overcome this problem ?

thanks in advance.

2 Answers

They must have stored session per click event. Means if u visit that page first time, it will store session and validate that session for next action you perform. Say, u select some value from drop down list. for that click again new value of session is generated that will load data for ur selected combobox value. then u click on show list then that previous session value is validated and you get accurate data.

Now see, if you not catch that session value programatically and not pass as parameter with that request, you will get default loaded data or not get any thing. So, its chalenging for you to chatch that data.Use firebug for help.

Another issue here could be that the generated content occurs through JavaScript run on your machine. jsdom is a module which will provide such content but is not as lightweight.

Cheerio does not execute these scripts and as a result content may not be visible (as you're experiencing). This is an article I read a while back and caused me to have the same discovery, just open the article and search for "jsdom is more powerful" for a quick answer:

Source:http://stackoverflow.com/questions/15785360/data-from-web-scraping-using-node-js-request-is-different-from-data-shown-in-the?rq=1

Tuesday, 5 August 2014

Business Intelligence Data Mining

Data mining can be technically defined as the automated extraction of hidden information from large databases for predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making.

Data mining requires the use of mathematical algorithms and statistical techniques integrated with software tools. The final product is an easy-to-use software package that can be used even by non-mathematicians to effectively analyze the data they have. Data Mining is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, fraud detection, web site personalization, e-commerce, healthcare, customer relationship management, financial services and telecommunications.

Business intelligence data mining is used in market research, industry research, and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, healthcare, the oil and gas industry, scientific tests, genetics, telecommunications, financial services and utilities. BI uses various technologies like data mining, scorecarding, data warehouses, text mining, decision support systems, executive information systems, management information systems and geographic information systems for analyzing useful information for business decision making.

Business intelligence is a broader arena of decision-making that uses data mining as one of the tools. In fact, the use of data mining in BI makes the data more relevant in application. There are several kinds of data mining: text mining, web mining, social networks data mining, relational databases, pictorial data mining, audio data mining and video data mining, that are all used in business intelligence applications.

Some data mining tools used in BI are: decision trees, information gain, probability, probability density functions, Gaussians, maximum likelihood estimation, Gaussian Baves classification, cross-validation, neural networks, instance-based learning /case-based/ memory-based/non-parametric, regression algorithms, Bayesian networks, Gaussian mixture models, K-means and hierarchical clustering, Markov models and so on.

Source:http://ezinearticles.com/?Business-Intelligence-Data-Mining&id=196648

Thursday, 31 July 2014

Automated SEO Tools Can Keep You Out of the SERPs

Every day, thousands of newcomers enter the world of internet marketing. They join all of the popular forums, follow popular advice, and purchase a bunch of tools to automate the process. While all three of these things are full of potential problems and pitfalls, it is the automated tools that have the potential to cause the greatest harm.

Why do People Buy Automated SEO Tools?

Automated SEO tools make some very grand promises. First, they promise to eliminate all of the hard work and effort that is needed to succeed with making money online. Next, they promise to give you an "edge" over your competitors. Finally, they promise to do it all for less than outsourcing.

But most of these tools don't work properly, meaning that any money you spent is money wasted. If you can't use it, you can't get any sort of return on your investment.

The few that do work properly, though, typically use techniques that are frowned upon by the search engines. They violate the terms of service, as well as commonly accepted web etiquette.

For instance, Scrape Box is a tool that is designed to find blogs on which you can comment and generate back links. It's a valuable tool if you're using it to automate the finding of relevant blogs. After that, though, it can only get you in trouble.

Just like most of these tools, Scrape Box is going to guide you through the process of creating a generic comment template. It will then assist you with "spinning" that comment, so that you can have hundreds (or even thousands) of "unique" versions.

What ends up happening, though, is that you end up with comments that resemble gibberish more than anything else. The tool then pushes these comments to blogs with links back to your website. And that's when everything goes downhill.

Search Engine Algorithms vs Automated SEO Tools

The most recent batch of algorithm updates, like Panda and Penguin, are designed to help Google better identify spam that is used for nothing other than search engine optimization. When you use Scrape Box to build and publish your content, you're spamming the web.

When Google's spiders crawl the content on these blogs and identify them as spam, they'll also note the fact that a link was sent back to your site. If you have too many of these, it will raise a red flag and your site may end up deindexed.

If you're lucky, you won't be penalized that harshly. Instead, all of your links will end up devalued. That isn't much better, though, because then the original cash investment in the tool, as well as all of the time you spent setting everything up, will be for nothing.

All of the popular SEO automation tools focus on spamming the web to build back links. This includes:

    Tools designed to automate blog comments

    Tools designed to automate article directory submissions

    Tools designed to submit your site to social bookmarking directories

    Tools designed to spin your content for uniqueness

The end results are disappointing, at best. You spend money on a tool that pushes out thousands of back links that never really help your SEO campaign. Ultimately, your site develops a back link profile that is so damaged that no amount of hard, legitimate work can ever overcome it.

Slow and Steady Wins the Race

The allure and appeal of these tools is simple to understand. They eliminate all of the hard work that would go into building a real back link profile. The problem, though, is easy to understand.

If you were to perform these tasks manually it would take a long time. But the quality of the back links you receive will far outweigh anything the automated tools could do for you.

Sure, you might not be able to manually generate a thousand back links in one day. But Google knows that, and when you do so you immediately set off a red flag. That's not how you build a solid, long lasting ranking.

By handling each step in your SEO campaign manually you will build a more natural and diverse back link footprint over time. Your rankings won't come quickly, but they'll stick for a very long time.

Source: http://ezinearticles.com/?Automated-SEO-Tools-Can-Keep-You-Out-of-the-SERPs&id=7427357

Wednesday, 9 July 2014

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

Data scrape is the process of extracting data from web by using software program from proven website only. Extracted data any one can use for any purposes as per the desires in various industries as the web having every important data of the world. We provide best of the web data extracting software. We have the expertise and one of kind knowledge in web data extraction, image scrapping, screen scrapping, email extract services, data mining, web grabbing.

Who can use Data Scraping Services?

Data scraping and extraction services can be used by any organization, company, or any firm who would like to have a data from particular industry, data of targeted customer, particular company, or anything which is available on net like data of email id, website name, search term or anything which is available on web. Most of time a marketing company like to use data scraping and data extraction services to do marketing for a particular product in certain industry and to reach the targeted customer for example if X company like to contact a restaurant of California city, so our software can extract the data of restaurant of California city and a marketing company can use this data to market their restaurant kind of product.

MLM and Network marketing company also use data extraction and data scrapping services to to find a new customer by extracting data of certain prospective customer and can contact customer by telephone, sending a postcard, email marketing, and this way they build their huge network and build large group for their own product and company.

We helped many companies to find particular data as per their need for example.

Web Data Extraction

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. We help you to create a kind of API which helps you to scrape data as per your need. We provide quality and affordable web Data Extraction application

Data Collection

Normally, data transfer between programs is accomplished using info structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all. That's why the key element that distinguishes data scraping from regular parsing is that the output being scraped was intended for display to an end-user.

Email Extractor

A tool which helps you to extract the email ids from any reliable sources automatically that is called a email extractor. It basically services the function of collecting business contacts from various web pages, HTML files, text files or any other format without duplicates email ids.

Screen scrapping

Screen scraping referred to the practice of reading text information from a computer display terminal's screen and collecting visual data from a source, instead of parsing data as in web scraping.

Data Mining Services

Data Mining Services is the process of extracting patterns from information. Datamining is becoming an increasingly important tool to transform the data into information. Any format including MS excels, CSV, HTML and many such formats according to your requirements.

Web spider

A Web spider is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.

Web Grabber

Web grabber is just a other name of the data scraping or data extraction.

Web Bot

Web Bot is software program that is claimed to be able to predict future events by tracking keywords entered on the Internet. Web bot software is the best program to pull out articles, blog, relevant website content and many such website related data We have worked with many clients for data extracting, data scrapping and data mining they are really happy with our services we provide very quality services and make your work data work very easy and automatic.

Source: http://ezinearticles.com/?How-Web-Data-Extraction-Services-Will-Save-Your-Time-and-Money-by-Automatic-Data-Collection&id=5159023