Jigsaw Data Scraping: November 2016

Cleaning up Data Scraped from the Web

This section deals with how you can clean up data – having extracted it from the web by scraping.
Useful clean up steps

One advantage of scraping data from the web is that you can actually have a better dataset than the original. Because you need to take steps to understand the dataset’s inconsistencies, you can eliminate or at least minimise them. From another perspective, spending time cleaning up messy data can fill the large gaps that your processor will experience when waiting for it to be downloaded from its host.

This section provides an example of several useful clean-up operations.

    Cleaning HTML
    Strip whitespace
    Converting numbers to number types:
    Converting Boolean values: ‘Yes’ -> True
    Converting dates to machine-readable formats: “24 June 2004” -> “2004-06-24”

Clean the HTML

HTML you find on the web can be atrocious. Here’s a quick function that can help. We make use of the `lxml`_ library. It’s very good at understanding broken HTML and will render a perfectly-formed page for your extractor functions.

You may be concerned that this is computationally wasteful. This is true, but it can reduce lots of the irritation of extracting specific information from messy HTML:

def clean_page(html, pretty_print=False):
    """
    >>> junk = "some random HTML<P> for you to try to parse</p>"
    >>> clean_page(junk)
    '<div><p>some random HTML</p><p> for you to try to parse</p></div>'
    >>> print clean_page(junk, pretty_print=True)
    <div>
    <p>some random HTML</p>
    <p> for you to try to parse</p>
    </div>
    """
    from lxml.html import fromstring
    from lxml.html import tostring
    return tostring(fromstring(html), pretty_print=pretty_print)

Converting yes/no to Boolean values

Computers are far better at interpreting Boolean values when they are consistently provided. Irrespective of the programming language, normalising these values will make any automatic comparisons much richer:

def to_bool(yes_no, none_to_false=True):
    """
    >>> to_bool('')
    False
    >>> to_bool(None):
    False
    >>> to_bool('y')
    True
    >>> to_bool('yip')
    True
    >>> to_bool('Yes')
    True
    >>> to_bool('nuh')
    False
    """
    yes_no = yes_no.strip().lower()
    if not yes_no.strip() and none_to_false:
        return False
    if yes_no.startswith('y'):
        return True
    elif yes_no.startswith('n'):
        return False

Converting numbers to the correct type

If you’re extracting numbers from HTML tables, they will each be represented as a string or Unicode, even though it would be more sensible to treat as integers or floating point numbers:

def to_int(number, european=False):
    """
    >>> to_int('32')
    32
    >>> to_int('3,998')
    3998
    >>> to_int('3.998', european=True)
    3998
    """
    if european:
        number = number.replace('.', '')
    else:
        number = number.replace(',', '')
    return int(number)

def to_float(number, european=False)
    """
    >>> to_float(u'42.1')
    42.1
    >>> to_float(u'32,1', european=True)
    32.1
    >>> to_float('3,132.87')
    3132.87
    >>> to_float('3.132,87')
    3132.87
    >>> to_float('(54.12)')
    -54.12

    Warning
    -------

    Incorrectly declaring `european` leads to troublesome results:

    >>> to_float('54.2', european=True)
    542
    """
    import string
    if european:
        table = string.maketrans(',.','.,')
        number = string.translate(number, table)
    number = number.replace(',', '')
    if number.startswith('(') and number.endswith(')'):
        number = '-' + number[1:-1]
return float(number)

If you are dealing with numbers from another region consistently, it may be appropriate to call upon the locale module. You will then have the advantage of code written in C, rather than Python:

>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
>>> locale.atoi('1,000,000')
1000000

Stripping whitespace

Removing whitespace from a string is built into many languages string. Removing left and right whitespace is highly recommended. Your database will be unable to sort data properly which have inconsistent treatment of whitespace:

>>> u'\n\tTitle'.strip()
u'Title'

Converting dates to a machine-readable format

Python is well blessed with a mature date parser, dateutil. We can take advantage of this to make light work an otherwise error-prone task.

dateutil can be reluctant to raise exceptions to dates that it doesn’t understand. Therefore, it can be wise to store the original along with the parsed ISO formatted string. This can be used for manual checking if required later.

Example code:

def date_to_iso(datestring):
    """
    Takes a string of a human-readable date and
    returns a machine-readable date string.

    >>> date_to_iso('20 July 2002')
    '2002-07-20 00:00:00'
    >>> date_to_iso('June 3 2009 at 4am')
    '2009-06-03 04:00:00'
    """
    from dateutil import parser
    from datetime import datetime
    default = datetime(year=1, month=1, day=1)
    return str(parser.parse(datestring, default=default))

Source: http://schoolofdata.org/handbook/recipes/cleaning-data-scraped-from-the-web/

How to scrape search results from search engines like Google, Bing and Yahoo

Search giants like Google, Yahoo and Bing made their empire on scraping others content. However, they don’t want you to scrape them. How ironic, isn’t it?

Search engine performance is a very important metric all digital marketers want to measure and improve. I’m sure you will be using some great SEO tools to check how your keywords perform. All great SEO tool comes with a search keyword ranking feature. The tools will tell you how your keywords are performing in google, yahoo bing etc.

How will you get data from search engines If you want to build a keyword ranking app?

These search engines have API’s but the daily query limit is very low and not useful for the commercial purpose. The only solution is to scrape search results. Search engine giants obviously know this :). Once they know that you are scraping, they will block your IP, Period!

How do Search engines detect bots?

Here are the common methods of detection of bots.

* IP address: Search engines can detect if there are too many requests coming from a single IP. If a high amount of traffic is detected, they will throw a captcha.

* Search patterns: Search engines match traffic patterns to an existing set of patterns and if there is huge variation, they will classify this as a bot.

If you don’t have access to sophisticated technology, it is impossible to scrape search engines like google, Bing or Yahoo.

How to avoid detection

There are some things you can do to avoid detection.

    Scrape slowly and don’t try to squeeze everything at once.
    Switch user agents between queries
    Scrape randomly and don’t follow the same pattern
    Use intelligent IP rotations
    Clear Cookies after each IP change or disable them completely

Thanks for reading this blog post.

Source: http://blog.datahut.co/how-to-scrape-search-results-from-search-engines-like-google-bing-and-yahoo/

Jigsaw Data Scraping

Tuesday, 29 November 2016

Cleaning up Data Scraped from the Web

Tuesday, 15 November 2016

How to scrape search results from search engines like Google, Bing and Yahoo