Saturday, 11 July 2015

Mobile app developers “duped” into distributing data-scraping malware: NICTA

The surge in mobile malware has led many to condemn developers' poor security practices, yet recent NICTA research suggests that – even though data-stealing is ubiquitous among both paid and free Android applications – many mobile application developers are in fact being “duped” into incorporating data-stealing routines into their applications.

A methodical analysis of Android applications and source code found that all of the top 100 paid and non-paid apps in Australia were collecting personal information, with 60 percent of the apps incorporating some sort of tracking library and 20 percent of the apps featuring more than three different tracking libraries.

While many have blamed developers for their poor security, NICTA mobile systems research group leader, Aruna Seneviratne, who leads the organisation's Networks Research Group, told CSO Australia that many tracking libraries were inadvertently added when developers incorporated third-party libraries into their mobile apps.

“In most cases app developers just use third-party libraries and don't know what's in them,” he said. “They're not being malicious for the sake of being malicious; they are just being duped into doing a thing that collects a lot of information.”

 And collect they do. Apps analysed by the team – whose paper 'early detection of spam mobile apps' was accepted for presentation at the recent WWW 2015 conference in Florence, Italy – were siphoning all kinds of personal information off of users' mobile devices, often sending it to enlarge what have become massive databases of personal preferences and behavioural modeling.

“It's amazing how much information each of those apps collects,” he said, “and the scary thing is that most of them actually go to a small number of sources – which means these guys can actually infer a lot of information about you. They have a very good idea of who you are and what you're doing – and they are cross-matching the information they collect.”

Ever more-clever data-siphoning routines were making data collection richer all the time, with many Android apps now being designed with libraries that collect information about nearby Wi-Fi access points and can correctly extrapolate the user's location 90 percent of the time.

Read more: The week in security: Android apps collecting your location data, home routers hit by drive-by malware

Seneviratne blamed Google's relatively lax app-approval process for the proliferation of such apps, which join the malware-laden apps that by the team's figures account for around 3 percent of all Google Play Store apps.

Recognising that developers are often as clueless as users about the extent of the data collection going on, the team has proposed an app-rating system that will give consumers a better idea of what they're enabling by downloading and installing a particular app.

A basic prototype has already been developed and a pilot site is expected to be up and running by the fourth quarter of this year. The service, which rates apps on criteria such as privacy and security, will be available to third parties as a Web service that Seneviratne hopes will eventually help it gain traction on app-rating and other sites.

Read more: Surveillance laws driving companies to limit data collection, developers to boost security

“We've been working to come up with a scheme that is similar to the energy-ratings system that you have for electrical appliances,” he said, noting that the site will also seek to boost developers' security awareness by correlating app ratings “to let consumers know they can download an alternate app that has the same functionality but a higher security rating”.

Israeli developer-tools firm Checkmarx has taken its own approach to improving developers' security skills, recently learning extensive lessons as hackers worked to manipulate its Game of Hacks security application – which is now under development to be sold to large corporates for developer training and testing.

This article is brought to you by Enex TestLab, content directors for CSO Australia.

Read more: The week in security: Budget flags encryption troubles, cross-government IAM

Feeling social? Follow us on Twitter and LinkedIn Now!

Read More:

    Victorian Commissioner for Privacy and Data Protection sorts sheep from the goats

    Better than email: VISA launches FireEye threat intel platform for merchants

Source: http://www.cso.com.au/article/576533/mobile-app-developers-duped-into-distributing-data-scraping-malware-nicta/

Friday, 26 June 2015

Data Scraping - Hand Scraped Hardwood Flooring Gives Your Home That Exclusive Look

Today hand scraped hardwood flooring is becoming extremely popular in the more opulent homes as well as in some commercial properties. Although this type of flooring has only recently become fashionable it has been around for many centuries.

Certainly before the invention of modern sanding techniques all floors where hand scraped at the location where they were to be installed to ensure that the floor would be flat and even. However today this method is used instead to provide texture, richness as well as a unique look and feel to the flooring.

Although manufacturers have produced machines which can provide a scraped look to their flooring it looks cheap compared to the real thing. Unfortunately the main problem with using a machine to scrape the flooring is that it provides a uniform look to the pattern of the wood. Because of this it lacks the natural feel that you would see with a floor which has been scraped by hand.

When done by hand, scraping creates a truly unique look to the floor. However the actual look and feel of each floor will vary as it depends on the skills of the person actually carrying out the work. If there is no control in place whilst the work is being carried out this can result in disastrous look to the finished product.

Many manufacturers who actually provide hand scraped hardwood flooring will either just dent, scoop or rough the floor up. But others will use sanding techniques in order to create a worn and uneven look to the flooring. The more professional teams will scrape the entire surface of the wood in order to create the unique hand made look for their customers.

Many companies will allow their customers to choose what type of scraping takes place on their wood. They can choose between light, medium and heavy. The companies who are really good at hand scraping will be able give the hardwood floor a reclaimed look by including wormholes, splits and other naturally-occurring features within the wood.

If you do decide to choose hand scraped hardwood flooring you will need to factor the costs that are associated with it into your budget. Unfortunately this type of flooring does not come cheap and you can find yourself paying upwards of $15 per sq ft. But once it is installed it will give a room a unique and warm rich feel to it and is certainly going to wow your friends and family when they see it for the first time.

Source: http://ezinearticles.com/?Hand-Scraped-Hardwood-Flooring-Gives-Your-Home-That-Exclusive-Look&id=572577

Saturday, 20 June 2015

Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples

My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.

The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td blocks and elements from tags within the td blocks, a simple readHTMLTable would not suffice.

The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr/dplyr and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).

The code (sans output) is in this gist, and IMO the rvest package is going to make working with web site data so much easier.

library(XML)
library(httr)
library(rvest)
library(magrittr)

# setup connection & grab HTML the "old" way w/httr

freak_get <- GET("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak_html <- htmlParse(content(freak_get, as="text"))

# do the same the rvest way, using "html_session" since we may need connection info in some scripts

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

# extracting the "old" way with xpathSApply

xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

xpathSApply(freak_html, "//*/td[4]", xmlValue)

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"  "http://www.imdb.com/title/tt0454876/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1024648/"  "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"  "http://www.imdb.com/title/tt0443272/"

##                                    href

## "http://www.imdb.com/title/tt1790885/?"


# extracting with rvest + XPath

freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + CSS selectors

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# building a data frame (which is kinda obvious, but hey)

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],

           stringsAsFactors=FALSE)

##                                 movie rank        rating                              imdb.url

## 1            Silver Linings Playbook     1 7.4 / trailer  http://www.imdb.com/title/tt1045658/

## 2  The Hobbit: An Unexpected Journey     2 8.2 / trailer  http://www.imdb.com/title/tt0903624/

## 3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer  http://www.imdb.com/title/tt0454876/

## 4                       Argo (DVDscr)    4 8.2 / trailer  http://www.imdb.com/title/tt1024648/

## 5                     Identity Thief     5 8.2 / trailer  http://www.imdb.com/title/tt2024432/

## 6                           Red Dawn     6 5.3 / trailer  http://www.imdb.com/title/tt1234719/

## 7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer  http://www.imdb.com/title/tt1446192/

## 8           Django Unchained (DVDscr)    8 8.8 / trailer  http://www.imdb.com/title/tt1853728/

## 9                    Lincoln (DVDscr)    9 8.2 / trailer  http://www.imdb.com/title/tt0443272/

## 10                  Zero Dark Thirty    10 7.6 / trailer http://www.imdb.com/title/tt1790885/?

Source: http://www.r-bloggers.com/migrating-table-oriented-web-scraping-code-to-rvest-wxpath-css-selector-examples/

Tuesday, 9 June 2015

Web Scraping Services : Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Wednesday, 3 June 2015

WordPress Titles: scraping with search url

I’ve blogged for a few years now, and I’ve used several tools along the way. zachbeauvais.com began as a Drupal site, until I worked out that it’s a bit overkill, and switched to WordPress. Recently, I’ve been toying with the idea of using a static site generator (a lá Jekyll or Hyde), or even pulling together a kind of ebook of ramblings. I also want to be able to arrange the posts based on the keywords they contain, regardless of how they’re categorised or tagged.

Whatever I wanted to do, I ended up with a single point of messiness: individual blog posts, and how they’re formatted. When I started, I seem to remember using Drupal’s truly awful WYSIWYG editor, and tweaking the HTML soup it produced. Then, when I moved over to WordPress, it pulled all the posts and metadata through via RSS, and I tweaked with the visual and text tools which are baked into the engine.

A couple years ago, I started to write in Markdown, and completely apart from the blog (thanks to full-screen writing and loud music). This gives me a local .md file, and I copy/paste into WordPress using a plugin to get rid of the visual editor entirely.

So, I wrote a scraper to return a list of blog posts containing a specific term. What I hope is that this very simple scraper is useful to others—WordPress is pretty common, after all—and to get some ideas for improving it, and handle post content. If you haven’t used ScraperWiki before, you might not know that you can see the raw scraper by clicking “view source” from the scraper’s overview page (or going here if you’re lazy).

This scraper is based on WordPress’ built-in search, which can be used by passing the search terms to a url, then scraping the resulting page:

http://zachbeauvais.com/?s=search_term&submit=Search

The scraper uses three Python libraries:

    Requests
    ScraperWiki
    lxml.html

There are two variables which can be changed to search for other terms, or using a different WordPress site:

term = "coffee"

site = "http://www.zachbeauvais.com"

The rest of the script is really simple: it creates a dictionary called “payload” containing the letter “s”, the keyword, and the instruction to search. The “s” is in there to make up the search url: /?s=coffee …

Requests then GETs the site, passing payload as url parameters, and I use Request’s .text function to render the page in html, which I then pass through lxml to the new variable “root”.

payload = {'s': str(term), 'submit': 'Search'}

r = requests.get(site, params=payload)  # This'll be the results page

html = r.text

root = lxml.html.fromstring(html)  # parsing the HTML into the var root

Now, my WordPress theme renders the titles of the retrieved posts in <h1> tags with the CSS class “entry-title”, so I loop through the html text, pulling out the links and text from all the resulting h1.entry-title items. This part of the script would need tweaking, depending on the CSS class and h-tag your theme uses.

for i in root.cssselect("h1.entry-title a"):

    link = i.cssselect("a")

    text = i.text_content()

    data = {

        'uri': link[0].attrib['href'],

        'post-title': str(text),

        'search-term': str(term)

    }

    if i is not None:

        print link

        print text

        print data

        scraperwiki.sqlite.save(unique_keys=['uri'], data=data)

    else:

        print "No results."

These return into an sqlite database via the ScraperWiki library, and I have a resulting database with the title and link to every blog post containing the keyword.

So, this could, in theory, run on any WordPress instance which uses the same search pattern URL—just change the site variable to match.

Also, you can run this again and again, changing the term to any new keyword. These will be stored in the DB with the keyword in its own column to identify what you were looking for.

See? Pretty simple scraping.

So, what I’d like next is to have a local copy of every post in a single format.

Has anyone got any ideas how I could improve this? And, has anyone used WordPress’ JSON API? It might be a logical next step to call the API to get the posts directly from the MySQL DB… but that would be a new blog post!

Source: https://scraperwiki.wordpress.com/2013/03/11/wordpress-titles-scraping-with-search-url/

Friday, 29 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get  HTML Document text and that text then parsed to extract data from the HTML codes.  Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service.  If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

 Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library.  PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, 26 May 2015

Web Scraping Services - Extracting Business Data You Need

Would you like to have someone collect, extract, find or scrap contact details, stats, list, extract data, or information from websites, online stores, directories, and more?

"Hi-Tech BPO Services offers 100% risk-free, quick, accurate and affordable web scraping, data scraping, screen scraping, data collection, data extraction, and website scraping services to worldwide organizations ranging from medium-sized business firms to Fortune 500 companies."

At Hi-Tech BPO Services we are helping global businesses build their own database, mailing list, generate leads, and get access to vast resources of unstructured data available on World Wide Web.

We scrape data from various sources such as websites, blogs, podcasts, and online directories; and convert them into structured formats such as excel, csv, access, text, My SQL using automated and manual scraping technologies. Through our web data scraping services, we crawl through websites and gather sales leads, competitor’s product details, new offers, pricing methodologies, and various other types of information from the web.

Our web scraping services scrape data such as name, email, phone number, address, country, state, city, product, and pricing details among others.

Areas of Expertise in Web Scraping:

•    Contact Details
•    Statistics data from websites
•    Classifieds
•    Real estate portals
•    Social networking sites
•    Government portals
•    Entertainment sites
•    Auction portals
•    Business directories
•    Job portals
•    Email ids and Profiles
•    URLs in an excel spreadsheet
•    Market place portals
•    Search engine and SEO
•    Accessories portals
•    News portals
•    Online shopping portals
•    Hotels and restaurant
•    Event portals
•    Lead generation

Industries we Serve:

Our web scraping services are suitable for industries including real estate, information technology, university, hospital, medicine, property, restaurant, hotels, banking, finance, insurance, media/entertainment, automobiles, marketing, human resources, manufacturing, healthcare, academics, travel, telecommunication and many more.

Why Hi-Tech BPO Services for Web Scraping?

•    Skilled and committed scraping experts
•    Accurate solutions
•    Highly cost-effective pricing strategies
•    Presence of satisfied clients worldwide
•    Using latest and effectual web scraping technologies
•    Ensures timely delivery
•    Round the clock customer support and technical assistance

Get Quick Cost and Time Estimate

Source: http://www.hitechbposervices.com/web-scraping.php

Monday, 25 May 2015

Which language is the most flexible for scraping websites?

3 down vote favorite

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a website for content.

What do I mean by "flexible"?

Well, ideally, I need something that will be easy to refactor and tweak for similar projects. I'm trying to avoid doing a lot of re-writing (well, re-coding) if I wanted to switch some of the variables in the program (i.e., the website to be scraped, the content to fetch, etc).

Anyways, if you could please give me your opinion, that would be great. Oh, and if you know any existing frameworks for the language you recommend, please share. (I know a little about Selenium and BeautifulSoup for python already).

4 Answers

I recently wrote a relatively complex web scraper to harvest a TON of data. It had to do some relatively complex parsing, I needed it to stuff it into a database, etc. I'm C# programmer now and formerly a Perl guy.

I wrote my original scraper using Python. I started on a Thursday and by Sunday morning I was harvesting over about a million scores from a show horse site. I used Python and SQLlite because they were fast.

HOWEVER, as I started putting together programs to regularly keep the data updated and to populate the SQL Server that would backend my MVC3 application, I kept hitting snags and gaps in my Python knowledge.

In the end, I completely rewrote the scraper/parser in C# using the HtmlAgilityPack and it works better than before (and just about as fast).

Because I KNEW THE LANGUAGE and the environment so much better I was able to add better database support, better logging, better error handling, etc. etc.

So... short answer.. Python was the fastest to market with a "good enough for now" solution, but the language I know best (C#) was the best long-term solution.

EDIT: I used BeautifulSoup for my original crawler written in Python.

5 down vote

The most flexible is the one that you're most familiar with.

Personally, I use Python for almost all of my utilities. For scraping, I find that its functionality specific to parsing and string manipulation requires little code, is fast and there are a ton of examples out there (strong community). Chances are that someone's already written whatever you're trying to do already, or there's at least something along the same lines that needs very little refactoring.

1 down vote

I think its safe to say that Python is a better place to start than Objective C. Honestly, just about any language meets the "flexible" requirement. All you need is well thought out configuration parameters. Also, a dynamic language like Python can go a long way in increasing flexibility, provided that you account for runtime type errors.

1 down vote

I recently wrote a very simple web-scraper; I chose Common Lisp as I'm learning the language.

On the basis of my experience - both of the language and the availability of help from experienced Lispers - I recommend investigating Common Lisp for your purpose.

There are excellent XML-parsing libraries available for CL, as well as libraries for parsing invalid HTML, which you'll need unless the sites you're parsing consist solely of valid XHTML.

Also, Common Lisp is a good language in which to implement DSLs; a DSL for web-scraping may be a solution to your requirement for flexibility & re-use.

Source: http://programmers.stackexchange.com/questions/74998/which-language-is-the-most-flexible-for-scraping-websites/75006#75006


Saturday, 23 May 2015

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data – Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

•    minimal manual intervention
•    low on effort and time
•    can work on any scale

Cons-

•    Data quality compromised
•    inaccurate and incomplete datasets
•    lesser details suited only for high-level analyses
•    Suited for gathering- blogs, forums, news
•    Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets – If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

•    High data quality
•    Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering – any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Wednesday, 20 May 2015

The Features of the "Holographic Meridian Scraping Therapy"

1. Systematic nature: Brief introduction to the knowledge of viscera, meridians and points in traditional Chinese medicine, theory of holographic diagnosis and treatment; preliminary discussion of the treatment and health care mechanism of scraping therapy; systemat­ic introduction to the concrete methods of the holographic meridian scraping therapy; enumerating a host of therapeutic methods of scraping for disorders in both Chinese and Western medicine to em­body a combination of disease differentiation and syndrome differen­tiation; and summarizing the health care scraping methods. It is a practical handbook of gua sha.

2. Scientific: Applying the theories of Chinese and Western medicine to explain the health care and treatment mechanism and clinical applications of scraping therapy; introducing in detail the practical manipulations, items for attention, and indications and contraindications of the scraping therapy. Here are introduced repre­sentative diseases in different clinical departments, for which scrap­ing therapy has a better curative effect and the therapeutic methods of scraping for these diseases. Stress is placed on disease differentia­tion in Western medicine and syndrome differentiation in Chinese medicine, which should be combined in practical application.

Although there are more than 140,000 kinds of disease known to modem medicine, all diseases are related to dysfunction of the 14 meridians and internal organs, according to traditional Chinese med­icine. The object of scraping therapy is to correct the disharmony in the meridians and internal organs to recover the normal bodily func­tions. Thus, the scraping of a set of meridian points can be used to treat many diseases. In the section on clinical application only about 100 kinds of common diseases are discussed, although the actual number is much more than that. For easy reference the "Index of Diseases and Symptoms" is appended at the back of the book.

3. Practical: Using simple language and plenty of pictures and diagrams to guarantee that readers can easily leam, memorize and apply the principles of scraping therapy. As long as they master the methods explained in Chapter Three, readers without any medical knowledge can apply scraping therapy to themselves or others, with reference to the pictures in Chapters Four and Five. Besides scraping therapy, herbal treatment for each disease or syndrome is explained and may be used in combination with the scraping techniques.

Referring to the Holographic Meridian Hand Diagnosis and pic­tures at the back of the book will enhance accuracy of diagnosis and increase the effectiveness of scraping therapy.

Since the first publication and distribution of the Chinese edition of the book in July 1995, it has been welcomed by both medical specialists and lay people. In March 1996 this book was republished and adopted as a textbook by the School for Advanced Studies of Traditional Chinese Medicine affiliated to the Institute of the Acu­puncture and Moxibustion of the China Academy of Traditional Chi­nese Medicine.

In order to bring this health care method to more and more peo­ple and to make traditional Chinese medicine better appreciated They have modified and replenished this book in the spirit of constant im­provement. They hope that they may make a contribution to the health care of mankind with this natural therapy which has no side-effects and causes no pollution.

They hope that the Holographic Meridian Scraping Therapy can help the health and happiness of more and more families in the world.

Source: http://ezinearticles.com/?The-Features-of-the-Holographic-Meridian-Scraping-Therapy&id=5005031

Sunday, 17 May 2015

Introducing ScrapeShield: Discover, Defend & Deter Content Scraping

If you're a publisher, whether an individual blogger or major media outlet, you've undoubtedly experienced content scraping. Searching the web for an article you've published or other original content you've created and you find it copied and republished on some other random website. Often the site will be full of ads. And, sometimes, it will even rank higher in search results than your original work.

While you may envision an army of individuals copying and pasting your content on their sites, the truth is content scraping is typically an automated process with bots that grab original content and then republish it without human intervention onto link farm sites. CloudFlare has blocked many of these bots automatically in the past, but we decided it was time to do something to more actively stop them.

Introducing ScrapeShield

ScrapeShield is an app created by the CloudFlare team. It incorporates several existing CloudFlare features like email obfuscation and hotlink protection that serve to protect from content scraping and adds a number of new features as well. Because we believe every publisher of original content should be able to understand and control how their work is used, we're providing ScrapeShield free for every CloudFlare user.

Detect, Defend & Deter

ScrapeShield has different elements to help you detect when your content is scraped, defend your site against content scrapers, and even deter content scrapers from targeting you in the first place. If you enable ScrapeShield, CloudFlare will automatically insert invisible tracking beacons in your content. When automated bots scrape your content, they pull the beacons along with them. CloudFlare detects these beacons when they ping from sites that aren't your own. You can access your ScrapeShield control panel to see where your content is being republished. Not only is this useful in showing scraping, but you can also see users who are reading your content through proxy services like Flipboard or Pulse.

The data from the content beacons is fed back into CloudFlare's protection system. As CloudFlare identifies content scraping bots, we automatically prevent them from accessing your site. Just like Project Honey Pot, the original inspiration for CloudFlare, used traps to detect when spammers were harvesting email addresses, CloudFlare now uses data from ScrapeShield to identify content scrapers and keep them off publishers' sites.

Maze

We didn't want to just stop scrapers from attacking sites on CloudFlare, we also wanted to tie up their resources so they couldn't harm the rest of the web. To do this, we created Maze. Maze routes known content scrapers who are visiting ScrapeShield-protected sites into a virtual labyrinth of gibirish and gobbledygook. We dynamically throttle the bandwidth and speed so instead of the pages loading as fast as possible, the connection is held open to the scrapers and their resources are tied up.

We use excess resources on the CloudFlare network to generate Maze, and it doesn't consume any of our publishers' resources or add any additional load to their sites. What's beautiful about the system is that the only way that content scrapers can be sure they're avoiding Maze is to avoid CloudFlare's IP addresses entirely. For any content scrapers who may be reading this, here's a helpful list of all of our IPs so you can make sure to stay away.

No Pinning

Finally, with the rise of sites like Pinterest, innocent content scraping may become even more prolific. While many sites welcome their images being pinned, we wanted to make it easy to opt out. ScrapeShield includes an option to add the no-pinning meta tag to your site to prevent your images from being pinned to the site. As other similar services include a mechanism to opt out, expect that we'll include an easy way for you to do so right from the ScrapeShield interface.

The health of the web depends on publishers creating original content getting credit for their creations. Cloud Flare is committed to building a better web and we're extremely excited about ScrapeShield as a new tool to help publishers do exactly that.

Source: https://blog.cloudflare.com/introducing-scrapeshield-discover-defend-dete/

Wednesday, 13 May 2015

Web Scraping Services Are Important Tools For Knowledge

Data extraction and web scraping techniques are important tools to find relevant data and information for personal or business use. Many companies, self-employed to copy and paste data from web pages. This process is very reliable, but very expensive as it is a waste of time and effort to get results. This is because the data collected and spent less resources and time required to collect these data are compared.

At present, several mining companies and their websites effective web scraping technique specifically for the thousands of pages of information developed culture can be traced. The information from a CSV file, database, XML file, or any other source with the required format is alameda. understanding of correlations and patterns in the data, so that policies can be designed to assist decision making. The information can also be stored for future reference.

The following are some common examples of data extraction process:

In order to rule through a government portal, citizens who are reliable for a given survey name removed.

Competitive pricing and data products include scraping websites

To access the web site or web design Stock download the videos and photos of scratching

Automatic Data Collection

It regularly collects data on a regular basis. Automated data collection techniques are very important because they find the company’s customer trends and market trends to help. By determining market trends, it is possible to understand customer behavior and predict the likelihood of the data will change.

The following are some examples of automated data collection:

Monitoring of special hourly rates for stocks

collects daily mortgage rates from various financial institutions

on a regular basis is necessary to check the weather

By using web scraping services, you can extract all data related to your business. Then analyzed the data to a spreadsheet or database can be downloaded and compared. Storing data in a database or in a required format and interpretation of the correlations to understand and makes it easier to identify hidden patterns.

Data extraction services, it is possible pricing, email, databases, profile data, and consistently to competitors for information about the data. Different techniques and processes designed to collect and analyze data, and has developed over time. Web Scraping for business processes that have beaten the market recently is one. It is a process from various sources such as websites and databases with large amounts of data provides.

Some of the most common methods used to scrape web crawling, text, fun, DOM analysis and include matching expression. After the process is only analyzers, HTML pages or meaning can be achieved through annotations. There are many different ways of scaling data, but more importantly is working toward the same goal. The main purpose of using web scraping service to retrieve and compile data in databases and web sites. In the business world is to remain relevant to the business process.

The central question about the relevance of web scraping contact. The process is relevant to the business world? The answer is yes. The fact that it is used by large companies in the world and many awards speaks derivatives.

Source: http://www.selfgrowth.com/articles/web-scraping-services-are-important-tools-for-knowledge

Friday, 1 May 2015

Customized Web Data Extraction Solutions for Business

As you begin leading your business on the path to success, competitive analysis forms a major part of your homework. You have already mobilized your efforts in finding the appropriate website data scrapping tool that will help you to collect relevant data from competitive websites and shape them up into useable information. There is however a need to look for a customized approach in your search for Data Extraction tools in order to leverage its benefits in the best possible way.

Off-the-shelf Tools Impede Data Extraction

 In the current scenario, Internet Technologies are evolving in abundance. Every organization leverages this development and builds their websites using a different programming language and technology. Off-the-shelf Website Data extraction tools are unable to interpret this difference. They fail to understand the data elements that need to be captured and end up in gathering data without any change in the software source codes.

As a result of this incapability in their technology, off-the-shelf solutions often deliver unclean, incomplete and also inaccurate data. Developers need to contribute a humungous effort in cleaning up and structuring the data to make it useable. However, despite the time-consuming activity, data seldom metamorphoses into the desired information. Also the personnel dealing with the clean-up process needs to have sufficient technical expertise in order to participate in the activities. The endeavor however results in an impediment to the whole process of data extraction leaving you thirsting for the required information to augment business growth.

Understanding how Web Extraction tools work

Web Scrapping tools are designed to extract data from the web automatically. They are usually small pieces of code written using programming languages such as Python, Ruby or PHP depending upon the expertise of the community building it. There are however several single-click models available which tends to make life easier for non-technical personnel.

The biggest challenge faced by a successful web extractor tool is to know how to tackle the right page and the right elements on that page in order to extract the desired information. Consequently, a web extractor needs to be designed to understand the anatomy of a web page in order to accomplish its task successfully. It should be designed to interpret the meaning of HTML elements like , table rows () within those tables, and table data (<td>) cells within those rows in order to extract the exact data. It will also be interfacing with the

element which are blocks of text and know how to extract the desired information from it.

Customized Solutions for your business

 Customized Solutions are provided by most Data Scraping experts. These software's help to minimize the cumbersome effort of writing elaborate codes to successfully accomplish the feat of data extraction. They are designed to seamlessly search competitive websites,identify relevant data elements, and extract appropriate data that will be useful for your business. Owing to their focused approach, these tools provide clean and accurate data thereby eliminating the need to waste valuable time and effort in any clean-up effort.

Most customized data extraction tools are also capable of delivering the extracted data in customized formats like XML or CSV. It also stores data in local databases like Microsoft Access, MySQL, or Microsoft SQL.

Customized Data scraping solutions therefore help you take accurate and informed decisions in order to define effective business strategies.

Source: http://scraping-solutions.blogspot.in/2014_07_01_archive.html 

Tuesday, 28 April 2015

Benefits of Scraping Data from Real Estate Website

With so much of growth in the recent times in real estate industry, it is likely that companies would want to create something different or use another method, so as to get desired benefits. Thus, it is best to go with the technological advancements and create real estate websites to get an edge over others in the industry. And to get all the information regarding website content, one can opt for real estate data scraping methods.

About real estate website scraping

Internet has become an important part of our daily lives and in industry marketing procedures too. With the use of website scraping one can easily scrape real estate listing from various websites. One just needs the help of experts and with proper software and tools; they can easily collect all the relevant real estate data from the required real estate websites and make a structured file containing the information. With internet becoming a valid platform for information and data submitted by numerous sources from around the globe, it is necessary to gather them all in one place for companies. In this way, the company can know what it lacks and work upon their strategies so as to gain profit and get to the top of the business world by taking one step at a time.

Uses of real estate website scraping

With proper use of website scraping one can collect and scrape the real estate listings which can help the company in the real estate market area. One can draw the attention of potential customers by designing the company strategies in such a way as contemplating the changing trends in the real estate global arena. All this is done with the help of the data collected from various real estate websites. With the help of proper website, one can collect the data and these get updated whenever new information gets into the web portal. In this way the company is kept updated about the various changes happening around the global market and thus, ensure in making plans regarding the company. This way one can plan ahead and take steps that can lead to the company gaining profits in future.

Thus, with the help of proper real estate website scraping one can be sure of getting all the information regarding real estate market. This way one can work upon making the company move as per the market trends and get a stronghold in real estate business.

Source: https://3idatascraping.wordpress.com/2013/09/25/benefit-of-scraping-data-from-real-estate-website/

Sunday, 26 April 2015

Scraping the Bottom of the Barrel - The Perils of Online Article Marketing

Many online article marketers so desperately wish to succeed, they want to dump corporate life and work for themselves out of their home. They decide they are going to create an online money making website. Therefore, they look around to see what everyone else is doing, and watch the methods others use to attract online buyers, and then they mimic their marketing, their strategies, and their business models.

Still, if you are copying what other people (less ethical people) are doing in online article marketing, those which are scraping the bottom of the barrel and using false advertising and misrepresentations, then all you are really doing is perpetuating distrust on the Internet. Therefore, you are hurting everyone, including people like me. You must realize that people like me don't appreciate that.

Let me give you a few examples of some of the things going on out there, thing that are being done by people who are ethically challenged. Far too many people write articles and then on their byline they send the Internet surfer or reader of the article to a website that has a squeeze page. The squeeze page has no real information on it, rather it asks for their name and e-mail address.

If the would-be Internet surfer is unwise enough to type in their name and email address they will be spammed by e-mail, receiving various hard-sell marketing pieces. Then, if the Internet Surfer does decide to put in their e-mail address, the website grants them access and then takes them to the page with information about what they are selling, or their online marketing "make you a millionaire" scheme.

Generally, these are five page sales letters, with tons of testimonials of people you've never heard of, and may not actually exist, and all sorts of unsubstantiated earnings claims of how much money you will make if you give them $39.35 by way of PayPal, for this limited offer "Now!" And they will send you an E-book with a strategic plan of how you can duplicate what they are doing. The reality is whatever they are doing is questionable to begin with.

If you are going to do online article marketing please don't scrape the bottom of the barrel, there's just too much competition down there from what I can see. Please consider all this.

Source: http://ezinearticles.com/?Scraping-the-Bottom-of-the-Barrel---The-Perils-of-Online-Article-Marketing&id=2710103

Wednesday, 22 April 2015

Hand Scraped Flooring For a Natural and Unique Look

An option in hardwood flooring that is being increasingly adopted by those looking for something new, innovative and unique for their homes is hand scraped flooring. This type of wood flooring helps one achieve a distinct natural look on one's floor and also has a couple of advantages.

There are three types of scraping that you can get done on your wooden flooring: light, medium and hard. Preferably, if you have a light colored woodwork, then you should go for light scraping and if your floor has a darker shade, then you should opt for hard scraping. But, irrespective of the type of scraping you go for, you must ensure that the laborers doing the scraping are very skilled and impeccable in their job as hand scraping floors is an art that demands patience, time, talent and hard work.

Nowadays, many people tend to go for machine scraping, attracted by the lower investment involved in it. But such people are unable to achieve the requisite natural effect on their floors as machines create patterns on the floors that are easily detectable. These patterns do not emerge with hand scraping and the consequent look is as random and unique as it gets.

Though such scraped flooring is a costly option in flooring, it demands little maintenance. While with perfectly smooth surfaces, you will be always on the edge ensuring that there are no scratches, with hand scraped floors, you will not have to be concerned about this as any new scratches will only add to the already distressed appearance of the flooring.

Prefinished hand scraped wood flooring is also available in the market nowadays. These eliminate the need of any on-site scraping. But this option is of course unsuitable for those who have already got their floors installed. As it is, if you get on-site scraping done, you will have more control over things as you would be able to see the scraping as it develops and would be therefore in a position to exercise your preferences more.

Source: http://ezinearticles.com/?Hand-Scraped-Flooring-For-a-Natural-and-Unique-Look&id=4581623

Saturday, 18 April 2015

What is HTML Scraping and how it works

There are many reasons why there may be a requirement to pull data or information from other sites, and usually the process begins after checking whether the site has an official API. There are very few people who are aware about the presence of structured data that is supported by every website automatically. We are basically talking about pulling data right from the HTML, also referred to as HTML scraping. This is an awesome way of gleaning data and information from third party websites.

Any webpage content that can be viewed can be scraped without any trouble. If there is any way provided by the website to the browser of the visitor to download content and use the same in a highly structured manner, in that case, accessing of the content programmatically is possible. HTML scraping works in an amazing manner.

Before indulging in HTML scraping, one can inspect the browser for network traffic. Site owners have a couple of tricks up their sleeve to thwart this access, but majority of them can be worked around.

Before moving on to how HTML scraping works, we must understand the reasons behind the same. Why is scraping needed? Once you get a satisfactory answer to this question, you can start looking for RSS or API feeds or various other traditional structured data forms. It is significant to understand that when compared with APIs, websites are more significant.

The most important advantage of the same is the maintenance of their websites where a lot of visitors visit rather than safeguarding structured data feeds. With Tweeter, the same has been publicly seen when it clamps down on the developer ecosystem. Many times, API feeds change or move without any prior warning. Many times, it can also be a deliberate attempt, but mostly, such issues or problems erupt as there is no authority or an organization that maintains or takes care of the structured data. It is rarely noticed, if the same gets severely mangled or goes offline. In case the website has certain issues or the website no longer works, the problem is more in the form of a ball in your court requiring dealing with the same without losing any time. api-comic-image

Rate limiting is another factor that needs a lot of thinking and in case of public websites, it virtually doesn’t exist. Besides some occasional sign up pages or captchas, many business websites fail to create and built defenses against any unwarranted automated access. Many times, a single website can be scraped for four hours straight without anyone noticing. There are chances that you would not be viewed under DDOS attack unless concurrent requests are being made by you. You will be seen just as an avid visitor or an enthusiast in the logs, that too, in case anyone is looking.

Another factor in HTML scraping is that one can easily access any website anonymously. Behavior tracking can be done with a few ways by the administrator of the website and this turns out to be beneficial if you want to privately gather the data. Many times, registration is imperative with APIs in order to get key and with any request being sent, this key also needs to be sent. But, in case of simple and straightforward HTTP requests, the visitor can stay anonymous besides cookies and IP address, which can again be spoofed.

The availability of HTML scraping is universal and there is no need to wait for the opening of the site for an API or for contacting anyone in the organization. One simply needs to spend some time and browse websites at a leisurely pace until the data you want is available and then find out the basic patterns to access the same.

Now you need to don a hat of a professional scraper and simply dive in. Initially, it may take some time to work up figuring out the way the data have been structured and the way it can be accessed just as we read APIs. If there is no documentation unlike APIs, you need to be a little more smart about it and use clever tricks.

Some of the most used tricks are

Data Fetching


The first thing that is required is data fetching. Find endpoints to begin with, that is the URLs that can help in returning the data that is required. If you are pretty sure about the data and the way it should be structured so as to match your requirements, you will require a particular subset for the same and later you can indulge in site browsing using the navigation tools.

GET Parameter

The URLs must be paid attention to and see the way it changes as you indulge in clicking between the sections and the way they divide into various subsections. Before starting, the other option that can be used is to straight away go to the search functionality of the site. Certain terms can be typed and the URL needs to be focused again for watching the changes on the basis of what is being searched. A GET parameter will be probably seen like q which changes on the basis of the search term used by you. Other GET parameters that are not being used can be removed from the URL until only the ones that are needed are left for data loading. Before a query string, there must always be a “?” beginning.

Now the time has come when you would have started to come across the data that you would like to see and want to access, but sometimes, there may be certain pagination issues that require to be dealt with. Due to these issues, you may not be able to see the data in its entirety. Single requests are kept away by many APIs as well from database slamming. Many times, clicking the next page can add some offset parameter that helps in data visibility on the page. All these steps will help you succeed in HTML scraping.

Source: https://www.promptcloud.com/blog/what-is-html-scraping-and-how-it-works/

Tuesday, 7 April 2015

rvest: easy web scraping with R

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:

install.packages("rvest")

rvest in action

To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

library(rvest)

lego_movie <- html("http://www.imdb.com/title/tt1490017/")

To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

lego_movie %>%

  html_node("strong span") %>%

  html_text() %>%

  as.numeric()

#> [1] 7.9

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:

lego_movie %>%

  html_nodes("#titleCast .itemprop span") %>%

  html_text()

#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"   

#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"

#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson"

#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"   

#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():

lego_movie %>%

  html_nodes("table") %>%

  .[[3]] %>%

  html_table()

#>                                              X 1            NA

#> 1 this movie is very very deep and philosophical   mrdoctor524

#> 2 This got an 8.0 and Wizard of Oz got an 8.1...  marr-justinm

#> 3                         Discouraging Building?       Laestig

#> 4                              LEGO - the plural      neil-476

#> 5                                 Academy Awards   browncoatjw

#> 6                    what was the funniest part? actionjacksin

Other important functions

•    If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")).

•    Extract the tag names with html_tag(), text with html_text(), a single attribute with html_attr() or all attributes with html_attrs().

•    Detect and repair text encoding problems with guess_encoding() and repair_encoding().

•    Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), and forward(). Extract, modify and submit forms with html_form(), set_values() and submit_form(). (This is still a work in progress, so I’d love your feedback.)

To see these functions in action, check out package demos with demo(package = "rvest").

Source: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

Sunday, 5 April 2015

Some Traps to know and avoid in Web Scraping

In the present day and age, web scraping comes across as a handy tool in the right hands. In essence, web scraping means quickly crawling the web for specific information, using pre-written programs. Scraping efforts are designed to crawl and analyze the data of entire websites, and saving the parts that are needed. Many industries have successfully used web scraping to create massive banks of relevant, actionable data that they use on a daily basis to further their business interests and provide better service to customers. This is the age of the Big Data, and web scraping is one of the ways in which businesses can tap into this huge data repository and come up with relevant information that aids them in every way.

Web scraping, however, does come with its own share of problems and roadblocks. With every passing day, a growing number of websites are trying to actively minimize the instance of scraping and protect their own data to stay afloat in today’s situation of immense competition. There are several other complications which might arise and several traps that can slow you down during your web scraping pursuits. Knowing about these traps and how to avoid them can be of great help if you want to successfully accomplish your web scraping goals and get the amount of data that you require.

Complications in Web Scraping

Over time, various complications have risen in the field of web scraping. Many websites have started to get paranoid about data duplication and data security problems and have begun to protect their data in many ways. Some websites are not generally agreeable to the moral and ethical implications of web scraping, and do not want their content to be scraped. There are many places where website owners can set traps and roadblocks to slow down or stop web scraping activities. Major search engines also have a system in place to discourage scraping of search engine results. Last but not the least, many websites and web services announce a blanket ban on web scraping and say the same in their terms and conditions, potentially leading to legal issues in the event of any scraping.

Here are some of the most common complications that you might face during your web scraping efforts which you should be particularly aware about –

•    Some locations on the intranet might discourage web scraping to prevent data duplication or data theft.

•    Many websites have in place a number of different traps to detect and ban web scraping tools and programs.

•    Certain websites make it clear in their terms and conditions that they consider web scraping an infringement of their privacy and might even consider legal redress.

•    In a number of locations, simple measures are implemented to prevent non-human traffic to websites, making it difficult for web scraping tools to go on collecting data at a fast pace.

To surmount these difficulties, you need a deeper and more insightful understanding of the way web scraping works and also the attitude of website owners towards web scraping efforts. Most major issues can be subverted or quietly avoided if you maintain good working practice during your web scraping efforts and understand the mentality of the people whose sites you are scraping.

Web Crawling Services & Web Scraping Services

Common Problems


With automated scraping, you might face a number of common problems. The behavior of web scraping programs or spiders presents a certain picture to the target website. It then uses this behavior to distinguish between human users and web scraping spiders. Depending on that information, a website may or may not employ particular web scraping traps to stop your efforts. Some of the commonly employed traps are –

Crawling Pattern Checks – Some websites detect scraping activities by analyzing crawling patterns. Web scraping robots follow a distinct crawling pattern which incorporates repetitive tasks like visiting links and copying content. By carefully analyzing these patterns, websites can determine that they are being caused by a web scraping robot and not a human user, and can take preventive measures.

Honeypots – Some websites have honeypots in their webpages to detect and block web scraping activities. These can be in the form of links that are not visible to human users, being disguised in a certain way. Since your web crawler program does not operate the way a human user does, it can try and scrape information from that link. As a result, the website can detect the scraping effort and block the source IP addresses.

Policies – Some websites make it absolutely apparent in their terms and conditions that they are particularly averse to web scraping activities on their content. This can act as a deterrent and make you vulnerable against possible ethical and legal implications.

Infinite Loops – Your web scraping program can be tricked into visiting the same URL again and again by using certain URL building techniques.

These traps in web scraping can prove to be detrimental to your efforts and you need to find innovative and effective ways to surpass these problems. Learning some web crawler tips to avoid traps and judiciously using them is a great way of making sure that your web scraping requirements are met without any hassle.

What you can do

The first and foremost rule of thumb about web scraping is that you have to make your efforts as inconspicuous as possible. This way you will not arouse suspicion and negative behavior from your target websites. To this end, you need a well-designed web scraping program with a human touch. Such a program can operate in flexible ways so as to not alert website owners through the usual traffic criteria used to spot scraping tools.

Web scraping for ecommerce data extraction


Some of the measures that you can implement to ensure that you steer clear of common web scraping traps are –

•    The first thing that you need to do is to ascertain if a particular website that you are trying to scrape has any particular dislike towards web scraping tools. If you see any indication in their terms and conditions, tread cautiously and stop scraping their website if you receive any notification regarding their lack of approval. Being polite and honest can help you get away with a lot.

•    Try and minimize the load on every single website that you visit for scraping. Putting a high load on websites can alert them towards your intentions and often might cause them to develop a negative attitude. To decrease the overall load on a particular website, there are many techniques that you can employ.

•    Start by caching the pages that you have already crawled to ensure that you do not have to load them again.

•    Also store the URLs of crawled pages.

•    Take things slow and do not flood the website with multiple parallel requests that put a strain on their resources.

•    Handle your scraping in gentle phases and take only the content you require.

•    Your scraping spider should be able to diversify its actions, change its crawling pattern and present a polymorphic front to websites, so as not to cause an alarm and put them on the defensive.

•    Arrive at an optimum crawling speed, so as to not tax the resources and bandwidth of the target website. Use auto throttling mechanisms to optimize web traffic and put random breaks in between page requests, with the lowest possible number of concurrent requests that you can work with.

•    Use multiple IP addresses for your scraping efforts, or take advantage of proxy servers and VPN services. This will help to minimize the danger of getting trapped and blacklisted by a website.

•    Be prepared to understand the respect the express wishes and policies of a website regarding web scraping by taking a good look at the target ‘robots.txt’ file. This file contains clear instructions on the exact pages that you are allowed to crawl, and the requisite intervals between page requests. It might also specify that you use a pre-determined user agent identification string that classifies you as a scraping bot. adhering to these instructions minimizes the chance of getting on the bad side of website owners and risking bans.

Use an advanced tool for web scraping which can store and check data, URLs and patterns. Whether your web scraping needs are confined to one domain or spread over many, you need to appreciate that many website owners do not take kindly to scraping. The trick here is to ensure that you maintain industry best practices while extracting data from websites. This prevents any incident of misunderstanding, and allows you a clear pathway to most of the data sources that you want to leverage for your requirements.

Hope this article helps in understanding the different traps and roadblocks that you might face during your web scraping endeavors. This will help you in figuring out smart, sensible ways to work around them and make sure that your experience remains smooth. This way, you can keep receiving the important information that you need with web scraping. Following these basic guidelines can help you prevent getting banned or blacklisted and stay in the good books of website owners. This will allow you continue with your web scraping activities unencumbered.

Source:https://www.promptcloud.com/blog/some-traps-to-avoid-in-web-scraping/

Friday, 27 March 2015

Scraping expert's Amazon Scraper provides huge access to find your desired product on Amazon

Today, with latest advancement of technology we find plenty of ecommerce websites offering huge benefits to people by giving out various products from different categories at an affordable cost. One of the most renowned ecommerce website Amazon has come up with its all new launch of Amazon Scraper for the comfort of their customers. This product Amazon Scraper is also called web harvesting which is a computer software technique for getting out data from websites.

Today anyone can find such web scraping tools that are specifically designed for particular websites. Like for example, Amazon Scraper is also a web scraper tool or technique utilised to crawl, or scrap or even extract the data from the largest e commerce website called Amazon.com. Scrapingexpert.com offers best Amazon scraper for extracting plenty of products from websites easily.

Amazon scraper

Let us see how the Amazon Scraper works:


How to use: Step 1) Select the Category; Enter the (Keyword, UPC, and ASIN) Step 2) Set the delay in seconds Step 3) Click Start

Also you can Scrape the below given details from Amazon.com:

  •     Product Title & Description
  •     Category & Cost Manufacture,
  •     QTY Seller Name,
  •     Total Sellers Shipping Cost,
  •     Shipping / Product Weight ImageURL, IsBuyBoxFBA, Source Link
  •     Stars, Customer Reviews
  •     ASIN, UPC, Model Number Sales Rank,
  •     Sales Rank In Category

Here are some interesting Product Features:
  •     Single Screen Dashboard that shows total extracted records, extracted keywords, and elapse.
  •     Filter Search - Skip data that do not match phrases or keywords
  •     Compatible for Microsoft XP/Vista/Windows 7
  •     Option to set delay between requests to simulate a human surfing in a browser
  •     Extracted data is stored in CSV format, which you can easily open in excel
  • Benefits:
  •     Less Expensive - With our valuable services, we allow you to save both your efforts and money. We have some competitors who outsource their scraping projects to us.
  •     Guaranteed Accurate Results - We assure you get most reliable solutions with accurate results that cannot be collected by any ordinary human being or anyone else.
  •     Delivers Fast Results - We promise to get your work done in just few hours, which can take plenty of time if done by someone else. We save your time, workforce and money and give you an opportunity to stand at a distinction over your multiple competitors.
  •     System Requirement: Operating System - Windows XP, Windows Vista, Windows 7 Net Framework 2.0

Are you searching for some cost effective programs to extract data of other users? If your answer is yes, then we offer Amazon Screen Scraping which is the best Amazon Screen Scraping method of data extraction. Today, in this competitive world of advanced technology there are multiple companies who claim to offer best Amazon Screen Scraping services, so hiring their services for Amazon Screen Scraping can allow you to scrap almost any data in any format you wish to obtain. Well, we at Scrapingexpert.com study each and every single bit of little details of the scraping project and then provide you with a free quote and the date of completing the work

In order to get accurate data pertaining to a specific product, you can use our Awesome Amazon Scraper Tool. This Awesome Amazon Scraping Tool is very effective tool that will help you to extract information about any product from Amazon.

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.

Wednesday, 25 March 2015

Web Data Extraction

The Internet as we know today is a repository of information that can be accessed across geographical societies. In just over two decades, the Web has moved from a university curiosity to a fundamental research, marketing and communications vehicle that impinges upon the everyday life of most people in all over the world. It is accessed by over 16% of the population of the world spanning over 233 countries.

As the amount of information on the Web grows, that information becomes ever harder to keep track of and use. Compounding the matter is this information is spread over billions of Web pages, each with its own independent structure and format. So how do you find the information you're looking for in a useful format - and do it quickly and easily without breaking the bank?

Search Isn't Enough

Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes. For all the power of Google and its kin, all that search engines can do is locate information and point to it. They go only two or three levels deep into a Web site to find information and then return URLs. Search Engines cannot retrieve information from deep-web, information that is available only after filling in some sort of registration form and logging, and store it in a desirable format. In order to save the information in a desirable format or a particular application, after using the search engine to locate data, you still have to do the following tasks to capture the information you need:

• Scan the content until you find the information.

• Mark the information (usually by highlighting with a mouse).

• Switch to another application (such as a spreadsheet, database or word processor).

• Paste the information into that application.

Its not all copy and paste


Consider the scenario of a company is looking to build up an email marketing list of over 100,000 thousand names and email addresses from a public group. It will take up over 28 man-hours if the person manages to copy and paste the Name and Email in 1 second, translating to over $500 in wages only, not to mention the other costs associated with it. Time involved in copying a record is directly proportion to the number of fields of data that has to copy/pasted.

Is there any Alternative to copy-paste?


A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors available on the Internet, lies with usage of custom Web harvesting software and tools.

Web harvesting software automatically extracts information from the Web and picks up where search engines leave off, doing the work the search engine can't. Extraction tools automate the reading, the copying and pasting necessary to collect information for further use. The software mimics the human interaction with the website and gathers data in a manner as if the website is being browsed. Web Harvesting software only navigate the website to locate, filter and copy the required data at much higher speeds that is humanly possible. Advanced software even able to browse the website and gather data silently without leaving the footprints of access.

The next article of this series will give more details about how such softwares and uncover some myths on web harvesting.

Source: http://ezinearticles.com/?Web-Data-Extraction&id=575212

Wednesday, 18 March 2015

Why Online Coverage Matters

To track online coverage is not just a fad these days. It is one of the tools that everyone can use to maintain a good image; monitor responses from guests or clients; and sell your ideas, products or services. It is one of the innovations of the computer age that has really made online presence a very strategic place to gain success in business, research, and all other ventures that one can think about.

The benefits that tracking online brings can never be measured by money because it in itself is an investment that no one can steal from you. You control it and you make good things happen to you and your business beyond your expectations.

Maintain a Good Image


A good image is something everyone works hard to achieve which with just one simple error of judgment or destructive criticism may ruin it in just a split second. As with all the best efforts you put on yourself to look good and attractive, you should also exert time and energy to make your online presence appealing and pleasing to all viewers regardless of age, nationality, and preferences.

Maintaining online presence can be done by the owner him or herself if the company is easy enough to  single handy or you may need an expert to do it for you at a reasonable coast. If you do not know how to do it, you read a lot of instructional articles and blogs or you can research the providers who can do the job for you efficiently and effectively.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/online-coverage-matters/

Monday, 16 March 2015

The New Gold Rush: Exploring the Untapped ‘Data Mining’ Reserves of Top 3 Industries

In a bid to reach new moms bang on time, Target knows when you’ll get pregnant. Microsoft knows Return on Investment (ROI) of each of its employee. Pandora knows what’s your current music mood. Amazing, isn’t it?

Call it the stereotype of mathematician nerds or Holy Grail of predictive analysts of modern day, Data Mining is the new gold rush for many industries.

Today, companies are mining data to predict exact actions of their prospective customers. That means, when a huge chunk of customer data is seen through a series of sophisticated, formatted and collective data mining process, it can help create future-ready content of marketing and buying messages, diminishing scope of errors and maximizing customer loyalty.

 Also a progressive team of coders and statisticians help push the envelope as far as the marketing and business tactics are concerned by collecting data and mining practices that are empowering.

Mentioned below is a detailed low-down of three such industries (real estate, retail and automobile) where LoginWorks Software has employed the most talented predictive analysts and comprehensive behavioral marketing platforms in the industry. Let’s take a look.

Real Estate Industry Looks Past the Spray-And-Pray Marketing Tactic By Mining User Data.

A supremely competitive market that is to an extent unstructured too, the real estate industry needs to reap the advantageous benefits of data mining. And, we at LoginWorks Software understand this extremely well!

Our robust team of knowledge-driven analysts make sure that we predict future trends, process the old data and rank the areas using actionable predictive analytics techniques. By applying a long-term strategy to analyze the trend and to get hold of the influential factors that are invested in buying a property, our data warehouses excels in using classical techniques, such as Neural Network, C&R Tree, linear regression, Multilayer Perception Model and SPSS in order to uncover the hidden knowledge.

By using Big Data as the bedrock of our Predictive Marketing Platform, we help you zero-in on the best possible property available for your interest. Data from more than a dozen of reliable national and international resources to give you the most accurate and up-to-the minute data. Right from extracting a refined database of one’s neighborhood insights to classic knowledge discovery of meaningful l techniques, our statisticians have proven accuracy. We scientifically predict your data by:

•    Understanding powerful insights that lead to property-buying decisions.

•    Studying properties and ranking them city-wise, based on their predictability of getting sold in the future.

•    Measuring trends at micro level by making use of Home Price Index, Market Strength Indicator, Automated Valuation Model and Investment analytics.

Our marketing platform consists of the mentioned below automated features:

 df

Image source: http://www.smartzip.com/corp/wp-content/uploads/2013/07/marketing-automation1.png

Data Mining Techniques for Customer Relationship Management and Customer Retention in Retail Industry

Data mining to a retailer is what mining gold to a goldsmith would be! Priceless, to say the least. To understand the dynamics and suggestive patterns of customer habits, a retailer is always scouting for information to up his sales and generate future leads from existing and prospective consumers. Hence, sourcing your birth date information from your social media profiles to zooming upon your customer’s buying behaviour in different seasons.

For a retailer, data mining helps the customer information to transform a point of sale into a detailed understanding of (1) Customer Identification; (2) Customer Attraction; (3) Customer Retention; and (4) Customer Development. A retailer can score potential benefits by calculating Return on Investment (ROI) of its customers by:

•    Gaining customer loyalty and long-term association

•    Saving up on huge spend on non-targeted advertising and marketing costs

•    Accessing customer information, which leads to directly targeting the profitable customers

•    Extending product life cycle

•    Uncovering predictable buying patterns that leads to a decrease in spoilage, distribution costs and holding costs

Our specialised marketing team targets customers for retention by applying myriad levels of data mining techniques, in both technological and statistical perspective. We primarily make use of ‘basket’ analysis technique that unearths links between two distinct products and ‘visual’ mining techniques that helps in discovering the power of instant visual association and buying.

Role of Data Mining in Retail Sector

Image Source: http://www.coolavenues.com/sites/default/files/Data%20Mining%201.jpg

Image Source: http://www.coolavenues.com/sites/default/files/Data%20Mining%201.jpg

Spinning the Magic Wheel of Data Mining Algorithms in Automobile Industry of Today

Often called as the ‘industries of industries’. the automobile industry of today is robustly engrossed in constructing new plants, and extracting more production levels from existing plants. Like food manufacturing and drug companies, today, automakers are in an urgent need to build sophisticated data extraction processes to keep themselves all equipped for exuberantly expensive and reputation-damaging incidents. If a data analytics by Teradata Corp, a data analytics company, is to be believed then the “auto industry spends $45 billion to $50 billion a year on recalls and warranty claim”. A number potentially damaging for the automobile industry at-large, we reckon!

Hence, it becomes all the more imperative for an automobile company of repute to make use of enhanced methodology of data mining algorithms.

Our analysts would help you to spot insightful patterns, trends, rules, and relationships from scores and scores of information, which is otherwise next to impossible for the human eye to trace or process. Our avant-garde technicians understand that an automative manufacturing industry does not interact on one-to-one basis with the end consumers on a direct basis, hence we step into the picture and use our fully-integrated data mining feature to help you with the:

•    Supply chain procedure (pre-sales and post-sales services, inventory, orders, production plan).

•    Full A-Zee marketing facts and figures(dealers, business centers, social media handling, direct marketing tactics, etc).

•    Manufacturing detailing (car configurations/packages/options codes and description).

•    Customers’ inclination information (websites web-activities).

Impact of Big Data Analytics of Direct Vehicle Pricing

Image Source: http://www.slideshare.net/FrostandSullivan/impact-of-big-data-on-automotive-industry

Image Source: http://www.slideshare.net/FrostandSullivan/impact-of-big-data-on-automotive-industry

Bottom line

To wrap it all up, it is imperative to understand that the customer data is just as crucial for an actionable insights as your regular listings data. Behavioural data and predictive analysis is where the real deal lies, because at the end of the day it is all about targeting the right audience with the right context!

 Move forward in your industry by availing LOGNWORKS SOFTWARES’ comprehensive, integrated, strategic and sophisticated Data Mining Services.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/can-identify-buying-preferences-customers-using-data-mining-techniques/