Defining a scraper

Defining a scraper is similar to defining a model in django. In my example I'll use github. Github has an API which would be far more suitable for any use, but it's a well-known site, which helps understanding the examples.

Let's start by defining GitHub's repository page

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/python/cpython/"

    description = Css(".repository-meta-content")

page = GithubProjectPage()
print(page.description)
# will output the description
print(page._dict)
# will output {"description": "<whatever the description is>"}

That's nice and all, but we don't just want to address the project page for the cpython mirror, but just any project page. We can do this by adding string formatting parameters to scrape_url.

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"

    description = Css(".repository-meta-content")

page = GithubProjectPage(username="python", projectname="cpython")
print(page.description)
# will output the description

You can avoid using keyword arguments by defining scrape_args like this:

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    description = Css(".repository-meta-content")

page = GithubProjectPage("python", "cpython")
print(page.description)
# will output the description

Cleaning up the data

Now when you run the previous example, you may notice that the description is padded with a lot of whitespace. We really don't want that, so we can pass in a cleanup function with the cleanup= keyword argument. Its signature is cleanup(extracted_data). In this example I'll use a lambda.

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    description = Css(".repository-meta-content", 
                      cleanup=lambda value: value.strip())

page = GithubProjectPage("python", "cpython")
print(page.description)
# will output the description

By default, data is extracted by taking the text contents of the element. Sometimes, however, the data you need is in an attribute. In that case, you can provide the attribute= keyword argument:

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    git_repo = Css("input.input-monospace", attribute="value")

page = GithubProjectPage("python", "cpython")
print(page.description)

If the data you're after is even more complicated (e.g. a combination of elements), you may want to perform the extraction yourself, by providing an extractor function with the extract= argument. Its signature is extract(element). The extracted data will be passed into the cleanup chain unmodified, which means you're not limited to strings.

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    git_repo = @Css("input.input-monospace",
                    extract=lambda elem: {"repo": elem.get("value")})

page = GithubProjectPage("python", "cpython")
print(page.description)

While lambda's are nice for simple conversions, sometimes you'll need to do something more complicated. A lambda would be too cramped fo that. In that case, it may be useful to declare the cleanup function using the decorator syntax. The signature of the decorated function is attributename(extracted_data, element).

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    @Css(".repository-meta-content")
    def description(self, value, element):
        return value.strip()

page = GithubProjectPage("python", "cpython")
print(page.description)
# will output the description

For some common datatypes, there are special Css selectors: CssInt, CssFloat, CssDate , CssRaw (for raw html) and CssBoolean (testing whether some selector is present).

List data

Normaly when scraping, only the first matching element is used, but sometimes you'll want to go over lists of things. To do so, specify the multiple argument. In this example contents will produce the names of all the root directories in the project.

from livescrape import ScrapedPage, Css

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    contents = Css('.js-directory-link', multiple=True)

Note that cleanup code runs per list item, not on the list as a whole.

Tabular data

If you need more than one datum per list item, you will need to use CSSGroup. You can provide the additional selectors by assigning them to attributes of the group. It will produce an object for each of the table contents.

from livescrape import ScrapedPage, Css, CssGroup

class GithubProjectPage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    table_contents = CssGroup('tr.js-navigation-item', multiple=True)

    table_contents.name = Css("td.content a")
    table_contents.message = Css("td.message a")
    table_contents.age = Css("td.age time", attribute="datetime")

Note that cleanup code runs per list item, not on the list as a whole.

Links

Websites typically have links, which you'll want to follow. The CssLink selector helps you by allowing you to specify what ScrapedPage should handle the target of that link. In the following example, we're reusing one of the GithubProjectPage definitions above.

from scrape import ScrapedPage, CssLink

class GithubOveriew(ScrapedPage):
    scrape_url = "https://github.com/%(username)s"
    scrape_args = ("username")

    repos = CssLink(".repo-list-name a", GithubProjectPage, multiple=True)

You could now type GithubOverview("python").repos[0].description to retrieve the description of the first repository on the overview page.