December 24, 2024|7 min reading

Crawlee-Python: The Ultimate Guide to Web Scraping and Browser Automation

Crawlee-Python: The Ultimate Guide to Web Scraping and Browser Automation
Author Merlio

published by

@Merlio

Crawlee-Python is a robust, open-source library designed to simplify web scraping and browser automation tasks for developers. Bringing the popular features of its JavaScript counterpart to the Python community, Crawlee-Python empowers you to build scalable, efficient, and reliable scrapers with ease. This guide will explore its key features, installation process, advanced usage, and best practices to help you get started.

Table of Contents

  • Key Features of Crawlee-Python
  • Getting Started with Crawlee-Python
    • Installation
    • Creating Your First Crawler
  • Advanced Usage
    • Crawling Multiple URLs
    • Implementing Custom Logic
    • Handling Different Page Types
  • Best Practices and Tips
  • Conclusion
  • FAQs

Key Features of Crawlee-Python

Crawlee-Python stands out due to its feature-rich framework, offering tools to tackle both simple and complex web scraping projects:

  • Unified Interface: Seamlessly switch between HTTP and headless browser crawling with a consistent API.
  • Automatic Parallel Crawling: Scale operations automatically based on system resources.
  • Type Hints: Enjoy enhanced development with full type hint coverage for better IDE support.
  • Automatic Retries: Built-in error handling and retry mechanisms improve reliability.
  • Proxy Rotation & Session Management: Integrated tools ensure robust scraping and avoid IP bans.
  • Pluggable Storage: Flexible options for storing tabular data and files.
  • Headless Browser Support: Out-of-the-box support using Playwright.
  • Asyncio-Based: Leverage efficient asynchronous operations with Python’s Asyncio.

Getting Started with Crawlee-Python

Installation

To begin, ensure you have Python 3.9 or higher installed. Install Crawlee using pip:

pip install crawlee

For additional features, use the following:

pip install 'crawlee[beautifulsoup]' # For BeautifulSoupCrawler pip install 'crawlee[playwright]' # For PlaywrightCrawler

Install Playwright dependencies if needed:

playwright install

Creating Your First Crawler

Here’s how to create a simple crawler that extracts page titles and content:

import asyncio from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext async def main(): crawler = PlaywrightCrawler(max_requests_per_crawl=50) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext): data = { 'url': context.page.url, 'title': await context.page.title(), 'content': (await context.page.content())[:1000] # First 1000 characters } await context.push_data(data) await crawler.run(['https://example.com']) await crawler.export_data('./result.csv') if __name__ == '__main__': asyncio.run(main())

Advanced Usage

Crawling Multiple URLs

Expand your crawler to handle multiple starting URLs and advanced features:

async def main(): crawler = PlaywrightCrawler(max_requests_per_crawl=100, headless=True) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext): page = context.page url = page.url title = await page.title() content = await page.evaluate('() => document.body.innerText') links = await page.evaluate('() => Array.from(document.links).map(link => link.href)') await context.push_data({ 'url': url, 'title': title, 'content': content[:500], 'links_found': links }) await context.enqueue_links() start_urls = [ 'https://example.com', 'https://another-example.com' ] await crawler.run(start_urls) await crawler.export_data('./result.json')

Implementing Custom Logic

Filter pages and process data:

async def main(): crawler = PlaywrightCrawler(max_requests_per_crawl=200) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext): page = context.page if 'blog' not in page.url: return title = await page.title() content = await page.evaluate('() => document.body.innerText') date = 'Unknown' await context.push_data({ 'url': page.url, 'title': title, 'content_preview': ' '.join(content.split())[:1000], 'date': date }) await crawler.run(['https://example-blog.com'])

Handling Different Page Types

Use route patterns for diverse pages:

async def main(): @crawler.router.match(pattern=r'/product/.*') async def product_handler(context: PlaywrightCrawlingContext): product_name = await context.page.query_selector_eval('.product-name', 'el => el.textContent') price = await context.page.query_selector_eval('.price', 'el => el.textContent') await context.push_data({ 'type': 'product', 'name': product_name, 'price': price }) await crawler.run(['https://example-store.com'])

Best Practices and Tips

Respect Robots.txt: Ensure ethical scraping.

Use Delays: Avoid overwhelming servers.

Handle Errors Gracefully: Implement retries.

Optimize Storage: Manage large-scale data efficiently.

Proxy Rotation: Prevent IP bans.

Monitor Progress: Use logging for tracking.

Conclusion

Crawlee-Python is a feature-rich tool for web scraping and browser automation. Whether you’re a beginner or an experienced developer, this library offers flexibility and power to build efficient scrapers. By adhering to best practices and exploring its advanced features, you can create reliable and scalable solutions for your data extraction needs.

FAQs

1. What makes Crawlee-Python different from other scraping libraries?

Crawlee-Python offers a unified interface for HTTP and browser crawling, proxy rotation, and built-in error handling, making it versatile and user-friendly.

2. Is Crawlee-Python suitable for large-scale scraping?

Yes, with features like automatic parallel crawling, proxy management, and persistent queues, it is well-suited for large-scale projects.

3. Can I use Crawlee-Python for browser automation?

Absolutely! Its integration with Playwright allows robust browser automation for tasks beyond scraping.

4. How do I handle websites with anti-scraping measures?

Use proxy rotation, user-agent rotation, and implement delays between requests to reduce detection risk.

Happy scraping!