Building a Web Scraper in 6 Easy Steps with Cursor and ChatGPT

I recently experimented with Cursor to build a web scraper. The goal was to scrape websites efficiently, extract meaningful content, and refine the process through iteration. Here’s how it unfolded:

Step 1: Basic Web Scraping

I started by asking Cursor:

"Write code to scrape a website."

The initial result worked, but it only retrieved links from the top-level site. I needed it to go deeper.

Step 2: Expanding the Scraping Depth

I then asked:

"Can you modify the code to go 4 layers deep on every link?"

Cursor generated a depth-first recursive approach, which almost worked. However, at some point, it got stuck and stopped producing output.

Step 3: Debugging the Depth Issue

To diagnose the problem, I asked:

"Why does this stop working before reaching the max depth?"

Cursor made some adjustments, but they were actually worse than the previous version. So, I reverted to the earlier working model and looked for another solution.

Step 4: Extracting Content from HTML

Switching to ChatGPT, I asked:

"Write a Python script to extract all <p> content from an HTML file."

This worked perfectly, but I realized I also needed <h1>, <h2>, and <h3> content.

Step 5: Expanding Content Extraction

To improve the script, I asked:

"Add <h1>, <h2>, and <h3> text to the script."

I then integrated this improved script into the Cursor-based scraper, ensuring that extracted text was included as another field in the output.

Step 6: Filtering by Keywords

Finally, I manually added a keyword variable to the script so that it only scraped pages whose URLs matched a specific keyword. This refinement made the scraper more targeted and efficient.

The Final Product

After these six steps, I had a functional web scraper capable of:

This iterative process using both Cursor and ChatGPT proved to be a quick and effective way to build a working web scraper. If you’re experimenting with web scraping, this approach can help you refine your own scripts efficiently!

PS: Here's the code:


import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
from urllib.parse import urljoin, urlparse
import time
from collections import defaultdict

class WebScraper:
    def __init__(self, max_depth=4, delay=1):
        self.max_depth = max_depth
        self.delay = delay  # Delay between requests in seconds
        self.visited_urls = set()
        self.results = defaultdict(list)
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }

    def is_valid_url(self, url):
        """Check if URL is valid and belongs to the same domain."""
        try:
            result = urlparse(url)
            return all([result.scheme, result.netloc])
        except:
            return False

    def get_links(self, url, soup):
        """Extract and normalize links from the page."""
        links = []
        for link in soup.find_all('a'):
            href = link.get('href')
            if href:
                absolute_url = urljoin(url, href)
                if self.is_valid_url(absolute_url):
                    links.append((link.text.strip(), absolute_url))
        return links


    def get_text(self, url, soup):
        """Extract text from the page."""
        elements = {"p": [], "h1": [], "h2": [], "h3": []}
    
        for tag in elements.keys():
            elements[tag] = [elem.get_text() for elem in soup.find_all(tag)]
    
        return elements

    
    def scrape_page(self, url, keyword,depth=0):
        """Recursively scrape a page and its links."""
        if depth > self.max_depth or url in self.visited_urls or keyword not in url :
            if keyword not in url:
                print(f" {keyword}  not in url {url} ")
            return

        try:
            print(f"Scraping {url} (depth {depth})")
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()
            
            # Add to visited set before processing
            self.visited_urls.add(url)
            
            # Parse the HTML content
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Extract links from the page
            links = self.get_links(url, soup)
            txt = self.get_text(url, soup)
            
            # Store the results
            self.results[depth].append({
                'url': url,
                'links': links,
                'text': txt
            })
            
            # Add delay between requests
            time.sleep(self.delay)
            
            # Recursively scrape each link
            for _, link_url in links:
                self.scrape_page(link_url,keyword, depth + 1)
                
        except requests.RequestException as e:
            print(f"Error scraping {url}: {e}")

    def save_results(self):
        """Save all scraped data to CSV files."""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        
        # Save summary file
        with open(f'scraping_summary_{timestamp}.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(['Depth', 'URLs Scraped'])
            for depth in range(self.max_depth + 1):
                writer.writerow([depth, len(self.results[depth])])
        
        # Save detailed results
        with open(f'scraped_data_{timestamp}.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(['Depth', 'URL', 'URL Text','Link Text', 'Link URL'])
            for depth in range(self.max_depth + 1):
                for page_data in self.results[depth]:
                    url = page_data['url']
                    soup = page_data['text']
                    for link_text, link_url in page_data['links']:
                        writer.writerow([depth, url, soup, link_text, link_url])

def scrape_website(url, keyword,  max_depth=4, delay=1):
    scraper = WebScraper(max_depth=max_depth, delay=delay)
    scraper.scrape_page(url,keyword)
    scraper.save_results()
    print(f"Scraping completed. Total URLs visited: {len(scraper.visited_urls)}")
    return True

if __name__ == "__main__":
    # Example usage
    target_url = "https://www.example.com"  # Replace with your target website

scrape_website(target_url, "shoes",max_depth=2, delay=0.1) #Replace shoes with your keyword