I recently experimented with Cursor to build a web scraper. The goal was to scrape websites efficiently, extract meaningful content, and refine the process through iteration. Here’s how it unfolded:
I started by asking Cursor:
"Write code to scrape a website."
The initial result worked, but it only retrieved links from the top-level site. I needed it to go deeper.
I then asked:
"Can you modify the code to go 4 layers deep on every link?"
Cursor generated a depth-first recursive approach, which almost worked. However, at some point, it got stuck and stopped producing output.
To diagnose the problem, I asked:
"Why does this stop working before reaching the max depth?"
Cursor made some adjustments, but they were actually worse than the previous version. So, I reverted to the earlier working model and looked for another solution.
Switching to ChatGPT, I asked:
"Write a Python script to extract all <p> content from an HTML file."
This worked perfectly, but I realized I also needed <h1>, <h2>, and <h3> content.
To improve the script, I asked:
"Add <h1>, <h2>, and <h3> text to the script."
I then integrated this improved script into the Cursor-based scraper, ensuring that extracted text was included as another field in the output.
Finally, I manually added a keyword variable to the script so that it only scraped pages whose URLs matched a specific keyword. This refinement made the scraper more targeted and efficient.
After these six steps, I had a functional web scraper capable of:
This iterative process using both Cursor and ChatGPT proved to be a quick and effective way to build a working web scraper. If you’re experimenting with web scraping, this approach can help you refine your own scripts efficiently!
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
from urllib.parse import urljoin, urlparse
import time
from collections import defaultdict
class WebScraper:
def __init__(self, max_depth=4, delay=1):
self.max_depth = max_depth
self.delay = delay # Delay between requests in seconds
self.visited_urls = set()
self.results = defaultdict(list)
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
def is_valid_url(self, url):
"""Check if URL is valid and belongs to the same domain."""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except:
return False
def get_links(self, url, soup):
"""Extract and normalize links from the page."""
links = []
for link in soup.find_all('a'):
href = link.get('href')
if href:
absolute_url = urljoin(url, href)
if self.is_valid_url(absolute_url):
links.append((link.text.strip(), absolute_url))
return links
def get_text(self, url, soup):
"""Extract text from the page."""
elements = {"p": [], "h1": [], "h2": [], "h3": []}
for tag in elements.keys():
elements[tag] = [elem.get_text() for elem in soup.find_all(tag)]
return elements
def scrape_page(self, url, keyword,depth=0):
"""Recursively scrape a page and its links."""
if depth > self.max_depth or url in self.visited_urls or keyword not in url :
if keyword not in url:
print(f" {keyword} not in url {url} ")
return
try:
print(f"Scraping {url} (depth {depth})")
response = requests.get(url, headers=self.headers)
response.raise_for_status()
# Add to visited set before processing
self.visited_urls.add(url)
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract links from the page
links = self.get_links(url, soup)
txt = self.get_text(url, soup)
# Store the results
self.results[depth].append({
'url': url,
'links': links,
'text': txt
})
# Add delay between requests
time.sleep(self.delay)
# Recursively scrape each link
for _, link_url in links:
self.scrape_page(link_url,keyword, depth + 1)
except requests.RequestException as e:
print(f"Error scraping {url}: {e}")
def save_results(self):
"""Save all scraped data to CSV files."""
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
# Save summary file
with open(f'scraping_summary_{timestamp}.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Depth', 'URLs Scraped'])
for depth in range(self.max_depth + 1):
writer.writerow([depth, len(self.results[depth])])
# Save detailed results
with open(f'scraped_data_{timestamp}.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Depth', 'URL', 'URL Text','Link Text', 'Link URL'])
for depth in range(self.max_depth + 1):
for page_data in self.results[depth]:
url = page_data['url']
soup = page_data['text']
for link_text, link_url in page_data['links']:
writer.writerow([depth, url, soup, link_text, link_url])
def scrape_website(url, keyword, max_depth=4, delay=1):
scraper = WebScraper(max_depth=max_depth, delay=delay)
scraper.scrape_page(url,keyword)
scraper.save_results()
print(f"Scraping completed. Total URLs visited: {len(scraper.visited_urls)}")
return True
if __name__ == "__main__":
# Example usage
target_url = "https://www.example.com" # Replace with your target website
scrape_website(target_url, "shoes",max_depth=2, delay=0.1) #Replace shoes with your keyword