Craigslist Data Scraper (MySQL)

I don’t normally talk about personal life, however I have been in the market for a new Ski Boat as my family and I are avid Skiers and Wake boarders. In this pursuit I have found that while Craigslist is an awesome tool for finding a boat, spending hours manually searching the listings is not my cup of tea. So with that said, I found myself looking for a way to automate the process and Python seemed to fit the bill. While I am certainly not an expert in Python, I was able to find many examples online that helped me quickly put together a script that would automate my entire search process. What I needed from the script was the ability to crawl Craigslist to search for listings that matched my criteria and then store the listings in a database and email the results. Below is what I came up with, hopefully this will be as useful for others as it has been for me. In order for this script to run on Debian or Ubuntu, you will need to ensure you have the appropriate python modules installed as seen in the command below.

# Install required python modules
apt-get install python-mysqldb python-feedparser python-mailer

Now that we have the required per-requisites installed, lets move on and get our script configured. First, you will need to edit theĀ ### Database Initialization ### section with your MySQL or MariaDB Server info. Next, you will need edit the ### Search Parameters ### section with the information you are searching for. Lastly, you will need to edit the ### Email Parameters ### section so you are not sending me your search results. Note, that after running the script the first time, you can safely remove the database creation area annotated by ### Database Creation, remove after first run ###.

Note: Following the success of the Craigslist Data Scraper (MySQL), I was asked to re-write the script for use with SQLite. If you would prefer the SQLite version please see Craigslist Data Scraper (SQLite).

#!/usr/bin/env python
'''
	clscraper-mysql - Craigslist Scraper w/MySQL
	Copyright (c) 2014 Matthew Marable
	Email: mmarable@oitibs.com
	Web: https://oitibs.com

	clscraper-mysql is free software: you can redistribute it and/or modify
	it under the terms of the GNU General Public License as published by
	the Free Software Foundation, either version 3 of the License, or
	(at your option) any later version.

	clscraper-mysql is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License
	along with clscraper-mysql.  If not, see <http://www.gnu.org/licenses/>.
'''
import sys
import time
import urllib
import datetime
import feedparser
import MySQLdb as db
from mailer import Mailer
from mailer import Message

### Database Initialization ###
con = None
con = db.connect(host="[Host]",user="[Username]",passwd="[Password]")
cur = con.cursor()

### Database Creation, remove after first run ###
try:
	cur.execute("""CREATE DATABASE IF NOT EXISTS `clscraper`""")
	cur.execute("""USE `clscraper`""")
	cur.execute("""
		CREATE TABLE IF NOT EXISTS `LISTINGS` (
		`SAID` INT(11) NOT NULL AUTO_INCREMENT,
		`URL` TEXT NOT NULL,
		`TITLE` TEXT NOT NULL,
		`MODIFIED` TEXT NOT NULL,
		`NEW` INT(1) NOT NULL DEFAULT '0',
		PRIMARY KEY (`SAID`))"""
		)

except db.error, e:
	print "Error %s:" % e.args[0]
	sys.exit(1)
### End Database Creation, remove after first run ###

### Search Parameters ###
sections = ["boo"]
sites = ["smd","annapolis","baltimore","washingtondc",
	"frederick","fredericksburg","westmd","delaware",
	"easternshore","martinsburg"]
terms = ["malibu","moomba","nautique","supra","mastercraft",
	"centurion","american skier","tige","brendella","axis",
	"ski supreme"]

### Craigslist RSS Search URL ###
rss_generic_link = "http://%s.craigslist.org/search/%s?query=%s"

### Generate the RSS links ###
rss_links = []
modified = datetime.datetime.now()
html_text = ""
count = 0

for section in sections:
	for site in sites:
		for term in terms:
			term = urllib.quote(term)
			rss_link = rss_generic_link % (site, section, term)
			rss_link = rss_link + "&format=rss"
			rss_links.append(rss_link)

for rss_link in rss_links:
	sections = feedparser.parse(rss_link)

	for section in sections.entries:
		title = section["title"]
		url = section["link"]

		cur.execute("""
			SELECT MODIFIED
			FROM LISTINGS
			WHERE TITLE = %s""",
			(title,)
		)

		if cur.fetchone() == None:
			count = count + 1
			html_text = html_text + "<a href='" + url + "' >" + title + "</a><br><br>"
			cur.execute("""
				INSERT INTO LISTINGS
				(URL, TITLE, MODIFIED, NEW)
				VALUES (%s, %s, %s, %s)""",
				(url, title, modified, 1,)
			)

con.commit()
con.close()

if count > 0:
	### Email Parameters ###
	message = Message(From="Search Automation <tech@oitibs.com>",To="mmarable@oitibs.com")
	message.Subject = "New Listings (" + str(count) + ")"
	message.Html = html_text.encode('utf-8')
	sender = Mailer('[SMTP Server IP]')
	sender.send(message)

Once you have edited the script to your liking, you can schedule the script to run at intervals using your crontab.

download-icon Craigslist Data Scraper (MySQL) Download

Did you find this article useful? Why not share it with your friends?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.