If you’ve ever needed to pull B2B contact info—names, emails, phone numbers—from a bunch of company websites or directories, you already know the pain: scattered formats, tricky layouts, and anti-bot roadblocks everywhere. This guide is for anyone who’s tired of copy-paste, wants more than generic scraping, and is considering using Crawlbase to set up custom extraction rules that actually work.
Below, I’ll walk you through the real steps to create extraction rules that pull what you need without hours of trial and error. I’ll call out what works, what’s a waste of time, and how to avoid the biggest headaches.
Why Custom Extraction Rules Matter (and When You Need Them)
Out of the box, most scraping tools (Crawlbase included) grab the raw HTML or a simple set of fields. That’s fine if you want generic data, but B2B contact details are rarely that tidy:
- Emails are buried in weird places or protected by JavaScript.
- Names and job titles are stuck inside custom HTML tags.
- Directory listings are paginated or loaded dynamically.
Custom extraction rules let you target exactly what you want, so you’re not sifting through junk later. If you’re serious about getting usable B2B leads, this is where you have to spend your effort.
Who should care:
- Anyone scraping for sales lead lists, recruiting, partnerships, or market research.
- Folks sick of reformatting noisy spreadsheet dumps.
- People who want to avoid manual cleaning or “almost-good-enough” automation.
1. Get the Basics Ready
Before you dive into Crawlbase’s custom rules, make sure you have the basics covered:
- Crawlbase account: You’ll need an account (free trial’s available, but serious usage is paid).
- Target URLs: Have a list of company or directory URLs you want to scrape.
- Know what you want: Be clear on which fields you need—email, name, title, LinkedIn, whatever.
Pro tip: Don’t try to build a mega-scraper for every possible field. Start with 2-3 fields you actually need, then expand.
2. Understand How Crawlbase Extraction Works
Here’s the short version: Crawlbase uses “extraction rules” (think: recipes) to tell its crawler which data to pull from a page. You define these using CSS selectors, XPath, or simple text patterns.
- CSS selectors: Easiest for most pages (e.g.,
.contact-email
,span.name
) - XPath: Powerful for weird or deeply nested structures, but less readable
- Regex (sometimes): For pulling things like emails or phone numbers out of blobs of text
What doesn’t work:
- “Just scrape everything and filter later.” This blows up your post-processing time.
- Hoping Crawlbase will “figure it out.” It’s not magic—you have to tell it what to grab.
3. Inspect the Target Page
This is where most people zone out, but it’s the step that makes or breaks your extraction.
- Open your browser’s Inspect tool (F12 or right-click → Inspect).
- Find the data you want: email address, name, phone, etc.
- Look for a pattern in the HTML—class names, IDs, tag types.
For example, if you see: html alice@company.com
Your CSS selector is .contact-email
.
If it’s more complicated: html
Contact: Alice Smith
Email: alice@company.com
Here, you might use something like:
- For name: div > p:first-child > b
- For email: div > p:nth-child(2) > span
What to ignore:
- Don’t obsess over every minor variation. Start with the most common pattern, then handle exceptions as they come up.
4. Write Your Extraction Rules in Crawlbase
Once you have your selectors, it’s time to plug them into Crawlbase.
A. Using the Crawlbase Dashboard
- Log in and go to the “Crawler” section.
- Start a new crawl or select an existing project.
- Under “Extraction Rules,” add a new rule for each field you want:
- Field name: (e.g., “email”)
- Selector type: CSS, XPath, or Regex
- Selector: The actual rule you found earlier
Example:
- Field: email
- Type: CSS
- Selector: .contact-email
You can preview results for one or more sample URLs—do this before launching a big crawl, or you’ll waste credits on junk data.
B. Using the API (for Developers)
If you want to automate or scale, you can define extraction rules in JSON and send them with your crawl job. Example payload:
json { "url": "https://example.com/contacts", "extractRules": { "name": { "selector": ".contact-name", "type": "css" }, "email": { "selector": ".contact-email", "type": "css" } } }
Send this via the Crawlbase API, and you’ll get back just the fields you specified.
5. Handle Common Roadblocks
Let’s be honest: not all pages cooperate. Here’s what to do when things get messy.
Obfuscated or Protected Emails
Some sites hide emails behind JavaScript or image files. Crawlbase can render JavaScript if you enable browser rendering, but:
- If emails are images: You’re out of luck—OCR scraping is a pain and rarely worth it.
- Emails shown on click: Use browser rendering and try a selector that targets the revealed element. Sometimes you’ll need to simulate a click (advanced—see Crawlbase’s docs).
Pagination and “Load More” Buttons
If the data is spread across multiple pages: - Static page numbers: Add all URLs to your target list. - Infinite scroll or “Load More”: Enable browser rendering and set Crawlbase to scroll to the bottom or click “Load More” a set number of times.
Don’t try to scrape huge directories in one go. Monitor errors and adjust.
Anti-Bot Protections
Crawlbase is better than most at getting past basic blocks, but nothing’s perfect.
- If you’re hitting CAPTCHAs, try slowing your crawl or rotating user agents.
- For really locked-down sites, it may not be worth the hassle—move on or try a different target.
6. Test, Validate, and Iterate
Don’t assume your rules work everywhere just because they work on one page.
- Run a test crawl on 3-5 URLs.
- Check the output for missing or garbled data.
- Adjust selectors as needed.
Pro tip: Save your extraction rules somewhere versioned (even a Google Doc). You’ll thank yourself later when sites change.
What to skip:
- Don’t waste time perfecting for edge cases you don’t care about. Get a solid 80% solution, then revisit if needed.
7. Export and Use the Data
Once you’re pulling the right info, export it in the format you need (CSV, JSON, whatever). Import it into your CRM, sales tool, or wherever you actually use the data.
Clean up as you go: Even with custom rules, expect a little manual review for typos, broken fields, or the occasional junk entry. Don’t overthink it—just fix the obvious issues and move on.
Keep It Simple: Final Thoughts
Custom extraction with Crawlbase isn’t rocket science, but it does take a little upfront work. The key is to stay focused: set up clear rules, test on real pages, and don’t try to solve every possible site variation on day one.
Start small, get your core data, and only add complexity if you actually need it. Most tools (Crawlbase included) will handle the heavy lifting if you point them in the right direction.
If you hit a wall, step back and ask: is this data worth the pain, or is there an easier source? Sometimes the answer is just “move on.” Good luck, and may your inbox fill with real leads, not gibberish.