Building a Smart LinkedIn Job Scraper: How I Automated Data Collection for My Job Search

LinkedIn doesn't offer a public API for job searches, and existing scraping solutions were either very expensive, unreliable, or lacked the features I needed. So I built my own scraper from scratch using Puppeteer, complete with session persistence and Google Sheets integration to avoid processing duplicate jobs.

I was using this previously - https://apify.com/curious_coder/linkedin-jobs-scraper but apify is expensive so i decided to build my own.

💡

Full technical docs, code, and setup instructions: GitHub Repository

The Problem

Manual job searching meant:

Seeing the same jobs daily with no efficient tracking
No way to export or bulk process LinkedIn data
Inconsistent application methods (Easy Apply vs external sites)
Manually copying details into spreadsheets for 50+ jobs daily

I needed automated data collection with duplicate prevention and structured output.

How the Scraper Works

The system I built has two main parts: a login tool that saves your LinkedIn session, and the scraper itself that collects job information.

Key Components:

Login Script - One-time manual auth, saves session permanently via Chrome's userDataDir
Express API - REST endpoint (POST /scrape-jobs) triggers scraping on demand
Puppeteer - Headless Chrome with stealth plugin to avoid bot detection
Google Sheets API - Duplicate detection without a database

Two-Phase Scraping Process

Phase 1: Fast URL Collection (2-3 minutes)

// Extract job IDs without loading full pages
const jobIds = await page.evaluate(() => {
  return Array.from(
    document.querySelectorAll('[data-occludable-job-id]')
  ).map(el => el.getAttribute('data-occludable-job-id'));
});
// Returns: ['3847562910', '3851234567', ...]

Phase 2: Smart Filtering + Detail Scraping

Calls Google Sheets API to fetch existing job IDs
Filters duplicates before scraping (50 collected → 35 duplicates → 15 scraped)
For new jobs only: extracts title, company, location, description, apply URL

Efficiency: If all 50 jobs are duplicates, completes in <30 seconds instead of 8-10 minutes.

// Extract job IDs without loading full pages
const jobIds = await page.evaluate(() => {
  return Array.from(
    document.querySelectorAll('[data-occludable-job-id]')
  ).map(el => el.getAttribute('data-occludable-job-id'));
});
// Returns: ['3847562910', '3851234567', ...]

Application URL Extraction

LinkedIn jobs have 3 application methods. The scraper handles all of them:

// 1. External sites (70%): Listen for new tab redirects
browser.on('targetcreated', async (target) => {
  const newPage = await target.page();
  if (newPage && !newPage.url().includes('linkedin.com')) {
    applyUrl = newPage.url();
  }
});

// 2. Easy Apply (15%): Extract from modal
const modalUrl = await page.evaluate(() => 
  document.querySelector('.artdeco-modal a[href^="https:"]')?.href
);

// 3. Fallback (10%): Parse description for URLs/emails
const urls = description.match(/(https?:\/\/[^\s"<>()]+)/g);
const emails = description.match(/[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+/g);

Optimization & Anti-Detection

Resource Blocking (60% faster loads):

page.on('request', (req) => {
  if (['image', 'stylesheet', 'font', 'media'].includes(req.resourceType())) {
    req.abort();  // Only load text
  } else {
    req.continue();
  }
});

Stealth + Rate Limiting:

puppeteer-extra-plugin-stealth masks automation fingerprints
2-3 second delays between jobs prevent rate limiting
Session persistence via saved Chrome profile (lasts 2-4 weeks)

Real-World Performance

Stats after 3+ months of daily use:

Speed: 8-10 min for 50 new jobs, <1 min if mostly duplicates
Success rate: 95%+ (failures from LinkedIn security challenges)
Volume: 3,000+ jobs processed
Resource usage: ~300MB memory, ~20% CPU

Sample API Response:

{
  "success": true,
  "jobsScraped": 15,
  "duplicatesSkipped": 35,
  "jobs": [
    {
      "id": "3847562910",
      "title": "Security Analyst",
      "companyName": "Tech Solutions Inc",
      "applyUrl": "https://techsolutions.com/careers/apply/12345",
      "location": "Toronto, Ontario, Canada",
      "salaryInfo": "$70,000 - $90,000 / year",
      "applicantsCount": "47 applicants"
      // ... more fields
    }
  ]
}

Tech Stack

Component	Version	Purpose
Node.js	20.x	Runtime
Puppeteer	24.31.0	Browser automation
puppeteer-extra-plugin-stealth	2.11.2	Anti-detection
Express	4.21.2	REST API
Google Sheets API	v4	Duplicate tracking

Why Google Sheets?

No database setup needed
N8N workflow writes to same sheets
Can manually review/edit in familiar interface
Free and instant integration

Quick Start

# 1. Clone and install
git clone https://github.com/Murali2602/Projects.git
cd Projects/LinkedIn-Job-Scraper
npm install

# 2. One-time login
node login.js  # Browser opens, log in manually, auto-saves session

# 3. Start API
npm start

# 4. Scrape jobs
curl -X POST http://localhost:3000/scrape-jobs \
  -H "Content-Type: application/json" \
  -d '{"searchUrl": "https://linkedin.com/jobs/search?keywords=security", "targetCount": 25}'

Full documentation with Google Sheets integration, Docker deployment, and advanced config: GitHub

Integration with Job Automation

This scraper is the foundation of my automated job search system. It feeds structured data to my N8N workflow, which:

Uses AI to analyze job fit (scores 0-100 based on my skills)
Generates tailored resumes for qualified positions (65%+ match)
Tracks everything in Google Sheets

The scraper handles the messy data collection so AI agents can focus on analysis and customization.

What I Learned:

Sessions expire every 2-4 weeks (simple manual re-login beats complex automation)
2-3 second delays prevent rate limiting (patience > speed)
LinkedIn's HTML is inconsistent (multiple fallback selectors essential)
10-15% of jobs lack clear application URLs (acceptable trade-off)

Future improvements: Automatic session refresh, multi-platform support (Indeed/Glassdoor), better salary parsing, company enrichment.

The code is open source and free to use. Check the GitHub repo for issues, suggestions, or contributions.

The Problem

How the Scraper Works

Two-Phase Scraping Process

Application URL Extraction

Optimization & Anti-Detection

Real-World Performance

Tech Stack

Quick Start

Integration with Job Automation

Related Posts

Automating My Cybersecurity Job Hunt with N8N and AI

Wazuh Unbound: Activity Monitoring and Threat Detection with Custom Rules and Tool Integration

Windows Privilege Escalation Tricks