Back

Easy Web Scraping with Crawl4AI, DeepSeek & Gemini (Python Guide)

avatar
21 Nov 20253 min read
Share with
  • Copy link

Can an LLM make web scraping simpler and smarter?

Have you ever wondered if a smart language model can help pull data from websites? The short answer is yes. With Crawl4AI and an LLM, you can turn messy pages into neat data. This makes web scraping faster and easier to read.

Why this matters: dynamic web data is everywhere

Web pages change all the time. Prices, tables, and lists update each day. If you want fresh facts, you need a good scraper. Using an LLM for LLM scraping helps to extract the right parts. It can also output data in a clean format, like JSON, so you can save it to a database.

What you’ll learn: quick overview of Crawl4AI + LLM workflow

First, make a Python virtual environment. Then install Crawl4AI and other packages. You may also add Playwright for browser control. Next, choose an LLM provider such as DeepSeek or Gemini. Set the model, give a schema, and let Crawl4AI convert page HTML into markdown. The model reads that markdown and fills your schema. You can ask for a strict JSON format to load data into a table.

Who this guide is for: developers, analysts, and builders

This guide is for people who write code in Python. It also helps analysts who need clean data fast. Builders who want to scale scraping will learn about costs and model choices.

| Tool | Main role | Speed | Cost tip | | --- | --- | --- | --- | | Crawl4AI | Scrape + clean HTML | Medium | Can run without an LLM to save cost | | DeepSeek | LLM extraction | Slower in tests | Good accuracy; watch tokens used | | Gemini | LLM extraction | Faster with flash models | Prompt may need tuning |

  • Tip: Test prompts on one page first.
  • Tip: Track tokens and requests to control cost.
  • Tip: Validate results by spot checking the output.

Setup: create a virtualenv and install Crawl4AI, Light LLM proxy and Playwright

Want to scrape a site fast with Python? First, make a clean workspace. Use a virtual environment. Then install the tools you need for web scraping and LLM scraping. This guide uses Crawl4AI, Light LLM proxy (or OpenRouter), and Playwright.

Create and activate a Python virtual environment

Make a new env with: python -m venv env. On Mac/Linux run: source env/bin/activate. On Windows run: env\Scripts\activate. This keeps your project tidy. Use Python in the env so packages don't mix with other projects.

Install packages: crawl4ai, light-llm-proxy (or OpenRouter), and dependencies

Install the main packages with pip. Example: pip install crawl4ai light-llm-proxy playwright. These add the scraper, a simple LLM bridge, and the browser runner. If you use DeepSeek or Gemini, the proxy helps talk to them with one API style.

Set API keys for DeepSeek / Gemini and optional base URLs

Set your keys as environment variables. For example: export DEEPSEEK_API_KEY=your_key or set GEMINI_API_KEY=your_key. If you use a custom host via the proxy, also set a base URL. Keep keys private and do not hard-code them in scripts.

Install Playwright browser if you see browser errors

If Playwright shows errors, run: playwright install. Or run: python -m playwright install. This downloads browsers Playwright needs. After that, the scraper can render pages and extract content.

| Command | Purpose | Notes | | --- | --- | --- | | python -m venv env | Make virtual environment | Keeps packages separate | | source env/bin/activate | Turn env on (Mac/Linux) | Windows use env\Scripts\activate | | pip install crawl4ai light-llm-proxy playwright | Install scraper and tools | Adds Crawl4AI and proxy support | | export DEEPSEEK_API_KEY=... | Set API key | Also set GEMINI_API_KEY if used | | playwright install | Download browsers | Fixes Playwright browser errors |

How to extract structured data with Crawl4AI + an LLM (step-by-step)

Want to turn a messy webpage into neat data you can use? This guide shows a simple way to do that. You will use Crawl4AI and an LLM. The steps are short and clear. You can try them with Python and Playwright.

Provide URL(s) and define the exact schema (JSON) you want

First, give the scraper the web page link or a list of links. Next, write the exact JSON schema you want back. For example, ask for rank, model name, score, confidence, words, organization, license. This makes the output easy to store in a database. The scraper will try to match that schema.

Configure the LLM strategy: markdown input, chunking and schema output

Choose a strategy that tells the LLM to read markdown. Let Crawl4AI convert the page into markdown first. Turn on chunking so the model handles big pages in pieces. Ask the LLM to return valid JSON that fits your schema. You can also skip the LLM and let Crawl4AI scrape alone if you want lower cost.

Run the Python script (example: python web_scraping.py) and inspect results

Run the script after you install packages and activate your virtual env. You might need the Playwright browser helper. The run can take time depending on the model. For example, using DeepSeek or Gemini gave different speeds. Check the JSON output and make sure fields match your schema.

Troubleshooting: missing fields? tweak system prompt and hyperparameters

If a field is wrong or missing, change the system prompt. Be clear about exact names to extract. Try different chunk sizes and turn off iframe crawling if needed. Note the cost: one test used ~150,000 tokens and cost around $0.08. Costs add up at scale, so test with small runs first. Now go use Crawl4AI with your favorite model and collect clean data.

Costs, model trade-offs, prompt tuning — and go try it now

Want to scrape web pages fast and get clean data? Web pages change a lot. That makes scraping very useful. But using an Crawl4AI + LLM setup can cost more than you think. Read on to learn when to use an LLM and when to skip it.

Token & cost example: ~150k tokens used — why this adds up at scale

A short test used about 150,000 tokens. That came from around 25 requests. It cost cents for a small run. But if you do many pages, costs grow fast. Big runs can reach thousands of dollars if you use large LLMs for every page. Always measure tokens before you scale.

When to skip LLMs: use Crawl4AI’s native scraping and markdown mode

Crawl4AI can scrape without an LLM. It also makes markdown output. That saves tokens and money. Use native scraping when you only need raw tables or lists. Use LLMs only for complex extraction or strict JSON schemas.

Model choices: DeepSeek (accuracy) vs Gemini Flash (speed) and provider notes

| Model | Strength | Speed | Cost notes | Best use | | --- | --- | --- | --- | --- | | DeepSeek | High accuracy | Slower | More tokens per call | Careful extraction; tricky pages | | Gemini Flash | Fast | Very fast | Lower latency; still check tokens | Large scale, quick runs |

Prompts matter. The same prompt can behave differently on each model. Tune the instructions for each provider. If a model misses names or fields, change the prompt and retry.

CTA — Run the script now: download the sample, set keys, and go use Crawl4AI

Ready to try? Get the sample script, set your API keys, and run it with Python and Playwright. Start with a few pages. Watch token use. Switch to native Crawl4AI mode if costs climb. Try DeepSeek for accuracy and Gemini for speed.

  • Download the sample script.
  • Set environment keys for your model.
  • Install Playwright and run the script.
  • Check token counts and results.
Related articles