09 Apr 2025

How to Build an AI Agent with n8n and OpenAI

Learn how to create an AI-powered web scraping agent using n8n and OpenAI. This simple, step-by-step guide helps you perform searches, summarise content automatically, and export results to CSV.

AI Agents n8n
How to Build an AI Agent with n8n and OpenAI

How to Build an AI Agent With n8n and Open AI

Building an AI-powered web scraping agent in n8n is easier than you might think. In this tutorial, you’ll learn how to build a simple AI-powered agent using n8n that accepts a search term, performs a Google search, visits each search result, summarises the content using OpenAI, and outputs the summaries to a downloadable CSV file.

Workflow Overview

Our workflow includes these steps:

  1. Form Trigger: Collects user input (search term).
  2. Google Search (HTTP Request): Searches Google for the term.
  3. Extract Result Links (HTML Node): Parses links from the Google results.
  4. Visit Links (HTTP Request): Fetches each search result page.
  5. Scrape Page Content (HTML Node): Extracts page content.
  6. Summarise Content (OpenAI Node): Summarises content using AI.
  7. Compile Results (CSV Node): Generates a CSV file for download.

Step-by-Step Tutorial

Step 1: Form Trigger Setup

Begin by creating a new n8n workflow. Delete the default trigger (if any) and add an n8n Form Trigger node. This node will generate a web form for us to collect user. In the Form Trigger’s settings, configure the form with one field:

  • Form Title: For example, “Web Search Agent”. This appears as the heading on the form.
  • Form Path: A custom path (e.g. search-agent) for the form URL.
  • Form Description: (Optional) A short instruction, e.g. “Enter a search term and get a CSV of results.”
  • Form Elements: Add a single Text field, give it a name like “query” (this will be the field key), and label it “Search Term”. Mark it required so the user must fill it in.

Step 2: Google Search (HTTP Request)

After the Form Trigger, add an HTTP Request node. This node will perform a direct Google search query for the term input by the user. Configure the HTTP Request node as follows:

  • Method: GET
  • URL: https://www.google.com/search – This is the Google search URL.
  • Query Parameters: Click “Add Parameter” and add a parameter named q with value set to an expression referencing the form input, e.g. {{$json["query"]}}. This appends the search term as ?q=<term> in the URL.
  • Response Format: Set to String. This tells n8n not to parse the response as JSON, but to keep the raw HTML as text.
  • Header: It’s good practice to add a User-Agent header to mimic a real browser, since Google may otherwise return a different or blocked result. In the Options or Headers section, add a header User-Agent with a value like:
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36.
    (This is a common desktop browser agent string.)

Connect the Form Trigger node to this HTTP Request node. Now, when the form is submitted, the HTTP node will execute a Google search for the provided term and return the HTML of the search results page.

Now we need to parse the Google HTML to extract the links for each search result. n8n provides an HTML node (formerly “HTML Extract”) that can parse HTML content using CSS selectors. Add an HTML node next in the flow, and configure it to Extract HTML Content:

  • Source Data: Select JSON (because our HTML is in the JSON output of the previous node).
  • JSON Property: Set this to data (the field where the HTTP Request node stored the HTML string).
  • Extraction Values: Here we define what we want to extract. We need all the result URLs, so click “Add Value”:
    • Key: e.g. link (this will be the name of the field for each extracted item).
    • CSS Selector: We can use a CSS selector that targets the anchor (<a> tag) of each result. Google’s results HTML typically has each result link wrapped in an <a> that contains an <h3> title. A convenient selector is:
      a:has(h3)
      

      This selector finds every <a> element that has an <h3> inside – which corresponds to the clickable title of each search hit.

    • Return Value: Choose Attribute, because we want the URL (href attribute) of the link.
    • Attribute: Enter href (the attribute that contains the URL).
  • Options: Enable Return Array (set it to true/on). This makes sure that if multiple links are found, the node returns them as an array of values under our link field:contentReference[oaicite:3]{index=3}, rather than a single combined string.

After configuring, connect the HTTP Request node (Google search) into this HTML Extract node. Execute the HTML node once with the data from the previous step (you can click “Execute Node” in the editor on the HTML node after a successful HTTP Request). In the HTML node’s output, you should see a field (e.g. link) that contains an array of URLs – these are the links to the search result pages. For example, link: ["http://example.com/...","http://anotherresult.com/...", ...]. Great! We have all the result URLs.

At this stage, we have an array of result links from Google. We need to visit each link to scrape content from each page. n8n will automatically loop through multiple items if we split them into separate outputs. To do this, insert a Split Out node next.

The Split Out node takes an array and outputs each element as a separate item:contentReference[oaicite:5]{index=5}. Configure the Split Out node:

  • Field to Split Out: Set to link (the name of the field containing our array of URLs).
  • Include: Choose No Other Fields (we only need the link itself for the next step).

Connect the HTML Extract node into the Split Out node. Now the workflow will fork the execution into multiple items – one for each URL. Essentially, if we had (say) 5 links, after the Split Out node we’ll have 5 items, each with a link field containing one URL.

Step 6: Scrape Page Content (HTML Node)

Now add another HTTP Request node, this time to retrieve each result page’s HTML. Connect it after the Split Out. Configure this HTTP Request as:

  • Method: GET
  • URL: Set to an expression to use the current item’s link. For example, in the URL field type {{$json["link"]}}. This will take the link value from each incoming item (provided by Split Out) and request it.
  • Response Format: String (we want the raw HTML again).
  • Timeout: (Optional) You may increase the timeout in Options if some pages are slow to respond, or set Ignore HTTP Errors to true to continue even if one page fails.

No need to set the User-Agent header here explicitly; many websites respond fine to default, but you could reuse the same header if desired.

Because of the Split Out, this single HTTP Request node will automatically execute once for each link. n8n handles this under the hood – you’ll see the node process multiple items. After this node runs, you’ll have multiple outputs (one per page), each containing the HTML of a result page.

Step 7: Summarise Content (OpenAI Node)

Now, we’ll use the OpenAI node to summarise the content extracted from each page.

  • Add the OpenAI node:
    • Operation: Completion
    • Model: gpt-3.5-turbo or similar
    • Prompt: Use an expression: Summarise the following text in 50 words or less:\n\nTitle: {{$json["title"]}}\n\nContent: {{$json["content"]}}
  • Additional Parameters:
    • Max tokens: 200

Step 8: Compile Results into CSV

The final step is to take all the scraped data (titles and content from each page) and compile it into a CSV file. n8n’s Spreadsheet File (or Convert to File) node can do this easily. Add a Spreadsheet File node (in recent n8n versions, use “Convert to File” node set to Convert to CSV operation). Connect our last HTML node into it.

Configure the node to Convert to CSV:

  • Fields: If prompted, you can usually leave empty to include all fields by default (which are title and content).
  • Output: Set Put Output File in Field to something like file (this will be the name of the binary property for the resulting file).
  • File Name: e.g. results.csv (so the user sees a proper filename when downloading).
  • Header Row: Enable this option so that the CSV’s first row will contain the column names (title, content).

When executed, this node will take all incoming items and convert them into a single CSV file (as a binary). Under the hood, it uses the field names as columns and each item as a row in the CSV.

The output of this node will be one item containing a binary file (with field name “file”). You can see in the editor UI that it’s a binary; you may even have a download option in the execution preview.

At this point, our data processing is done – we have the file ready.

Step 8: Return the CSV File to the User

Since we set the Form Trigger to respond only after the workflow finishes, we need to give the user the CSV file. There are a couple of ways to do this:

Option 1: Use Form Trigger’s Redirect (Simplest) – In the Form Trigger node settings, change Form Response to Redirect URL. For the URL, you need a place where the file can be accessed. If you’re self-hosting n8n, one quick trick is to serve the file from the n8n instance. For example, you could use an HTTP Response node (Webhook Response) to send the file, or host it on an external storage and provide that link here. In a simple setup, you might not have a public URL for the file directly. Alternatively, you could display a message with instructions to retrieve it (or send via email, etc.). For a beginner tutorial, an easy approach is to instead provide the CSV via the n8n Editor when testing, and focus on the workflow logic.

Option 2: Webhook Response Node – (Advanced) For more control, you can use a Respond to Webhook node connected after the CSV generation. Configure it to send back the binary file in the HTTP response. However, using this with Form Trigger requires the form submission to actually be handled by that webhook. As of now, the Form Trigger node doesn’t directly accept a binary response in the same way a Webhook node does (this is a known limitation). This method may require using a standard Webhook trigger instead of Form Trigger for the final step.

For simplicity, you can stick with a confirmation message. The user can then retrieve the CSV from the n8n execution log or any storage node you connect. (In a real deployment, you might email the file to the user or upload it to cloud storage and give a link.)

Final Workflow Structure

Your final workflow:

Form Trigger → Google Search → Extract Links → Split Out → Fetch Page → Scrape Content → Summarise (OpenAI) → Compile CSV → Return CSV

Testing the Workflow

Activate the workflow and open the form. Enter a search term like “workflow automation tools” and submit. Your CSV file with AI-generated summaries will be ready for download or inspection in the n8n editor.

Conclusion

You’ve successfully built an AI-enhanced web scraping agent using n8n and OpenAI. Explore further by adjusting summarisation prompts or integrating additional AI processing steps.

Happy automating!

Osher Digital Business Process Automation Experts Australia

Let's transform your business

Get in touch for a free consultation to see how we can automate your operations and increase your productivity.