#ExtractSummit2026 is coming back to Austin and Dublin. Last year's events were something special. The conversations, the people, the energy in the room. We're bringing it back, and we're already excited about what this year's going to look like. We're now open for speaker proposals for both events. If you're working in the web data space and you've got something worth sharing, a project you shipped, a problem you solved in an interesting way, a perspective that hasn't been talked about enough, we'd love to hear from you. Speaker submissions are now open. Click the link below to submit your proposal https://lnkd.in/gQFSTMnp #ExtractSummit2026
Extract Summit
IT Services and IT Consulting
Dublin, Dublin 921 followers
Join the #1 web scraping event, Extract Summit 2025 in Austin (Sept 24–25) & Dublin (Nov 5–6)
About us
Extract Summit 2025 is the #1 in-person web scraping event, bringing together industry leaders, developers, data professionals, and legal experts shaping the future of web data extraction. Whether you’re an experienced professional or just getting started, this is the place to learn the latest strategies, explore real-world use cases, and connect with the best in the industry. 🚀 Now accepting speaker submissions! If you have expertise, innovative ideas, or real-world case studies on web scraping, AI, or data extraction, we want to hear from you. Apply to speak below.
- Website
-
https://extractsummit.io/
External link for Extract Summit
- Industry
- IT Services and IT Consulting
- Company size
- 51-200 employees
- Headquarters
- Dublin, Dublin
Updates
-
We're busy planning another TWO events this year details & CFP to follow shortly! https://lnkd.in/d7WfpaM will be updated soon
-
-
If you work with web data, the last year has made something very clear: The internet looks human… even when most of it isn’t. 🤖⚠️ At Extract Summit 2025 in Dublin, Domagoj Marić from Pontis Technology unpacked how the “Dead Internet Theory” is becoming a real, measurable problem for anyone scraping the web. A few themes hit hard: The synthetic web isn’t theoretical anymore: Nearly half of all traffic is bots, and AI-generated content is shaping feeds, trends, and even entire conversations. Building “slop bots” is frighteningly easy: With a couple dozen lines of Python and a cheap LLM, anyone can mass-produce comments or posts at scale. Platforms benefit from the noise: Fake engagement still counts as engagement, and AI-managed profiles only blur the line further. Detection tools are playing catch-up: From text patterns to image watermarking to new transparency laws, progress is happening, but the synthetic ecosystem evolves faster. Authentic data is becoming a strategic asset: As models start training on synthetic data, real human signals are becoming the most valuable part of the web. If you missed the session in Dublin, this write-up is a great way to catch up and to rethink how you approach web data in a synthetic-first world. 📖 Read the full blog → https://lnkd.in/dMGnUCs5 #ExtractSummit #WebScraping #Zyte #DeadInternetTheory
-
We saw it launch live at the #ExtractSummit. You can now read the story behind Web Scraping Copilot and why we built it.
🚀We recently launched Web Scraping Copilot, an AI-powered VS Code extension that helps engineers code faster while staying in control. In our latest blog, Zyte’s Chief Product Officer Iain Lennon shares the thinking behind it. Production-scale web data demands a high level of rigour in an AI-powered solution - deterministic quality, support for engineering discipline, precise control. That’s why The Web Scraping Copilot takes the approach of "partial autonomy" - AI with choice-making on a spectrum from 'assistant' to 'agent', to be able to multiply productivity combined with defined outcomes. 📖 Read the full story → https://lnkd.in/dY3fYeQk #WebScrapingCopilot #Webscraping #AI #Scrapy #Zyte
-
If you work with web data, the last week probably raised as many questions as answers. 🤔 At Extract Summit 2025 in Dublin, engineers, researchers and legal experts got together to talk about where web scraping is really heading. A few themes stood out: AI is speeding up the hard parts – From reverse-engineering anti-bot systems to tools like Web Scraping Copilot, teams are going from months of work to minutes, while still keeping control of their code. The “dead internet” problem is real – As more of the web fills with synthetic content and bot traffic, models risk training on their own output. High-quality, authentic sources are becoming a strategic advantage. Access is now an arms race – IPs are a “weak signal” now. Fingerprints, behaviour and full “personas” are in play, raising the bar (and cost) for serious web data gathering. Details matter more than ever – One brittle selector or one legal misstep can have cascading consequences. An investigative mindset and strong compliance are becoming core scraping skills. Agents may need an “ID card” – With AI agents joining humans and bots on the web, ideas like Web Bot Auth hint at a future where authenticated, differentiated access becomes the norm. If you couldn’t make it, here's a 5-minute recap. 📖 Read the full summary of Extract Summit 2025 → https://lnkd.in/dmRRMDM7 #ExtractSummit #WebScraping
-
Zyte Chief Operating Officer Suzanne Hassett closed Extract Summit Dublin by capturing the day’s biggest threads: ➡️ The rise of AI across every conversation ➡️ The growing sophistication of anti-bot systems ➡️ The evolving legal and ethical frameworks that will define the next era of web data. She reflected on how far the industry has come, and how fast it’s moving, while reminding everyone to approach AI’s potential with care and responsibility. And she did an excellent job as the emcee for #ExtractSummit2025 in Dublin. #ExtractSummit2025 #Zyte #WebScraping #AI #DataEthics #AIFuture #WebData #Compliance #Community #DataInnovation
-
-
John Rooney, Zyte Developer Engagement Manager, closed out Extract Summit Dublin with a story every engineer in the room recognized: how “doing it all yourself” in web scraping has become unsustainable. He walked through the shift from home-built scrapers to API-first architectures, where uptime, CAPTCHA solving, proxy rotation, and browser management are handled automatically so teams can focus on what matters: the data itself. 🔑 Key takeaways -Complexity has exploded — AI-driven antibot systems, fingerprinting, and CAPTCHAs make DIY scraping an arms race. -Stop firefighting — APIs reclaim developer time and eliminate 3 a.m. maintenance calls. -Shift from maintenance to innovation — Move talent from fixing to building value. -Your clients don’t care about scrapers — They care about consistent, high-quality data. -Reframe the trade-offs — Visible API costs replace hidden TCO, “black box” risk is offset by SLAs, and hybrid models handle edge cases. 🚀 The takeaway: Migrating to a web scraping API isn’t tactical anymore—it’s strategic. #WebScrapingAPIs #ExtractSummit2025 #WebScraping #DeveloperExperience #ExtractDataCommunity
-
-
Our Extract Summit Dublin legal panel featured moderator Sanaea Daruwalla, Chief Legal Officer at Zyte; Dr Nikos Minas, Global IP Counsel at Wesco; Dr. Bernd Justin Jütte, Associate Professor in Intellectual Property Law at UCD Sutherland School of Law; and Callum Henry, Legal Counsel at Zyte. The panel covered where AI and IP law are actually moving (EU AI Act, copyright, training data, TDM opt-outs), and what web-data teams should do now. Key takeaways: -EU AI Act ≠ one-size-fits-all. It’s risk-based: prohibited systems, high-risk systems, limited-risk, minimal risk or General Purpose AI, each of which have different obligations. Most AI-assisted scraping tools land in limited risk → transparency obligations including disclosing AI use and keeping records. -Extraterritorial reality. If you deploy an AI system in the EU, you’re expected to comply with the EU AI Act—even if you are based and trained the model outside the EU. -Procurement/contracting implication. Downstream customers of web data (esp. GPAI) may have additional disclosure/record-keeping requirements under the AI Act. Keep provenance and TDM-opt-out handling auditable. -Individuals aren’t “unregulated.” The AI Act targets providers/users commercially, but ordinary laws (fraud, defamation, malicious comms, data protection) still apply to people using AI. -Web scraping best practices still rule. AI-generated spiders don’t shift liability; you remain responsible for respectful crawling and web-scraping compliance. -EU Copyright Directive matters. There’s an exception for text & data mining (TDM), unless rightsholders opt-out. Be sure to comply with opt outs where applicable. -Anthropic snapshot (US). Scanning lawfully acquired books for training was found to be fair use; training on torrented/pirated books leaned against fair use (case settled after judge’s indicative views on the pirated books). -Getty v. Stability (UK) — what this panel highlighted. The court didn’t deal with primary copyright infringement as the training took place outside the UK. On secondary infringement issues, the model’s learned statistical weights weren’t treated as copies of images. Open question still pending elsewhere: is initial copying during training an infringement under UK/EU law? #AICompliance #WebScrapingLaw #DataEthics #Copyright #DataProtection #AIRegulation #EUAIAct #LegalTech #EthicalAI #ExtractSummit2025
-
-
IPv6 isn’t a “future thing” for scraping. It’s here, widely deployed, and it changes how sites see you. Yuli Azarch, CEO at RapidSeedbox, unpacked how websites bucket IPv6 traffic, why naïve rotations fail, and the exact block sizing/config patterns that actually raise success rates and lower costs. Key takeaways: 🌍 Adoption is real: ~46% of global Google traffic now runs over IPv6; major sites (Google, YouTube, Amazon, etc.) accept it. 🧠 Think in /48s, not single IPs: Most sites treat a whole /48 as one “identity boundary.” Abuse it and they’ll block the /48, not a lone address. 🧰 Serious scale starts at /29: Use an IPv6 /29 (≈524,288 × /48s). Route it, split into /48s per job/session, and rotate within /64s. 🔄 Rotate the right way: Don’t blast all traffic through one /48 just because it has 1.2 septillion addresses—sites don’t care; they score the /48 as a unit. 🧾 rDNS helps trust: Set reverse DNS (PTR) per /48 and keep forward/reverse consistent—small config, measurable friction drop. 💸 Cost angle: IPv4 is scarce and pricey; IPv6 blocks can cut proxy cost per successful request when configured properly. What teams should do now: 🧭 Inventory targets: Confirm which sites are IPv6-enabled; plan dual-stack where needed. 🧩 Get the right blocks: Acquire at least one /29, allocate dedicated /48s per scraper/project; rotate within /64s. ⚙️ Bind correctly: Ensure proxies/scrapers truly bind outbound to IPv6; verify at packet level. 🏷️ Configure rDNS: Delegate PTR per /48 and align with AAAA forward records. 🚦 Throttle & separate: Isolate high-volume jobs across multiple /48s to avoid cross-contamination/blocks. 📊 Measure success: Track success/error by /48, not just IP/request, to catch block-level issues early. #ExtractSummit2025 #IPv6 #WebScraping #DataExtraction #Antibot #ProxyArchitecture #Networking #DevInfra
-
-
Fabien Vauchelles, creator of Scrapoxy, took the audience deep into the evolving arms race between bots and antibots. From TCP/TLS fingerprinting to self-healing spiders, he showed how modern scraping now demands both deep technical skill and intelligent automation to stay ahead. Takeaways: -Detection is easy; faithful emulation is hard — off-the-shelf libs make detection straightforward. Matching those stacks cleanly demands kernel-level packet shaping and bespoke TLS—specialist work. -Residential IPs aren’t the moat anymore — clean supply is commoditized; the signal now lives in low-level stacks and interaction patterns. -Browser side: open source helps… and hurts — projects that patch browsers at the C++ layer (e.g., custom-Firefox builds) enable deep fingerprint control, but their signatures get cataloged quickly. The gold standard is truly custom builds—expensive, niche talent required. -New fingerprints keep arriving — canvas/emoji glyphs → audio context → subtle rendering/driver quirks; antibots add “weirdness detectors,” not just 0/1 checks. -CAPTCHA is solvable at scale with AI + good automation — but some sites are testing intrusive “proof-you’re-human” UX (e.g., webcam gestures). That blocks bots and users. Expect limited uptake outside high-risk flows. -AI supercharges reverse engineering and upkeep — obfuscated JS and antibot flows that took months to analyze can be summarized in minutes by strong models—turning “what’s happening” into actionable diffs. -Self-healing scrapers are here — Fabien’s MCP “Scrapy Inspector” pattern records requests/responses, then lets an LLM run targeted selector/tests, propose code edits, and iterate until green—no human touching the spider. -Economics raise the bar — to really “look human,” you need custom browsers, TCP/TLS control, QA, monitoring—i.e., serious capex/opex. Only large platforms (major aggregators, AI companies) can fund the full stack. -Toward a more closed internet — big sites will increasingly gate access via authentication + commercial terms (agent-access protocols, allowlists), not perpetual blocker roulette. ☝🏻 What teams should do now: 🧭 Capture full request/response, TLS hints, and browser signals; make it searchable for both humans and LLMs. 🔁 Add an inspection middleware + LLM test harness that can detect failures → validate selectors/flows → propose and apply minimal patches → re-run. 🧱 Separate discovery from extraction; prefer product/internal APIs where possible to reduce breakage. 🛠️ Only build low-level TCP/TLS/browser capabilities for your most valuable targets; rent the rest via a high-grade access layer. 🔐 Prepare for auth deals. Stand up token management, session stewardship, and throttled, respectful crawling to qualify for partner access. 🧍♂️💻 Plan for UX-hostile challenges. Have alternates ready when sites deploy proof-of-human steps (like webcam gestures) that degrade real users. #WebScraping #AntiBot #AI #LLM #Scrapy #ExtractSummit2025
-