Web scraping: what it is, and why it doesn't work
One of the most satisfying parts of our job at Violet is talking with customers. We especially love it when we get to help our customers solve some of their thorniest problems, or ones they thought they just had to live with.
In these conversations, web scraping is a big offender. Anyone and everyone attempting to monetize affiliate links usually defaults to scraping as a way to get product information. What they often don’t realize is they’re digging their own grave, foreclosing meaningful scale and innovation in the process.
What is scraping?
To understand why ecommerce so desperately needs to retire scraping, it’s important first to understand how it works (or doesn’t).
Also known as "web crawling," web scraping is the process of extracting data from a website and repurpose it on another site. While scraping could technically be done by a person manually, it is typically performed by a bot or script.
In the case of ecommerce, web scraping is the fundamental building block of most affiliate marketing: channels and aggregators scrape data from the web with a bot farm that scans a merchant site for product details (image, price, URL, etc.). Then, affiliate companies repurpose those extracted product details to create a new web page of their own, where they can drive traffic, sales, and hopefully earn a commission.
Generally, scraping entails just two main steps:
Scan source code: Ever hit “view source” in your browser? If you have, you’ve seen the main source of a web scraper’s information: metadata, captured in HTML code. To scrape, a bot searches that code for certain tags and information it requires to build a product detail page (PDP), such as the item’s name, available variants like size, color, etc., and product media.
Extract specific data: Once the bot has identified the product details it needs, it extracts that information and downloads it into a database, with each type of information tagged and organized (i.e. name, make/model, color, size, media, etc.) so that it can use that database to create a new PDP on the channel’s site.
While web data extraction has existed for decades, scraping for product data specifically has accelerated over the past few years due to growth in ecommerce. Specifically, web scraping is used to connect online channels with products they can present to their audiences.
At face value, scraping appears to save channels time by automating the process of scanning and duplicating PDPs. Considering that online merchants today work with over two hundred different platforms, scraping becomes the de-facto solution for getting product information simply because it is the one most channels can afford.
But appearances can be deceiving. And though scraping does automate product data retrieval, it also creates a host of problems for shoppers, channels, and merchants alike as soon as someone actually wants to make a purchase.
No scale, no dynamics: the problem with scraping
Scraping creates several issues for online channels hoping to monetize through affiliate marketing that we’ve seen at Violet over the years. Namely, scraping fails to address seven key features of a successful shopping experience:
Inventory: Data that is scraped from a website is static, and reflects only a snapshot in time. Without scraping again, channels have no way of knowing or relaying to shoppers what’s actually available for purchase.
Updates: Scraping bots rely on pattern recognition to read and update the information in their databases. So if a merchant updates the layout or design of their own PDP, scrapers can’t detect the information they need. One small revision can render a merchant page unreadable by the scraper, and thus untranslatable to the channel’s site.
Tax & shipping: PDPs don’t contain information about shipping providers and warehouse locations, so any purchases made based on scraped product information will not be able to accurately estimate (let alone charge) shipping and tax fees. The result most often is the channel eats the cost to avoid any unpleasant surprises for their shoppers.
Payments: Scraping is a one-way transfer of information from the merchant’s site to the channel’s database–it can’t send information back. So when it comes time for a shopper to pay for their order, channels either redirect the shopper to the merchant’s site (creating friction) or they collect payment directly from the shopper. This makes them the merchant of record, and can come with a host of liabilities.
Orders: If a channel collects payment from the shopper, they still need to manually re-order the products from the real merchant(s). This means they have to serve as an intermediary between the merchant and the shopper, confirming the order with each party for every single purchase.
Tracking, exchanges, & returns: Because scraping doesn’t send information back, the entire post-purchase experience must still be manually coordinated by the channel. Any shipping tracking, exchange, or return must run first through the channel, then through the merchant. If an item needs to be returned, the channel refunds the shopper and then must request a refund from the merchant, etc.
Payouts: The entire purpose of scraping is to help channels monetize their own sites by making a commission on third party sales. But scraping in no way tracks or relays information between merchants and channels about what or how much is sold. This is the entire reason affiliate networks exist, which are prone to leakage, attribution error, and fraud. Typically channels that do this themselves keep track of what they sell, and invoice merchants individually on a monthly or quarterly basis, adding additional operational overhead, delays, and headaches.
Each of these gaps on their own would create challenges for an online channel trying to generate reliable business through affiliate marketing. For each scraper, the more products they extract, the more overhead they create for themselves with re-orders, payments, returns, and payouts. It also means on big shopping days like Black Friday or Super Saturday, the volume of work required to keep up can overwhelm already stretched resources.
In conversations with online channels, we’ve seen some companies where nearly half of their headcount is devoted entirely to order fulfillment–even at companies who are explicitly not merchants, but who are taking on those responsibilities simply because they have no better operational model.
What appears initially to be a quick fix ends up requiring more and more stop-gap solutions of its own, drowning companies in overhead and siphoning headcount away from innovation and into patching a broken system.
Built for advertising, not for shopping
If scraping is so ineffective and error-prone, why does everybody use it? Simply put: until very recently, it’s been the only option out there. What was designed for an advertising-centric internet is currently the only tool available for those attempting to build a commerce-centric internet.
Specifically, web scraping was created for tagging products. Online advertisements traffic in representations only: when you buy an ad unit, you’re just buying the digital depiction of the item, not the actual item. You don’t need to know the inventory, shipping information, or payment info. No physical product needs to change hands.
But we’ve now reached the point where peoples’ behavior and expectations have outpaced existing commerce infrastructure: they not only expect to discover products and services on the internet, they expect to be able to purchase them there as well. To close the gap between discovery and purchase, we need a completely different infrastructural paradigm, one based on integration.
Unified integration: built for ecommerce
Violet’s unified ecommerce API is infrastructure built for ecommerce, by engineers who understand the challenges and needs of both channels and merchants firsthand. Our single API integrates with many of the most popular e-commerce platforms, allowing direct connection between merchants’ inventories and channels’ front-end experiences. Instead of just extracting a PDP, Violet addresses every key feature of a successful shopping experience, from discovery all the way through order fulfillment and payout:
Inventory: Because the channel is directly connected to a merchant’s backend, the PDP syncs with inventory information instantly, so shoppers (and channels) always know what’s in stock.
Updates: Updates to a merchant’s website have no effect on their ability to read and exchange product and order information with a channel. Because the data comes directly from their backend, the integration is frontend-agnostic.
Tax & shipping: Violet’s API also pulls all relevant shipping & tax information, so shoppers are never surprised and channels never have to eat unnecessary fees.
Payments: Because Violet connects a channel’s checkout experience directly with the merchant’s backend, payment can be directly facilitated with the Merchant: no duplicate payments, no serving as a merchant of record.
Orders: Orders placed with violet are indistinguishable from orders placed on a merchant’s website. This means they receive the order directly from the shopper, along with all the necessary contact, shipping, and billing information–no middleman necessary.
Tracking, exchanges, and returns: Similarly, all order tracking, exchanges, and returns are handled directly by the merchant. The channel is not involved once the order is placed.
Payouts: Violet also has the ability to automate payouts between merchants and channels, so neither side gets bogged down in extra paperwork or processing.
The end result is a seamless, automated infrastructure that meets shoppers where they are, and frees channels and merchants to do what they do best. Instead of time and resource-consuming “automations,” online channels and aggregators get a frictionless, responsive, and scalable foundation that finally allows them to scale with ease, and devote their resources to innovating and improving their core product.
At Violet, we’re working diligently to create a shopping-forward internet, where integration rules the day and scraping becomes a thing of the past.