There's been a lot of conversation recently about leveraging web proxies to prevent data loss. Cloud Access Service Brokers (CASB) and Secure Access Service Edge (SASE) proponents are suggesting that essentially, DLP (Data Loss Prevention) can be enforced by web proxies (forward or reverse). In my humble opinion, this is an exceptionally poor architectural choice, delivering a non-functional “solution” that reflects the flawed thinking behind it.
My rationale for this bold statement? In addition to many years of experience in both web proxies and DLP technology...
- Web proxies are simply not designed for this. They’re high-performance data path engines that add minimal latency to user requests and responses from websites. They’re not designed for complex content filtering tasks, such as decomposing a PDF file buried under multiple layers of gzip and tars. To address these tasks, firewall vendors such as Checkpoint created the OPSEC initiative in the late ‘90s and introduced CVP (Content Vectoring Protocol) to offload content inspection to a 3rd party engine, such as an AV product, often running on another host.
Eventually, a variant of CVP called ICAP (Internet Content Adaptation Protocol) emerged as the de facto standard for web proxy vendors to integrate with an AV or content filtering engine. Although logical by design, a significant problem soon emerged: the inspection engine was required to “see” the entire file before deciphering “goodness”. Bearing in mind that an individual file could be hundreds of megabytes, this approach impacted user experience and led to cumbersome keep-alive hacks within the proxy stack to prevent the client or server from timing out. Despite multiple attempts at custom integrations and acquisitions among web proxy and content filtering engine vendors, these problems persisted
- The second problem is that DLP is a form of enforcing policy after content inspection. A key aspect of that is content classification, i.e. establishing whether a file of unstructured piece of data includes sensitive data. None of the methodologies used to do this today (document fingerprinting/sub-fingerprinting, regex/keyword matching, text classification using ML/NLP) works efficiently or co-operatively with a complementary web proxy. Here’s why:
- Light changes to a document alters the fingerprint/sub-fingerprint, yielding an ever-growing matching database that becomes almost instantly unmanageable, making it impractical for scalable deployment. (More in-depth perspective on this technology here.)
- Regex/keyword match generally leads to significant false positives. Think about it: the word ‘bank’ in a document could mean a place to manage money, a river bank, or to ‘bank’ my future on unwieldy technology. Where is the context to disambiguate even this small example? Now imagine relying on CASB/SASE vendors to manage this complex issue.
- A key problem with machine-learning DLP is that it requires continuous training/learning to identify what is or isn’t sensitive, using statistical ML and/or natural language processing (NLP) to classify. Applying this learning to a complex workflow with a 3rd party SASE/CASB service comes with significant - and obvious - challenges. To illustrate the point, let’s take a classic problem that even the best DLP solutions today can’t handle:
“My company’s 4th quarter financial results are confidential ‘til 4pm EST on the 4th Tuesday after the quarter ends. After that, it’s public information.”
Enforcing this, even with an on-prem solution and eyes on glass is extremely difficult. Relying on a 3rd party web proxy DLP to accomplish it is risky - or even negligent.
- It's publicly documented that services such as Dropbox use an optimized protocol (aka de-duplication on top of http) to upload files and avoid sending unnecessary data, meaning that large files are often split into chunks. A network proxy sitting in the middle doesn’t even see the entire file - and this incomplete information is what DLP, enforced via web proxy, would rely on for enforcement.
Another example is a CASB intercepting responses from a SaaS to the user, and then re-writing http links within the page to force subsequent user communications through filtering web proxies. In the real world, there is no CASB that has a maintainable offering that re-writes ALL links within a response. Often, the links are constructed on the fly; try figuring that out in real time within a reverse proxy, it would be a nightmare. In other words, not all http communication from the user’s browser is being intercepted by CASB-directed web proxies.
How can such a broken paradigm be suitable for enforcing DLP when not all web traffic is visible?
Poor design amplifies the problem
DLP is a complex problem that can’t be reduced to “Is this content sensitive or not?” Even though a piece of content within a document might be sensitive, if it exists for legitimate purposes, it requires downstream access to stakeholders to read, edit or forward to trusted third parties, such as law firms.
From a broader perspective, it would seem that DLP functionality, inline observation, action and enforcement are best delivered from an observation point such as an endpoint or within a browser where one has much more context and metadata. Unfortunately, none of the DLP products are designed for this, and when trying to function in this way deliver a deluge of FPs or bloated keyword/regex dictionaries.
Even endpoint DLP offerings have no visibility into what is happening within a browser, where some of the most sensitive content originates. Hacking DLP functionality in web proxies or in a latency sensitive data path is a poor design choice that amplifies the problem by creating adjacent consideration and resource requirements. Perhaps it can deliver a good demo but is totally impractical in the real world.