Is Amazon's data lake secure? [ref]

Tags:  Archive

It's a little difficult to piece out with the details presented in this Politico piece, but the basic crux of the referenced article is that Amazon, Inc. ( or consumer properties, not Amazon Web Services) isn't fully adhering to Data Rights like right of erasure simply because they have no idea where all of a users' data is and have a cultural unwillingness to invest in finding it. Their data inventory practices are insufficient because their culture of frugality and obsession with the end-consumer experience prevents them from caring about it, and punishes and excludes those who do. Three engineers who championed these issues internally were shut down, told to stop looking for problems, sidelined, and eventually forced to exit the company.

These 3 engineers sourced for this article believe they were sidelined and eventually quit Amazon because they raied these issues repeatedly and up through leadership chains. I know I have seen similar things in big tech (both the technical issues raised and the sidelining and political gamesmanship) and my feelings of big data are pretty clear: Tech is Not Ready for New User Rights. I never came across anything quite so egregious that I felt the need to whistle blow, but I can still see echoes of this article's claims throughout my own experience and what I've gleaned through my time in the industry. One of the under-accounted risks of big-data is that it has a habit of getting away from you: storage costs trend towards zero so there is a great incentive to simply copy data when you're modifying it and without proper inventory controls those copies can escape audits or take on a life of their own.

The Tools required to implement proper access controls are only starting to scale up now and companies running on the bleeding edge can I think implement Data Rights and specifically Data Access Rights to their full extent. However, along with simple cost of hardware, the biggest implementation blocker on large roll-outs of corporate data governance is still dealing with internal customers which are unwilling or unable to migrate to more secure access patterns. It's not just technical security at play here and these engineers clearly think quite highly of their own organizations' work and speak towards the strength of security controls if only the end-user could apply them properly. This, despite the fact that companies for years now are constantly leaving large databases of user data publicly accessible in EC2 and S3, usually through laziness or misconfiguration1 but always inherently aided by poor tooling UX and incomprehensible documentation.

Regarding the difficulty of proper account lifecycle and systems access: "we found hundreds of thousands of accounts where the employee is no longer there but they still have system access" is quite a claim especially if it's about access to the data lake. I can believe the claim that IT systems and account-offboarding missed certain types of account integrations, or that certain access keys were not being invalidated, perhaps IAM security keys with access to the data lake AWS account. Incidentally, offboarding automation is probably also chronically underfunded.

"You've got tens of thousands of … teams connecting to big data." "You should have a way to follow all the different types of data. From a technology point of view, you need to know where the data is going and how it's being protected. That does not exist." These claims about the security controls and difficulties in managing access to the data lake tell me that they couldn't get a secure authentication rolled out to all of internal customers so there were a handful of unclosable unsecured back-doors in to the data lake and no will to invest in those customers' systems where users were either anonymous or could masquerade as another or evade authenticating securely. A system like this is unfortunately only as good as its weakest link, and it's a terrible position to be in from a threat-modeling perspective if the people most likely to be able to engage in espionage like this would be the people technically capable of doing it and the people most likely to have the security keys lying around in the first place. All you lack is a motive at that point.

Even with access management in place, there are technical challenges surrounding data tagging and carrying those tags through lineage of copies and modified subsets of data sets but it should be possible at the very least to audit queries and table consumers even at the scale Amazon operates at. Frankly if the "second to none" data security and privacy teams are unwilling or it is somehow economically or technically unfeasible, we're all fucked. It can't be as easy as making copies of the data in to a new S3 bucket and the downloading that S3 bucket at home, and it can't be even simpler of plugging the IAM tokens in to the S3 client libraries at home and downloading the files, I simply don't believe that, but three engineers thought it was serious enough to risk their careers raising it to management.

You can't legislate big data out of existence any more than you can legislate AI out of existence or financial crime, and these behemoths aren't going to be weened off the big data diet by ethereum decentralized apps and the interoperable ecosystems only goes as far as their killer apps take them right now. Yet even if users switched from Facebook to their community-managed Mastodon nodes, data brokers and the government and your regional bank and some consultant with a dozen servers aren't going to willingly switch to build systems without unaudited and unprincipled use of big data like this despite preaching their love of privacy by design. Despite the public conversation and growing consensus around the need for new data rights, the Graph of Rational Actors Betting on Our Future has done little to course-correct these companies' hungry diets in spite of increasingly massive breaches, and so these companies have managed to scrape by with such small investment. But then none of that stuff is really rational or fair, is it?

My opinion on the matter? Forget it being about "personal data" or "PII" and making sure every data point falls in to or out of this arbitrary legal category, this is a fools errand that only gets harder as more data piles up. These companies have already managed to so constrict their interpretation of PII worthy of data rights that the majority of data powering this awful machine would already be exempt and the gears would continue turning. We can't afford a permanent record, but these data-hungry companies can because that diet feeds their profit. I think that it should be considered more that all data gets an expiration date by default​, the machine learning models and clickstreams and analytics and "legitimate interest" data processes can't get a free pass as they power a panopticon and put us in danger of exposure, and as this article shows, an individual right of erasure isn't sufficient.

how do we down-size the big data diet? what are the trade-offs? what are some of the first and second order effects of doing so

They say that they didn't start centrally planning for General Data Protection Regulation enactment until the 11th hour, and it appears they still aren't done doing their data governance homework even if they think they've got a passing grade. Not having a dedicated team for GDPR strategy doesn't mean they hadn't spent time preparing for it in a coordinated yet decentralized manner, it just shows how simplistic the company thought the issue at stake was and how few systemic changes the governance team was able to institute to an effective degree. There is no Privacy by Design on display here. The amount of work that has to go in to tracking the lineage and use of big data is no small task, technically and politically speaking. Hadoop's authentication system, for example, is known to be difficult to work with especially outside of the Java and Windows ecosystems that these systems are largely designed in consideration of. But, for better or worse, not everything can be built in those ecosystems and some of them need access to the data lake, and the path forward is dark and full of terrors. And the user and role management in Hadoop without an enforced authentication system is worse than useless2, it's misleading in its guarantees.

And so we see poor risk management dovetailing nicely with poor governance… To the business and shareholders and corporate values it's good risk management at least until they have a massive data breach!

This article lays out what I think is a pretty accurate portrayal of how Amazon's corporate culture of gamesmanship plays out and how the industry has underfunded compliance and security improvements which they felt were cheaper to lobby around or ignore than fully alleviate. They have quite a reputation for being a difficult or very intense and very ideological place to work like this from folks I know including ex-Amazon employees in the Seattle and Bay Area tech scenes and I know myself that Uber learned so many of these tactics in 2015 and 2016 from tepid Amazon middle-managers and big tech careerists we hired, and they put them to such awful ends. Wielding your corporate cultural values as a weapon in retaliatory meetings or only to further your project or career? hell yeah, we can do that! Lobbying a standards body to put off security improvement work? now that's not even innovative. Of course an honest whistleblower would be having "ongoing performance issues at the company" and decide to leave. FOH.

Both former U.S.-based employees were told at one point by their direct management to “stop looking for problems,” even though they were required to do exactly that under multiple laws, regulations and industry requirements and despite the fact that they could be personally liable for issues.



  1. laziness: i want to run this dashboard at home/i don't want to run elasticsearch on my laptop

    misconfiguration: i dare you to get an IAM policy with custom ARN limitations to restrict access to S3 objects working properly on the first try, i fucking dare you.↩︎

  2. for example of the complexity in operating something like this, and the absolute madness of not doing so:

    This allows any user to trivially bypass the HDFS security model or to change file permissions at will:

    $ whoami
    $ hadoop fs -chown dwoods /user/admin/test2
    chown: changing ownership of '/user/admin/test2': Permission denied. user=dwoods is not the owner of inode=test2
    $ HADOOP_USER_NAME=hdfs hadoop fs -chown dwoods /user/admin/test2
    $ hadoop fs -ls -d /user/admin/test2
    drwxr-xr-x   - dwoods admin          0 2015-07-21 22:11 /user/admin/test2

    The HDFS file system authorization model is useless without proper authentication.


This is Referenced


in "Essays"

2021 Is Amazon's data lake secure?