Hot take? “Industry self-regulation” for AI is a good intention that just won’t work
As a society we’ve become lax in considering our own data privacy. Our data is used all over the internet. But with Generative AI, that data can be used not just as facts, but in the creation of new information. False information. Fighting against that is an incredibly complex and expensive technical challenge to deal with it.
Deciding to deal with data privacy is not something I’d like to leave up to the whims of technology executives. (And I am one.)
To date there are two broad options for regulating how data privacy is protected: government intervention and/or industry self-regulation. As of Jan 2025, the U.S. has only self-regulation, and that scares the crap out of me — because the motivation for tech to do so is left to the intentions of CEOs. And even good intentions simply don’t work.
One of the most successful cases of industry self-regulation (outside of medicine) is Journalism. Though many iterations have expanded and clarified, their code of ethics originates from the 1923 Canons of Journalism created by the American Society of Newspaper Editors in response to backlash from false reporting. That accusation hit hard for a profession that originated during the American revolution against the English Crown. It was natural consequence of a common desired outcome (reliable and factual information) and, importantly, can be governed by that same population.
Print something false, and someone can fact-check your ass and feel professional pride in doing so. The standard makes it clear what is expected, the professional code gives it the backbone to be on watch, and the nature of reporting means it’s actually possible to verify the information. The mechanism keeps journalists in check. No so in tech.
The tech industry lacks the motivation, and the one-two punch is that Generative AI, especially from Large Language Models (LLMs), lacks of traceability.
What motivates the tech industry? Capitalism.
Tech was built on making businesses more successful, for profit. The original companies prior to the 1911 formation of IBM built employee time-keeping systems among other B2B systems, and were acquired to form a more competitive IBM. Tech’s roots are in capitalism — revenue and market competition. Every tech company needs adoption growth, which is generated by usefulness and value of their solution. Personalization is the bare minimum to achieving that value in a meaningful way, and that requires data — directly from the user, the customer, and/or publicly available sources.
Business executives and investors know that reputation has a direct impact on growth. But when it comes to governing data privacy, tech companies have traditionally pushed the boundaries and put the responsibility on the user. In other words — you are the data source, whether or not you understand the consequences.
Individuals and groups can’t control their own narrative
Accepted best practice for a user account creation is to request consent for how they will use your data. But I submit that practice offers near-zero impact for data privacy now, given how we use technology today.
Some accounts are so embedded in our world that people are forced to give away their data in order to operate in modern society — your email, how you maintain contact with family members, how you build your professional network, where you find local information, and more. Further, these companies have existed long enough that most of us agreed to the use of our data long before AI was in play. Yes, you can modify settings. But only so far if you want to use those applications.
The value of those accounts has increased over time and adoption has thrived — that’s great for business, and even great for consumers… if the use of the application was the only consideration for how our data is being used. Our data is not solely used in that application (minimally consider how many times you “log in” with one account to access another). Now our data is locked in, ever-growing in it’s usage and lack of control, even when builders have good intentions.
Builders aren’t the sole creators anymore
The purpose of Generative AI is to create: the output does not simply merge data, rather, something new is being created. The train of “thought” back to the origination of that new idea is not necessarily traceable. With the specific lens of data privacy, this means reacting to the problems that AI creates for individuals and society is pointless. The harm will have been done, and depending on the model, proliferated. By its very nature privacy is not something that can be retro-active. If we label data privacy errors as “AI is early and we are iterating,” then we have failed miserably to hold companies accountable for their creations.
The AI experiences we are excited about and add value to our lives only works because of the vast amount of data these companies can access. So whose fault is it when the enormous data of the internet is ingested with false or biased data? Whose fault is it when hallucinations occur that are harmful to individuals or groups?
Generative AI inputs do not have a direct line to its output — the builder/owner cannot always draw a line back to the bad source or sources that created the false information. Again, the tool’s job is to create, not regurgitate.
And the companies building the LLMs are not the only ones using that data — other software companies use those LLMs to create and then often retrain the models without removing the bad data. Garbage in/garbage out is the oldest error in the books, but the amount of data being used to make LLMs function means it’s essentially impossible to properly classify and correct prior to use.
Self-regulation is not useless. It’s just not enough.
We do have worthy efforts in developing frameworks for self-regulation, from NIST and IAPP for example, and they are well-informed and articulated. But the only motivation tech companies have to follow them other than good intentions (which depend on the values of the CEO), is reputation. We’re not driving a revolution here.
Even Journalism’s code of ethics was boosted in efficacy when the government backed and enhanced it in the Hutchins Commission Report in 1947. Without some type of intervention to motivate the tech industry, frameworks like NIST’s are helpful, but driving only good intentions.
I’ve not given up hope. There are ways to make progress as demonstrated by NIST and the European Union’s AI Act that will have global influence. And, more evidence is demonstrating that smaller language models with smaller data sets are actually more useful and accurate (which will reduce but never remove) the risks to data privacy.
But LLMs are not going away. We need to be mindful of who we ask for self-regulation — think critically about their motivations and ask if that’s an acceptable driver for overcoming barriers to solving incredibly hard and expensive problems.
AI is changing faster than anything we’ve seen before. So maybe something effective will happen soon that supersedes the effectiveness of government regulations to create motivation for ensuring data privacy. I hope that’s the case.
✨ I’d truly love to hear alternative opinions on this! ✨