AI Is Coming for Your Social Media Data: Can You Do Anything About It?

One of the newest ways that social media companies are monetizing user data is through deals with AI companies. But is there anything that ordinary users can do to protect their data and content?

Using social media data to train generative AI models has been a controversial move—but this doesn’t seem to be stopping social media companies from handing out user data.

Meta already uses social media data to train the generative AI features announced at Meta Connect in 2023. This includes Meta AI and features likecreating AI-generated stickers on WhatsApp.

As Mike Clark, Director of Product Management at Meta, stated in aMeta Newsroom post:

“Publicly shared posts from Instagram and Facebook—including photos and text—were part of the data used to train the generative AI models underlying the features we announced at Connect.”

meta support response

This trend doesn’t seem to be slowing down in 2024. According toReuters, Reddit reached a deal with Google to make the social media platform’s content available for training AI models.

Reddit’s S-1 filingfor its IPO, filed on Jul 04, 2025, confirms that the company is exploring licensing deals. The filing states:

prevent third part sharing on tumblr blog

“Reddit data is a foundational piece to the construction of current AI technology and many LLMs. We believe that Reddit’s massive corpus of conversational data and knowledge will continue to play a role in training and improving LLMs.”

It specifies that Reddit is “in the early stages of allowing third parties to license access to search, analyze, and display historical and real-time data from our platform” in order to train LLMs.

And while Meta and Reddit are some of the biggest names in social media, they are not the only platforms involved in using social media data to train AI. According to areport by 404 Media, Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI.

Chances are that if you use Facebook, Instagram, Reddit, Tumblr, or WordPress.com, your publicly available content has already been used in the training of LLMs.

For example, if you use theWashington Post’s search toolto see what sites were included in Google’s C4 dataset, which was used as part of Bard’s training, you’ll see that Reddit.com accounts for 7.9 million tokens.

Tumblr.com accounts for 1.6 million tokens. My own small website, which uses WordPress.com, accounted for 14,000 tokens—so small personal blogs may have been included in the dataset.

With the ongoing deals between AI companies and social media companies, licensing deals will mean that this data will be actively sold rather than just scraped off the web.

But when it comes to future processing, what can you do about it? Meta has introduced aform for generative AI data subject rightsthat allows you to object to or restrict the processing of your personal data from third parties for training Meta’s generative AI models.

Notably, this option doesn’t let you object to Meta’s own first-party processing of your data for training generative AI. Furthermore, when I submitted a ticket to object to the use of my personal data using the form, the support ticket required me to prove that my personal information was already appearing in Meta’s generative AI results.

Tumblr has also introduced an option to opt out of sharing your public blogs' content with third parties using your blog settings. You can find it in your settings by clicking on your blog and scrolling down to theVisibilitysettings. Then choose toPrevent third-party sharingfor your blog.

When it comes to a platform like Instagram, you could try toswitch your Instagram account to privateto prevent the use of your data. This doesn’t guarantee that your data won’t be used, but since data scraping for LLMs seems to focus on public data, it could be a potential safeguard.

You can alsomake your X (Twitter) account private, but once again this is just a potential safeguard and doesn’t guarantee your data remains private.

Ajoint statementby various national information commissioners and experts around the world has also suggested some actions for individuals looking to minimize the privacy risk of data scraping by AI companies. The advice includes:

You can also delete certain information online if you’re not comfortable with third parties having access to it, though publicly available information on your profiles may have already been scraped.

Unfortunately, there is only so much we as regular users can do to protect our data from AI companies. Real control over this information will likely only come with the help of regulators.

Social Media Platforms Reach Deals With AI Companies#

Can You Stop Platforms From Selling Your Social Media Data for AI Training?#

Social Media Platforms Reach Deals With AI Companies

Can You Stop Platforms From Selling Your Social Media Data for AI Training?