Security researchers found that datasets used by companies to develop large language models included API keys, passwords, and other sensitive credentials.
Large language models are dominating the online landscape, with companies promoting AI solutions that claim to solve all problems.
By infosecbulletin
/ Monday , March 3 2025
On Sunday, Poland Minister for Digitalisation Krzysztof Gawkowski said that Polish cybersecurity services found unauthorized access to the IT infrastructure...
Read More
By infosecbulletin
/ Monday , March 3 2025
Security researchers found that datasets used by companies to develop large language models included API keys, passwords, and other sensitive...
Read More
By infosecbulletin
/ Sunday , March 2 2025
Amnesty International’s Security Lab discovered a cyber-espionage campaign in Serbia, where officials used a zero-day exploit from Cellebrite to unlock...
Read More
By infosecbulletin
/ Saturday , March 1 2025
DragonForce ransomware targets organizations in Saudi Arabia. An attack on a major Riyadh real estate and construction firm led to...
Read More
By infosecbulletin
/ Saturday , March 1 2025
Microsoft has filed an amended complaint in recent civil litigation, naming the main developers of malicious tools that bypass the...
Read More
By infosecbulletin
/ Friday , February 28 2025
At the Global Investors Summit 2025, Union Minister Ashwini Vaishnaw announced that India’s first indigenous semiconductor chip will be ready...
Read More
By infosecbulletin
/ Thursday , February 27 2025
Cisco has warned of a critical vulnerability, CVE-2025-20111, in several Nexus switch models. This flaw could let attackers remotely crash...
Read More
By infosecbulletin
/ Thursday , February 27 2025
GitLab has released a security advisory, urging all self-managed installations to upgrade to versions 17.9.1, 17.8.4, or 17.7.6 due to...
Read More
By infosecbulletin
/ Thursday , February 27 2025
A China-linked botnet is targeting Microsoft 365 accounts with widespread password spraying attacks, according to a report by SecurityScorecard. A...
Read More
By infosecbulletin
/ Wednesday , February 26 2025
A breach notification site has added millions of new passwords and email addresses obtained from infostealer malware. Troy Hunt, founder of...
Read More
For an AI to be effective, it needs extensive training data, much of which is gathered from the Internet by specialized companies and organizations.
Common Crawl provides datasets for companies to train their AI, gathering information from the internet, which may include sensitive data.
Researchers from Truffle Security discovered that credentials, API keys, and passwords are being exposed. The main issue is that some web developers hardcode sensitive information into websites, which then ends up in LLM training data.
Researchers discovered 11,908 live secrets, such as API keys and passwords, across 2.76 million websites.
“Leaked keys in Common Crawl’s dataset should not reflect poorly on their organization; it’s not their fault developers hardcode keys in front-end HTML and JavaScript on web pages they don’t control. And Common Crawl should not be tasked with redacting secrets; their goal is to provide a free, public dataset based on the public Internet for organizations like Truffle Security to conduct this type of research,” explained the researchers.
Companies that create LLMs have warned against hardcoding sensitive information on websites. They advise avoiding this practice, as users may unintentionally share the code in their work, worsening the issue.