The Hidden Dangers: How Note-Taking Apps Serve as Leak Vectors to Large Language Models

Sep 30, 2025
6 min read

Updated: Oct 5, 2025

The Dangers of data leakage from note taking apps

In the digital age, note-taking apps have revolutionized how we capture, organize, and retrieve information. From Evernote’s early dominance in the 2000s to the explosive growth of Notion, Obsidian, OneNote, and Roam Research, these tools have become indispensable for professionals, students, and casual users alike.

The global note-taking app market is booming, projected to grow from $9.54 billion in 2024 to $11.02 billion in 2025, driven by increasing demand for productivity tools in remote work and education. This surge is partly fueled by integrations with artificial intelligence (AI) and large language models (LLMs), such as Notion AI, which uses models like GPT-4 for generating summaries and editing content, or Microsoft OneNote’s Copilot for intelligent note enhancement. Evernote and Roam Research also offer AI-assisted features, making notes smarter and more interactive.

At first glance, these apps appear harmless - simple repositories for thoughts, to-do lists, and research. However, beneath this veneer lies a significant risk: they can act as leak vectors to LLMs like OpenAI’s GPT series or similar models from Google and Meta. User data, including sensitive personal information, intellectual property, or confidential business strategies, could inadvertently end up in LLM training datasets, used for unauthorized inferences, or exposed through breaches.

This happens because many apps sync data to the cloud, integrate with external AI services, or rely on user behaviours that facilitate leaks. As LLMs scrape vast amounts of data for training—often including publicly or semi-publicly shared content, the boundary between a private note and a model’s knowledge base blurs.

In this article, we’ll delve into the technical mechanisms of these leaks, explore human factors that amplify them, and provide actionable mitigation strategies. By understanding these risks, users can better protect their data in an AI-driven world.

Direct Integrations and API Calls

One of the most straightforward ways data leaks from note-taking apps to LLMs is through direct AI integrations. Apps like Notion AI and Roam Research send note content to external LLM APIs for features such as summarization, idea generation, or grammar correction. For instance, Notion AI leverages OpenAI’s endpoints to process user inputs, transmitting snippets of notes over the internet.

Technically, this involves HTTP POST requests carrying payloads of note text. A typical interaction might look like this in Python pseudocode:

import requests

def send_to_llm_api(note_content, api_key):
    url = "https://api.openai.com/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "gpt-4",
        "messages": [{"role": "user", "content": note_content}]
    }
    response = requests.post(url, headers=headers, json=payload)
    return response.json()

Here, note_content, potentially containing sensitive data is sent unencrypted in the payload if not properly handled. While HTTPS encrypts transit data, vulnerabilities arise from lack of end-to-end encryption (E2EE) for the payload itself. Token-based authentication, like OAuth or API keys, can be compromised if stored insecurely in the app’s local storage or intercepted.

Man-in-the-middle (MITM) attacks exacerbate this. An attacker on the same network could use tools like sslstrip to downgrade HTTPS to HTTP, capturing payloads. For example, in a hypothetical scenario, a user on public Wi-Fi uses Notion AI to summarize a confidential business plan; an attacker spoofs DNS to reroute the API call, exposing the data. Even without interception, if the LLM provider’s policies allow data retention for model improvement, notes could feed into training datasets, as seen in controversies around OpenAI’s data usage.

Real-world example: In 2023, similar integrations in productivity tools led to unintended data sharing, highlighting the need for apps to implement zero-knowledge proofs or local AI processing to minimize leaks.

Cloud Storage and Sync Mechanisms

Most note-taking apps rely on cloud storage for seamless syncing across devices, turning them into potential leak vectors. Evernote, for instance, uses AWS S3 buckets to store notes, while OneNote integrates with Microsoft Azure. Data at rest is often encrypted with AES-256, but vulnerabilities persist.

Data at rest refers to stored notes, protected by encryption keys managed by the provider. However, weak key management, such as hardcoded keys in app code, or misconfigured buckets (e.g., public read access) can expose data. In 2019, AWS S3 misconfigurations affected millions, and similar issues could plague note apps.

In transit, sync uses protocols like HTTPS or WebSockets. WebSockets enable real-time updates but are susceptible to interception if not secured with TLS 1.3. Risks include SSL stripping, where attackers force unencrypted connections, or DNS spoofing, redirecting traffic to malicious servers.

Provider policies add another layer: Some allow anonymized data for LLM training. Imagine a flowchart: User note → App sync → Cloud bucket → Provider’s data pipeline → LLM training set.

If notes contain PII, even anonymized, de-anonymization via LLMs is possible, as models can infer identities from context.

Hypothetical: A misconfigured Evernote S3 bucket leaks notes, which web crawlers scrape into datasets like Common Crawl, used for open-source LLMs.

Third-Party Plugins and Extensions

Plugins and extensions expand functionality but introduce vulnerabilities. Obsidian’s plugin ecosystem, for example, accesses local note vaults, while Notion extensions use OAuth for integration.

Permission scopes often grant broad access, like read/write to all notes. Malicious plugins could exfiltrate data to LLM-powered services, perhaps under the guise of “AI enhancement.”

Real-world vulnerabilities: CVE-2019-12592 affected Evernote’s Chrome extension, enabling universal XSS to steal notes. Obsidian plugins pose risks from unvetted code, potentially injecting malware. In 2024, browser extension studies revealed common issues like privilege escalations.

Amplification: A rogue plugin sends notes to a remote LLM for “analysis,” leaking data en masse.

Sharing and Collaboration Features

Collaboration features, like shared links in Notion or real-time editing in OneNote, facilitate oversharing. URL-based sharing with weak controls, e.g., no expiration or password, allows unauthorized access.

Session hijacking via stolen cookies can expose shared notes. Moreover, public shares might be scraped by web crawlers feeding LLM corpora.

Example: A shared OneNote page with sensitive IP is indexed by search engines, ending up in training data.

Export and Import Functions

Exporting notes to PDF, Markdown, or CSV seems innocuous, but these files can leak when uploaded to LLM interfaces like ChatGPT or stored publicly on GitHub.

Format risks: PDFs embed metadata (e.g., author, creation date), revealing user info. A Markdown export with embedded links could lead to further scraping.

Scenario: User exports notes to PDF, uploads to an LLM for summarization, unknowingly contributing to model training.

Device-Level Access and Malware

Local storage in apps like Joplin uses SQLite databases, accessible via filesystem permissions. Malware like keyloggers or rootkits can read these.

Android/iOS sandboxing helps, but flaws exist, e.g., clipboard managers enabling side-channel attacks.

Compromised devices feed data to LLM-integrated bots. Examples: North Korean malware in note-taking apps steals data, or OneNote files delivering payloads like Qakbot.

Indirect Leaks via User Interactions

Users often copy-paste notes into LLM chats, training models indirectly. Browser history syncing or autocomplete pulls from notes.

AI search engines might index synced local notes.

Data Breaches and Secondary Exploitation

Historical breaches: Evernote’s 2013 hack exposed 50 million users’ emails and passwords. 10 Dumped data ends on the dark web, ingested by LLM pipelines via Common Crawl.

De-anonymization: LLMs re-identify data by cross-referencing patterns. 60 Stats: 2025 saw over 30,000 new vulnerabilities, with breaches costing $4.88M on average.

Emerging Vectors: IoT and Voice Integrations

Voice-to-text in Otter.ai integrates with note apps, sending audio to cloud LLMs. 77 IoT syncing (e.g., smart devices) enables ambient listening leaks.

Example: Otter.ai’s voice training uses data for AI improvement, risking exposure.

Human Vulnerabilities Amplifying Leaks

Technology isn’t solely to blame; humans are the weak link, amplifying risks through behaviors and oversights.

Behavioral Risks

Poor password practices, reuse or weak phrases, lead to account takeovers. Once compromised, notes are exfiltrated to LLMs.

Social Engineering

Phishing mimics app updates, granting access to scam tools powered by LLMs.

Misconfiguration

Disabling E2EE, using public Wi-Fi for sync, or ignoring settings exposes data.

Oversharing and Cognitive Biases

Storing passwords in apps, ignoring risks due to confirmation bias, or sharing via LLM platforms.

Insider Threats

App employees leak data to LLM projects.

Lack of Awareness

Misunderstanding policies allowing AI training worsens issues. 88

Mitigations: Use local apps like Standard Notes with E2E, audit plugins, enable 2FA. Stats show breaches up 10% in 2025.

Note-taking apps, while convenient, pose hidden dangers as leak vectors to LLMs through integrations, breaches, and human errors. Vigilance, better designs with local AI, and regulatory oversight on AI data handling are essential to mitigate these risks. As the market grows, so must our protections.

Sources: Below is a list of the sources referenced in the blog article, sorted in ascending order by their citation ID. For each, I’ve included a brief title or description based on the source content, along with the direct URL link. These are derived from reliable web searches matching the cited information.

• Citation 0: Note Taking App Market Report 2025 Analysis And Forecast 2034 - https://www.thebusinessresearchcompany.com/report/note-taking-app-global-market-report

• Citation 2: Does Evernote Sync with Microsoft? How-to-Sync Guide - https://www.multcloud.com/tutorials/does-evernote-sync-with-microsoft-2223.html

• Citation 10: 50 million compromised in Evernote hack - https://www.cnn.com/2013/03/04/tech/web/evernote-hacked

• Citation 33: Meet the new Notion AI - https://www.notion.com/product/ai

• Citation 41: Plugins - Obsidian - https://obsidian.md/plugins

• Citation 42: Google Chrome Extensions Vulnerabilities - https://www.cmu.edu/iso/news/2025/google-vulnerabilities.html

• Citation 43: Plugin security - Obsidian Help - https://help.obsidian.md/plugin-security

• Citation 45: CVE-2019-12592 Detail - https://nvd.nist.gov/vuln/detail/CVE-2019-12592

• Citation 50: Cost of a Data Breach Report 2025 - https://www.ibm.com/reports/data-breach

• Citation 60: Deidentifying Medical Documents with Local, Privacy-Preserving Large Language Models - https://ai.nejm.org/doi/full/10.1056/AIdbp2400537

• Citation 62: Deidentifying Medical Documents with Local, Privacy-Preserving Large Language Models - https://ai.nejm.org/doi/full/10.1056/AIdbp2400537

• Citation 67: Qakbot Evolves to OneNote Malware Distribution - https://www.trellix.com/blogs/research/qakbot-evolves-to-onenote-malware-distribution/

• Citation 72: RustDoor and Koi Stealer for macOS Used by North Korea-Linked Hackers - https://unit42.paloaltonetworks.com/macos-malware-targets-crypto-sector/

• Citation 77: Otter Meeting Agent - AI Notetaker, Transcription, Insights - https://otter.ai/

• Citation 88: Artificial Intelligence Staff Awareness E-learning Course - https://www.itgovernanceusa.com/shop/product/artificial-intelligence-staff-awareness-e-learning-course

• Citation 91: Privacy & Security | Otter.ai - https://otter.ai/privacy-security