Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks, typically by introducing specially crafted input data (known as adversarial examples) to cause incorrect model predictions.
How Adversarial Machine Learning Works
1. Adversarial Examples:
- These are input samples that have been slightly modified to deceive a trained model, resulting in misclassifications. These modifications are often imperceptible or barely noticeable to humans.
2. Generating Adversarial Examples:
- Techniques like the Fast Gradient Sign Method (FGSM) or Jacobian-based Saliency Map Attack (JSMA) are used. For instance, FGSM computes the gradient of the loss with respect to the input data, then adjusts the input data to maximize the loss.
3. Types of Attacks:
- White-box Attacks: Attackers have full knowledge of the model, including its architecture and parameters. They use this knowledge to craft adversarial examples.
- Black-box Attacks: Attackers have limited knowledge of the target model. They might know its type or training data but not its internal parameters.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.