Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy
Back

Releasing Canica: A Text Dataset Viewer

Discover Canica, Lakera's interactive text dataset viewer that elevates data analysis with visual exploration tools like t-SNE and UMAP. Now available for the machine learning community under the MIT license.

Lakera Team
December 1, 2023
November 14, 2023
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

At Lakera, we collect huge datasets of text that we use to train, test, and improve our models, but a model can only be as good as the data we train it on.

We developed canica, a text dataset viewer, to help us understand the quality of our datasets.

Canica consumes some text and its corresponding embeddings and allows you to interactively explore it as a 2D plot using algorithms like t-distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP).

This tool is already a valuable part of our workflows, and as part of our efforts to help the machine learning community, we are releasing canica under the MIT license.

The source code is available on GitHub, and the canica package has been published to the Python Package Index, so you can install it right now via pip.

You may be wondering why we call it canica. In Spanish, canica means marble (the toy). During the development of canica we did some experiments showing the t-SNE optimisation process in real-time. It looked like a group of marbles bouncing around, hence the name. Plus, I think canica has a nice ring to it (doesn’t it?).

Exploring canica

Let’s take a look at a well-known dataset of Amazon reviews and filter it down to 1000 reviews in English and German, and generate text embeddings for these reviews using OpenAI's embeddings API.

Screenshot showing a text dataset visualization in canica

This plot shows two clusters of reviews, with English reviews in orange and German reviews in blue. They are mostly separated, but some reviews end up in the other language's region.

We could ask many questions about this dataset, like why are some points surrounded by points of the other color?

Screenshot showing a visualization of two clusters of reviews

Hovering over a point will give us more information about it. The point we’ve highlighted is a German review of a Nokia phone cover with two English reviews surrounding it.

“Not convenient to store your phone, pouch is too small.. The pouch is too small for an iphone and not really…”
“This bluetooth is very handy and easy to carry around. This bluetooth is very handy and easy to carry around. What I like about this device is that Selfie don't need to raise your hand far just to get a good picture…”

All these reviews are for phone accessories. Points share semantic similarities with nearby points, even though the larger overall clusters correspond to different languages, which means that our embeddings represent semantic information similarly across languages.

Inspecting a local neighbourhood

We often faced a challenge when using tools like matplotlib or plotly to plot t-SNE results: there was no easy solution that could help us relate the 2D space back to the original embedding space.

Dimensionality reduction is great for simplifying data, but it can leave out crucial context, which can make it harder to grasp the structure of your data.

One of the unique features of canica is that it lets us explore the neighbourhoods in the original embedding space through the 2D plot by clicking on a point to highlight the nearest neighbours of the selected point and adjusting the number of neighbours using the slider.

This gives us a better idea of how the dimensionality reduction works and which information we see in the resulting plot. Hovering over the highlighted points allows us to understand how our embeddings work and which information they contain.

Focusing on a subset

Canica also allows you to focus on a specific subset of your data.

After selecting a data point and adjusting the neighbour count, the re-plot button will rerun the dimensionality reduction algorithm on the selected subset. The plot will rerender, and we can investigate specific subsets that may not have been clear before.

Thanks to feedback from our internal Lakera users, you can see that canica highlights the last focused point to help you keep track of this process.

We’re excited to share canica with you and always welcome your feedback and contributions.

You can discover more about canica and how it can enhance your approach to data analysis by exploring the tutorial notebook in our GitHub.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Master Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Lakera Team
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download
You might be interested
No items found.
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.