Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lancedb-bcbb4faf-docs-to-pandas-blob-mode.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

LanceDB handles multimodal data—images, audio, video, and PDF files—natively by storing the raw bytes in a binary column alongside your vectors and metadata. This approach simplifies your data infrastructure by keeping the raw assets and their embeddings in the same database, eliminating the need for separate object storage for many use cases. This guide demonstrates how to ingest, store, and retrieve image data using standard binary columns, and also introduces the Lance Blob API for optimized handling of larger multimodal files.

Store binary data

To store binary data, define a binary Arrow field in your schema (pa.binary() in Python, Binary in TypeScript, and DataType::Binary in Rust).

1. Setup and imports

First, import the necessary libraries for LanceDB and Arrow in your SDK.

2. Prepare data

For this example, we’ll create some dummy in-memory images. In a real application, you would read these from files or an API. The key is to convert your data (image, audio, etc.) into a raw bytes object.

3. Define the schema

When creating the table, it is highly recommended to define the schema explicitly. This ensures that your binary data is correctly interpreted as a binary type by Arrow/LanceDB and not as a generic string or list.

4. Ingest data

Now, create the table using the data and the defined schema.

Retrieve and use blobs

When you search your LanceDB table, you can retrieve the binary column just like any other metadata.

Convert bytes back to objects

Once you have the bytes back from the search result, you can decode them into the original format (for example, an image object or audio buffer).

Large Blobs (Blob API)

For larger files like high-resolution images or videos, Lance provides a specialized Blob API. By using a large-binary Arrow type (pa.large_binary() in Python, LargeBinary in TypeScript, and DataType::LargeBinary in Rust) and specific metadata, you enable lazy loading and optimized encoding. This allows you to work with massive datasets without loading all binary data into memory upfront.

1. Define a blob schema

To use the Blob API, you must mark the column with {"lance-encoding:blob": "true"} metadata.

2. Ingest large blobs

You can then ingest data normally, and Lance will handle the optimized storage.

For more advanced usage, including random access and file-like reading of blobs, see the Lance format’s blob API documentation.

3. Convert blob tables to pandas

When you call to_pandas() on a local LanceDB table that contains Blob API columns, the blob_mode argument controls how those columns materialize. This is available in the Python SDK on local tables; remote tables raise NotImplementedError. blob_mode accepts:
  • "lazy" (default): returns blob columns without eagerly materializing their payloads. Use this when you want existing behavior and don’t need to inspect blob metadata. For namespace-managed tables and in-memory datasets, lazy mode falls back to the standard PyArrow to_pandas() path.
  • "bytes": eagerly materializes each blob as bytes. Use this when you need the raw payload in the DataFrame, for example to decode an image or audio clip in-process.
  • "descriptions": returns blob descriptors (offsets, sizes, and positions) instead of the data itself. Use this when you want to plan I/O without paying the cost of loading every blob.
"bytes" and "descriptions" require a filesystem-backed Lance dataset; they are not supported on in-memory tables or namespace-managed tables. Extra keyword arguments are forwarded to the underlying PyArrow / Lance pandas conversion, so you can also pass options like split_blocks or self_destruct: Query builders accept the same PyArrow to_pandas kwargs, but not blob_mode. They materialize Arrow results first, apply LanceDB-specific flatten and timeout handling, and then forward the remaining kwargs:

Other modalities

The pa.binary() and pa.large_binary() types are universal. You can use this same pattern for other types of multimodal data:
  • Audio: Read .wav or .mp3 files as bytes.
  • Video: Store video transitions or full clips using the Blob API.
  • PDFs/Documents: Store the raw file content for document search.