BTI: An Index for the BDSS PDF Repository

Abstract

The USPTO Bulk Data Storage System (BDSS) is a public repository of U.S. patents and pre-grant publications. The repository includes a set of TAR files, each containing PDFs of documents that were published during one week. Because the BDSS webserver supports the HTTP Range parameter, it is possible to download a specific portion of an archive that comprises a selected PDF. However, the HTTP Range parameter requires the offset and size of the portion. Because the archives do not include an index, determining the location of the desired data would require linearly searching the archive.

This project presents a database comprising an index of the BDSS PDF archives. The database can be used to determine the archive, offset, and size of any PDF, which can then be requested from BDSS using the HTTP Range parameter. This project allow users to retrieve any PDF from BDSS quickly, efficiently, freely, and automatically.

Quick Start

This project, the BDSS TAR Index (“BTI”), can be used to retrieve PDFs from BDSS in four ways:

Download and run the BTI Python 3 script, which is published (and viewable) at https://www.usptodata.com/bti/bti.py. This script requires only a version of the Python 3 interpreter, and it uses only Python built-in libraries. The script presents a graphical user interface (GUI) that accepts a U.S. patent, publication, or application number, downloads the PDF from BDSS, and opens it in the default PDF viewer. Downloaded documents are locally cached for future viewing.
Download and run bti.py with a command-line argument indicating a U.S. patent, publication, or application number. The script downloads the PDF from BDSS and opens it in the default PDF viewer. Downloaded documents are locally cached for future viewing.
Call the BTI API, published at https://www.usptodata.com/bti/bti.php, specifying a U.S. patent, publication, or application number as the ref parameter (e.g.: https://www.usptodata.com/bti/bti.php?ref=6123456). The API queries a local version of the BTI database and returns a JSON-formatted response indicating the archive, offset, and size of the requested PDF. This information can be used to retrieve the PDF from BDSS via an HTTP request specifying the Range parameter.
Download the BTI SQLite database, published at https://www.usptodata.com/bti/bti_sqlite.zip (updated weekly), and query the database using a U.S. patent or publication number. The database returns the archive, offset, and size of the requested PDF. This information can be used to retrieve the PDF from BDSS via an HTTP request specifying the Range parameter.

Background

Patents and publications are commonly available from sources like the USPTO PatFT/AppFT search engines and Google Patents. Each search engine initially presents requested documents as a full-text extract, which is suitable for many purposes, but is sometimes inadequate. For example, full-text extracts do not include the figures, lack page and line numbers, and do not accurately reproduce mathematical equations or chemistry notation. As a result, it is often necessary to examine the PDF as published by the USPTO.

Unfortunately, the options for retrieving patent and publications PDFs are inconvenient and are not easy to automate:

The USPTO PatFT/AppFT search engines include a link to the PDF in each search result. However, viewing the PDF requires viewing the full-text extract page and clicking “Images,” which opens a PDF viewer applet. The user can view the PDF one page at a time in the viewer, click “Full Pages” to instead view a scrollable version, or click “Download” to save the document to a local volume. All of these steps require additional user input, which can slow down the process of accessing PDFs. Further, the steps involving the PDF viewer require user input that is difficult to emulate.
Similarly, Google Patents includes a link to the PDF in each search result. Unlike the PatFT/AppFT search engines that require using a PDF viewer applet, Google Patents provides a download link. However, the PDF download link is not standardized, so the page must be accessed manually or scraped by an automated process to reach the link. Also, scraping techniques might violate the Google Patents acceptable use policy.
The USPTO Public PAIR interface provides PDFs of all documents in any published application, including issued patents. However, PAIR also requires several user input steps to reach any particular PDF. PAIR actively blocks automated processes through the use of captchas.
Some third-party services have been developed for the specific purpose of providing access to U.S. patent and publication PDFs, such as Free Patents Online, Patent Retriever, and Pat2PDF. These services do not provide options for automated access, likely because they are ad-supported, and automated access would reduce impressions and ad revenue. Further, these services are arguably no more convenient than Google Patents.
Still other third-party services, such as patent analytics services, can provide patent and publication PDFs through various user interfaces. However, patent analytics services are not free and can be prohibitively expensive for some users.

As another option, the USPTO provides the Bulk Data Storage System (BDSS), which includes a wealth of data relating to U.S. patents. BDSS includes Patent Grant Multi-Page PDF Images, which are presented as individual TAR files (tape archives, colloquially known as “tarballs”), where each archive includes all of the patent PDFs issued by the USPTO within a one-week period (except for patents issued before the year 2000, for which each archive covers multiple weeks, months, or years). Similarly, BDSS includes Patent Application Multi-Page PDF Images for pre-grant publication PDFs, which are also published as weekly TARs.

BDSS, by its nature, is amenable to bulk download and automated access. A particular PDF could be viewed by downloading the TAR and extracting the PDF of interest. However, each TAR is quite large – for example, the TAR for pre-grant publications for the week of May 5, 2022 (https://bulkdata.uspto.gov/data/patent/application/multipagepdf/2022/app_pdf_20220505.tar) is 18 gigabytes. As a result, downloading an archive in order to retrieve a single PDF (which is typically about 100 kilobytes) is time-consuming and inefficient. Further, because the archives are organized by date, accessing any PDF requires determining the date of publication or issuance and locating the corresponding TAR file.

Thus, the patent community has an unmet need for accessing patent and pre-grant publication PDFs in a manner that is:

Quick (i.e., minimizing the number of user interactions between a patent/publication number and the corresponding PDF),
Efficient (i.e., minimizing the volume of data transmitted or received),
Free (i.e., not requiring access fees or viewing ads), and
Subject to automation (i.e., not dependent upon user input).

Concept

The Problem

Earlier this year (2022), I wondered if it might be possible to download only the part of a BDSS TAR file that contained a desired PDF. I quickly discovered that the BDSS webserver supports the HTTP Range parameter. As a result, a specially crafted HTTP request could chop out the desired part of any file – as indicated by its offset from the file start and its length.

Next, I started looking into how one could determine which part of a TAR file contains a PDF of interest. Here, I encountered several issues that rendered the simplest options unusable.

First: BDSS does not offer an index of its archives. The files are added to the repository, but are not externally indexed.

Second: BTI files are stored in the TAR format, which does not include an index.

Many file packaging and archive formats (e.g., zip and rar) append a file index to the file, so that an archive can be randomly indexed by examining the index to find the location of a desired portion of the archive.

Unfortunately, the TAR format specification does not include an index. Rather, TAR files are organized as linked lists. A TAR file begins with a header of a first file that indicates its name and size, followed by the contents of the first file, followed by the header for a second file and the contents of the second file, etc. As a result, in order to find any file within a TAR, a process must examine the first header to determine if the first file is the desired file. If so, the process can skip to the end of the (fixed-size) header and extract the first file. If not, the process hops over the first file and examines the second header to determine if the second file is the desired file. This iterative process continues, in a similar manner as navigating a linked list, until the desired file is located.

Given this format, one way to find a desired PDF in BDSS is to download an entire TAR and scan it from start to finish to find the desired PDF. But each TAR is at least 10, and the BDSS webserver has an approximate maximum transfer rate of 5 Mb/s, so downloading an entire archive imposes an initial delay of at least 30 minutes. This could be reduced slightly by streaming the TAR and scanning it while streaming. If you’re lucky, the PDF is near the start of the TAR and can be retrieved with only a few hops. If you’re unlucky, the PDF is near the end of the TAR and can only be retrieved by streaming the entire TAR. On average, you still have to scan through half of the TAR to find the PDF, which cuts the delay only to 15 minutes.

Third: Although the TAR format does not include an index, the first file within each BDSS TAR is a text file that lists its contents. Theoretically, this could be used to determine the locations of files in the TAR (or at least their approximate locations, due to variable PDF file sizes). Unfortunately, the order of the files in the contents list does not match the order of the files in the TAR.

For example, the TAR for patent PDFs issued during the week of May 3, 2022 contains a contents file that begins:

11317552 B2 20220503 22
11317553 B2 20220503 23
11317554 B2 20220503 30
11317555 B2 20220503 14
11317556 B2 20220503 31

…but the first files stored in the TAR (after the contents file) are:

P20220503-20220503/D0/950/861/D0950861.pdf
P20220503-20220503/D0/950/503/D0950503.pdf
P20220503-20220503/D0/950/487/D0950487.pdf
P20220503-20220503/D0/950/510/D0950510.pdf
P20220503-20220503/D0/950/728/D0950728.pdf

Due to this mismatch, the contents list cannot be used to determine the location of a file in the archive.

The Solution

For the foregoing reasons, it is not possible to predict the location of any file within a BDSS TAR. Finding the file requires linearly scanning the TAR.

But what if someone scanned the entire BDSS repository and generated an index? That index could be published for use by the entire patent community to access BDSS at will.

I completed this task by the following process:

First, I wrote a Python script to stream a TAR from BDSS and index its contents. (This turned out to be an interesting task because the TAR format has some quirks. For instance, the TAR format specification indicates that file size and offset are stored as 12-byte integers. However, in reality, both values are stored as 11-byte octal values followed by a null terminator. Why octal? And why include a null terminator in a fixed-length field?)

I started running the Python script on a server in my home to compile data for the BDSS PDF repository. This effort proved to be unrealistic, because the BDSS PDF repository is 21 terabytes, which vastly exceeds the monthly download quota of my home internet service provider.

Instead, I purchased an array of six Google Cloud Platform (GCP) virtual machines to run the indexing script in a distributed manner. GCP was very well-suited to this task. Each virtual machine requires a vast amount of data ingress (to stream the entirety of the 21-terabyte PDF repository), but a negligible amount of processing power (since the data scan is computationally simple), a negligible amount of storage (since the VMs only stored the filenames, sizes, and offsets of each TAR), and a negligible amount of data egress (to send me the scan results). The pricing of GCP VMs is metered by compute, storage, and data egress – but data ingress is free!

Over the course of three weeks, the GCP VM array indexed the 21-terabyte BDSS PDF archives for a total cost of about $40.
Next, I wrote a Python script to compile the index data into a SQLite database. Organizing this database for search performance, storage efficiency, and flexibility took a few design attempts (see the full description below). I also decided to correlate patent and publication numbers with U.S. application number, so that the index can also be searched by application number.

The resulting database, bti.sqlite, identifies the TARs, offsets, and lengths of 12.3 million patents and 6.9 million publications in a 1.8-gigabyte database (under 1.0 gigabytes zipped).
Next, I wrote a Python script to monitor BDSS, download the two new PDF TARs that are published each week, update bti.sqlite to include the newest TARs, and upload the updated database to usptodata.com (as well as an unzipped version for use by the webserver).
I wrote bti.php, which is a PHP script that runs on the usptodata.com webserver as an API for the BDSS TAR Index.
Finally, I wrote bti.py, which is an end-user Python script that retrieves and displays PDFs based on either a downloaded copy of bti.sqlite or the usptodata.com API.

The results of this project are now available and will be supplemented with weekly updates of bti.sqlite.

Detailed Description

The BTI Database

bti.sqlite is a SQLite 3 database with the following schema:

Patents and publications are stored in separate tables. For each record, the table contains an id (table key, an incrementing integer), offset, size, and two fields (year_part and date_code) that are used to determine the URL of the TAR file.

BDSS stores TARs for PDFs according to the following URL format:

https://bulkdata.uspto.gov/data/patent/grant/multipagepdf/YEAR_PART/grant_pdf_DATE_PART.tar

https://bulkdata.uspto.gov/data/patent/application/multipagepdf/YEAR_PART/app_pdf_DATE_PART.tar

In both cases, year_part is the four-dIgit year – except for patents issued before January 11, 2000, in which case year_part is 1790_1999 (see this page of historic patents).

date_code is more variable. For many TAR files, the URL indicates a single date, like 20100706. For others, the URL includes a date range, like 20100622_20100629. The scheme is rather inconsistent, as shown on this page. Also, the size of the date ranges increases relative to their age; the range of the oldest patent archive is 17900731_18641101. In order to store the date_part efficiently, each unique date_part is stored in the DATE_CODES table with a code (table key, an incrementing integer). For both PATENTS and PUBLICATIONS, the date_code field is a foreign-key relationship field indicating a DATE_CODES code.

Each of PATENTS and PUBLICATIONS has a corresponding FTS5 table. FTS5 is a full-text search extension of SQLite 3 that enables fast text searches, which enables near-instantaneous lookup of records with patent numbers like 12012345 or D951131. Each FTS5 includes a foreign-key relation field to the base table (patent related to PATENTS and publication related to PUBLICATIONS), a number text field, and an application_number text field for looking up documents by application number.

The database was initially compiled from the data scavenged by the Google Cloud Platform virtual machines. A Python script running on a local server periodically checks the BDSS repository for updates, downloads any new TAR files, updates bti.sqlite, and deploys the updated database to USPTO Data.

The USPTO Data BTI API

bti.php is a small PHP script that serves as an API for the BTI database.

This API accepts two HTTP GET parameters:

ref (required) indicates a patent number, publication number, or application number. The format is flexible: U.S. patents can be specified as US6123456 or 6,123,456, etc. (but the format can aid with the inference of type).
type (optional) indicates a document type: patent, publication, or application. If this parameter is omitted, the API will attempt to infer the document type from the format of ref.

The API will query a local copy of bti.sqlite and will return a JSON object, such as the following for https://www.usptodata.com/bti/bti.php?ref=6123456:

{"ref":"6123456","reduced_ref":"6123456","type":"patent","url":"https://bulkdata.uspto.gov/data/patent/grant/multipagepdf/2000/grant_pdf_20000926_20001017.tar","offset":513618432,"size":630676,"db_timestamp":"20220511 12:07:22"}

…or for https://www.usptodata.com/bti/bti.php?ref=20160123456:

{"ref":"20160123456A1","reduced_ref":20160123456A1","type":"publication","url":"https://bulkdata.uspto.gov/data/patent/application/multipagepdf/2016/app_pdf_20160505.tar","offset":4990404608,"size":883883,"db_timestamp":"20220511 12:07:22"}

For application numbers, bti.php will prefer a corresponding publication (if published) over a corresponding patent (if issued).

If the API cannot infer the document type, or if the database query returns no results, then the API returns:

{"ref":"asdf","reduced_ref":null,"type":null,"url":null,"offset":null,"size":null,"db_timestamp":"20220511 12:07:22"}

The BTI Python Script

bti.py is a Python 3 script that downloads patents and publications from BDSS using bti.sqlite. This Python script is a single, fully self-contained file that relies only on Python 3 built-in libraries and has no external dependencies. A GitHub repository for this script is provided at https://github.com/neuron-whisperer/bti.

The Python script can query either the USPTO Data API or a local, downloaded copy of bti.sqlite. Querying the API is slightly slower and requires Internet access. The local database (which is about 1.8 gigabytes uncompressed) enables the script to run slightly faster, since it queries the local database instead of the USPTO Data API. The downloadable copy of bti.sqlite is available here – just extract the .sqlite database and place it in the same directory as the script. The archive is updated weekly to add the latest patents and publications.

The Python script also stores all downloaded PDFs in a local cache directory, which can be set to any desired directory. The documents are stored with filenames such as:

U.S. Patent No. 6,123,456.pdf

U.S. Pub. No. 20160123456A1.pdf

GUI mode: If run without command-line arguments, the Python script presents a graphical user interface using the Tkinter Tcl/Tk built-in library. (Note: For MacOS users, the built-in Tkinter library is broken, as indicated in this documentation page. As a result, the GUI cannot be shown on MacOS unless you install a more recent version of Tkinter, as discussed in that documentation page. The Python script detects this problem on MacOS devices and provides information to resolve it.)

The GUI accepts a reference number and, if necessary, infers a document type. If the document is cached and if “open” is checked, the GUI opens the PDF in the default PDF viewer. If the document is not cached, the GUI queries the local database or the USPTO Data API. If the database returns a result, the GUI downloads the document and stores it in the cache, and (if “open” is checked) opens the PDF in the default PDF viewer.

The GUI provides a configuration tab that includes:

An indication of the online or offline status of the BDSS archives;
An indication of whether the script is using the API or a local copy of the database; and
An indication of whether the current version of the Python script at USPTO Data is newer than the running version, and if so, an option to download the latest version.

Command-line mode: The Python script can be run in command-line mode with parameters such as the following:

python bti.py 6123456                      # U.S. Patent No. 6,123,456
python bti.py -t patent 6123456            # U.S. Patent No. 6,123,456 (explicitly specifying document type)
python bti.py -t publication 20160123456   # U.S. Publication No. 2016/0123456A1
python bti.py -d 6123456                   # download only - do not open in default PDF viewer
python bti.py -q 6123456                   # quiet mode - no console output
python bti.py -h                           # lists help options

Like GUI mode, the command-line mode of the Python script accepts the reference number and, if necessary, infers a document type. If the document is cached (and if -d is not indicated), the script opens the PDF in the default PDF viewer. If the document is not cached, the script queries the local database or the USPTO Data API to find the URL of the TAR file and the offset and size of the PDF. If the database returns a result, the script downloads the document and stores it in the cache, and (if -d is not indicated) opens the PDF in the default PDF viewer.

Frequently Asked Questions (FAQ)

Question: What are the objectives of the BDSS TAR Index?
Answer: The BDSS TAR Index project has three key objectives:
1. Improving access to patent and publication PDFs in a manner that is quick, efficient, free, and subject to automation;
2. Reducing the cost to the USPTO to provide patent and publication PDFs to the patent community; and
3. Advancing the state of software tools and resources that the patent community uses to access the data products of the USPTO.

Question: What kinds of documents does the BDSS TAR Index cover?
Answer: The BDSS TAR Index includes the following document types:
- Original utility patents (e.g., U.S. Patent No. 10,123,456)
- Utility publications (e.g., U.S. Publication No. 2017/0127557A1)
- Utility republications (e.g., U.S. Publication No. 2015/0175286A2)
- Reissue utility patents (e.g., U.S. Patent No. RE38,737)
- Plant patents (e.g., U.S. Patent No. PP15,369)
- Plant publications (e.g., U.S. Publication No. 2004/0111777P1)
- Design patents (e.g., U.S. Patent No. D887,060)
- Statutory invention registrations (e.g., H1,808)
- Defensive publications (e.g., T965,002)
- Additional Improvement (AI) publications (e.g., AI225)
- X-Patents (e.g., X1)
- Reissue X-Patents (e.g., RX1)

Question: What formats does BTI require for references?
Answer: BTI is designed to be as robust as possible. Internally, references are identified according to a canonical numbering format: no punctuation, no whitespace, and no leading zeros in number parts. However, BTI handles user input much more robustly and tries to convert many forms of user input to the canonical representation. Thus, U.S. Patent No. 6,123,456 can be queried as: 6123456, or 6,123,456, or US6123456, or U.S. 6,123,456 B1, etc. – all of these forms of input will be converted to 6123456.

Question: Why does the BDSS TAR Index use SQLite as the storage format?
Answer: SQLite has three key advantages for this project. First, all Python 3 packages include support for SQLite, so no additional modules or software are required. Second, SQLite databases are stored as a single file, which provides greater portability than packages like MySQL where the engine stores, owns, and manages the data. And third, while SQLite is not as full-featured as MySQL or other packages, its feature set is completely sufficient for this project. For example, SQLite supports full-text-search (FTS5) tables that enable rapid searching of text fields over millions of records.

Question: What is the future of the BDSS TAR Index?
Answer: As of this initial release, I plan to spend the next several months monitoring, stabilizing, and curating the BDSS TAR Index. In the future, features that might be added to BTI include:
- Adding indices for other data sets within the BDSS repository. Currently, the only other data that could be indexed is the “single-page TIFFs” collection. (However, I am not sure whether the use of TIFFs by the patent community justifies the effort to index it.) The other files in BDSS would not be not aided by an index. For example, BDSS includes TARs containing weekly XML files (which are typically published as a single massive XML file, and thus not indexable). BDSS also includes various collections of data that are published as zip files – but, as previously mentioned, the zip file specification appends an index to each zip file.
- Adding a web interface that can retrieve the PDF from BDSS and send it to the user. Ideally, this could be handled by client-side JavaScript, but this is not possible due to cross-site scripting (XSS) restrictions. (In essence, a web page that downloads JavaScript from domain #1, such as upstodata.com, cannot access data from domain #2, such as uspto.gov.) The alternative is to extend the PHP script to perform this task, but this requires an unknown amount of data ingress, compute, and egress, and the demands might exceed the bounds of this project.

More Information

For any questions, please contact me via email to mail@usptodata.com.

Please further information about USPTO data products and projects, please visit https://usptodata.com.