Bitquery Blockchain Data Lake: Sample and Streaming Example

This repo shows what a block in the Bitquery Blockchain Data Lake looks like and how to stream one from the lake over a standard S3 interface.

What's here

.
├── stream.py           # stream a block from the lake; throughput, decode, tx JSON, transfers
├── transfers.py        # transfers_from_block(): native, ERC-20, ERC-721 extraction
├── requirements.txt
├── Dockerfile          # builds the self-seeding demo lake image
├── entrypoint.sh       # boots SeaweedFS and loads the sample block
├── stream.log          # full sample output of stream + decode + tx JSON
└── transfers.log       # sample output of the --transfers run

The sample block

The demo image ships a real Base block (Chain ID 8453), block number 46,600,927, so you can see the exact format the lake serves:

A single block stored as a Protobuf message, LZ4-compressed.
About 3.2 MB compressed and 11.6 MB decoded.
Contains the block header, 169 transactions, 1,180 logs, plus full receipts and traces.

The block schema is published at bitquery/streaming_protobuf, and the Python bindings come from bitquery-pb2-kafka-package. It is the same schema Bitquery uses for its Kafka streams, so one decoder works for both.

Sample output

After pointing the script at the demo lake (see Stream a block), this command streams the block, prints throughput, decodes it, and dumps the first three transactions as JSON straight from the pb2 schema:

python3 stream.py --bucket archive --key "$KEY" --tx 3

  streamed     3.24 MB  in   0.0s  ->    203.3 MB/s (1.63 Gbps)
  reads: 1, object size: 3.24 MB

  decoded 11.55 MB (evm):
  number     : 46,600,927
  hash       : 0x0133403c4fe53c434b1d2a1686d339eebd4e8e7f50ab52ab84cd68029e82e955
  timestamp  : 1779991201
  gas used   : 50,521,537
  transactions: 169
  logs        : 1,180

  first 3 transaction(s) as JSON:

--- transaction #0 ---
{
  "TransactionHeader": {
    "Hash": "w9VR+y41FcMsk8LvKHYlRRHgdCfg3kkd/lWd1K8HvS4=",
    "Gas": "1000000",
    "Data": "Pba+KwAACN0AEBwSAAAAAAAAAAQAAAAAahiCYwAAAAABgHPlAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA5HOxsAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPzQHYlB/RvebPz97nUBKWDMsXMW3xtFk5B0Vv8WHJWmDXIiAAAAAAAAAAAAAAAAUFD2mpeG8IFQkjTxp/RoS15bdskAAAAAAAAAAAAAAAAAlA==",
    "Protected": true,
    "Type": 126,
    "To": "QgAAAAAAAAAAAAAAAAAAAAAAABU=",
    "From": "3q3erd6t3q3erd6t3q3erd6tAAE=",
    "ToCode": {
      "Hash": "H5WGVKsGoVKZPnoK57bbsNSxkmXMkze4eJ/hNTvZ3DU=",
      "Size": 2055
    },
    "IsSystemTx": false,
    "SourceHash": "QdzORbzbW87xDkjNIAiUs6c+ERcQyK3Mpqs4tbhw2GM=",
    "Time": "1779991201000000000"
  },
  "Signature": {},
  "Receipt": {
    "ReceiptHeader": {
      "GasUsed": "46218",
      "Type": 126,
      "CumulativeGasUsed": "46218",
      "Status": "1"
    }
  },
  "Trace": { "... full call trace and opcode capture states ..." }
}

--- transaction #1 ---
{
  "TransactionHeader": {
    "Index": "1",
    "Hash": "y+Zdue1Xv1ewK5fb3WZylm1rMqkHzdtRHUcxmjX1asM=",
    "Gas": "100000",
    "Type": 2,
    "To": "QIdhDPfPAYPmJs87yEFSpCc3S6M=",
    "From": "guFZ1j5YUGfp+jpLun2ZL7xmd1E="
  },
  "Receipt": {
    "ReceiptHeader": {
      "GasUsed": "46371",
      "CumulativeGasUsed": "92589",
      "Status": "1"
    },
    "Logs": [
      {
        "LogHeader": {
          "Address": "QIdhDPfPAYPmJs87yEFSpCc3S6M=",
          "Data": "//////////////////////////////////////////8="
        },
        "Topics": [
          { "Hash": "jFvh5evsfVvRT3FCfR6E890DFMD3sikeWyAKyMfDuSU=" },
          { "Index": "1", "Hash": "AAAAAAAAAAAAAAAAguFZ1j5YUGfp+jpLun2ZL7xmd1E=" },
          { "Index": "2", "Hash": "AAAAAAAAAAAAAAAA2LqdGpn8IfDsok6bhXN8KKGUpOI=" }
        ]
      }
    ]
  },
  "Trace": { "... full execution trace ..." }
}

--- transaction #2 ---
{ "... same pb2 fields: header, signature, receipt, logs, trace ..." }

Each transaction JSON includes the full pb2 structure: header, signature, receipt (with logs), and trace. The excerpt above trims bloom filters, long Data fields, and trace internals for readability. See stream.log for the complete run (~23k lines).

How blocks are stored

The data lake holds the complete archive of each chain, from genesis to the current tip, as one .block.lz4 file per block. Files are named by block number and hash:

<block_number>_<block_hash>_<...>.block.lz4

Bitquery hosts the lake and writes the blocks. You read them over an S3 interface, which means any standard S3 client works (aws s3, boto3, s3fs).

Stream a block

pip install -r requirements.txt

KEY="base/blocks/000046600927_0x0133403c4fe53c434b1d2a1686d339eebd4e8e7f50ab52ab84cd68029e82e955_49e9339dd61bdb91320044378bff935efd925d868ca257ef8c3bc42177f9fd44.block.lz4"

Run the demo lake. We publish a ready-to-run image that starts a SeaweedFS lake with the sample block already loaded. You do not upload anything.

docker run -p 8333:8333 marketingbitquery/datalake-demo

This gives you a lake at http://localhost:8333. Point the script at it:

export DATA_LAKE_ENDPOINT=http://localhost:8333
export DATA_LAKE_ACCESS_KEY=admin
export DATA_LAKE_SECRET_KEY=secret

Stream, decode, and print the first three transactions:

python3 stream.py --bucket archive --key "$KEY" --tx 3

For a block summary only (no transaction JSON):

python3 stream.py --bucket archive --key "$KEY" --decode

Keep streaming for a fixed window to see a steady rate:

python3 stream.py --bucket archive --key "$KEY" --duration 15

Run several readers in parallel to see aggregate throughput scale:

python3 stream.py --bucket archive --key "$KEY" --duration 15 --concurrency 16

Extract transfers from the block (native, ERC-20, ERC-721):

python3 stream.py --bucket archive --key "$KEY" --transfers
python3 stream.py --bucket archive --key "$KEY" --transfers 5 --token 0x4200000000000000000000000000000000000006

  469 transfers  {'erc20': 422, 'native': 28, 'erc721': 19}

  [erc20] 0x4200000000000000000000000000000000000006  0x498581ff... -> 0xbf4195ab...  amount 3390958905493657

The parser is transfers_from_block() in transfers.py. See Extract transfers from a block for the full walkthrough.

Options

Flag	Default	Purpose
`--endpoint`	`DATA_LAKE_ENDPOINT` env	S3 endpoint of the data lake
`--bucket`	(required)	Bucket name
`--key`	(required)	Object key of the `.block.lz4` file
`--region`	`DATA_LAKE_REGION` env, else `us-east-1`	S3 region
`--chain`	`evm`	Schema to decode with: `evm`, `solana`, `tron`, `utxo`
`--decode`	off	Decode the block after streaming and print a summary
`--tx N`	`0`	After streaming, decode and print the first N transactions as JSON (full pb2 fields). Implies decode.
`--json`	off	After streaming, decode and print the full block as JSON (large). Implies decode.
`--transfers N`	`0`	After streaming, extract transfers (native, ERC-20, ERC-721) and print the first N (`-1` for all). EVM only. Implies decode.
`--token 0x...`	none	With `--transfers`, only show this token contract
`--duration`	`0`	Keep streaming for N seconds
`--concurrency`	`1`	Number of parallel readers; greater than 1 runs fan-out mode

Decode the local sample without streaming

The demo image bakes the block into /seed/. You can also decode a downloaded .block.lz4 directly:

import lz4.frame
from evm.block_message_pb2 import BlockMessage

raw = open("000046600927_0x0133403c4fe53c434b1d2a1686d339eebd4e8e7f50ab52ab84cd68029e82e955_49e9339dd61bdb91320044378bff935efd925d868ca257ef8c3bc42177f9fd44.block.lz4", "rb").read()
block = BlockMessage()
block.ParseFromString(lz4.frame.decompress(raw))
print(int.from_bytes(block.Header.Number, "big"), len(block.Transactions))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bitquery Blockchain Data Lake: Sample and Streaming Example

What's here

The sample block

Sample output

How blocks are stored

Stream a block

Options

Decode the local sample without streaming

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
000046600927_0x0133403c4fe53c434b1d2a1686d339eebd4e8e7f50ab52ab84cd68029e82e955_49e9339dd61bdb91320044378bff935efd925d868ca257ef8c3bc42177f9fd44.block.lz4		000046600927_0x0133403c4fe53c434b1d2a1686d339eebd4e8e7f50ab52ab84cd68029e82e955_49e9339dd61bdb91320044378bff935efd925d868ca257ef8c3bc42177f9fd44.block.lz4
Dockerfile		Dockerfile
README.md		README.md
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt
stream.log		stream.log
stream.py		stream.py
transfers.log		transfers.log
transfers.py		transfers.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Bitquery Blockchain Data Lake: Sample and Streaming Example

What's here

The sample block

Sample output

How blocks are stored

Stream a block

Options

Decode the local sample without streaming

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages