@@ -104,7 +104,7 @@ model=BAAI/bge-large-en-v1.5
104104revision=refs/pr/5
105105volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
106106
107- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model --revision $revision
107+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.4 --model-id $model --revision $revision
108108```
109109
110110And then you can make requests like
@@ -309,13 +309,13 @@ Text Embeddings Inference ships with multiple Docker images that you can use to
309309
310310| Architecture | Image |
311311| -------------------------------------| -------------------------------------------------------------------------|
312- | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-1 .3 |
312+ | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-1 .4 |
313313| Volta | NOT SUPPORTED |
314- | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-1 .3 (experimental) |
315- | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.3 |
316- | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.3 |
317- | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.3 |
318- | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-1 .3 (experimental) |
314+ | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-1 .4 (experimental) |
315+ | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.4 |
316+ | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.4 |
317+ | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.4 |
318+ | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-1 .4 (experimental) |
319319
320320** Warning** : Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
321321You can turn Flash Attention v1 ON by using the ` USE_FLASH_ATTENTION=True ` environment variable.
@@ -344,7 +344,7 @@ model=<your private model>
344344volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
345345token=< your cli READ token>
346346
347- docker run --gpus all -e HF_API_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model
347+ docker run --gpus all -e HF_API_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.4 --model-id $model
348348```
349349
350350### Using Re-rankers models
@@ -362,7 +362,7 @@ model=BAAI/bge-reranker-large
362362revision=refs/pr/4
363363volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
364364
365- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model --revision $revision
365+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.4 --model-id $model --revision $revision
366366```
367367
368368And then you can rank the similarity between a query and a list of texts with:
@@ -382,7 +382,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
382382model=SamLowe/roberta-base-go_emotions
383383volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
384384
385- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model
385+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.4 --model-id $model
386386```
387387
388388Once you have deployed the model you can use the ` predict ` endpoint to get the emotions most associated with an input:
@@ -402,7 +402,7 @@ You can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM archi
402402model=naver/efficient-splade-VI-BT-large-query
403403volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
404404
405- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model --pooling splade
405+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.4 --model-id $model --pooling splade
406406```
407407
408408Once you have deployed the model you can use the ` /embed_sparse ` endpoint to get the sparse embedding:
@@ -432,7 +432,7 @@ model=BAAI/bge-large-en-v1.5
432432revision=refs/pr/5
433433volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
434434
435- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 -grpc --model-id $model --revision $revision
435+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.4 -grpc --model-id $model --revision $revision
436436```
437437
438438``` shell
0 commit comments