We recently conducted an experiment, AI Freight Forwarder, where we tested various large language models (LLMs) to assess their ability to pass the National Freight Forwarder examination in Poland.
This experiment provided valuable insights into the strengths and weaknesses of different models, helping us understand their suitability for real-world applications in the Transportation, Shipping, and Logistics (TSL) industry.
Testing the different LLM models
The number of LLM models continues to proliferate – there are over 1.4 m models published on Huggingface! For this experiment, we selected the leading, most popular models for evaluation:
- Claude by Anthropic (Haiku and Sonnet releases)
- ChatGPT 4o (August and November releases, as well as Mini)
- Grok by X
- Gemini by Google (Pro and Flash)
- Deepseek Chat v3 by Deepseek
- Qwen Instruct by Qwen
- Mixtral by MistralAI
- Llama 3 by Meta
Most of these models successfully passed the Freight Forwarder exam, achieving different scores. If you want to explore our full findings, check out our previous article detailing the experiment. For a deeper dive into the technical setup of the AI Freight Forwarder test, read Technical aspects of the AI Freight Forwarder experiment.
Which LLM model works best?
The short answer: it depends on your criteria.
Below is a comparative table showcasing key performance metrics for each model, including cost, context window, latency, and throughput.
Let’s take a quick look at what results the different models scored.
Performance comparison table

Exam results and implications
The passing threshold for the Freight Forwarder exam was 75%. The highest-scoring model achieved 83%, while the lowest scored 63%. This indicates that while most models can be used for automation in logistics, their effectiveness varies.
Claude 3 Haiku, GPT-4.0 Mini, Mixtral and LLama 3 didn’t pass the exam and therefore we wouldn’t recommend using them in your custom TSL solutions.
How to choose the right LLM model?
When selecting an LLM model for your business use case, consider the following criteria:
Output quality
Output quality reflects the results scored at the test. Claude Sonnet was the winner, but ChatGPT and Grok didn’t stay far behind. All of these models could thus work well for your use case, displaying similar results.
If you prioritize accuracy and response quality:
- Best option: Claude Sonnet (Top performer in the exam)
- Other strong options: ChatGPT 4o and Grok (Close behind in accuracy)
Cost efficiency
The overall best results were achieved by DeepSeek - it didn’t score the top result, but came 6th at the lowest cost. There are cheaper models, but they didn’t perform as well on the test. If price per token matters to you, go with DeepSeek. This is an open-source model, hence the low price. However, the ongoing infrastructure costs will take a significant portion of your budget, so using this model will only make sense if you plan to use a large volume of tokens.
If you won’t be using a large volume of tokens, a more suitable option would be an LLM available through API. In that case, the cheapest model would be Claude 3.5 Haiku.
While Claude Sonnet achieved the best score, it incurred the highest bills and showed slower performance, compared to other models. Claude Sonnet is thus a good option when you prioritize output quality, but don’t care about the costs and timelines.
If minimizing costs is your priority:
- Best option: DeepSeek for large volume of tokens (open-source, so you’d have to cover infrastructure costs) or Claude 3.5 Haiku (API only - pricier than DeepSeek, but no infrastructure costs are involved)
- Trade-off: Cheaper models exist, but they performed worse on the exam
While Claude Sonnet scored the highest, it is one of the most expensive models and has slower performance. If cost isn’t a concern, it's an excellent choice.
Efficiency (speed & throughput)
Gemini Flash offers the highest throughput, followed by Claude Haiku and ChatGPT 4o. However, note that having been released in May 2024, Gemini Flash is already quite an old model. It’s more advisable to use a newer one, therefore if throughput is your priority, consider ChatGPT 4o (though it will potentially cost you a lot).
If speed and high throughput matter:
- Best option: Gemini Flash (Highest throughput)
- Alternatives: Claude Haiku, ChatGPT 4o (Newer models with strong performance)
However, Gemini Flash was released in May 2024, making it an older model. For long-term investment, consider ChatGPT 4o.
Context length (input capacity)
If you need to provide the LLM model with additional context to work for your business, and that context is broad, your best option is Google Gemini Pro - it surpasses all competitors by a mile for this capability. It provides the widest context and is relatively affordable.
For a very broad context, we’d suggest applying retrieval-augmented generation (RAG). RAG optimizes the output of an LLM by fetching data from external sources and uses it to enrich prompts. As such, it allows to bypass contextualization altogether.
If your use case requires longer context windows (e.g. for analysis of large volumes of data):
- Best option: Google Gemini Pro (Supports 2,000,000 tokens, far exceeding competitors)
- Why it matters: LLM models vary in their ability to handle context. Those with lower contextual support thresholds may not be capable of analyzing large data volumes.
Model control & deployment options
If you want full control over the model, opt for DeepSeek or Qwen - both are available for download and self-hosted usage. Most other well-performing models are accessible via API only.
If you need self-hosted models for security and control:
- Best options: DeepSeek and Qwen (Available for download)
- Other models: Most high-performing models are API-only.
The best LLMs for your business
The best LLM model for your use case depends on your priorities:
If you prioritize accuracy, go for Claude Sonnet or ChatGPT 4o.
If cost per token is a major factor, DeepSeek offers a great balance between price and performance, for large volumes of tokens used.
If you need high efficiency, consider Gemini Flash or ChatGPT 4o.
If your use case requires large input context, Google Gemini Pro is the clear winner.
If you prefer self-hosting, DeepSeek and Qwen are your best options.
When you carefully match the model’s capabilities with your business needs, you will be able to optimize performance, costs, and scalability, ensuring the best return on investment in your AI-powered automation.
It’s critical to add that the landscape of LLM models is changing at an extremely high speed. What we present in this article may not be as accurate, as both the models as well as assessment criteria may change in a matter of months, if not weeks. We intend to continue the different LLM models as the landscape evolves, to ensure we can advise you on their application in the best possible way.
How we can help with AI integration
Choosing the right LLM model is just the first step – successfully integrating AI into your business operations is where the real value lies. This is where we come in.
As a software development company with two decades of experience working with logistics, transportation, and supply chain businesses, we understand the unique challenges of the industry. We specialize in custom AI solutions that:
- Streamline operations: AI-powered automation can optimize freight management, document processing, and customer support.
- Reduce costs: smart AI models help minimize errors, cut manual workloads, and improve efficiency.
- Enhance decision-making: AI-driven insights can help with route optimization, demand forecasting, and risk assessment.
- Keep you ahead of the competition: our AI-driven strategies help logistics companies stay competitive in a fast-changing market.
If you are looking to integrate AI into your logistics operations, we would love to help. Book a free discovery call with our team and we’ll explore your needs together.