From Entities to Alphas: Launching the Python Version of the Equities Entity Store

Introduction

When we launched the Equities Entity Store in Mathematica, it revolutionized how financial professionals interact with market data by bringing semantic structure, rich metadata, and analysis-ready information into a unified framework. Mathematica’s EntityStore provided an elegant way to explore equities, ETFs, indices, and factor models through a symbolic interface. However, the industry landscape has evolved—the majority of quantitative finance, data science, and machine learning now thrives in Python.

While platforms like FactSet, WRDS, and Bloomberg provide extensive financial data, quantitative researchers still spend up to 80% of their time wrangling data rather than building models. Current workflows often involve downloading CSV files, manually cleaning them in pandas, and stitching together inconsistent time series—all while attempting to avoid subtle lookahead bias that invalidates backtests.

Recognizing these challenges, we’ve reimagined the Equities Entity Store for Python, focusing first on what the Python ecosystem does best: scalable machine learning and robust data analysis.

The Python Version: What’s New

Rather than beginning with metadata-rich entity hierarchies, the Python Equities Entity Store prioritizes the intersection of high-quality data and predictive modeling capabilities. At its foundation lies a comprehensive HDF5 dataset containing over 1,400 features for 7,500 stocks, measured monthly from 1995 to 2025—creating an extensive cross-sectional dataset optimized for sophisticated ML applications.

Our lightweight, purpose-built package includes specialized modules for:

  • Feature loading: Efficient extraction and manipulation of data from the HDF5 store
  • Feature preprocessing: Comprehensive tools for winsorization, z-scoring, neutralization, and other essential transformations
  • Label construction: Flexible creation of target variables, including 1-month forward information ratio
  • Ranking models: Advanced implementations including LambdaMART and other gradient-boosted tree approaches
  • Portfolio construction: Sophisticated tools for converting model outputs into actionable investment strategies
  • Backtesting and evaluation: Rigorous performance assessment across multiple metrics


Guaranteed Protection Against Lookahead Bias

A critical advantage of our Python Equities Entity Store implementation is its robust safeguards against lookahead bias—a common pitfall that compromises the validity of backtests and predictive models. Modern ML preprocessing pipelines often inadvertently introduce information from the future into training data, leading to unrealistic performance expectations.

Unlike platforms such as QuantConnect, Zipline, or even custom research environments that require careful manual controls, our system integrates lookahead protection at the architectural level:

# Example: Time-aware feature standardization with strict temporal boundaries
from equityentity.features.preprocess import TimeAwareStandardizer

# This standardizer only uses data available up to each point in time
standardizer = TimeAwareStandardizer(lookback_window='60M')
zscore_features = standardizer.fit_transform(raw_features)

# Instead of the typical approach that inadvertently leaks future data:

# DON'T DO THIS: sklearn.preprocessing.StandardScaler().fit_transform(raw_features)

Multiple safeguards are integrated throughout the system:

  • Time-aware preprocessing: All transformations (normalization, imputation, feature engineering) strictly respect temporal boundaries
  • Point-in-time data snapshots: Features reflect only information available at the decision point
  • New listing delay: Stocks are only included after a customizable delay period from their first trading date
# From our data_loader.py - IPO bias protection through months_delay

for i, symbol in enumerate(symbols):
    first_date = universe_df[universe_df["Symbol"] == symbol]["FirstDate"].iloc[0]
    delay_end = first_date + pd.offsets.MonthEnd(self.months_delay)
    valid_mask[:, i] = dates_pd > delay_end

  • Versioned historical data: Our HDF5 store maintains proper vintages to reflect real-world information availability
  • Pipeline validation tools: Built-in checks flag potential lookahead violations during model development

While platforms like Numerai provide pre-processed features to prevent lookahead, they limit you to their feature set. EES gives you the same guarantees while allowing complete flexibility in feature engineering—all with verification tools to validate your pipeline’s temporal integrity.

Application: Alpha from Feature Ranking

As a proof of concept, we’ve implemented a sophisticated stock ranking system using the LambdaMART algorithm, applied to a universe of current and former components of the S&P 500 Index.. The target label is the 1-month information ratio (IR_1m), constructed as:

IR_1m = (r_i,t+1 – r_benchmark,t+1) / σ(r_i – r_benchmark)

Where r_i,t+1 is the forward 1-month return of stock i, r_benchmark is the corresponding sector benchmark return, and σ is the tracking error.

Using the model’s predicted rank scores, we form decile portfolios rebalanced monthly over a 25-year period (2000-2025), with an average turnover of 66% per month.

The top decile (Decile 10) portfolio demonstrates a Sharpe Ratio of approximately 0.8 with an annualized return of 17.8%—impressive performance that validates our approach. As shown in the cumulative return chart, performance remained consistent across different market regimes, including the 2008 financial crisis, the 2020 pandemic crash, and subsequent recovery periods.

Risk-adjusted performance increases across the decile portfolios, indicating that the selected factors appear to provide real explanatory power:

Looking at the feature importance chart, the most significant features include:

  • Technical features:
    • Volatility metrics dominate with “Volatility_ZScore” being the most important feature by a wide margin
    • “Mu_1m_ZScore” (1-month return z-score)
    • “relPriceAverage_3m_ZScore” (3-month relative price average)
    • “Convexity_3m_ZScore” (price path convexity over 3 months)
  • Fundamental features:
    • “PB_RMW_60m” (Price-to-Book adjusted for profitability over 60 months)
  • Interaction terms
    • “CAGR_60m_ROCE” (compound annual growth rate combined with return on capital employed)
    • ProfitFactor_60m_CAGR_60m” (interaction between profit factor and growth)
  • Cross-sectional features:
    • “CalmarRatio_6m_ZScore” (risk-adjusted return metric)
    • “Volatility_GICSSectorPctRank” (sector-normalized volatility percentile rank)

Our model was trained on data from 1995-1999 and validated on an independent holdout set before final out-of-sample testing from 2000-2025, in which the model is updated every 60 months.

This rigorous approach to validation ensures that our performance metrics reflect realistic expectations rather than in-sample overfitting.

This diverse feature set confirms that durable alpha generation requires the integration of multiple orthogonal signals unified under a common ranking framework—precisely what our Python Equities Entity Store facilitates. The dominance of volatility-related features suggests that risk management is a critical component of the model’s predictive power.

Package Structure and Implementation

The Python EES is organized as follows:

equityentity/

├── __init__.py

├── features/

│   ├── loader.py        # Load features from HDF5

│   ├── preprocess.py    # Standardization, neutralization, filtering

│   └── labels.py        # Target generation (e.g., IR@1m)

├── models/

│   └── ranker.py        # LambdaMART, LightGBM ranking models

├── portfolio/

│   └── constructor.py   # Create portfolios from rank scores

├── backtest/

│   └── evaluator.py     # Sharpe, IR, turnover, hit rate

└── entity/              # Optional metadata (JSON to dataclass)

    ├── equity.py

    ├── etf.py

    └── index.py

Code Example: Ranking Model Training

Here’s how the ranking model module works, leveraging LightGBM’s LambdaMART implementation:

class RankModel:

    def __init__(self, max_depth=4, num_leaves=32, learning_rate=0.1, n_estimators=500,
use_gpu=True, feature_names=None):

        self.params = {
            "objective": "lambdarank",
            "max_depth": max_depth,
            "num_leaves": num_leaves,
            "learning_rate": learning_rate,
            "n_estimators": n_estimators,
            "device": "gpu" if use_gpu else "cpu",
            "verbose": -1,
            "max_position": 50
        }

        self.model = None
        self.feature_names = feature_names if feature_names is not None else []

    def train(self, features, labels):

        # Reshape features and labels for LambdaMART format
        n_months, n_stocks, n_features = features.shape
        X = features.reshape(-1, n_features)
        y = labels.reshape(-1)
        group = [n_stocks] * n_months
        train_data = lgb.Dataset(X, label=y, group=group, feature_name=self.feature_names)
        self.model = lgb.train(self.params, train_data)

Portfolio Construction

The system seamlessly transitions from predictive scores to portfolio allocation with built-in transaction cost modeling:

# Portfolio construction with transaction cost awareness

def construct_portfolios(self):

    n_months, n_stocks = self.pred_scores.shape

    for t in range(n_months):

        # Get predictions and forward returns
        scores = self.pred_scores[t]
        returns_t = self.returns[min(t + 1, n_months - 1)]

        # Select top and bottom deciles
        sorted_idx = np.argsort(scores)
        long_idx = sorted_idx[-n_decile:]
        short_idx = sorted_idx[:n_decile]

        # Calculate transaction costs from portfolio turnover
        curr_long_symbols = set(symbols_t[long_idx])
        curr_short_symbols = set(symbols_t[short_idx])
        long_trades = len(curr_long_symbols.symmetric_difference(self.prev_long_symbols))
        short_trades = len(curr_short_symbols.symmetric_difference(self.prev_short_symbols))

        tx_cost_long = self.tx_cost * long_trades
        tx_cost_short = self.tx_cost * short_trades

        # Calculate net returns with costs
        long_ret = long_raw - tx_cost_long
        short_ret = -short_raw - tx_cost_short - self.loan_cost

Complete Workflow Example

The package is designed for intuitive workflows with minimal boilerplate. Here’s how simple it is to get started:

from equityentity.features import FeatureLoader, LabelGenerator
from equityentity.models import LambdaMARTRanker
from equityentity.portfolio import DecilePortfolioConstructor

# Load features with point-in-time awareness
loader = FeatureLoader(hdf5_path='equity_features.h5')
features = loader.load_features(start_date='2010-01-01', end_date='2025-01-01')

# Generate IR_1m labels
label_gen = LabelGenerator(benchmark='sector_returns')
labels = label_gen.create_information_ratio(forward_period='1M')

# Train a ranking model
ranker = LambdaMARTRanker(n_estimators=500, learning_rate=0.05)
ranker.fit(features, labels)

# Create portfolios from predictions
constructor = DecilePortfolioConstructor(rebalance_freq='M')
portfolios = constructor.create_from_scores(ranker.predict(features))

# Evaluate performance
performance = portfolios['decile_10'].evaluate()
print(f"Sharpe Ratio: {performance['sharpe_ratio']:.2f}")
print(f"Information Ratio: {performance['information_ratio']:.2f}")
print(f"Annualized Return: {performance['annualized_return']*100:.1f}%")

The package supports both configuration file-based workflows for production use and interactive Jupyter notebook exploration. Output formats include pandas DataFrames, JSON for web applications, and HDF5 for efficient storage of results.

Why Start with Cross-Sectional ML?

While Mathematica’s EntityStore emphasized symbolic navigation and knowledge representation, Python excels at algorithmic learning and numerical computation at scale. Beginning with the HDF5 dataset enables immediate application by quantitative researchers, ML specialists, and strategy developers interested in:

  • Exploring sophisticated feature engineering across time horizons and market sectors
  • Building powerful predictive ranking models with state-of-the-art ML techniques
  • Constructing long-short portfolios with dynamic scoring mechanisms
  • Developing robust factor models and alpha signals

And because we’ve already created metadata-rich JSON files for each entity, we can progressively integrate the symbolic structure—creating a hybrid system where machine learning capabilities complement knowledge representation.

Increasingly, quantitative researchers are integrating tools like LangChain, GPT-based agents, and autonomous research pipelines to automate idea generation, feature testing, and code execution. The structured design of the Python Equities Entity Store—with its modularity, metadata integration, and time-consistent features—makes it ideally suited for use as a foundation in LLM-driven quantitative workflows.

Competitive Pricing and Value

While alternative platforms in this space typically come with significant cost barriers, we’ve positioned the Python Equities Entity Store to be accessible to firms of all sizes:

While open-source platforms like QuantConnect, Zipline, and Backtrader provide accessible backtesting environments, they often lack the scale, granularity, and point-in-time feature control required for advanced cross-sectional ML strategies. The Python Equities Entity Store fills this gap—offering industrial-strength data infrastructure, lookahead protection, and extensibility without the steep cost of commercial platforms.

Unlike these competitors that often require multiple subscriptions to achieve similar functionality, Python Equities Entity Store provides an integrated solution at a fraction of the cost. This pricing strategy reflects our commitment to democratizing access to institutional-grade quantitative tools.

Next Steps

We’re excited to announce our roadmap for the Python Equities Entity Store:

  1. July 2025 Release: The official launch of our HDF5-compatible package, complete with:
    • Comprehensive documentation and API reference
    • Jupyter notebooks demonstrating key workflows from data loading to portfolio construction
    • Example strategies showcasing the system’s capabilities across different market regimes
    • Performance benchmarks and baseline models with full backtest history
    • Python package available via PyPI (pip install equityentity)
    • Docker container with pre-loaded example datasets
  2. Q3 2025: Integration of the symbolic entity framework, allowing seamless navigation between quantitative features and qualitative metadata
  3. Q4 2025: Extension to additional asset classes and alternative data sources, expanding the system’s analytical scope
  4. Early 2026: Launch of a cloud-based computational environment for collaboration and strategy sharing

Accessing the Python Equities Entity Store

As a special promotion, existing users of the current Mathematica Equities Entity Store Enterprise Edition will be given free access to the Python version on launch.

So, if you sign up now for the Enterprise Edition you will receive access to both the existing Mathematica version and the new Python version as soon as it is released. 

After the launch of the Python Equities Entity Store, each product will be charged individually.  So this limited time offer represents a 50% discount.

See our web site for pricing details: https://store.equityanalytics.store/equities-entity-store

Conclusion

By prioritizing scalable feature datasets and sophisticated ranking models, the Python version of the Equities Entity Store positions itself as an indispensable tool for modern equity research. It bridges the gap between raw data and actionable insights, combining the power of machine learning with the structure of knowledge representation.

The Python Equities Entity Store represents a significant step forward in quantitative finance tooling—enabling faster iteration, more robust models, and ultimately, better investment decisions.

Matlab vs. Python

In a previous article I made a detailed comparison of Mathematica and Python and tried to identify areas where the former excels. Despite the many advantages of the Python technology stack, I was able to pinpoint a few areas in which I think Mathematica holds the upper hand. Whether those are sufficient to warrant the investment of time and money required to master the Wolfram Language is another matter, which the user must decide for himself.

In this comparison between Matlab and Python I won’t reiterate the strengths of the Python that make it the programming language of choice for so many developers. Let me instead focus on some of the key aspects of Matlab where I think the Mathworks product outshines its rival.

Matlab is designed for numerical computing, while Python is a general-purpose programming language that has become a major tool for scientific computing through libraries like NumPy, SciPy, and Matplotlib.

The key advantages of Matlab relative to Python, as I see them, are as follows:

Integrated Development Environment (IDE):

Matlab comes with a feature-rich IDE that is tailored for mathematical and engineering workflows. This includes tools for debugging, data visualization, GUI creation, and managing workspace variables. The Matlab IDE is specifically designed to streamline the development of mathematical and engineering applications.

Advanced Toolboxes:

Matlab offers a wide range of specialized toolboxes for different applications, including signal processing, control systems, neural networks, image processing, and many others. These toolboxes are professionally developed, rigorously tested, and regularly updated, providing a comprehensive suite of algorithms and functions for specific domains. With its vast ecosystem of scientific libraries Python has caught up with Matlab in recent years, and even overtaken it in some areas, but Matlab’s toolboxes are tried and battle-tested technologies that are used by millions of users in state-of-the-art applications.

Simulink:

Matlab provides Simulink, a platform for Model-Based Design for dynamic and embedded systems. Simulink is a graphical programming environment for modeling, simulating, and analyzing multidomain dynamical systems. This is particularly useful in engineering applications where system modeling and simulation are crucial.

Built-in Support for Matrix Operations:

Matlab (Matrix Laboratory) has inherent support for matrix operations and linear algebra, making it highly efficient for tasks that involve complex mathematical computations.

Performance:

Matlab is optimized for operations involving matrices and vectors, which are central to engineering and scientific computations. For certain numerical tasks, Matlab’s performance is superior due to its highly optimized code and ability to handle parallel computing and GPU acceleration effectively.

Matlab’s speed has further accelerated over the last decade due to just-in-time compilation. This feature automatically compiles Matlab’s interpreted code into machine code at runtime, which speeds up execution, especially in loops and computationally intensive tasks. The JIT compilation process is entirely transparent to the user, requiring no modifications to the code or the development process.
Python itself is an interpreted language and does not include JIT compilation in its standard implementation (CPython). However, JIT compilation can be introduced through third-party libraries or alternative Python implementations, such as Numba or PyPy.

Testing and Debugging:

Both Matlab and Python are equipped with robust testing and debugging tools that cater to their specific user bases. Matlab’s tools are tightly integrated into its IDE and are particularly tailored for numerical computing and engineering tasks. I would regard them as the industry standard in terms of features, ease of use and helpfulness. In contrast, Python’s testing and debugging ecosystem is more diverse, with multiple options available for different tasks, including third-party libraries that extend its capabilities.

Documentation and Support:

Matlab’s documentation is extensive, well-organized, and includes examples for a wide range of functions and toolboxes. Additionally, MathWorks provides excellent support services, including technical support and community forums, which can be particularly valuable for complex or specialized projects.

Conclusion

While Python has gained significant popularity in scientific computing, data science, and machine learning due to its open-source nature and the vast ecosystem of libraries, Matlab holds strong advantages in numerical computing, engineering applications, and when integrated solutions with robust support and documentation are required.

However, Python offers greater flexibility, scalability and has grown significantly in scientific computing. MATLAB historically had limitations with very large datasets, but recent releases have added features to improve performance with big data. Still, Python likely retains an advantage for extreme scales. The choice depends on the specific use case – for small-scale numerical computing and modeling MATLAB provides an integrated optimized environment while Python excels in general-purpose programming and very large-scale data intensive applications. However, both continue to evolve impressive capabilities so the lines are blurring. Ultimately data scientists and engineers are best served by being proficient in both languages.

Python vs. Wolfram Language

Python vs. Wolfram Language

As an avid user of both Python and Wolfram Language for technical computing, I’m often asked how they compare. Python’s strengths as an open-source language are clear:

  • Ubiquity – With millions of users, Python has become ubiquitous across fields like data science, ML engineering, web development, and scientific research. This massive adoption fuels continuous enhancement of its tools.
  • Comprehensive capabilities – Python’s expansive ecosystem of 200,000+ libraries spans everything from numerical computing to web frameworks to industrial automation. It is a versatile, widely-supported language for building end-to-end applications.
  • Approachability – Python’s straightforward syntax, multitude of online resources, and abundance of machine learning libraries like TensorFlow and PyTorch make it highly accessible for new programmers and non-CS domain experts alike.
  • Interoperability – Python integrates smoothly with everything from SQL and NoSQL databases to enterprise IT environments and microcontrollers like Raspberry Pi. This flexibility enables diverse production deployments.


In summary, Python offers benefits in ubiquity, breadth, approachability, and seamless interoperability with external systems. Together, they show the value of domain-specific and general-purpose languages for tackling modern analytics and engineering challenges.

However, while Python is a versatile, open-source language popular among developers, the Wolfram Language offers some unique advantages:

Powerful Symbolic Capabilities

One of the most powerful aspects of the Wolfram Language is its unparalleled symbolic manipulation abilities for mathematical computation. Operations like symbolic integration, solving equations analytically, theorem proving, model simplification and more are built deeply into the language in a way no other programming language matches. Python can conduct numeric computation and data analysis well, but does not have this domain of symbolic capabilities natively.

For any usage involving abstract mathematical development, derivation of analytical results, or formal proofs, the symbolic nature of the Wolfram Language is a major differentiator.


Wolfram Notebooks

offer notable advantages over Jupyter notebooks in Python:

  • More visual appeal – The Wolfram notebooks produce beautifully typeset output and publication-ready visualizations by default, whereas Jupyter’s output is more basic.
  • Greater configurability – Wolfram’s notebooks allow extensive styling, templating, and customization of content for different applications. Jupyter also enables some configuration, but not to the same degree.
  • Tighter integration – The Wolfram notebooks leverage the language’s underlying functions and capabilities more fluidly since it’s one integrated environment. Jupyter interfaces well with Python but there is still some separation.
  • Interactivity – Wolfram notebooks support advanced interactivity through Manipulate/Animate and instant visual output.

    Overall, while Jupyter notebooks are hugely popular among Python developers and enable great functionality, Wolfram’s notebook solution stands out as more robust, customizable, and visually polished. The tight integration with the Wolfram Language and computational capabilities augments interactive analysis in a way Jupyter can’t match.

Integrated Knowledge and Data

The Wolfram Language stands out in providing an “integrated knowledge base” that spans from sophisticated algorithms to real-world data across domains. This includes vast curated datasets on topics from architecture to chemistry to finance that can readily feed models and analyses without additional wrangling.

Additionally, the entity store concept allows users to author their own object-based, customizable data repositories. Python’s classes are focused on methods rather than data and while Python offers strong libraries for storing and accessing data, Wolfram facilitates more zero-friction application of real-world knowledge and entity-oriented data storage out-of-the-box. For minimizing time manipulating data or searching for reference algorithms before modeling, Wolfram Language excels.

The entity store in particular enables a very natural object/entity-based programming style that can integrate smoothly with Wolfram’s class system and its underlying symbolic capabilities. This unique data representation system differentiation is a key strength (for example, see the Equities Entity Store).

Interactivity and Prototyping

The Wolfram Language excels in hands-on analysis and rapid iteration thanks to its line-by-line execution and built-in Manipulate/Animate functions for customizable graphics, animations and interactive simulations. Python does allow some interactivity in Jupyter notebooks, but does not match Wolfram’s capabilities for creating interactive visualizations on-the-fly. This makes Wolfram Language uniquely well-suited for highly iterative, prototyping tasks that involve visual output. If ease of exploration and fluid development is a priority, the Wolfram Language has clear strengths.

Seamless Parallelization

The Wolfram Language has seamless built-in parallelization capabilities that allow code to efficiently utilize multi-core systems without the developer needing to directly manage threads or processes. Python can achieve parallelism through libraries, but the developer bears responsibility for managing dependencies and avoiding conflicts. Similarly, the Wolfram Language directly interfaces with Nvidia GPUs out-of-the-box for high performance numerical code with minimal extra effort. Thus, for users focused on computational speedup, Wolfram simplifies parallelization and GPU integration in very useful ways.

Python libraries like TensorFlow and PyTorch do hide GPU complexities well for deep learning. But in general, achieving parallel execution in Python places a greater burden on the developer. Wolfram’s approach dramatically lowers the barriers to leveraging multiple cores and GPU power for everyday computations.

Sophisticated Visualization

Creating publication-quality, customized visualizations requires just lines of code in the Wolfram Language, thanks to the built-in graphics capabilities. While Python offers powerful visualization through add-on libraries like Matplotlib, Seaborn, Bokeh, and Plotly, Wolfram’s out-of-the-box solutions may provide greater ease of use. However, from low-level control to interactive web plots, Python’s visualization options are quite extensive despite requiring more setup. Ultimately, for rapid high-level plotting, Wolfram Language has advantageous default capabilities. But Python gives more flexibility and customization options through its ecosystem of graphic libraries.

In summary, while Python offers flexibility and a large user base – advantages in its own right – the Wolfram Language dramatically reduces lines of code and development time. By curating real-world data, algorithms, and visualization in one coherent language and platform, it streamlines and accelerates quantitative work for scientists, analysts, economists and more.

If you do significant data analysis or modeling, I encourage you to try the Wolfram Language and see the difference yourself. It’s been a gamechanger for my productivity.

DataScience| Handling Big Data

Handling Large Files in CSV format with NumPy and Pandas

One of the major challenges that users face when trying to do data science is how to handle big data. Leaving aside the important topic of database connectivity/functionality and the handling of data too large to fit in memory, my concern here is with the issue of how to handle large data files, which are often in csv format, but which are not too large to fit into available memory.

It is well known that, due to their generality, Mathematica’s Import and Export functions are horribly slow when handling large csv files. For example, writing out a list of 10 million 64-bit reals takes almost 5 minutes:

No alt text provided for this image

and reading is also unacceptably slow:

No alt text provided for this image

Performance results like these create the impression that Mathematica is suitable for handling only “toy” problems, rather than the kind of large and complex data challenges faced by data scientists in the real world.

Sure, you can speed this up with ReadLine, but not by much, after doing all the string processing. And while the mx binary file format speeds up data handling enormously, it doesn’t address the issue of how to get the data into the requisite file format, other than via the WL DumpSave function – in other words, the data already has to be in a Mathematica notebook in order to write an mx file.

With purely numerical data once way to address this by using non-proprietary binary file formats. For example, in Python we create a NumPy array and use the tofile() method to output the data in real64 binary format, in less then 2 seconds:

No alt text provided for this image

Then in Mathematica the read process is equally fast when processing a file of numerical data in binary format, around 50x faster than the time taken to process the same file in csv format:

No alt text provided for this image

The procedure is just as fast in the reverse direction, with binary data exports from Mathematica taking a fraction of the time required to process the same data in csv format (around 200x faster!):

No alt text provided for this image

And the data is extremely fast read back in Python using the numpy fromfile method:

No alt text provided for this image

This procedure is robust enough to accommodate missing data. For instance, let’s replace some of the values in our data array with np.nan values and export the file once again in binary format:

No alt text provided for this image

Reading the binary file into Mathematica, we find no reduction in speed, as the np.nan values are stored as decimals, which are replaced by the value Indeterminate in the imported Mathematica array:

No alt text provided for this image

So, for purely numerical data we have a fast and reliable procedure for transferring data between Python, R and Mathematica using binary format. This means that we can load very large csv files in Python, do some pre-processing in pandas and export the massaged data in binary format for further analysis in Mathematica, if required.

More Complex Data Structures: the HDF5 Format

A major step in the right direction has been achieved through the significant effort that WR has put into implementing the HDF5 binary file format standard in the Wolfram Language. This serves two purposes: firstly, it can speed up the storage and retrieval of large datasets, by orders of magnitudes (depending on the data type); secondly, unlike Wolfram’s proprietary mx file format, HDF5 is an open source format that can store large, complex & hierarchical datasets that are accessible via Python, R and MatLab, as well as other languages/platforms, including Mathematica. So, working with the same dataset as before, but using HDF5 format, we get an speed-up of around 500x on the file write and around 270x on the file read:

No alt text provided for this image

Another major benefit of working in binary format is the enormous saving in disk storage, compared to csv:

No alt text provided for this image

So it becomes feasible to envisage a workflow in which some pre-processing of a very large dataset in csv format takes place initially in e.g. Python Pandas, the results of which are exported to a HDF5 or binary format file for further processing in Mathematica.

This advance does a great deal to address some of the major concerns about using Mathematica for large data science projects. And I am not sure that users are necessarily aware of its significance, given all the hoopla over more glamorous features that tend to get all the attention in new version releases.