Finance Data Copilot

🤖 Automated Quantitative Trading & Factors Extraction from Financial Reports

📖 Background

Research reports are treasure troves of insights, often unveiling potential factors that can drive successful quantitative trading strategies. Yet, with the sheer volume of reports available, extracting the most valuable insights efficiently becomes a daunting task.

Furthermore, rather than hastily replicating factors from a report, it’s essential to delve into the underlying logic of their construction. Does the factor capture the essential market dynamics? How unique is it compared to the factors already in your library?

Therefore, there is an urgent need for a systematic approach to design a framework that can effectively manage this process. And this is where the Finance Data Copilot steps in.

🎥 Demo

🌟 Introduction

In this scenario, RDAgent demonstrates the process of extracting factors from financial research reports, implementing these factors, and analyzing their performance through Qlib backtesting. This process continually expands and refines the factor library.

Here’s an enhanced outline of the steps:

Step 1 : Hypothesis Generation 🔍

  • Generate and propose initial hypotheses based on insights from financial reports with thorough reasoning and financial justification.

Step 2 : Factor Creation ✨

  • Based on the hypothesis and financial reports, divide the tasks.

  • Each task involves developing, defining, and implementing a new financial factor, including its name, description, formulation, and variables.

Step 3 : Factor Implementation 👨‍💻

  • Implement the factor code based on the description, evolving it as a developer would.

  • Quantitatively validate the newly created factors.

Step 4 : Backtesting with Qlib 📉

  • Integrate the full dataset into the factor implementation code and prepare the factor library.

  • Conduct backtesting using the Alpha158 plus newly developed factors and LGBModel in Qlib to evaluate the new factors’ effectiveness and performance.

Dataset

Model

Factors

Data Split

CSI300

LGBModel

Alpha158 Plus

Train

2008-01-01 to 2014-12-31

Valid

2015-01-01 to 2016-12-31

Test

2017-01-01 to 2020-08-01

Step 5 : Feedback Analysis 🔍

  • Analyze backtest results to assess performance.

  • Incorporate feedback to refine hypotheses and improve the model.

Step 6 :Hypothesis Refinement ♻️

  • Refine hypotheses based on feedback from backtesting.

  • Repeat the process to continuously improve the model.

⚡ Quick Start

Please refer to the installation part in Installation and Configuration to prepare your system dependency.

You can try our demo by running the following command:

  • 🐍 Create a Conda Environment

    • Create a new conda environment with Python (3.10 and 3.11 are well tested in our CI):

      conda create -n rdagent python=3.10
      
    • Activate the environment:

      conda activate rdagent
      
  • 📦 Install the RDAgent

    • You can install the RDAgent package from PyPI:

      pip install rdagent
      
  • 🚀 Run the Application

    • Download the financial reports you wish to extract factors from and store them in your preferred folder.

    • Specifically, you can follow this example, or use your own method:

      wget https://github.com/SunsetWolf/rdagent_resource/releases/download/reports/all_reports.zip
      unzip all_reports.zip -d git_ignore_folder/reports
      
    • Run the application with the following command:

      rdagent fin_factor_report --report-folder=git_ignore_folder/reports
      
    • Alternatively, you can store the paths of the reports in report_result_json_file_path. The format should be:

      [
          "git_ignore_folder/report/fin_report1.pdf",
          "git_ignore_folder/report/fin_report2.pdf",
          "git_ignore_folder/report/fin_report3.pdf"
      ]
      
    • Then, run the application using the following command:

      rdagent fin_factor_report
      

🛠️ Usage of modules

  • Env Config

The following environment variables can be set in the .env file to customize the application’s behavior:

pydantic settings rdagent.app.qlib_rd_loop.conf.FactorFromReportPropSetting

Bases: FactorBasePropSetting

Show JSON schema
{
   "title": "FactorFromReportPropSetting",
   "type": "object",
   "properties": {
      "scen": {
         "default": "rdagent.scenarios.qlib.experiment.factor_from_report_experiment.QlibFactorFromReportScenario",
         "title": "Scen",
         "type": "string"
      },
      "knowledge_base": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Knowledge Base"
      },
      "knowledge_base_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Knowledge Base Path"
      },
      "hypothesis_gen": {
         "default": "rdagent.scenarios.qlib.proposal.factor_proposal.QlibFactorHypothesisGen",
         "title": "Hypothesis Gen",
         "type": "string"
      },
      "interactor": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Interactor"
      },
      "hypothesis2experiment": {
         "default": "rdagent.scenarios.qlib.proposal.factor_proposal.QlibFactorHypothesis2Experiment",
         "title": "Hypothesis2Experiment",
         "type": "string"
      },
      "coder": {
         "default": "rdagent.scenarios.qlib.developer.factor_coder.QlibFactorCoSTEER",
         "title": "Coder",
         "type": "string"
      },
      "runner": {
         "default": "rdagent.scenarios.qlib.developer.factor_runner.QlibFactorRunner",
         "title": "Runner",
         "type": "string"
      },
      "summarizer": {
         "default": "rdagent.scenarios.qlib.developer.feedback.QlibFactorExperiment2Feedback",
         "title": "Summarizer",
         "type": "string"
      },
      "evolving_n": {
         "default": 10,
         "title": "Evolving N",
         "type": "integer"
      },
      "train_start": {
         "default": "2008-01-01",
         "title": "Train Start",
         "type": "string"
      },
      "train_end": {
         "default": "2014-12-31",
         "title": "Train End",
         "type": "string"
      },
      "valid_start": {
         "default": "2015-01-01",
         "title": "Valid Start",
         "type": "string"
      },
      "valid_end": {
         "default": "2016-12-31",
         "title": "Valid End",
         "type": "string"
      },
      "test_start": {
         "default": "2017-01-01",
         "title": "Test Start",
         "type": "string"
      },
      "test_end": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": "2020-08-01",
         "title": "Test End"
      },
      "report_result_json_file_path": {
         "default": "git_ignore_folder/report_list.json",
         "title": "Report Result Json File Path",
         "type": "string"
      },
      "max_factors_per_exp": {
         "default": 6,
         "title": "Max Factors Per Exp",
         "type": "integer"
      },
      "report_limit": {
         "default": 20,
         "title": "Report Limit",
         "type": "integer"
      }
   },
   "additionalProperties": false
}

Config:
  • env_prefix: str = QLIB_FACTOR_

  • protected_namespaces: tuple = ()

field max_factors_per_exp: int = 6

Maximum number of factors implemented per experiment

field report_limit: int = 20

Maximum number of reports to process

field report_result_json_file_path: str = 'git_ignore_folder/report_list.json'

Path to the JSON file listing research reports for factor extraction

field scen: str = 'rdagent.scenarios.qlib.experiment.factor_from_report_experiment.QlibFactorFromReportScenario'

Scenario class for Qlib Factor from Report

pydantic settings rdagent.components.coder.factor_coder.config.FactorCoSTEERSettings

Show JSON schema
{
   "title": "FactorCoSTEERSettings",
   "type": "object",
   "properties": {
      "coder_use_cache": {
         "default": false,
         "title": "Coder Use Cache",
         "type": "boolean"
      },
      "max_loop": {
         "default": 10,
         "title": "Max Loop",
         "type": "integer"
      },
      "fail_task_trial_limit": {
         "default": 20,
         "title": "Fail Task Trial Limit",
         "type": "integer"
      },
      "v1_query_former_trace_limit": {
         "default": 3,
         "title": "V1 Query Former Trace Limit",
         "type": "integer"
      },
      "v1_query_similar_success_limit": {
         "default": 3,
         "title": "V1 Query Similar Success Limit",
         "type": "integer"
      },
      "v2_query_component_limit": {
         "default": 1,
         "title": "V2 Query Component Limit",
         "type": "integer"
      },
      "v2_query_error_limit": {
         "default": 1,
         "title": "V2 Query Error Limit",
         "type": "integer"
      },
      "v2_query_former_trace_limit": {
         "default": 3,
         "title": "V2 Query Former Trace Limit",
         "type": "integer"
      },
      "v2_add_fail_attempt_to_latest_successful_execution": {
         "default": false,
         "title": "V2 Add Fail Attempt To Latest Successful Execution",
         "type": "boolean"
      },
      "v2_error_summary": {
         "default": false,
         "title": "V2 Error Summary",
         "type": "boolean"
      },
      "v2_knowledge_sampler": {
         "default": 1.0,
         "title": "V2 Knowledge Sampler",
         "type": "number"
      },
      "knowledge_base_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Knowledge Base Path"
      },
      "new_knowledge_base_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "New Knowledge Base Path"
      },
      "enable_filelock": {
         "default": false,
         "title": "Enable Filelock",
         "type": "boolean"
      },
      "filelock_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Filelock Path"
      },
      "max_seconds_multiplier": {
         "default": 1000000,
         "title": "Max Seconds Multiplier",
         "type": "integer"
      },
      "data_folder": {
         "default": "git_ignore_folder/factor_implementation_source_data",
         "title": "Data Folder",
         "type": "string"
      },
      "data_folder_debug": {
         "default": "git_ignore_folder/factor_implementation_source_data_debug",
         "title": "Data Folder Debug",
         "type": "string"
      },
      "simple_background": {
         "default": false,
         "title": "Simple Background",
         "type": "boolean"
      },
      "file_based_execution_timeout": {
         "default": 3600,
         "title": "File Based Execution Timeout",
         "type": "integer"
      },
      "select_method": {
         "default": "random",
         "title": "Select Method",
         "type": "string"
      },
      "python_bin": {
         "default": "python",
         "title": "Python Bin",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:
  • env_prefix: str = FACTOR_CoSTEER_

field coder_use_cache: bool = False

Indicates whether to use cache for the coder

field data_folder: str = 'git_ignore_folder/factor_implementation_source_data'

Path to the folder containing financial data (default is fundamental data in Qlib)

field data_folder_debug: str = 'git_ignore_folder/factor_implementation_source_data_debug'

Path to the folder containing partial financial data (for debugging)

field enable_filelock: bool = False
field file_based_execution_timeout: int = 3600

Timeout in seconds for each factor implementation execution

field filelock_path: str | None = None
field knowledge_base_path: str | None = None

Path to the knowledge base

field max_loop: int = 10

Maximum number of task implementation loops

field max_seconds_multiplier: int = 1000000
field new_knowledge_base_path: str | None = None

Path to the new knowledge base

field select_method: str = 'random'

Method for the selection of factors implementation

field simple_background: bool = False

Whether to use simple background information for code feedback

field v2_add_fail_attempt_to_latest_successful_execution: bool = False