Large Scale Financial Filing Analysis on HPCC Systems

2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)(2020)

引用 3|浏览1
暂无评分
摘要
Insights from public companies’ financial filings are necessary for securities analysts and investors to make the right investment decisions. Synthesizing salient facts from such filings is a complex language task, especially now as the data volume is growing at an overwhelming pace. To ease human labor in this process, our work proposed a financial filing analysis pipeline which automatically scrapes financial filings, generates the embeddings of the contextual data and performs sentiment analysis in order to predict future performance of the underlying companies. The pipeline is built on top of Big Data processing platform HPCC Systems to enable the capability of processing large amounts of financial filings in a scalable and timely manner. By applying word embedding and machine learning models to a large amount of SEC financial filings, our pipeline is able to process 20 GB of XBRL files -- 5,000 filing documents for more than 3,500 companies -- into 50,000 sentence embeddings within 5 minutes and transform the same data to TF-IDF embedding in about 8 minutes. To test sentiment analysis, we randomly sampled and manually labeled 5,000 SEC filings. As a result, the sentiment analysis suggested that the usefulness of stock price as a metric is specific to each industry and overall market, but is usable as long as the scope of inquiry is sufficiently narrow. Additionally, while our model is trained only on 5,000 manually labeled filings with unigrams and a final loss of 0.09, the results of the sentiment analysis exhibited discriminatory power exceeding naïve label selection through random or biased choice, suggesting that there is efficacy in using Natural Language Processing to analyze SEC filings.
更多
查看译文
关键词
SEC, Sentiment Analysis, Natural Language Processing, HPCC Systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要