← Data Journalism

Using Pydantic for Classifying Categories of Stocks

Even the best APIs throw back some times messy categorization. Pydantic plus ChatGPT helped me to get this sorted.

As ever so often in data journalism the truth is messy and so are endpoints, APIs and libraries. I have found a great library as of late that gives us as a team a list of all stocks of indices like the S&P 500. It also offers a categorization of the industries in which each stock falls. There is just a problem that seasoned data journalists will spot here right away.
Not only is there for each stock a multitude of industries it is put into. Furthermore, even for the same stock, different spellings of say "Health Care" can be attributed. This is kinda tricky because it means, if you do sector-based analysis you will catch different number of stocks depending on the spelling or the exact wording.

Pre-AI, fuzzy matching would have been the source to go for. However, the problem stil is that things like "Medical devices" and "Medical Equipment" might not have been classified as the same, even though it most probably is.

So what could you do? I tried something that I have learned  from  Marcel Pauly from Der Spiegel who has been experimenting with Retrieval-Augmented Generation a.k.a. RAG for a while.  His preferred use back then at least as he shows in this Github Repo: Pydantic.  And I have used it for this particular task like this. 
But why have I stuck to such a rather quite deterministic way of using AI? Why have I still prompt and programmed this task quite line-by-line rather than just throwing the whole junk of messy data against an LLM and let it figure out the rest?

Well, maybe I am too much of a control freak. But I thought that giving the model constraints through the classes defined in Pydantic, I can guardrail the results that I get back. And can be more sure that the categorization of stocks is somewhat congruent across the whole analysis.

Since the story is not published yet, I will add the actual code to follow my methodology at a later stage here. So bare with me and my two screenshots for now.

Comments 0

No comments yet — be the first.