Next.js Discord

Discord Forum

I need some help with matching some data from 2 different datasets

Answered
Sun bear posted this in #help-forum
Open in Discord
Sun bearOP
So I'm trying to match the locations for each Company in Dataset 1 with the locations I have in Dataset 2, so that in the Output I can use the location ID from Dataset 1 for each company instead of using the location data from Dataset 1 (I hope this makes sense)

Here is how my Datasets look like:

Dataset 1 input (the dataset starts with a comma for some reason, as you can see):
,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
4425416,tata consultancy services,tcs.com,1968.0,information technology and services,10001+,"bombay, maharashtra, india",india,linkedin.com/company/tata-consultancy-services,190771,341369

Matching against Dataset 2 input:
geonameID,locationName,countryName,countryCode
5128638,New York,United States of America,US"


The script generates a list with all the unmatched locations and inside it I can see locations which should be matched but aren't, such as New York, US.


Unmatched locations:
new york, united states
bombay, india
alexandria, united states
london, united kingdom
palo alto, united states


The issue is that currently no location matches are found, and I'm not sure why, so I need someone to have a look at my Python script (bellow) and let me know what the issue is.

The output of the script looks like this currently (I'm only extracting the company name and location for the moment)

company_name,company_location
Ibm,"new york, united states"
Tata Consultancy Services,"bombay, india"
Accenture,2964574
Us Army,"alexandria, united states"
Ey,"london, united kingdom"


Here is my python Script: https://paste.ec/paste/ADhzoCQ4#1OD1ynbbmrV-WVlhmnM2FdjrDrZ1yP5uCY98/hqMOvV
Answered by Anay-208
Yes, I want to help you. However, you need to be familiarized with the basics of python, it'd take 10-15 mins if you already know python.

I'm familiar with python, However, We are just volunteers, We can basically Guide You by seeing the error messages, on how to solve the issues, or sometimes even edit the code for you!

However, By this message, I'm able to understand that you don't even know how your code works.

For now, I'm seeing your code properly.

But in the meantime, atleast spend 5-10 mins learning the basics
View full answer

178 Replies

You'll have to debug through this. I'll be able to guide you to identify the error
@Anay-208 You'll have to debug through this. I'll be able to guide you to identify the error
Sun bearOP
Sure thing! If you can tell me what debug lines to add and at which line that would help alot as I could run the script and show the results back here
So, after each operation(s), you can add print and see where exactly it causes the invalid output
Sun bearOP
You'll have to tell me exactly what to add and where to add it.
Such as, add this line "qwwrqioruio" at after line 193
@Sun bear You'll have to tell me exactly what to add and where to add it.
I just recommend you to get an idea on how the code is working first.
I don't want to be rude, but in #help-forum, we can guide you to fix errors, However, if you don't make an effort yourself we can't help you.
Ask claude to explain the code first
Once you get an idea, you can debug yourself
Sun bearOP
I know how the code works, cause I had to tell Claude what to do exactly, it's just that I don't know how it works cause I don't know Python.
You can do a quick crash course on python
That would be necessary.

For example, if someone has no experience in terms of developement, They'll hire a Web Dev Only.
But if they don't have any knowledge about it, they can't just ask people how to fix the code in full.

The same case would apply here, if you are confused, you can perhaps hire someone(like me) to completely make the python program for you
@Anay-208 That would be necessary. For example, if someone has no experience in terms of developement, They'll hire a Web Dev Only. But if they don't have any knowledge about it, they can't just ask people how to fix the code in full. The same case would apply here, if you are confused, you can perhaps hire someone(like me) to completely make the python program for you
Sun bearOP
I just need help with the matching algorithm, something is wrong there. You said you're willing to help Anay, that was the whole point of this thread.

I've also presented all the needed information related to what the issue is, instead of having a look at the code, specifically at the matching part to see what could be misconfigured in there you started recommending me to learn Python (which is not bad advice by any means) but that's not why I opened the thread in the first place.

You said you were willing to help and that you're familiar with Python, so I was expecting you to try and help.
@Sun bear You'll have to tell me exactly what to add and where to add it.
Yes, I want to help you. However, you need to be familiarized with the basics of python, it'd take 10-15 mins if you already know python.

I'm familiar with python, However, We are just volunteers, We can basically Guide You by seeing the error messages, on how to solve the issues, or sometimes even edit the code for you!

However, By this message, I'm able to understand that you don't even know how your code works.

For now, I'm seeing your code properly.

But in the meantime, atleast spend 5-10 mins learning the basics
Answer
Masai Lion
@Sun bear
nws
im here
hehe
Sun bearOP
I was waiting for some backup @Masai Lion 😁
So here's where I think the issue must be happening
Masai Lion
K so if im not wrong, the main issue with the unmatched locations likely stems from discrepancies in how locations are formatted and matched.
let me do a quick snippet of ur code
Sun bearOP
Awesome! And if you need to better see how the data looks in the 2 datasets I can also make some screenshots 😁
Masai Lion
Mind I put it on uhhh github?
@Masai Lion Mind I put it on uhhh github?
Sun bearOP
I don't
@Masai Lion This is the Companies Dataset
Thats' how it looks
@Sun bear Could it be because of case sensitive? As I just saw your inputs, as matching against and unmatched are different
Sun bearOP
And this is my World Locations dataset @Masai Lion
@Sun bear <@1107387213559369832> This is the Companies Dataset
Masai Lion
mmmm alr...
alr let me send code
i think it should do
@Anay-208 <@998405690718703696> Could it be because of case sensitive? As I just saw your inputs, as matching against and unmatched are different
Sun bearOP
It could be related to that. The code does try to do a bit of case matching if you look.
@Masai Lion https://github.com/yoboywhat/johnny/blob/main/main.py
Sun bearOP
Why is it so small ? 😮
@Sun bear It could be related to that. The code does try to do a bit of case matching if you look.
Masai Lion
ye, on the code i gave you i emphasize matching and casing
@Sun bear Why is it so small ? 😮
Masai Lion
111 lines
Sun bearOP
Yeah that's much smaller than what I currently have, did you remove something ?
@Sun bear Yeah that's much smaller than what I currently have, did you remove something ?
Masai Lion
i just grouped some stuff to make it smaller
like funcs and all that
nws
Sun bearOP
Awesome lol
The smaller the better
@Sun bear The smaller the better
Masai Lion
exactly
i also see u trying to do python + next?
Sun bearOP
Yeah, casue this is for my Database basically
@Sun bear Yeah, casue this is for my Database basically
Masai Lion
nice ! i also have a next + python project, but my python works as server for an AI im doing
kinda dumb AI rn but hopefully it learns fast
Sun bearOP
xd
Complex stuff 😁
@Sun bear xd
Masai Lion
mmmm ye gotchu
so
uhm
we could do this:
Sun bearOP
Whoops so there's an issue I think, or maybe it just takes much longer
Masai Lion
Load and Normalize Locations from Both Datasets
Match and Replace Locations with Location IDs
Create a Company List with Location IDs
we could do those 3
tho idk how efficient it would be
@Masai Lion > Load and Normalize Locations from Both Datasets > Match and Replace Locations with Location IDs > Create a Company List with Location IDs
Sun bearOP
Yup, and if for some reason a location isn't found in my locations dataset then we can just copy paste the location as it is from the Companies Dataset
But I have like 12 million locations
Chances for that should be very low
@Sun bear But I have like 12 million locations
Masai Lion
holy moly
Sun bearOP
xD I wanted to make sure I don't miss anything lol
The dataset will have to be cleaned a bit at some point lol
But for now it's okay
Masai Lion
try this:
i updated the code
Sun bearOP
So the issue the current code has is for some reason it runs extremely slow
@Sun bear The dataset will have to be cleaned a bit at some point lol
Masai Lion
ye but rn it sounds okie
Sun bearOP
Like 19 seconds per line lol
@Sun bear Like 19 seconds per line lol
Masai Lion
holy
k let me fix that
Sun bearOP
Please, something is probably wrong lol
It usually only takes 80 seconds for the whole process start to finish (minus the uploading)
Also please make sure it doesn't upload anything to Supabase yet if I say "No" when it asks me.
Masai Lion
i think its all funcs i made lmao
Sun bearOP
For now we should just focus on making sure the data is correct first.
Masai Lion
k reduced cache now
@Sun bear Also please make sure it doesn't upload anything to Supabase yet if I say "No" when it asks me.
Sun bearOP
Nevermind that won't work anyway cause the Keys are not set up yet.
Masai Lion
let me commit
Sun bearOP
It needs to go through this many locations when it does the matching 😁
commit done
@Sun bear It needs to go through this many locations when it does the matching 😁
Masai Lion
maybe thats why it taking too long lmao
Sun bearOP
Woah the code is even smaller now
@Sun bear Woah the code is even smaller now
Masai Lion
just a tiny bit
Sun bearOP
I'm waiting for it to do something I'm not sure why it takes so long to start
Maybe we should add a progress bar for that too
So that I can see if it's doing something or not
Masai Lion
the issue here is that
u want to read lot of data
between to files
Sun bearOP
Yeah, that's about 2GB of data
Masai Lion
so the program is going crazy
@Sun bear Yeah, that's about 2GB of data
Masai Lion
exactly lmao
Sun bearOP
Well actually it's more like 1.5GB
Oh there we go
I can see lots of errors
@Sun bear Click to see attachment
Masai Lion
bruh
the more u know:
the issue is that we are attempting to use a dictionary as a key in another dictionary or cache.
and the solution MIGHT BE that we need to change how the world_locations dictionary is passed to the match_location function or refactor the approach
Sun bearOP
We can try, If it doesn't work then we can try something else 😁
try it
Sun bearOP
I can tell we're probably on the good road becasue it takes its time before it starts showing the ui progress bar.
Masai Lion
does it work?
Sun bearOP
Before that it would only spend like 10-20 seconds at most
Then show the progress bar
@Masai Lion does it work?
Sun bearOP
It's still "thinking"
xD
@Masai Lion is it normal to take this long ?
Sun bearOP
That's okay then I can wait. I'll let you know when I see something moving.
Masai Lion
my AI rn doesn’t take too long cuz it’s dumb, but later on it will due to its data
@Masai Lion my AI rn doesn’t take too long cuz it’s dumb, but later on it will due to its data
Sun bearOP
I don't think you're dumb, on the contrary!
@Sun bear I don't think you're dumb, on the contrary!
Masai Lion
No no, my AI hehe
It’s dumb cuz it’s neuronal
and it has to learn
and be cloned
and stuff so it’s like
Sun bearOP
Wow
Masai Lion
How is it going
Sun bearOP
@Masai Lion did you remove the progress bar ui ?
That would've helped to see what's going on
Cause I can't see anything
@Sun bear <@1107387213559369832> did you remove the progress bar ui ?
Masai Lion
As far as I know I kept it
🤡
Sun bearOP
Okay then it must mean it didn't get there yet
Sun bearOP
All I can see is that it's taking as much RAM as it can
Sun bearOP
Masai Lion
BRO
HELL NAH
Sun bearOP
Yeah something's not right lol
Masai Lion
Let me see the code
Sun bearOP
It shouldn't take this long to even start lol
Masai Lion
Well let me give it a look again
Any updates on the issue?
@Sun bear It shouldn't take this long to even start lol
Masai Lion
Ye lol
@Anay-208 Any updates on the issue?
Masai Lion
Doesn’t start and take 4 RAM
@Anay-208 Any updates on the issue?
Sun bearOP
Not working yet 😁
Masai Lion
:sunglasses_1:
@Anay-208 wth
Masai Lion
Exactly
I will try fix it on my side
Sun bearOP
I interrupted it and this is what it showed @Masai Lion
@Sun bear I interrupted it and this is what it showed <@1107387213559369832>
Masai Lion
It issued matching
Breh
Sun bearOP
Sun bearOP
I just made myself a Cappuccino ☕
Sun bearOP
@Masai Lion Let me know if having access to the 2 datasets would make it easier for you to help me out.
:sunglasses_1:
@Masai Lion I don’t think so, but will tell u if I do!
Sun bearOP
Awesome, if needed just let me know, in the meantime I'll wait so take your time. If you don't have time for this now we can leave it for another day too, don't feel pressured or anything.
Masai Lion
@Sun bear
I tried doing some stuff
But time took me
Can we try tommorow?
:meow_stare:
@Masai Lion Can we try tommorow?
Sun bearOP
For sure!
@Sun bear any updates?
@Anay-208 <@998405690718703696> any updates?
Sun bearOP
I'm waiting for Whois to reply so that we can continue, unless you know what's causing the issue and can help me fix it 😁
Sun bearOP
@Anay-208 I've managed to make some progress with the help of Chat GPT. The processing is quite slow though, but the success rate is only about 55% at the moment.

I have to do some checks to see if those locations are really missing from my Dataset, if they are that probably means that there's still issues with the way matching is done.
It's much better than no matching at all, at least now it's finally starting to find matches.
Alright
Sun bearOP
I think the issue has to do with mapping, I've noticed GPT had created a manual list of countries of just a few, which seems to have worked great but thing is there are alot of countries in the dataset so I told it to instead look at the list of countries in the dataset and sort of create a list with the countries at runtime and then do the matching, and it seems to be working but now the success rate dropped more than half.

I'm currently trying to figure out why the success rate is so low using this method.
Sun bearOP
@Masai Lion @Anay-208

The issue at hand is one of the datasets that I'm using. That's why the low success rate when it comes to matching, I just realised this recently. My locations dataset is too big, contains too many duplicates and see,s focused on addresses. After cleaning some of the duplicates the number of locations dropped from about 12 million to about 8 million but I'm looking to switch to a better locations dataset, then come back to this.

I'd like to thank both of you for your participation and doing your best to help out a friend, I'll pick something in here as a solution, plus kudos cause you deserve it.