Saving time during data integration
Here are a couple of things which will save you time when integrating this data:
- linkedin_profile.user_id and linkedin_company.org_id are original ids which come from linkedin itself
- linkedin_profile.id and linkedin_company.id are our internal pkeys. Despite their similarity to user_id and org_id, they diverge and should be treated as ordinary autogenerated pkeys
- There are duplicate linkedin_profile and linkedin_company records. Deduping them is an ongoing project, but fortunately we have a way to filter them: WHERE EXISTS (SELECT 1 FROM linkedin_profile_slug s WHERE s.linkedin_profile_id = lp.id). We're using these slug tables to route writes internally, so those duplicate records won't see new updates.
- Most normalized tables like linkedin_profile_position2 are using the same scaffolding, which allows for deduplication and time travel. Each type of normalized record has a unique hash expression used for uniqueness, and it also has min_snapshot_id and max_snapshot_id which should contain linkedin_profile.max_snapshot_id if you're looking for the most recent version of a profile. Check out v0r0.lkd_profile definition for details, which unites all the tables into a single row.