Saving time during data integration

Here are a couple of things which will save you time when integrating this data: 

  1. linkedin_profile.user_id and linkedin_company.org_id are original ids which come from linkedin itself 
  2. linkedin_profile.id and linkedin_company.id are our internal pkeys. Despite their similarity to user_id and org_id, they diverge and should be treated as ordinary autogenerated pkeys 
  3. There are duplicate linkedin_profile and linkedin_company records. Deduping them is an ongoing project, but fortunately we have a way to filter them: WHERE EXISTS (SELECT 1 FROM linkedin_profile_slug s WHERE s.linkedin_profile_id = lp.id). We're using these slug tables to route writes internally, so those duplicate records won't see new updates. 
  4. Most normalized tables like linkedin_profile_position2 are using the same scaffolding, which allows for deduplication and time travel. Each type of normalized record has a unique hash expression used for uniqueness, and it also has min_snapshot_id and max_snapshot_id which should contain linkedin_profile.max_snapshot_id if you're looking for the most recent version of a profile. Check out v0r0.lkd_profile definition for details, which unites all the tables into a single row.