Developing With Production Data
Test data isn't very good
Vox Media
- One big media platform called Chorus
- The Verge
- SBNation.com
- 320+ sports blogs
- 21 Regional Sites
- Lots of user generated content, and lots of different styles
Problems
- What does a developer see when they first start the app?
- What kind of data is a designer using for example?
- What data does a QA tester use?
Script it?
5.times.each do |j|
Story.create!(
:user => users[:author],
:community => community,
:title => title(j),
:body => lorem_ipsum()
)
end
That solution isn't very good
- Have to constantly update the script
- Recreate data from production to recreate bugs anyway
- From a team perspective, using the local app is a terrible experience
Use production data!
That would be great if my database wasn't massive
Alternative talk titles
- Creating a representative subset of production data
- Imperfect, hacky SQL that greatly improves developers lives
Complications
Size - Production database is > 100 GB, so that's not really an option. How do we get a subset?
Assets - How do we share external assets and make that work? (S3, etc)
Services - How do we make internal services play nice? (Staircar)
This is my dev environment
Results
- Production DB is 118 GB
- Development DB is 4.7 GB
- Gzipped SQL is 650 MB
- 90 minutes to create
- 60 minutes to import (SSD)
Prerequisites
- You actually have this problem
- Slave DB is a must
- Time for trial and error
- Figure out what your largest tables are, and approximate what a meaningful subset would be.
General Process
- Connect AR to DB slave
- For every table, dump to file. A subset, if it's a big table
- mysqldump database table --skip-comments --single-transaction --quick
- Import that dump into a temporary DB
- Why? Because we need to anonymize/delete things.
- For every big model
- For every table
- If foreign keys to big table, DELETE things that are missing
- end
- end
Process continued...
- Anonymize user data
- Wipe all stored ip addresses
- Wipe all stored analytics related data
- Anonymize all emails, etc, etc
- Change all passwords to known crypted/salt
- Dump that temporary DB, gzip, upload to S3
It's a Trap!
- Need a list of all models. Can be tricky w/ namespaces and inheritance
- def self.has_belongs_relationship_to?(model)
- self.reflect_on_all_associations.select{ |x| x.macro == :belongs_to }
- Detect polymorphism
- Does it belong to a subtype of that other class?
- Always raw SQL if possible - but don't let AR go to sleep
Assets
- Every non-production environment assumes assets are from production
- Append token to start of newly created asset filenames
- Check for token when generating url. If local, use non-production S3 settings
Services
- Non-production environments all point to staging
- Staging data mirrors production data, maybe synced weekly
We're Hiring
We need developers (and designers if that's your thing)