11
votes

The Python native capabilities for lists, sets & dictionaries totally rock. Is there a way to continue using the native capability when the data becomes really big? The problem I'm working on involved matching (intersection) of very large lists. I haven't pushed the limits yet -- actually I don't really know what the limits are -- and don't want to be surprised with a big reimplementation after the data grows as expected.

Is it reasonable to deploy on something like Google App Engine that advertises no practical scale limit and continue using the native capability as-is forever and not really think about this?

Is there some Python magic that can hide whether the list, set or dictionary is in Python-managed memory vs. in a DB -- so physical deployment of data can be kept distinct from what I do in code?

How do you, Mr. or Ms. Python Super Expert, deal with lists, sets & dicts as data volume grows?

3
Maybe u need SQLAlchemy? Or another ORM? And save u data to database?Denis
Mr. and Ms. Python Super Experts are called pythonistas. ;-)Aufwind
It's extremely hard, if not impossible, to serialize and deserialize arbitrary Python objects but it's easy to persistent a subset of Python objects, like int, str, list and dict using pickle/json or whatever. However data persistence only consists of a small part in your problem. Another problem to solve is that you need to create some kind of mapper to map your object with the database. If you are using relational databases like Postgresql or MySQL, you may take a look at ORMs like Sqlalchemy but If you can only use GAE's bigtable then you may need to write your own ORM...Overmind Jiang
@Druss: not official, nor ever will be. For myself, I'm a snake charmer.Chris Morgan
@Druss, first time I hear of snake charmers. Sounds neat, too. ;-)Aufwind

3 Answers

8
votes

I'm not quite sure what you mean by native capabilities for lists, sets & dictionaries. However, you can create classes that emulate container types and sequence types by defining some methods with special names. That means that you could create a class that behaves like a list, but stores its data in a SQL database or on GAE datastore. Simply speaking, this is what an ORM does. However, mapping objects to a database is very complicated and it is probably not a good idea to invent your own ORM, but to use an existing one.

I'm afraid there is no one-size-fits-all solution. Especially GAE is not some kind of of Magic Fairy Dust you can sprinkle on your code to make it scale. There are several limitations you have to keep in mind to create an application that can scale. Some of them are general, like computational complexity, others are specific to the environment your code runs in. E.g. on GAE the maximum response time is limited to 30 seconds and querying the datastore works different that on other databases.

It's hard to give any concrete advice without knowing your specific problem, but I doubt that GAE is the right solution.

In general, if you want to work with large datasets, you either have to keep that in mind from the start or you will have to rework your code, algorithms and data structures as the datasets grow.

2
votes

You are describing my dreams! However, I think you cannot do it. I always wanted something just like LINQ for Python but the language does not permit to use Python syntax for native database operations AFAIK. If it would be possible, you could just write code using lists and then use the same code for retrieving data from a database.

I would not recommend you to write a lot of code based only in lists and sets because it will not be easy to migrate it to a scalable platform. I recommend you to use something like an ORM. GAE even has its own ORM-like system and you can use other ones such as SQLAlchemy and SQLObject with e.g. SQLite.

Unfortunately, you cannot use awesome stuff such as list comprehensions to filter data from the database. Surely, you can filter data after it was gotten from the DB but you'll still need to build a query with some SQL-like language for querying objects or return a lot of objects from a database.

OTOH, there is Buzhug, a curious non-relational database system written in Python which allows the use of natural Python syntax. I have never used it and I do not know if it is scalable so I would not put my money on it. However, you can test it and see if it can help you.

0
votes

You can use ORM: Object Relational Mapping: A class gets a table, an objects gets a row. I like the Django ORM. You can use it for non-web apps, too. I never used it on GAE, but I think it is possible.