메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

Nice programing

메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

nicepro 2020. 10. 26. 21:04

메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

SqlAlchemy를 사용하여 인터페이스하는 ~ 10M 레코드 MySQL 테이블이 있습니다. 이 테이블의 큰 하위 집합에 대한 쿼리는 데이터 집합의 한입 크기의 청크를 지능적으로 가져 오는 내장 생성기를 사용한다고 생각했지만 너무 많은 메모리를 소비한다는 것을 발견했습니다.

for thing in session.query(Things):
    analyze(thing)

이것을 피하기 위해, 나는 덩어리로 깨지는 내 자신의 반복자를 만들어야한다는 것을 알았다.

lastThingID = None
while True:
    things = query.filter(Thing.id < lastThingID).limit(querySize).all()
    if not rows or len(rows) == 0: 
        break
    for thing in things:
        lastThingID = row.id
        analyze(thing)

이것이 정상입니까 아니면 SA 내장 생성기와 관련하여 내가 놓친 것이 있습니까?

이 질문에 대한 대답 은 메모리 소비가 예상되지 않음을 나타내는 것 같습니다.

대부분의 DBAPI 구현은 행을 가져올 때 완전히 버퍼링합니다. 따라서 일반적으로 SQLAlchemy ORM이 하나의 결과를 가져 오기 전에 전체 결과 집합이 메모리에 있습니다.

그러나 Query작동 방식은 객체로 돌아 가기 전에 기본적으로 주어진 결과 집합을 완전히로드하는 것입니다. 여기서 근거는 단순한 SELECT 문 이상의 쿼리에 관한 것입니다. 예를 들어, 하나의 결과 집합에서 동일한 개체 ID를 여러 번 반환 할 수있는 다른 테이블에 대한 조인 (즉시로드와 공통 됨)에서 전체 행 집합이 메모리에 있어야합니다. 그렇지 않으면 올바른 결과가 반환 될 수 있습니다. 부분적으로 만 채워질 수 있습니다.

따라서을 Query통해이 동작을 변경할 수있는 옵션을 제공합니다 yield_per(). 이 호출은 Query배치 크기를 제공하는 배치에서 행을 생성하도록합니다. 문서 상태에서 이것은 컬렉션을 열심히로드하지 않는 경우에만 적합하므로 기본적으로 수행중인 작업을 실제로 알고있는 경우입니다. 또한 기본 DBAPI가 행을 사전 버퍼링하는 경우에도 여전히 해당 메모리 오버 헤드가 있으므로 접근 방식은 사용하지 않는 것보다 약간만 더 잘 확장됩니다.

나는 거의 사용하지 않는다 yield_per(); 대신 창 함수를 사용하여 위에서 제안한 LIMIT 접근 방식의 더 나은 버전을 사용합니다. LIMIT 및 OFFSET에는 OFFSET 값이 매우 크면 쿼리가 느리고 느려진다는 큰 문제가 있습니다. N의 OFFSET로 인해 N 개의 행이 페이지를 넘기 게되므로 같은 쿼리를 한 번이 아니라 50 번 수행하는 것과 같습니다. 더 크고 더 많은 수의 행. 창 기능 접근 방식을 사용하여 선택하려는 테이블의 청크를 참조하는 "창"값 집합을 미리 가져옵니다. 그런 다음 각 창에서 한 번에 가져 오는 개별 SELECT 문을 내 보냅니다.

창 기능 접근 방식은 위키에 있으며 큰 성공을 거두었습니다.

또한 참고 : 모든 데이터베이스가 창 기능을 지원하는 것은 아닙니다. Postgresql, Oracle 또는 SQL Server가 필요합니다. 적어도 Postgresql을 사용하는 IMHO는 그만한 가치가 있습니다. 관계형 데이터베이스를 사용하는 경우 가장 좋은 방법을 사용하는 것이 좋습니다.

SQLAlchemy를 사용하여 효율적인 순회 / 페이징을 조사해 왔으며이 답변을 업데이트하고 싶습니다.

슬라이스 호출을 사용하여 쿼리 범위를 적절하게 제한하고 효율적으로 재사용 할 수 있다고 생각합니다.

예:

window_size = 10  # or whatever limit you like
window_idx = 0
while True:
    start,stop = window_size*window_idx, window_size*(window_idx+1)
    things = query.slice(start, stop).all()
    if things is None:
        break
    for thing in things:
        analyze(thing)
    if len(things) < window_size:
        break
    window_idx += 1

저는 데이터베이스 전문가는 아니지만 SQLAlchemy를 간단한 Python 추상화 계층으로 사용할 때 (즉, ORM 쿼리 개체를 사용하지 않음) 메모리 사용량을 늘리지 않고 300M 행 테이블을 쿼리 할 수있는 만족스러운 솔루션을 찾았습니다.

다음은 더미 예입니다.

from sqlalchemy import create_engine, select

conn = create_engine("DB URL...").connect()
q = select([huge_table])

proxy = conn.execution_options(stream_results=True).execute(q)

그런 다음 SQLAlchemy fetchmany()메서드를 사용하여 결과를 무한 while루프로 반복합니다.

while 'batch not empty':  # equivalent of 'while True', but clearer
    batch = proxy.fetchmany(100000)  # 100,000 rows at a time

    if not batch:
        break

    for row in batch:
        # Do your stuff here...

proxy.close()

This method allowed me to do all kind of data aggregation without any dangerous memory overhead.

NOTE the stream_results works with Postgres and the pyscopg2 adapter, but I guess it won't work with any DBAPI, nor with any database driver...

There is an interesting usecase in this blog post that inspired my above method.

In the spirit of Joel's answer, I use the following:

WINDOW_SIZE = 1000
def qgen(query):
    start = 0
    while True:
        stop = start + WINDOW_SIZE
        things = query.slice(start, stop).all()
        if things is None:
            break
        for thing in things:
            yield(thing)
        start += WINDOW_SIZE

Using LIMIT/OFFSET is bad, because you need to find all {OFFSET} columns before, so the larger is OFFSET - the longer request you get. Using windowed query for me also gives bad results on large table with large amount of data (you wait first results for too long, that it's not good in my case for chunked web response).

Best approach given here https://stackoverflow.com/a/27169302/450103. In my case I resolved problem simply using index on datetime field and fetching next query with datetime>=previous_datetime. Stupid, because I used that index in different cases before, but thought that for fetching all data windowed query would be better. In my case I was wrong.

AFAIK, the first variant still gets all the tuples from the table (with one SQL query) but builds the ORM presentation for each entity when iterating. So it is more efficient than building a list of all entities before iterating but you still have to fetch all the (raw) data into memory.

Thus, using LIMIT on huge tables sounds like a good idea to me.

참고URL : https://stackoverflow.com/questions/7389759/memory-efficient-built-in-sqlalchemy-iterator-generator

'Nice programing' 카테고리의 다른 글

Objective-C : 초기화 대 초기화 (0)	2020.10.26
파이썬 히스토그램에 로그 빈을 갖는 방법 (0)	2020.10.26
중첩 된 목록 이해 이해 (0)	2020.10.26
Julia의 잘못 작성된 R 예제 속도 향상 (0)	2020.10.26
ActiveRecord 또는 쿼리 해시 표기법 (0)	2020.10.26

현재글메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

nicepro

메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

'Nice programing' 카테고리의 다른 글

'Nice programing'의 다른글

티스토리툴바

메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

메모리 효율적인 내장 SqlAlchemy 반복기 / 생성기?

'Nice programing' 카테고리의 다른 글

'Nice programing'의 다른글

관련글

티스토리툴바